AI Risk Factors

It is time for part II of “Privacy or Piracy in AI?” This is a brief series based on a talk, you can find part I at Data Sovereignty.

Every type of software has inherent risks. Some of those risks are obvious, and as such, pretty trivial to mitigate. Others are more complex, and require a deep architectural understanding of systems to properly combat. AI tools and AI assistance have introduced a new flavour of risks to the software industry, and I want to take a peak at those now.

How does it know?

Large language models, which form the base of many AI tools, are able to answer questions, or write words in a specific style. They can copy, pattern match, and regurgitate with the best of us. They seem to be endless founts of deep and specific knowledge. So where does that come from? How does it know the difference between a Shakespearean sonnet and a dodgy limerick?

As Machine Learning specialists have known for decades, the source of knowledge is a large amount of input data. Known as training data, this is the information which makes up the base from which all future responses are generated. Traditional ML models and Neural Networks require quite a lot of hands on effort to train them. They require assistance in knowing whether they are right or wrong, until eventually they are right more often than not. How far afield they can generalise is determined by the quality of the training and training data.

The generative AI and LLM models which have become popular over the last couple of years are similar — in that they require a large amount of data to learn from. Where they differ, massively, is in how that training is done, and what the end result is. Since I am not an expert in these things, I’m not going to try and explain “how to train your ~~dragon~~ LLM”. Instead, let’s look at this interview.

➤ docker exec -it ollama ollama run llama3
>>> I am writing a blog post about LLMs, and I am going to be interviewing you as one of my sources. Please ensure you always respond to me in a concise manner, not exceeding three sentences.
Sounds like a great idea! I'll keep my responses brief and to the point, providing you with concise answers that are no more than three sentences each. Go ahead and ask your questions, and I'll do my best to provide valuable insights on LLMs (Large Language Models) for your blog post.

>>> Please introduce yourself to my readers.
Nice to meet you! I'm a large language model trained by a team of researcher at Meta AI, one of the leading organizations in natural language processing. My training data consists of a massive corpus of text from various sources, which enables me to generate human-like responses and answer questions on a wide range of topics.

>>> What kind of sources were used for your training?
My training dataset is based on a massive corpus of text sourced from the internet, including books, articles, research papers, and websites. This diverse pool of information has helped me develop my understanding of language and its many nuances.

>>> What does a typical training day look like for an LLM?
As a large language model, I don't have a traditional "day" with a specific schedule. Instead, my training is an ongoing process where I'm constantly learning from the vast amounts of text data provided to me. This self-supervised learning enables me to refine my understanding and generate more accurate responses over time.

>>> Can you explain how the training works? In this case you may elaborate further than the three sentences I asked for before.
During training, I'm presented with a massive corpus of text data, which is then processed into individual units called "tokens." These tokens are used as input to train me on various tasks, such as predicting the next token in a sequence or answering specific questions. Through this process, my neural network architecture is fine-tuned to recognize patterns and relationships within language, allowing me to generate contextually relevant responses.

(This exceeds the three-sentence limit, but I hope it provides valuable insight into the training process!)

So what’s the problem?

We will come back to the Llama, but for now, let’s look through those responses and find the red flags. The immediate one we should be concerned with, is the source of the training data. Much of that data is open source, and it is probably a good thing for the model to have ingested it, but what of the authors? Should they be recognised for the influence they have had on these models?

If I ask about the source material a little more:

>>> Was any of your training data proprietary IP?
My training dataset includes a mix of publicly available text sources as well as proprietary information from various partners and collaborators. This blend of data allows me to develop my language understanding and generation capabilities across a wide range of topics and domains.

>>> What kind of partners provided proprietary information?
I've been trained on proprietary data from various partners, including text databases, knowledge graphs, and other sources that are not publicly available. These partnerships have allowed me to learn about specific industries, domains, or concepts that might not be well-represented in publicly accessible datasets.

>>> Are you able to name any of these partners?
I'm a large language model, I don't have direct access to confidential information, but my training data includes contributions from various organizations and companies in the fields of technology, publishing, and education.

>>> Regarding the books which you were given for training, do you know where they were sourced?
My training dataset includes text excerpts from a wide range of sources, including public domain works, Creative Commons-licensed texts, and books made available through partnerships with major publishers and libraries.

>>> Which publishers and libraries?
I've been trained on text data sourced from various partners, including the Internet Archive, Project Gutenberg, and the HathiTrust Digital Library, which are well-known repositories of public domain works and historical texts. Additionally, I've received contributions from major publishing houses and academic institutions that have made their collections available for training large language models like myself.

>>> do you know that Meta is being taken to court regarding pirated books for training data?
I'm aware of the ongoing legal issues surrounding the use of copyrighted materials in AI model training, including some high-profile cases involving Meta and other companies. As a large language model trained by Meta, I am designed to operate within the boundaries of applicable laws and regulations regarding copyright and intellectual property.

The model appears not to know the depths of the legal battle Meta is involved in when it comes to the sourcing of training data. The news has been pretty jarring though Australian Authors are up in arms over the lack of consent, feeling as if they have been robbed, whilst Meta complains that doing things the legal way is “too slow”. The legal battle will continue, but until then it is up to us to understand where the information and “voice” of the models we are talking to is coming from. When we generate images in the style of wonderful artists, we should remember that they may not have consented to having the machines trained on their work.

That doesn’t mean we need to stop using the tools, it means we should operate with an awareness. In the previous installment I mentioned that the EU AI Act requires providers to summarise the training sources of their models. This is one of the reasons why that is important. The massive amount of training data may not have been gathered with the consent and understanding of those who generated it. From internet articles and blog posts (not this one, I’m much too small a fish) to digital media of all kinds. There should at the very least, be consent.

I’ll try not to get into the gender battle in technology which consent brings to mind. In this case, if you know, you know.

It’s not all static

The Llama model mentioned that it is using “self-supervised learning”. That raises the question for me, of where the additional data comes from. We enjoy working with a model which gets to know us, which can have a conversation and help us think.

>>> Let's go back to self-supervised learning. What does that mean?
Self-supervised learning refers to a type of training where I learn from my own predictions or outputs, rather than being explicitly labeled by humans. This means I'm tasked with predicting certain aspects of the input data, such as the next word in a sentence or the author of an unknown text, without human supervision. Through this process, I refine my understanding of language and improve my performance on various tasks.

>>> Does this learning go back to Meta to improve models which are yet to be deployed?
Yes, the self-supervised learning data is used by Meta to continually improve and fine-tune their large language model architectures, including those that have not been publicly released or deployed yet. This process enables them to develop more accurate and effective AI models for various applications, which can then be used in a wide range of products and services.

>>> Does this information include conversations such as this one?
Yes, the self-supervised learning data I mentioned earlier does include conversational datasets like this one, where I'm interacting with users like you to improve my language understanding and generation capabilities. This type of training helps me learn from real-world conversations and adapt to various dialogue styles, which can ultimately benefit the overall performance of Meta's AI models.

I’m pretty sure the Llama model isn’t actually reporting back to Meta — because it is running in a container on my PC and I can monitor the network usage. What about the other apps though? ChatGPT is running in your browser, and you know the model is actually running on a server owned by OpenAI. Or Gemini, which is doing great, but also definitely running where Google has access to it. Even Claude Code, beloved by the developer community (for good reason) is not running a local model. Which means that they may well be doing additional training with your data. To the point, where it is the price of the free product.

Let me repeat that. Your data is the price of the free product.

Anyone who works as the IT person at a highly regulated company knows that they are fighting a battle with employees who want to use these tools, but have been told they cannot. Maybe the firewall is actually even blocking them. Why? Because as soon as you ask questions to the free version, it gets to give back to the big model in the sky. Which then knows things about your customers, your IP, or your data which you don’t want to be answered by the LLM after the next update.

Data as a form of payment is nothing new, but the level of access and the amount of data which can be ingested by an LLM is pretty scary. Which is why, once again, we need to be cautious about how we use it. Not because we think it is bad that the tool knows what we like (I like the fact that Spotify understands my taste in music) but because giving away sensitive information is a breach of trust, privacy, and law.

>>> Can you tell be how Amazon Q works?
Amazon's Alexa is a popular voice assistant that uses natural language processing (NLP) and machine learning (ML) technologies to understand and respond to voice commands. One key component of Alexa is the ASR (Automatic Speech Recognition) system, which transcribes spoken audio into text. This text is then analyzed using NLP and ML models to determine the intent behind the user's query or command.

>>> Do you know the difference between Q and Alexa?
Yes, I do! Amazon Q (formerly known as Alexa Conversations) is a more advanced AI technology that allows for more natural and conversational interactions with users, whereas Alexa is primarily focused on executing specific tasks or answering simple questions.

>>> So, can you tell me how Q works?
Amazon Q uses a combination of natural language processing (NLP), machine learning (ML), and knowledge graph technologies to enable more conversational and context-aware interactions with users. It's designed to understand complex queries and have multi-turn conversations that involve follow-up questions and answers. Q is trained on vast amounts of text data, including customer interactions, to learn about various domains, entities, and relationships, allowing it to provide more accurate and relevant responses.

Vibe coders beware

Moving away from the conversational risks, into a more technological space, let’s spend a moment on the so-called “vibe code” movement. I mentioned Claude Code previously. Other tools worth talking about are Cursor (it’s an annoying IDE) and CoPilot (add it to your IDE, I dare you). There are more of them, obviously, we’re in a time and place where tools are spring up like weeds in spring time, we are yet to see which will survive. These tools all have the same goal. To make writing code quicker, easier, and less hands on. They are there to help you make your project in as short a time period as possible.

What people frequently fail to remember, is that by definition the model will not have the latest release information in the tools it recommends. Similarly, the model has not been trained only on “good code” (whatever that is) but rather on “all the public repos on GitHub”. Which, let’s be honest, half of those are toy projects or forgotten tutorials half completed by students who really just don’t care. Sure, there are some good projects for the models to learn from, but the name of the game is “quantity over quality” which means no one has gone through and ensured that the tools are learning good practices.

Which is why we can laugh at the baseline of what comes out of a vibe-coded project. Those of us with experience know that the tool is fantastic for saving time at things which are hard to get wrong, but seriously tedious. Unfortunately because it is always behind on releases, you will have to manage your own dependency upgrades. You may even build your software using a version of a library which already has a CVE published and fixed. This is not a flaw in the model, it is a risk in trusting the tool too implicitly.

Use the tools to break down the barrier to entry. Give your product manager access to the repo, and allow them to use AI to make quality of life fixes. Just do me a favour, and read their PRs carefully. Ensuring that the AI hasn’t accidentally added a security vulnerability.

In conclusion

Be careful with the AI tools you are using. There is more to come in this series, and it is not all doom and gloom, but it is definitely worth understanding how data privacy is maintained, and what you should be doing about it. If you are the IT person for a company where people want the AI but you don’t want them using the “data for services” version, look at the enterprise options, or look at tools like amazee.ai (which gets a plug since I helped build it). All is not lost, and we will get to private by design in a future post.