Data Sovereignty

This post begins an experiment with, conceptually, a short series of posts. The idea being, a normal length blog is roughly a ten minute read. Which means a good conference length talk (30 to 40 minutes) should be three to four posts. So, herewith part one of Privacy or Piracy in AI?

We begin with a topic of two overloaded words. Data Sovereignty. A legal term, talking about where data is stored. But what does it really mean? How does it apply in different contexts? How does it apply to the GenAI and LLM bubble?

Data

How do we even define data? We start by constraining our system with a set of assumptions. The first assumption is that we treat information and data interchangeably. Data scientists are going to groan and be sad as soon as I say this, but in the context of businesses operating on the internet, information is data, and vice versa. This allows us the flexibility to claim this post is data even if it turns out to be a wonky shape. My second assumption is that we are in fact speaking in the context of digital business. Meaning that the brokers of this data are connected to the internet, are making use of one or more pieces of software not hosted on the machine where the data is captured, and are using that data in order to provide goods or services to customers.

Because we’re nerds here, let’s look at a couple of dictionary definitions of the word.

1: factual information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation
2: information in digital form that can be transmitted or processed
3: information output by a sensing device or organ that includes both useful and irrelevant or redundant information and must be processed to be meaningful

— Merriam-Webster

information, especially facts or numbers, collected to be examined and considered and used to help decision-making, or information in an electronic form that can be stored and used by a computer

— Cambridge Dictionary

At least the dictionary agrees with my assumptions. What the definition misses out on, is that some data is more sensitive that other data. You may hurt its feelings if you say mean things about it. Or, you may hurt the people it references if bad actors get hold of it. This is why the information which makes up your identity is considered sensitive and not to be shared unencrypted on the internet with everyone and their uncle. This is why we are careful with credit card information. Personally Identifying Information (PII) is a strictly controlled substance. Not to be traded without the explicit permission of the one it identifies. When it leaks, people freak out, as they should.

Sovereignty

Said “sov-rin-tea”, this is one of those words where the English language adds letters. Living in a republic like South Africa one questions the relevance of the word. So, let’s start with what it means.

1 a: supreme power especially over a body politic
b: freedom from external control : autonomy
c: controlling influence
2: one that is sovereign; especially: an autonomous state
3 (obsolete): supreme excellence or an example of it

— Merriam-Webster

the power of a country to control its own government.

— Cambridge Dictionary

Um. Is this what we expected contextually? This is political for the most part. It doesn’t even go into the associations with Kings and Queens which the word Sovereign might bring up. The concept of control is interesting though. Let’s latch on to that part of the definition. If we let go the government aspects, and instead look at the idea of “local control” we start putting together something sensible.

Data sovereignty is about a country being in control of their own data. It means the information about the people of the country is stored within the country they are relevant to, and much as their physical location means they are subject to the physical laws of that land, but also protected by said laws, data is protected and restricted by legal principles determined locally.

Legal Matters

I mentioned “the laws of the land” and they matter quite a lot. It turns out that even though the internet is a global phenomenon, and data can be shared between countries at the speed of light (not instantly), geopolitics matter. Anyone who works in a regulated industry (such as with children, or in a medical field) will tell you that the laws are unique to the country, and sometimes even more finely grained, to the state or province or city. Physical protection as well as digital. There are three influential pieces of legislation I want us to take a quick look at. These are not new to most of us, but worth remembering.

POPIA

The Protection Of Personal Information Act in South Africa came into effect in 2020. In the simplest terms, it provides the legal foundation for South African data sovereignty. Unless there are specific criteria met, data should not leave the country (network traffic notwithstanding). If the data gatherer garners informed consent from the user, that data may then be processed and stored out of country to varying levels depending on further complex matters. There is a bunch of legal-speak involved, but from a development perspective what it comes down to is this.

Do everything in your power to ensure no data is stored in another country.
If your data must be stored outside of South Africa, ensure it is in a location which is fully GDPR compliant, as GDPR is actually stricter than POPIA.
Ensure your users consent to the external processing of their data.

One of my favourite aspects of POPIA is how deeply it is rooted in the South African Constitution. A document lauded around the world as one of the best written constitutions, and something which protects us from many different aspects of governmental decay. There is a whole side post to be written on the history of the South African Constitution, which I won’t bore you with now. But what I want to highlight is this:

Everyone has the right to privacy, which includes the right not to have— (a) their person or home searched; (b) their property searched; (c) their possessions seized; or (d) the privacy of their communications infringed.

Privacy is acknowledged as a human right in a legal document. When the first version of this document was published in 1996 we didn’t have the concept of generative AI, but there were lots of other uses for personal data and information. Even in 2020 when POPIA came into effect, it wasn’t about generative AI so much as data being sold for dubious purpose.

GDPR

Most of us have heard of GDPR. There are even “sleep stories” where people read the GDPR document as a way of helping you to fall asleep. In fact, the document is hundreds of pages of dense legal language aiming to break down what is meant by the General Data Protection Regulation and how it will impact people and businesses. Coming into effect in 2018 GDPR predates POPIA, and set the standard for data protection across the world.

Again, human rights enter the picture with the 1950 European Convention on Human Rights which no doubt influenced South Africa.

GDPR in a nutshell gives users more control over the life cycle of their data. Data retention policies are becoming the norm, as well as data location policies. The fines which can be levied if a group is found not to be compliant are hefty enough that even the giants who sometimes believe themselves above the law have taken action. The most noticeable change has been to the way cookies are stored in our browsers. All those super annoying cookie popups where you have to click “accept” or go in and turn off extra cookies? Yep, those are because it is now legally required that you accept cookies before they can be stored on your device. Once again, informed consent.

EU AI Act

The EU AI Act regulates AI use for all businesses in the EU.

— Software Improvement Group

One of the first such acts, but certainly not the last the EU AI Act takes the principles we have mentioned regarding POPIA and GDPR and moves them forwards with the changing times of generative AI and Large Language Models (LLMs). Working to ensure that there are no gross breaches of privacy in the name of progress. This act has many parts, each one coming into effect along a period of time. The strictest rules (e.g. no facial recognition without a court order) are already in place, with the less pressing aspects still to come.

As of Aug 2nd, 2025, every GPAI provider must keep a private “black-box” dossier that shows regulators exactly how the model was built and tested; publish a short, public summary of the copyrighted material used for training; give customers a compact “model card” that spells out what the model is (and isn’t) meant to do; and prove that EU copyright rules are respected—whether by licences, opt-outs or clear attribution.

This part of the act is fascinating. The model trainers are being held to account for the data they are ingesting. If they cannot prove that they are following copyright law they are going to have to pay fines. Even if those fines only come into effect next year. This puts pressure on the model providers, and in my opinion, it will help to ensure intellectual property is respected.

So what’s the impact?

Essentially, all the laws which previously protected data and information on the internet still apply to AI applications even if providers are not building with that in mind. This opens up those providers to heavy risk, because if the data does leak they will be held liable. Some of them cover their bases with interesting Ts&Cs. Others are looking at enterprise only, and still others are building new solutions. One of the interesting things to watch as we go through this series is the impact of the open source community. The internet is a different shape now to the dot com bubble, and whilst many people are drawing very accurate parallels (go find your own sources) there is a chance this will balance out in new and interesting ways.

We’ll get more into the risks of the software, and how that plays into the AI literacy aspects of the EU AI act in future posts, but for now, if you have decision making power about AI product use, take into account the security certifications being associated with them. Not every provider is “safe for use” out for the box, and some of the fees you might face for the enterprise grade security are steep. I personally don’t feel OK using “just any” free AI products in my personal capacity or in my professional capacity. The more I can lock it down, the better.

If I can get to a point where I know the AI model is running in South Africa I will be even happier.