Private By Design

Herewith the conclusion of my slow and meandering “series” on Private AI. If you want to see the other parts, you can find them in Data Sovereignty and AI Risk Factors. There is supposed to be a post on prompt injection, but someone else did it better and they even referenced much of the same source material I would have used. So consider that part III and this part IV.

In Principle

The concept of “Private by Design” is very similar to the better known “secure by default”. It means that right at the beginning of the project, before any decisions are made, we’re thinking about how to maintain privacy. We’re considering where and how software is deployed, choosing providers with transparent privacy policies, and opting for a smaller feature set if we cannot offset against the risks already mentioned.

As with many concepts in software development, the high level understanding is simple. Doing it right on the other hand may not be easy. When designing software as a service (SaaS), we have to look at what data we store as metadata about customers, as well as data which is stored in an isolated location for customers. When you start accounting for sovereignty laws, you may be limited in the second option.

Designing for privacy means building in controls regarding data retention (for GDPR etc). It means keeping a close eye on what is logged, and what it outputs as metrics. It means knowing what data you can and cannot expose to the engineering team.

AI changed the shape of the game. We need access to prompts in order to tune the responses and ensure the LLM is not hallucinating. We may not have access to the prompts because that is by definition the customer’s data, the very private stuff we’re supposed to be securing. Not only do we have to encrypt the DB at rest, we need to understand that many of the tools we are wrapping up for resale in a SaaS app were designed for a team or enterprise to use to build in-house tools.

Reduced Features

If we start thinking carefully about what we should and should not do. If we consider the risks in agentic systems, we look and we see there are problems not yet solved. Now, we can do our risk analysis and suggest it might well be alright to do that. It won’t be exploited, we’re too small. Well, that’s not how it works. This is the internet. People scan everything, everywhere, all at once.

So, if we don’t know how to protect the system sufficiently from prompt injection via web search, we will have to disallow that functionality. For now. Is it still worth building the app? Probably, if only because there are a lot of places which are sufficiently regulated to appreciate privacy as a feature may offset others which are missing.

This is the one surefire way to combat the lethal trifecta is to reduce what the system has access to. In our case, we want to give it private data, that’s the whole point. So that means we have to restrict External Communication and Untrusted Content. Obviously we cannot really stop users putting whatever they want into the chat, but we can make that the only means of input. The System only knows what it has been given by a user. Guardrails can be bypassed, and will be bypassed. So it becomes likely that simply by allowing users to interact with the service there will be some amount of untrusted content. That leaves us with cutting out external communication. If the system never returns data to the training provider. If the system is not allowed to make external tool calls, then the blast radius of the untrusted content can be reduced.

At the cost of functionality which people are selling in other apps.

That’s the business

Choosing to ship a product with fewer features than the big competitors may appear counter intuitive. We had some great conversations in the amazee.ai engineering space about whether this is a good idea or not. The deal here is that we take private by design seriously. So do our customers.

If you are building a system in-house, you can afford to tell your users they can’t have everything they want. They might even be used to that. If you’re shopping around for an app which takes privacy as seriously as you do, look for the ones which are private by design. Look for the ones which allow you to define where they are hosted. Look for the ones where the feature set is reduced in order to maintain the safety of your data. Look for the ones which display an understanding of the laws around data handling.

Don’t just trust. Much like you don’t trust just any encryption algorithm or password safe, you should understand the risks in giving your valuable data to the tools you are using, and if in any way they are reporting back to places you don’t trust, maybe stop using them.

If you want the real reason I only use LLM coding tools and not other tools. This is the one. The data is less private (when working in open source), and the external aspects are better understood.