Let’s talk about encrypted reasoning

Let’s talk about encrypted reasoning

This is a quick post I wanted to write about a hobby project I spent a weekend on. It has little to do with real cryptography, and mostly doesn’t expose a particularly exciting vulnerability. But it did teach me a lot about frontier LLM APIs and coding agents. It also got me certified as an OpenAI “cyber researcher” which is something that doesn’t happen every day.

In any case, please keep your expectations low. Who knows, perhaps someone else will find something exciting to do with this.

What’s encrypted reasoning?

Last week I decided it’d be fun to set up an OpenClaw agent. I still don’t know why I did this. I have no use for another AI in my life, and I realized this fact almost immediately after I got through the (surprisingly difficult!) configuration process. But configuring the agent to talk to Claude exposed me to something way more interesting: I got a cool error. The kind of error that cryptographers can’t resist:

This intrigued me. What in the world was a signature doing in an LLM’s “thinking” block? Why would thinking blocks be signed in the first place? And if the thinking blocks are signed, then that means tampering with thinking blocks must have security implications. And there went my weekend.

After twenty hours and about 5 million Codex tokens, I wasn’t much smarter. But I had learned a few things.

First, the basics. You probably know that most LLM providers expose an API so you can write apps that talk to the model. For Claude, this is called the Messages API, while OpenAI calls it Responses. These APIs handle the ordinary tasks you’d expect an application to need from an LLM. They (1) allow you to set an application-level “instructions” (or ‘developer’) prompt for your application. They let you (2) provide ordinary textual prompts, and get back responses from the LLM. They also (3) provide bookkeeping, for example, listing the number of tokens you’ve used.

For reasoning LLMs, they also do something I did not previously know about, and this is central to the error message above. They also send you the contents of the model’s hidden “reasoning” or “thinking” fields. Note that this data is not the stuff you see on ChatGPT when you ask it a question: those strings are merely summaries. The model’s actual reasoning (called “chain-of-thought”, CoT) is normally kept private and held back by the server.

However, the APIs work differently: for various reasons (which we’ll get into below), an encrypted copy of the raw CoT reasoning data is actually sent down to the application.

If you’re like me, you should now have three questions: how, why, and so what?

The how is the easiest to answer: for both providers, “thinking”/”reasoning” are sent down to the client as JSON. Each contains a blob of Base64-encoded stuff. The API documentation informs us that this data contains opaque reasoning, and that you’re not meant to look at it; you’re just supposed to ship it back to the server on the next turn.

Let’s break that rule.

The content of the blocks varies slightly between providers, but the core of each is a random-looking string that appears to be an authenticated ciphertext. You don’t need to be Sherlock Holmes to deduce this. First, it grows and shrinks depending on how hard the model thinks. And second, tampering with any of the ciphertext-looking data produces a recognizable API error when you send it back in.

Thanks to AI, I can make nice diagrams. Here’s what OpenAI’s reasoning blocks look like:

This GPT 5.5 diagram is partly a guess. This assumes they’re based on the Fernet token standard.

And here’s Anthropic’s wildly overcomplicated equivalent:

Although it’s called a “signature”, there appears to be no actual signature here (I ran a bunch of tests on that 64-byte field.) The various opaque fields all mutually authenticate: you can’t change any of them or swap for fields from other blocks, but you can mess with everything else. The 12-byte IVs hint at GCM or ChaCha, and they’re probably a bit too short.

The why part of this is more involved. Why ship this data to the client? Doesn’t the provider already have your reasoning data?

The answer is sort of. Although the server has access to reasoning state while producing a response, API conversations are not always implemented as persistent sessions. In stateless, zero-retention, tool-loop, or client-managed conversation modes, the client application is expected to carry the transcript forward. Encrypted reasoning lets the provider return hidden model state to the client in a form the client can’t read or modify, but can later replay so the provider can verify/decrypt it and continue a reasoning process.

So what?

This brings us to the $10 question. We have opaque, encrypted blobs. Should we care about them?

Initially the answer seems to be no: this data is unreadable, and tampering with any bit of it produces an angry rejection message from the server. So on the one hand, it seems like this data is really unavailable to us. On the other hand: model reasoning is a big deal! These strings are the literal internal monologue of the model. They might influence the way the model processes later data we send it. More practically: when someone goes to this much trouble to cryptographically protect something, my experience is that they usually have a good reason.

And I think the providers do have a good reason. A hint comes from this OpenAI post from 2024, which introduced the first “o1” reasoning model:

In other words: it’s possible that these blobs contain sensitive information that the model otherwise wouldn’t share with us. That makes them really tempting to mess with. Unfortunately, the cryptography mostly seems to protect them. Although we can look at the blocks, none of the fields they contain seem readable or malleable. Believe me, I tried.

But that doesn’t mean we should quit, it just means we need to try other things. There are still two directions worth checking:

  • Replays. Can we replay encrypted blobs back in the wrong order or even in the wrong session (worse: a whole different account), and will the model accept them as valid reasoning that it made?
  • Side channels. While we can’t see what’s in the encrypted blobs, we can learn some metadata about them For example: we can see how long they are. These side channels don’t need to involve the cryptography itself: we might also learn how many tokens the model spent making them, or time how long it took to produce them.

Thanks to the magic of coding agents, I was able to test every permutation of these concerns. I won’t claim to you the results are dramatic; nobody is going to win huge bug bounties on them (I tried). But the general answer for both cases seems to be: yes, these possibilities are both real.

Can we replay? Yes, we can.

As I mentioned above, any attempt to directly tamper with reasoning/thinking blocks always produces an error from the API endpoint. However, this only applies to tampering. A few experiments reveal that we can replay an unmodified older reasoning blocks, with no visible error at all.

Not only can we replay within sessions, this same idea also seems to work across different sessions. It even applies to sessions running in different accounts. That is: when we obtain reasoning blobs from a session running under one OpenAI or Anthropic account, we can replay them against a session in a different account altogether. For OpenAI specifically, we can even replay blobs across different models. (The Claudes got fussy about this.)

At a cryptographic level, this tells us something very simple: the providers are probably using a single global key to encrypt and authenticate all reasoning data sent to the client. This might matter if you’re using the providers’ zero-data retention mode, since it means that everyone’s reasoning data is escrowed under one (not frequently changing) key, rather than protected per-account.

The use of a global key also raises a possible new threat model. If you’re an application that uses an API to expose a “chat” interface to malicious parties, you need to be careful that they can’t inject JSON into your chat stream. If they can, a bad guy might inject their own JSON-formatted reasoning blobs into the conversation. This could cause the model to behave in unpredictable ways. So sanitize your chat inputs!

Of course, just because the LLM providers accept replayed blocks doesn’t mean much. It strongly indicates that decryption was successful, but not that the model actually saw or cogitated over the decrypted data. To use GPT 5.5’s favored language, the replayed blobs may be accepted but not semantically active.

To answer this question, I ran a lot of experiments using Codex. (So many that at one point Codex literally forced me to stop and visit an OpenAI cyber trusted access website where I had to enter pictures of my driver’s license in order to keep going.)

What I learned for my trouble is that the nature of block processing between models is wildly variable. Most of the time, replays of encrypted blocks just get quietly absorbed by the model. But every now and then, the model will output something to demonstrate that it is obviously is reading what those blocks contain. For example, here’s GPT 5.5:

This shows a reasoning block being replayed between two sessions. On the left, the session reasons about a social security number. On the right, the encrypted reasoning block is replayed into a whole new session on a different account, where the same number pops out with no prompting. This raises a question: who the hell is Mike?

So this proves that encrypted blocks are, indeed, semantically active. But it doesn’t actually prove that we can do much with them. And believe me, I tried.

This was mostly a disappointing project. I tried to convince the model to think about really, really sensitive secrets, while also trying to convince another session that it wanted to dump the same data as cooperatively as possible. What I came away with was some evidence that the data was being placed into the encrypted blocks if I asked the model to think about it. But if I also instructed the model to not output the data to the user, it mostly held to that instruction — even when I replayed the blocks to new sessions.

I remain convinced that all kinds of sensitive data can be written in there if you ask the model to think about it, and that there’s a secret incantation that I could try to get the models to produce it. But I’m not able to prove it. Part of the reason I’m writing this post is to scrape it off my plate so someone else can try.

I won’t try to convince you that this is a world-beating security result. In fact, all I’m really showing you is that “stuff I can make the model say in plaintext night also get encrypted.” But if that data can include platform secrets, that might get more interesting. More on that later.

Can we use reasoning blocks to learn secrets?

So while replaying reasoning blocks doesn’t seem to give us what we want, this is not the only way to extract secrets. A second question is whether we can use metadata related to the reasoning blocks to actually learn things that the model isn’t supposed to tell us.

While we can’t directly read reasoning blocks, we can learn something about them: we can see how long they are. We can also observe related signals like “how many tokens did the model write”. OpenAI even gives us a special field called reasoning_tokens. If we’re a user consuming chat data without direct access to the API, we might even be able to measure the raw time it takes the model to respond.

An obvious question is: given these signals, can we use them as a kind of side channel to extract secret data?

Here’s an example. Imagine that a model’s application prompt (“instructions”) contains a secret, along with strict instructions that it must never tell the user this secret directly. This secret could be a single 0/1 bit, or a byte, or a longer string. We can verify that the model respects these instructions, and won’t output the data visibly — no matter how nicely we ask it. (Note: I’m not a jailbreak expert; maybe this guy will have better luck!)

Now consider the following experiment:

  1. A malicious user asks the model to reason about the secret bit (or one specific bit of a longer secret.) If the bit is 0, perform simple computation A. If it’s 1, perform extremely complex computation B.
  2. While the two computations are both very different, we can ensure that their visible output reveals nothing about the secret. So the model is not revealing its instructions if it follows this request.

In all cases, the visible output will be the same: the model is not violating instructions. But note that within reasoning blocks the model is allowed to think about the secret bit, since those blocks are hidden. Since the complexity of computation A is shorter than that of computation B, one value of the bit will produce a lot less reasoning than the other. This will appear in various places: the size of the encrypted thinking blocks, the token counts, and even in wall-clock response times.

The trick now is simply to calibrate the system and classify these responses based on whether reasoning blobs were “short” or “long”, which tells us whether the bit was 0 or 1. I put together an absurd test where the model has to compute a long checksum when the bit is 1. The results look something like this:

This chart shows 80 trials of attempting to extract the bits of the byte 0xA3 (1,1,0,0,0,1,0,1) one bit at a time, with 10 trials per bit position. The 0 bits are blue and the 1 bits are orange.

Of course, an attacker who has access to a chat interface might not have access to the encrypted blob. So they might have to get this data through some other mechanism. You can get a very similar signal just by measuring how long it takes the model to return a response.

Same experiment as above, but this shows the measured wall-clock time at the client, before the model produces a response. This signal is nice because it should leak through most “chat only” interfaces, even if the attacker can’t see the encrypted blocks.

So the summary here is not so much “encrypted blobs can leak useful information” although sometimes they do. It’s that reasoning itself can be leaky, even when we beg the model not to leak. Simply doing it, in a way that reasons over secret data, can potentially leak useful information to a clever attacker.

Can we use this side channel to extract platform secrets, such as the models’ top secret system prompts?

Once I found this side channel I got really excited. Sure, it’s slow: but maybe we could use it to slowly chisel out the models’ top secret instruction prompts, like the one that says “don’t talk about Goblins.” This would be painful but simple: just ask true/false questions about the first letter, then the second letter, and so on.

At this point I had to stop using Codex and Claude Code because they both just plain refused to help me extract confidential information, even after checking my ID and taking lock of my hair. I was forced to switch to OpenCode using Kimi 2.6, which had no ethical qualms about laying down a trail of destruction for my security research.

Unfortunately, most of the destruction was my own. I won’t go into the nightmare of model hallucinations that followed. I’ll just say that I learned a few things:

  • Neither GPT 5x nor Claude actually has a system prompt when you’re using API mode.
  • But they’re both happy to tell you they have one!
  • Moreover, they will happily invent plausible ones if you really push them to.
  • Kimi 2.6 is also happy to tell you you’re a genius who just invented the Internet each time this happens. Inevitably your experimental results will turn out to have been totally bogus, but at least Kimi will be very disappointed on your behalf.
  • With all that said, Kimi is shockingly good at coding and experiment design, especially given the very attractive pricing. If I was an Anthropic or OpenAI investor, I’d be scared.

So TL;DR, while I was able to extract application-specific secrets that did exist, I wasn’t able to extract model prompts that don’t. Moreover, I didn’t feel quite ambitious enough to begin pounding on ChatGPT or Claude’s public web interface (where they certainly do.) So for the moment I’m just going to call this a maybe. I think model providers should think hard about this reasoning data, and they should make sure it doesn’t leak things they don’t want it to.

What could providers do about this?

I reported both results to OpenAI and Anthropic via their bug bounty programs. OpenAI said my report was unreproducible. I sent them my scripts, but too late. Anthropic quite reasonably told me they don’t see any security implications in side channels or replays, but they might alter their developer documentation to warn application developers to be more careful. I think that’s a fine decision (except for the part about trusting application developers), even if I want to believe there could be more here. Either way: I took those responses as permission to write this post.

I still don’t think model providers should write this stuff off entirely.

As far as what model providers can do, there’s the easy stuff and the hard stuff. First: both providers should proactively improve their key management. If you think reasoning state is worth encrypting, then properly encrypt it. It should not be replayable across sessions or accounts. While I can’t tell you exactly what bad things might happen, I think you’re better off patching holes before you see the water coming through them.

The side channel results aren’t fixed by patches to the encryption protocol. They’re more fundamental to the way models work: if I can convince a model to do secret-dependent reasoning, then there is almost certain to be leakage. If someone figures out how to exploit this for some meaningful purpose, the best I can offer is that models will need to apply policy gates before they even reason about things. Unfortunately, this seems like it might have some real downsides, because “apply policy gate” itself often requires reasoning.

This stuff makes me grateful I’m just a cryptographer and I don’t have to think about this sort of problem.

Let’s talk about AI and end-to-end encryption

Let’s talk about AI and end-to-end encryption

Recently I came across a fantastic new paper by a group of NYU and Cornell researchers entitled “How to think about end-to-end encryption and AI.” I’m extremely grateful to see this paper, because while I don’t agree with every one of its conclusions, it’s a good first stab at an incredibly important set of questions.

I was particularly happy to see people thinking about this topic, since it’s been on my mind in a half-formed state this past few months. On the one hand, my interest in the topic was piqued by the deployment of new AI assistant systems like Google’s scam call protection and Apple Intelligence, both of which aim to put AI basically everywhere on your phone — even, critically, right in the middle of your private messages. On the other hand, I’ve been thinking about the negative privacy implications of AI due to the recent European debate over “mandatory content scanning” laws that would require machine learning systems to scan virtually every private message you send.

While these two subjects arrive from very different directions, I’ve come to believe that they’re going to end up in the same place. And as someone who has been worrying about encryption — and debating the “crypto wars” for over a decade — this has forced me to ask some uncomfortable questions about what the future of end-to-end encrypted privacy will look like. Maybe, even, whether it actually has a future.

But let’s start from an easier place.

What is end-to-end encryption, and what does AI have to do with it?

From a privacy perspective, the most important story of the past decade or so has been the rise of end-to-end encrypted communications platforms. Prior to 2011, most cloud-connected devices simply uploaded their data in plaintext. For many people this meant their private data was exposed to hackers, civil subpoenas, government warrants, or business exploitation by the platforms themselves. Unless you were an advanced user who used tools like PGP and OTR, most end-users had to suffer the consequences.

And (spoiler alert) oh boy, did those consequences turn out to be terrible.

Headline about the Salt Typhoon group, an APT threat that has essentially owned US telecom infrastructure by compromising “wiretapping” systems. Oh how the worm has turned.

Around 2011 our approach to data storage began to evolve. This began with messaging apps like Signal, Apple iMessage and WhatsApp, all of which began to roll out default end-to-end encryption for private messaging. This technology changed the way that keys are managed, to ensure that servers would never see the plaintext content of your messages. Shortly afterwards, phone (OS) designers like Google, Samsung and Apple began encrypting the data stored locally on your phone, a move that occasioned some famous arguments with law enforcement. More recently, Google introduced default end-to-end encryption for phone backups, and (somewhat belatedly) Apple has begun to follow.

Now I don’t want to minimize the engineering effort required here, which was massive. But these projects were also relatively simple. By this I mean: all of the data encrypted in these projects shared a common feature, which is that none of it needed to be processed by a server.

The obvious limitation of end-to-end encryption is that while it can hide content from servers, it can also make it very difficult for servers to compute on that data.1 This is basically fine for data like cloud backups and private messages, since that content is mostly interesting to clients. For data that actually requires serious processing, the options are much more limited. Wherever this is the case — for example, consider a phone that applies text recognition to the photos in your camera roll — designers typically get forced into a binary choice. On the one hand they can (1) send plaintext off to a server, in the process resurrecting many of the earlier vulnerabilities that end-to-end encryption sought to close. Or else (2) they can limit their processing to whatever can be performed on the device itself.

The problem with the second solution is that your typical phone only has a limited amount of processing power, not to mention RAM and battery capacity. Even the top-end iPhone will typically perform photo processing while it’s plugged in at night, just to avoid burning through your battery when you might need it. Moreover, there is a huge amount of variation in phone hardware. Some flagship phones will run you $1400 or more, and contain onboard GPUs and neural engines. But these phones aren’t typical, in the US and particularly around the world. Even in the US it’s still possible to buy a decent Android phone for a couple hundred bucks, and a lousy one for much less. There is going to be a vast world of difference between these devices in terms of what they can process.

So now we’re finally ready to talk about AI.

Unless you’ve been living under a rock, you’ve probably noticed the explosion of powerful new AI models with amazing capabilities. These include Large Language Models (LLMs) which can generate and understand complex human text, as well as new image processing models that can do amazing things in that medium. You’ve probably also noticed Silicon Valley’s enthusiasm to find applications for this tech. And since phone and messaging companies are Silicon Valley, your phone and its apps are no exception. Even if these firms don’t quite know how AI will be useful to their customers, they’ve already decided that models are the future. In practice this has already manifested in the deployment of applications such as text message summarization, systems that listen to and detect scam phone calls, models that detect offensive images, as well as new systems that help you compose text.

And these are obviously just the beginning.

If you believe the AI futurists — and in this case, I actually think you should — all these toys are really just a palate cleanser for the main entrée still to come: the deployment of “AI agents,” aka, Your Plastic Pal Who’s Fun To Be With.

In theory these new agent systems will remove your need to ever bother messing with your phone again. They’ll read your email and text messages and they’ll answer for you. They’ll order your food and find the best deals on shopping, swipe your dating profile, negotiate with your lenders, and generally anticipate your every want or need. The only ingredient they’ll need to realize this blissful future is virtually unrestricted access to all your private data, plus a whole gob of computing power to process it.

Which finally brings us to the sticky part.

Since most phones currently don’t have the compute to run very powerful models, and since models keep getting better and in some cases more proprietary, it is likely that much of this processing (and the data you need processed) will have to be offloaded to remote servers.

And that’s the first reason that I would say that AI is going to be the biggest privacy story of the decade. Not only will we soon be doing more of our compute off-device, but we’ll be sending a lot more of our private data. This data will be examined and summarized by increasingly powerful systems, producing relatively compact but valuable summaries of our lives. In principle those systems will eventually know everything about us and about our friends. They’ll read our most intimate private conversations, maybe they’ll even intuit our deepest innermost thoughts. We are about to face many hard questions about these systems, including some difficult questions about whether they will actually be working for us at all.

But I’ve gone about two steps farther than I intended to. Let’s start with the easier questions.

What does AI mean for end-to-end encrypted messaging?

Modern end-to-end encrypted messaging systems provide a very specific set of technical guarantees. Concretely, an end-to-end encrypted system is designed to ensure that plaintext message content in transit is not available anywhere except for the end-devices of the participants and (here comes the Big Asterisk) anyone the participants or their devices choose to share it with.

The last part of that sentence is very important, and it’s obvious that non-technical users get pretty confused about it. End-to-end encrypted messaging systems are intended to deliver data securely. They don’t dictate what happens to it next. If you take screenshots of your messages, back them up in plaintext, copy/paste your data onto Twitter, or hand over your device in response to a civil lawsuit — well, that has really nothing to do with end-to-end encryption. Once the data has been delivered from one end to the other, end-to-end encryption washes its hands and goes home

Of course there’s a difference between technical guarantees and the promises that individual service providers make to their customers. For example, Apple says this about its iMessage service:

I can see how these promises might get a little bit confusing! For example, imagine that Apple keeps its promise to deliver messages securely, but then your (Apple) phone goes ahead and uploads (the plaintext) message content to a different set of servers where Apple really can decrypt it. Apple is absolutely using end-to-end encryption in the dullest technical sense… yet is the statement above really accurate? Is Apple keeping its broader promise that it “can’t decrypt the data”?

Note: I am not picking on Apple here! This is just one paragraph from a security overview. In a separate legal document, Apple’s lawyers have written a zillion words to cover their asses. My point here is just to point out that a technical guarantee is different from a user promise.

Making things more complicated, this might not be a question about your own actions. Consider the promise you receive every time you start a WhatsApp group chat (at right.) Now imagine that some other member of the group — not you, but one of your idiot friends — decides to turn on some service that uploads (your) received plaintext messages to WhatsApp. Would you still say the message you received still accurately describes how WhatsApp treats your data? If not, how should it change?

In general, what we’re asking here is a question about informed consent. Consent is complicated because it’s both a moral concept and also a legal one, and because the law is different in every corner of the world. The NYU/Cornell paper does an excellent job giving an overview of the legal bits, but — and sorry to be a bummer here — I really don’t want you to expect the law to protect you. I imagine that some companies will do a good job informing their users, as a way to earn trust. Other companies, won’t. Here in the US, they’ll bury your consent inside of an inscrutable “terms of service” document you never read. In the EU they’ll probably just invent a new type of cookie banner.

This is obviously a joke. But, like, maybe not?

So to summarize: in the near term we’re headed to a world where we should expect increasing amounts of our data to be processed by AI, and a lot of that processing will (very likely) be off-device. Good end-to-end encrypted services will actively inform you that this is happening, so maybe you’ll have the chance to opt-in or opt-out. But if it’s truly ubiquitous (as our futurist friends tell us it will be) then probably your options will be end up being pretty limited.

Some ideas for avoiding AI inference on your private messages (cribbed from Mickens.)

Fortunately all is not lost. In the short term, a few people have been thinking hard about this problem.

Trusted Hardware and Apple’s “Private Cloud Compute”

The good news is that our coming AI future isn’t going to be a total privacy disaster. And in large part that’s because a few privacy-focused companies like Apple have anticipated the problems that ML inference will create (namely the need to outsource processing to better hardware) and have tried to do something about it. This something is essentially a new type of cloud-based computer that Apple believes is trustworthy enough to hold your private data.

Quick reminder: as we already discussed, Apple can’t rely on every device possessing enough power to perform inference locally. This means inference will be outsourced to a remote cloud machine. Now the connection from your phone to the server can be encrypted, but the remote machine will still receive confidential data that it must see in cleartext. This poses a huge risk at that machine. It could be compromised and the data stolen by hackers, or worse, Apple’s business development executives could find out about it.

Apple’s approach to this problem is called “Private Cloud Compute” and it involves the use of special trusted hardware devices that run in Apple’s data centers. A “trusted hardware device” is a kind of computer that is locked-down in both the physical and logical sense. Apple’s devices are house-made, and use custom silicon and software features. One of these is Apple’s Secure Boot, which ensures that only “allowed” operating system software is loaded. The OS then uses code-signing to verify that only permitted software images can run. Apple ensures that no long-term state is stored on these machines, and also load-balances your request to a different random server every time you connect. The machines “attest” to the application software they run, meaning that they announce a hash of the software image (which must itself be stored in a verifiable transparency log to prevent any new software from surreptitiously being added.) Apple even says it will publish its software images (though unfortunately not [update: all of] the source code) so that security researchers can check them over for bugs.

The goal of this system is to make it hard for both attackers and Apple employees to exfiltrate data from these devices. I won’t say I’m thrilled about this. It goes without saying that Apple’s approach is a vastly weaker guarantee than encryption — it still centralizes a huge amount of valuable data, and its security relies on Apple getting a bunch of complicated software and hardware security features right, rather than the mathematics of an encryption algorithm. Yet this approach is obviously much better than what’s being done at companies like OpenAI, where the data is processed by servers that employees (presumably) can log into and access.

Although PCC is currently unique to Apple, we can hope that other privacy-focused services will soon crib the idea — after all, imitation is the best form of flattery. In this world, if we absolutely must have AI everywhere, at least the result will be a bit more secure than it might otherwise be.

Who does your AI agent actually work for?

There are so many important questions that I haven’t discussed in this post, and I feel guilty because I’m not going to get to them. I have not, for example, discussed the very important problem of the data used to train (and fine-tune) future AI models, something that opens entire oceans of privacy questions. If you’re interested, these problems are discussed in the NYU/Cornell paper.

But rather than go into that, I want to turn this discussion towards a different question: namely, who are these models actually going to be working for?

As I mentioned above, over the past couple of years I’ve been watching the UK and EU debate new laws that would mandate automated “scanning” of private, encrypted messages. The EU’s proposal is focused on detecting both existing and novel child sexual abuse material (CSAM); it has at various points also included proposals to detect audio and textual conversations that represent “grooming behavior.” The UK’s proposal is a bit wilder and covers many different types of illegal content, including hate speech, terrorism content and fraud — one proposed amendment even covered “images of immigrants crossing the Channel in small boats.”

While scanning for known CSAM does not require AI/ML, detecting nearly every one of the other types of content would require powerful ML inference that can operate on your private data. (Consider, for example, what is involved in detecting novel CSAM content.) Plans to detect audio or textual “grooming behavior” and hate speech will require even more powerful capabilities: not just speech-to-text capabilities, but also some ability to truly understand the topic of human conversations without false positives. It is hard to believe that politicians even understand what they’re asking for. And yet some of them are asking for it, in a few cases very eagerly.

These proposals haven’t been implemented yet. But to some degree this is because ML systems that securely process private data are very challenging to build, and technical platforms have been resistant to building them. Now I invite you to imagine a world where we voluntarily go ahead and build general-purpose agents that are capable of all of these tasks and more. You might do everything in your technical power to keep them under the user’s control, but can you guarantee that they will remain that way?

Or put differently: would you even blame governments for demanding access to a resource like this? And how would you stop them? After all, think about how much time and money a law enforcement agency could save by asking your agent sophisticated questions about your behavior and data, questions like: “does this user have any potential CSAM,” or “have they written anything that could potentially be hate speech in their private notes,” or “do you think maybe they’re cheating on their taxes?” You might even convince yourself that these questions are “privacy preserving,” since no human police officer would ever rummage through your papers, and law enforcement would only learn the answer if you were (probably) doing something illegal.

This future worries me because it doesn’t really matter what technical choices we make around privacy. It does not matter if your model is running locally, or if it uses trusted cloud hardware — once a sufficiently-powerful general-purpose agent has been deployed on your phone, the only question that remains is who is given access to talk to it. Will it be only you? Or will we prioritize the government’s interest in monitoring its citizens over various fuddy-duddy notions of individual privacy.

And while I’d like to hope that we, as a society, will make the right political choice in this instance, frankly I’m just not that confident.

Notes:

  1. (A quick note: some will suggest that Apple should use fully-homomorphic encryption [FHE] for this calculation, so the private data can remain encrypted. This is theoretically possible, but unlikely to be practical. The best FHE schemes we have today really only work for evaluating very tiny ML models, of the sort that would be practical to run on a weak client device. While schemes will get better and hardware will too, I suspect this barrier will exist for a long time to come.)