# Can end-to-end encrypted systems detect child sexual abuse imagery?

A few weeks ago, U.S. Attorney General William Barr joined his counterparts from the U.K. and Australia to publish an open letter addressed to Facebook. The Barr letter represents the latest salvo in an ongoing debate between law enforcement and the tech industry over the deployment of end-to-end (E2E) encryption systems — a debate that will soon be moving into Congress.

The latest round is a response to Facebook’s recent announcement that it plans to extend end-to-end encryption to more of its services. It should hardly come as a surprise that law enforcement agencies are unhappy with these plans. In fact, governments around the world have been displeased by the increasing deployment of end-to-end encryption systems, largely because they fear losing access to the trove of surveillance data that online services and smartphone usage has lately provided them. The FBI even has a website devoted to the topic.

If there’s any surprise in the Barr letter, it’s not the government’s opposition to encryption. Rather, it’s the new reasoning that Barr provides to justify these concern. In past episodes, law enforcement has called for the deployment of “exceptional access” mechanisms that would allow law enforcement access to plaintext data. As that term implies, such systems are designed to treat data access as the exception rather than the rule. They would need to be used only in rare circumstances, such as when a judge issued a warrant.

The Barr letter appears to call for something much more agressive.

Rather than focusing on the need for exceptional access to plaintext, Barr focuses instead on the need for routine, automated scanning systems that can detect child sexual abuse imagery (or CSAI). From the letter:

More than 99% of the content Facebook takes action against – both for child sexual exploitation and terrorism – is identified by your safety systems, rather than by reports from users. …

We therefore call on Facebook and other companies to take the following steps:

Embed the safety of the public in system designs, thereby enabling you to continue to act against illegal content effectively with no reduction to safety, and facilitating the prosecution of offenders and safeguarding of victims;

To many people, Barr’s request might seem reasonable. After all, nobody wants to see this type of media flowing around the world’s communications systems. The ability to surgically detect it seems like it could do some real good. And Barr is correct that true that end-to-end encrypted messaging systems will make that sort of scanning much, much difficult.

What’s worrying in Barr’s letter is the claim we can somehow square this circle: that we can somehow preserve the confidentiality of end-to-end encrypted messaging services, while still allowing for the (highly non-exceptional) automated scanning for CSAI. Unfortunately, this turns out to be a very difficult problem — given the current state of our technology.

In the remainder of this post, I’m going to talk specifically about that problem. Since this might be a long discussion, I’ll briefly list the questions I plan to address:

1. How do automated CSAI scanning techniques work?
2. Is there a way to implement these techniques while preserving the security of end-to-end encryption?
3. Could these image scanning systems be subject to abuse?

I want to stress that this is a (high-level) technical post, and as a result I’m going to go out of my way not to discuss the ethical questions around this technology, i.e., whether or not I think any sort of routine image scanning is a good idea. I’m sure that readers will have their own opinions. Please don’t take my silence as an endorsement.

### How do automated CSAI scanning techniques work?

Facebook, Google, Dropbox and Microsoft, among others, currently perform various forms of automated scanning on images (and sometimes video) that are uploaded to their servers. The goal of these scans is to identify content that contains child sexual abuse imagery (resp. material), which is called CSAI (or CSAM). The actual techniques used vary quite a bit.

The most famous scanning technology is based on PhotoDNA, an algorithm that was developed by Microsoft Research and Dr. Hany Farid. The full details of PhotoDNA aren’t public — this point is significant — but at a high level, PhotoDNA is just a specialized “hashing” algorithm. It derives a short fingerprint that is designed to closely summarize a photograph. Unlike cryptographic hashing, which is sensitive to even the tiniest changes in a file, PhotoDNA fingerprints are designed to be robust even against complex image transformations like re-encoding or resizing.

The key benefit of PhotoDNA is that it gives providers a way to quickly scan incoming photos, without the need to actually deal with a library of known CSAI themselves. When a new customer image arrives, the provider hashes the file using PhotoDNA, and then compares the resulting fingerprint against a list of known CSAI hashes that are curated by the National Center for Missing and Exploited Children (NCMEC). If a match is found, the photo gets reported to a human, and ultimately to NCMEC or law enforcement.

The obvious limitation of the PhotoDNA approach is that it can only detect CSAI images that are already in the NCMEC database. This means it only finds existing CSAI, not anything new. (And yes, even with that restriction it does find a lot of it.)

To address that problem, Google recently pioneered a new approach based on machine learning techniques. Google’s system is based on a deep neural network, which is trained on a corpus of known CSAI examples. Once trained — a process that is presumably  ongoing and continuous — the network can be applied against fresh images in to flag  media that has similar characteristics. As with the PhotoDNA approach, images that score highly can be marked for further human review. Google even provides an API that authorized third parties can use for this purpose.

Both of these approaches are very different. But they have an obvious commonality: they only work if providers to have access to the plaintext of the images for scanning, typically at the platform’s servers. End-to-end encrypted (E2E) messaging throws a monkey wrench into these systems. If the provider can’t read the image file, then none of these systems will work.

For platforms that don’t support E2E — such as (the default mode) of Facebook Messenger, Dropbox or Instagram — this isn’t an issue. But popular encrypted messaging systems like Apple’s iMessage and WhatsApp are already “dark” to those server-side scanning technologies, and presumably future encrypted systems will be too.

### Is there some way to support image scanning while preserving the ability to perform E2E encryption?

Some experts have proposed a solution to this problem: rather than scanning images on the server side, they suggest that providers can instead push the image scanning out to the client devices (i.e., your phone), which already has the cleartext data. The client device can then perform the scan, and report only images that get flagged as CSAI. This approach removes the need for servers to see most of your data, at the cost of enlisting every client device into a distributed surveillance network.

The idea of conducting image recognition locally is not without precedent. Some device manufacturers (notably Apple) have already moved their neural-network-based image classification onto the device itself, specifically to eliminate the need to transmit your photos out to a cloud provider.

Unfortunately, while the concept may be easy to explain, actually realizing it for CSAI-detection immediately runs into a very big technical challenge. This is the result of a particular requirement that seems to be present across all existing CSAI scanners. Namely: the algorithms are all secret.

While I’ve done my best to describe how PhotoDNA and Google’s techniques work, you’ll note that my descriptions were vague. This wasn’t due to some lack of  curiosity on my part. It reflects the fact that all the details of these algorithms — as well as the associated data they use, such as the database of image hashes curated by NCMEC and any trained neural network weights — are kept under strict control by the organizations that manage them. Even the final PhotoDNA algorithm, which is ostensibly the output of an industry-academic collaboration, is not public.

While the organizations don’t explicitly state this, the reason for this secrecy seems to be a simple one: these technologies are probably very fragile.

Presumably, the concern is that criminals who gain free access to these algorithms and  databases might be able to subtly modify their CSAI content so that it looks the same to humans but no longer triggers detection algorithms. Alternatively, some criminals might just use this access to avoid transmitting flagged content altogether.

This need for secrecy makes client-side scanning fundamentally much more difficult. While it might be possible to cram Google’s neural network onto a user’s phone, it’s hugely more difficult to do so on a billion different phones, while also ensuring that nobody obtains a copy of it.

### Can fancy cryptography help here?

The good news is that cryptographers have spent a lot of time thinking about this exact sort of problem: namely, finding ways to allow mutually-distrustful parties to jointly compute over data that each, individually, wants to keep secret. The name for this class of technologies is secure multi-party computation, or MPC for short.

CSAI scanning is exactly the sort of application you might look to MPC to implement. In this case, both client and service provider have a secret. The client has an image it wants to keep confidential, and the server has some private algorithms or neural network weights.* All the parties want in the end is a “True/False” output from the detection algorithm. If the scanner reports “False”, then the image can remain encrypted and hidden from the service provider.

So far this seems simple. The devil is in the (performance) details.

While general MPC (and two-party, or 2PC) techniques have long existed, some recent papers have investigated the costs of implementing secure image classification via neural networks using these techniques. (This approach is much more reminiscent of the Google technique, rather than PhotoDNA, which would have its own complexities. See notes below.***)

The papers in question (e.g., CryptoNets, MiniONN, Gazelle, Chameleon and XONN) employ sophisticated cryptographic tools such as leveled fully-homomorphic encryption, partially-homomorphic encryption, and circuit garbling, in many cases making specific alterations to the neural network structure in order to allow for efficient evaluation. The tools are also interactive: a client and server must exchange data in order to perform the classification task, and the result appears only after this exchange of data.

All of this work is remarkable, and really deserves a much more in-depth discussion. Unfortunately I’m only here to answer a basic question — are these techniques practical yet? To do that I’m going to do a serious disservice to all this excellent research. Indeed, the current state of the art can largely be summarized by following table from one of the most recent papers:

What this table shows is the bandwidth and computational cost of securely computing a single image classification using several of the tools I mentioned above. A key thing to note here is that the images to be classified are fairly simple — each is a 32×32-bit pixel color thumbnail. And while I’m no judge of such things, the neural networks architectures used for the classification also seem relatively simple. (At very least, it’s hard to imagine that a CSAI detection neural network is going to be that much less complex.)

Despite the relatively small size of these problem instances, the overhead of using MPC turns out to be pretty spectacular. Each classification requires several seconds to minutes of actual computation time on a reasonably powerful machine — not a trivial cost, when you consider how many images most providers transmit every second. But the computational costs pale next to the bandwidth cost of each classification. Even the most efficient platform requires the two parties to exchange more than 1.2 gigabytes of data.**

Hopefully you’ve paid for a good data plan.

Now this is just one data point. And the purpose here is certainly not to poo-poo the idea that MPC/2PC could someday be practical for image classification at scale. My point here is simply that, at present, doing this sort of classification efficiently (and privately) remains firmly in the domain of “hard research problems to be solved”, and will probably continue to be there for at least several more years. Nobody should bank on using this technology anytime soon. So client-side classification seems to be off the table for the near future.

But let’s imagine it does become efficient. There’s one more question we need to consider.

Are (private) scanning systems subject to abuse?

As I noted above, I’ve made an effort here to dodge the ethical and policy questions that surround client-side CSAI scanning technologies. I’ve done this not because I back the idea, but because these are complicated questions — and I don’t really feel qualified to answer them.

Still, I can’t help but be concerned about two things. First, that today’s CSAI scanning infrastructure represents perhaps the most powerful and ubiquitous surveillance technology ever to be deployed by a democratic society. And secondly, that the providers who implement this technology are so dependent on secrecy.

This raises the following question. Even if we accept that everyone involved today has only the best intentions, how can we possibly make sure that everyone stays honest?

Unfortunately, simple multi-party computation techniques, no matter how sophisticated, don’t really answer this question. If you don’t trust the provider, and the provider chooses the (hidden) algorithm, then all the cryptography in the world won’t save you.

Abuse of a CSAI scanning system might range from outsider attacks by parties who generate CSAI that simply collides with non-CSAI content; to insider attacks that alter the database to surveil specific content. These concerns reach fever pitch if you imaginea corrupt government or agency forcing providers to alter their algorithms to abuse this capability.  While that last possibility seems like a long shot in this country today, it’s not out of bounds for the whole world. And systems designed for surveillance should contemplate their own misuse.

Which means that, ultimately, these systems will need some mechanism to ensure that service providers are being honest. Right now I don’t quite know how to do this. But someone will have to figure it out, long before these systems can be put into practice.

Notes:

* Most descriptions of MPC assume that the function (algorithm) to be computed is known to both parties, and only the inputs (data) is secret. Of course, this can be generalized to secret algorithms simply by specifying the algorithm as a piece of data, and computing an algorithm that interprets and executes that algorithm. In practice, this type of “general computation” is likely to be pretty costly, however, and so there would be a huge benefit to avoiding it.

** It’s possible that this cost could be somewhat amortized across many images, though it’s not immediately obvious to me that this works for all of the techniques.

*** Photo hashing might or might not feasible to implement using MPC/2PC. The relatively limited information about PhotoDNA describes is as including a number of extremely complex image manipulation phases, followed by a calculation that occurs on subregions of the image. Some sub-portions of this operation might be easy to move into an MPC system, while others could be left “in the clear” for the client to compute on its own. Unfortunately, it’s difficult to know which portions of the algorithm the designers would be willing to reveal, which is why I can’t really speculate on the complexity of such a system.

# How safe is Apple’s Safe Browsing?

This morning brings new and exciting news from the land of Apple. It appears that, at least on iOS 13, Apple is sharing some portion of your web browsing history with the Chinese conglomerate Tencent. This is being done as part of Apple’s “Fraudulent Website Warning”, which uses the Google-developed Safe Browsing technology as the back end. This feature appears to be “on” by default in iOS Safari, meaning that millions of users could potentially be affected.

As is the standard for this sort of news, Apple hasn’t provided much — well, any — detail on whose browsing history this will affect, or what sort of privacy mechanisms are in place to protect its users. The changes probably affect only Chinese-localized users (see Github commits, courtesy Eric Romang), although it’s difficult to know for certain. However, it’s notable that Apple’s warning appears on U.S.-registered iPhones.

Regardless of which users are affected, Apple hasn’t said much about the privacy implications of shifting Safe Browsing to use Tencent’s servers. Since we lack concrete information, the best we can do is talk a bit about the technology and its implications. That’s what I’m going to do below.

### What is “Safe Browsing”, and is it actually safe?

Several years ago Google noticed that web users tended to blunder into malicious sites as they browsed the web. This included phishing pages, as well as sites that attempted to push malware at users. Google also realized that, due to its unique vantage point, it had the most comprehensive list of those sites. Surely this could be deployed to protect users.

The result was Google’s “safe browsing”. In the earliest version, this was simply an API at Google that would allow your browser to ask Google about the safety of any URL you visited. Since Google’s servers received the full URL, as well as your IP address (and possibly a tracking cookie to prevent denial of service), this first API was kind of a privacy nightmare. (This API still exists, and is supported today as the “Lookup API“.)

To address these concerns, Google quickly came up with a safer approach to, um, “safe browsing”. The new approach was called the “Update API”, and it works like this:

1. Google first computes the SHA256 hash of each unsafe URL in its database, and truncates each hash down to a 32-bit prefix to save space.
2. Google sends the database of truncated hashes down to your browser.
3. Each time you visit a URL, your browser hashes it and checks if its 32-bit prefix is contained in your local database.
4. If the prefix is found in the browser’s local copy, your browser now sends the prefix to Google’s servers, which ship back a list of all full 256-bit hashes of the matching  URLs, so your browser can check for an exact match.

At each of these requests, Google’s servers see your IP address, as well as other identifying information such as database state. It’s also possible that Google may drop a cookie into your browser during some of these requests. The Safe Browsing API doesn’t say much about this today, but Ashkan Soltani noted this was happening back in 2012.

It goes without saying that Lookup API is a privacy disaster. The “Update API” is much more private: in principle, Google should only learn the 32-bit hashes of some browsing requests. Moreover, those truncated 32-bit hashes won’t precisely reveal the identity of the URL you’re accessing, since there are likely to be many collisions in such a short identifier. This provides a form of k-anonymity.

The weakness in this approach is that it only provides some privacy. The typical user won’t just visit a single URL, they’ll browse thousands of URLs over time. This means a malicious provider will have many “bites at the apple” (no pun intended) in order to de-anonymize that user. A user who browses many related websites — say, these websites — will gradually leak details about their browsing history to the provider, assuming the provider is malicious and can link the requests. (Updated to add: There has been some academic research on such threats.)

And this is why it’s so important to know who your provider actually is.

### What does this mean for Apple and Tencent?

That’s ultimately the question we should all be asking.

The problem is that Safe Browsing “update API” has never been exactly “safe”. Its purpose was never to provide total privacy to users, but rather to degrade the quality of browsing data that providers collect. Within the threat model of Google, we (as a privacy-focused community) largely concluded that protecting users from malicious sites was worth the risk. That’s because, while Google certainly has the brainpower to extract a signal from the noisy Safe Browsing results, it seemed unlikely that they would bother. (Or at least, we hoped that someone would blow the whistle if they tried.)

But Tencent isn’t Google. While they may be just as trustworthy, we deserve to be informed about this kind of change and to make choices about it. At very least, users should learn about these changes before Apple pushes the feature into production, and thus asks millions of their customers to trust them.

### We shouldn’t have to read the fine print

When Apple wants to advertise a major privacy feature, they’re damned good at it. As an example:  this past summer the company announced the release of the privacy-preserving “Find My” feature at WWDC, to widespread acclaim. They’ve also been happy to claim credit for their work on encryption, including technology such as iCloud Keychain.

But lately there’s been a troubling silence out of Cupertino, mostly related to the company’s interactions with China. Two years ago, the company moved much of iCloud server infrastructure into mainland China, for default use by Chinese users. It seems that Apple had no choice in this, since the move was mandated by Chinese law. But their silence was deafening. Did the move involve transferring key servers for end-to-end encryption? Would non-Chinese users be affected? Reporters had to drag the answers out of the company, and we still don’t know many of them.

In the Safe Browsing change we have another example of Apple making significant modifications to its privacy infrastructure, largely without publicity or announcement. We have learn about this stuff from the fine print. This approach to privacy issues does users around the world a disservice.

It increasingly feels like Apple is two different companies: one that puts the freedom of its users first, and another that treats its users very differently. Maybe Apple feels it can navigate this split personality disorder and still maintain its integrity.

I very much doubt it will work.

# Looking back at the Snowden revelations

Edward Snowden recently released his memoirs. In some parts of the Internet, this has rekindled an ancient debate: namely, was it all worth it? Did Snowden’s leaks make us better off, or did Snowden just embarass us and set back U.S. security by decades? Most of the arguments are so familiar that they’re boring at this point. But no matter how many times I read them, I still feel that there’s something important missing.

It’s no coincidence that this is a cryptography blog, which means that I’m not concerned with the same things as the general public. That is, I’m not terribly interested in debating the value of whistleblower laws (for some of that, see this excellent Twitter thread by Jake Williams). Instead, when it comes to Snowden’s leaks, I think the question we should be asking ourselves is very different. Namely:

What did the Snowden leaks tell us about modern surveillance capabilities? And what did we learn about our ability to defend against them?

And while the leaks themselves have receded into the past a bit — and the world has continued to get more complicated — the technical concerns that Snowden alerted us to are only getting more salient.

### Life before June 2013

It’s difficult to believe that the Snowden revelations began over six years ago. It’s also easy to forget how much things have changed in the intervening years.

Six years ago, vast portions of our communication were done in plaintext. It’s hard to believe how bad things were, but back in 2013, Google was one of the only major tech companies who had deployed HTTPS in its services by default, and even there they had some major exceptions. Web clients were even worse. These graphs (source and source) don’t cover the whole time period, but they give some of the flavor:

Outside of HTTPS, the story was even worse. In 2013 the vast majority of text messages were sent via unencrypted SMS/MMS or poorly-encrypted IM services, which were a privacy nightmare. Future developments like the inclusion of default end-to-end encryption in WhatsApp were years away. Probably the sole (and surprising) exception to was Apple, which had been ahead of the curve in deploying end-to-end encryption. This was largely counterbalanced by the tire fire that was Android back in those days.

But even these raw facts don’t tell the full story.

What’s harder to present in a chart is how different attitudes were towards surveillance back before Snowden. The idea that governments would conduct large-scale interception of our communications traffic was a point of view that relatively few “normal people” spent time thinking about — it was mostly confined to security mailing lists and X-Files scripts. Sure, everyone understood that government surveillance was a thing, in the abstract. But actually talking about this was bound to make you look a little silly, even in paranoid circles.

That these concerns have been granted respectability is one of the most important things Snowden did for us.

### So what did Snowden’s leaks really tell us?

The brilliant thing about the Snowden leaks was that he didn’t tell us much of anything. He showed us. Most of the revelations came in the form of a Powerpoint slide deck, the misery of which somehow made it all more real. And despite all the revelation fatigue, the things he showed us were remarkable. I’m going to hit a few of the highlights from my perspective. Many are cryptography-related, just because that’s what this blog is about. Others tell a more basic story about how vulnerable our networks are.

### “Collect it all”

Prior to Snowden, even surveillance-skeptics would probably concede that, yes, the NSA collects data on specific targets. But even the most paranoid observers were shocked by the sheer scale of what the NSA was actually doing out there.

The Snowden revelations detailed several programs that were so astonishing in the breadth and scale of the data being collected, the only real limits on them were caused by technical limitations in the NSA’s hardware. Most of us are familiar with the famous examples, like nationwide phone metadata collection. But it’s the bizarre, obscure leaks that really drive this home. For example:

“Optic Nerve”. From 2008-2010 the NSA and GCHQ collected millions of still images from every Yahoo! Messenger webchat stream, and used them to build a massive database for facial recognition. The collection of data had no particular rhyme or reason — i.e., it didn’t target specific users who might be a national security threat. It was just… everything. Don’t believe me? Here’s how we know how indiscriminate this was: the program didn’t even necessarily target faces. It got… other things:

MYSTIC/SOMALGET. In addition to collecting massive quantities of Internet metadata, the NSA recorded the full audio every cellular call made in the Bahamas. (Note: this is not simply calls to the Bahamas, which might be sort of a thing. They abused a law enforcement access feature in order to record all the mobile calls made within the country.) Needless to say, the Bahamian government was not party to this secret.

MUSCULAR. In case anyone thought the NSA avoided attacks on American providers, a series of leaks in 2014 documented that the NSA had tapped the internal leased lines used to connect Google and Yahoo datacenters. This gave the agencies access to vast and likely indiscriminate access to torrents of data on U.S. and European users, information was likely above and beyond the data that these companies already shared with the U.S. under existing programs like PRISM. This leak is probably most famous for this slide:

Yahoo!, post-Snowden. And in case you believe that this all ended after Snowden’s leaks, we’ve learned even more disturbing things since. For example, in 2015, Yahoo got caught installing what has been described as a “rootkit” that scanned every single email in its database for specific selectors, at the request of the U.S. government. This was so egregious that the company didn’t even tell it’s CISO, who left the next week. In fact, we know a lot more about Yahoo’s collaboration during this time period, thanks to Snowden.

These examples are not necessarily the worst things we learned from the Snowden leaks. I chose them only to illustrate how completely indiscriminate the agency’s surveillance really was. And not because the NSA was especially evil, but just because it was easy to do. If you had any illusions that this data was being carefully filtered to exclude capturing data belonging to U.S. citizens, or U.S. companies, the Snowden leaks should have set you straight.

### SIGINT Enabling

The Snowden leaks also helped shatter a second illusion: the idea that the NSA was on the side of the angels when it comes to making the Internet more secure. I’ve written about this plenty on this blog (with sometimes exciting results), but maybe this needs to be said again.

One of the most important lessons we learned from the Snowden leaks was that the NSA very much prioritizes its surveillance mission, to the point where it is willing to actively insert vulnerabilities into encryption products and standards used on U.S. networks. And this kind of thing wasn’t just an occasional crime of opportunity — the agency spent $250 million per year on a program called the SIGINT Enabling Project. Its goal was, basically, to bypass our commercial encryption at any cost. This kind of sabotage is, needless to say, something that not even the most paranoid security researchers would have predicted from our own intelligence agencies. Agencies that, ostensibly have a mission to protect U.S. networks. The Snowden reporting not only revealed the existence of these overall programs, but they uncovered a lot of unpleasant specifics, leading to a great deal of follow-up investigation. For example, the Snowden leaks contained specific allegations of a vulnerability in a NIST standard called Dual EC. The possibility of such a vulnerability had previously been noted by U.S. security researchers Dan Shumow and Niels Ferguson a few years earlier. But despite making a reasonable case for re-designing this algorithm, those researchers (and others) were basically brushed off by the “serious” people at NIST. The Snowden documents changed all that. The leaks were a devastating embarassment to the U.S. cryptographic establishment, and led to some actual changes. Not only does it appear that the NSA deliberately backdoored Dual EC, it seems that they did so (and used NIST) in order to deploy the backdoor into U.S. security products. Later investigations would show that Dual EC was present in software by RSA Security (allegedly because of a secret contract with the NSA) and in firewalls made by Juniper Networks. (Just to make everything a bit more horrifying, Juniper’s Dual EC backdoor would later be hijacked and turned against the United States by unknown hackers — illustrating exactly how reckless this all was.) And finally, there are the mysteries. Snowden slides indicate that the NSA has been decrypting SSL/TLS and IPsec connections at vast scale. Even beyond the SIGINT Enabling-type sabotage, this raises huge questions about what the hell is actually going on here. There are theories. These may or may not be correct, but at least now people are thinking about them. At very least, it’s clear that something is very, very wrong. ### Have things improved? This is the$250 million question.

Some of the top-level indicators are surprisingly healthy. HTTPS adoption has taken off like a rocket, driven in part by Google’s willingness to use it as a signal for search rankings — and the rise of free Certificate Authorities like LetsEncrypt. It’s possible that these things would have happened eventually without Snowden, but it’s less likely.

End-to-end encrypted messaging has also taken off, largely due to adoption by WhatsApp and a host of relatively new apps. It’s reached the point where law enforcement agencies have begun to freak out, as the slide below illustrates.

Does Snowden deserve credit for this? Maybe not directly, but it’s almost certain that concerns over the surveillance he revealed did play a role. (It’s worth noting that this adoption is not evenly distributed across the globe.)

It’s also worth pointing out that at least in the open source community the quality of our encryption software has improved enormously, largely due to the fact that major companies made well-funded efforts to harden their systems, in part as a result of serious flaws like Heartbleed — and in part as a response to the company’s own concerns about surveillance.

It might very well be that the NSA has lost a significant portion of its capability since Snowden.

### The future isn’t American

I’ve said this before, as have many others: even if you support the NSA’s mission, and believe that the U.S. is doing everything right, it doesn’t matter. Unfortunately, the future of surveillance has very little to do with what happens in Ft. Meade, Maryland. In fact, the world that Snowden brought to our attention isn’t necessarily a world that Americans have much say in.

As an example: today the U.S. government is in the midst of forcing a standoff with China over the global deployment of Huawei’s 5G wireless networks around the world. This is a complicated issue, and financial interest probably plays a big role. But global security also matters here. This conflict is perhaps the clearest acknowledgement we’re likely to see that our own government knows how much control of communications networks really matters, and our inability to secure communications on these networks could really hurt us. This means that we, here in the West, had better get our stuff together — or else we should be prepared to get a taste of our own medicine.

If nothing else, we owe Snowden for helping us to understand how high the stakes might be.

# How does Apple (privately) find your offline devices?

At Monday’s WWDC conference, Apple announced a cool new feature called “Find My”. Unlike Apple’s “Find my iPhone“, which uses cellular communication and the lost device’s own GPS to identify the location of a missing phone, “Find My” also lets you find devices that don’t have cellular support or internal GPS — things like laptops, or (and Apple has hinted at this only broadly) even “dumb” location tags that you can attach to your non-electronic physical belongings.

The idea of the new system is to turn Apple’s existing network of iPhones into a massive crowdsourced location tracking system. Every active iPhone will continuously monitor for BLE beacon messages that might be coming from a lost device. When it picks up one of these signals, the participating phone tags the data with its own current GPS location; then it sends the whole package up to Apple’s servers. This will be great for people like me, who are constantly losing their stuff: if I leave my backpack on a tour bus in China in my office, sooner or later someone else will stumble on its signal and I’ll instantly know where to find it.

(It’s worth mentioning that Apple didn’t invent this idea. In fact, companies like Tile have been doing this for quite a while. And yes, they should probably be worried.)

If you haven’t already been inspired by the description above, let me phrase the question you ought to be asking: how is this system going to avoid being a massive privacy nightmare?

Let me count the concerns:

• If your device is constantly emitting a BLE signal that uniquely identifies it, the whole world is going to have (yet another) way to track you. Marketers already use WiFi and Bluetooth MAC addresses to do this: Find My could create yet another tracking channel.
• It also exposes the phones who are doing the tracking. These people are now going to be sending their current location to Apple (which they may or may not already be doing). Now they’ll also be potentially sharing this information with strangers who “lose” their devices. That could go badly.
• Scammers might also run active attacks in which they fake the location of your device. While this seems unlikely, people will always surprise you.

The good news is that Apple claims that their system actually does provide strong privacy, and that it accomplishes this using clever cryptography. But as is typical, they’ve declined to give out the details how they’re going to do it. Andy Greenberg talked me through an incomplete technical description that Apple provided to Wired, so that provides many hints. Unfortunately, what Apple provided still leaves huge gaps. It’s into those gaps that I’m going to fill in my best guess for what Apple is actually doing.

A big caveat: much of this could be totally wrong. I’ll update it relentlessly when Apple tells us more.

### Some quick problem-setting

To lay out our scenario, we need to bring several devices into the picture. For inspiration, we’ll draw from the 1950s television series “Lassie”.

A first device, which we’ll call Timmy, is “lost”. Timmy has a BLE radio but no GPS or connection to the Internet. Fortunately, he’s been previously paired with a second device called Ruth, who wants to find him. Our protagonist is Lassie: she’s a random (and unknowing) stranger’s iPhone, and we’ll  assume that she has at least an occasional Internet connection and solid GPS. She is also a very good girl. The networked devices communicate via Apple’s iCloud servers, as shown below:

(Since Timmy and Ruth have to be paired ahead of time, it’s likely they’ll both be devices owned by the same person. Did I mention that you’ll need to buy two Apple devices to make this system work? That’s also just fine for Apple.)

Since this is a security system, the first question you should ask is: who’s the bad guy? The answer in this setting is unfortunate: everyone is potentially a bad guy. That’s what makes this  problem so exciting.

### Keeping Timmy anonymous

The most critical aspect of this system is that we need to keep unauthorized third parties from tracking Timmy, especially when he’s not lost. This precludes some pretty obvious solutions, like having the Timmy device simply shout “Hi my name is Timmy, please call my mom Ruth and let her know I’m lost.” It also precludes just about any unchanging static identifier, even an opaque and random-looking one.

This last requirement is inspired by the development of services that abuse static identifiers broadcast by your devices (e.g., your WiFi MAC address) to track devices as you walk around. Apple has been fighting this — with mixed success — by randomizing things like MAC addresses. If Apple added a static tracking identifier to support the “Find My” system, all of these problems could get much worse.

This requirement means that any messages broadcast by Timmy have to be opaque — and moreover, the contents of these messages must change, relatively frequently, to new values that can’t be linked to the old ones. One obvious way to realize this is to have Timmy and Ruth agree on a long list of random “pseudonyms” for Timmy, and have Timmy pick a different one each time.

This helps a lot. Each time Lassie sees some (unknown) device broadcasting an identifier, she won’t know if it belongs to Timmy: but she can send it up to Apple’s servers along with her own GPS location. In the event that Timmy ever gets lost, Ruth can ask Apple to search for every single one of Timmy‘s possible pseudonyms. Since neither nobody outside of Apple ever learns this list, and even Apple only learns it after someone gets lost, this approach prevents most tracking.

A slightly more efficient way to implement this idea is to use a cryptographic function (like a MAC or hash function) in order to generate the list of pseudonyms from a single short “seed” that both Timmy and Ruth will keep a copy of. This is nice because the data stored by each party will be very small. However, to find Timmy, Ruth must still send all of the pseudonyms — or her “seed” — up to Apple, who will have to search its database for each one.

### Hiding Lassie’s location

The pseudonym approach described above should work well to keep Timmy‘s identity hidden from Lassie, and even from Apple (up until the point that Ruth searches for it.) However, it’s got a big drawback: it doesn’t hide Lassie‘s GPS coordinates.

This is bad for at least a couple of reasons. Each time Lassie detects some device broadcasting a message, she needs to transmit her current position (along with the pseudonym she sees) to Apple’s servers. This means Lassie is constantly telling Apple where she is. And moreover, even if Apple promises not to store Lassie‘s identity, the result of all these messages is a huge centralized database that shows every GPS location where some Apple device has been detected.

Note that this data, in the aggregate, can be pretty revealing. Yes, the identifiers of the devices might be pseudonyms — but that doesn’t make the information useless. For example: a record showing that some Apple device is broadcasting from my home address at certain hours of the day would probably reveal when I’m in my house.

An obvious way to prevent this data from being revealed to Apple is to encrypt it — so that only parties who actually need to know the location of a device can see this information. If Lassie picks up a broadcast from Timmy, then the only person who actually needs to know Lassie‘s GPS location is Ruth. To keep this information private, Lassie should encrypt her coordinates under Ruth’s encryption key.

This, of course, raises a problem: how does Lassie get Ruth‘s key? An obvious solution is for Timmy to shout out Ruth’s public key as part of every broadcast he makes. Of course, this would produce a static identifier that would make Timmy‘s broadcasts linkable again.

To solve that problem, we need Ruth to have many unlinkable public keys, so that Timmy can give out a different one with each broadcast. One way to do this is have Ruth and Timmy generate many different shared keypairs (or generate many from some shared seed). But this is annoying and involves Ruth storing many secret keys. And in fact, the identifiers we mentioned in the previous section can be derived by hashing each public key.

A slightly better approach (that Apple may not employ) makes use of  key randomization. This is a feature provided by cryptosystems like Elgamal: it allows any party to randomize a public key, so that the randomized key is completely unlinkable to the original. The best part of this feature is that Ruth can use a single secret key regardless of which randomized version of her public key was used to encrypt.

All of this  leads to a final protocol idea. Each time Timmy broadcasts, he uses a fresh pseudonym and a randomized copy of Ruth‘s public key. When Lassie receives a broadcast, she encrypts her GPS coordinates under the public key, and sends the encrypted message to Apple. Ruth can send in Timmy‘s pseudonyms to Apple’s servers, and if Apple finds a match, she can obtain and decrypt the GPS coordinates.

### Does this solve all the problems?

The nasty thing about this problem setting is that, with many weird edge cases, there just isn’t a perfect solution. For example, what if Timmy is evil and wants to make Lassie reveal her location to Apple? What if Old Man Smithers tries to kidnap Lassie?

At a certain point, the answer to these question is just to say that we’ve done our best: any remaining problems will have to be outside the threat model. Sometimes even Lassie knows when to quit.

# Attack of the week: searchable encryption and the ever-expanding leakage function

A few days ago I had the pleasure of hosting Kenny Paterson, who braved snow and historic cold (by Baltimore standards) to come talk to us about encrypted databases.

Kenny’s newest result is with first authors Paul Grubbs, Marie-Sarah Lacharité and Brice Minaud (let’s call it GLMP). It isn’t so much about building encrypted databases, as it is about the risks of building them badly. And — for reasons I will get into shortly — there have been a lot of badly-constructed encrypted database schemes going around. What GLMP point out is that this weakness isn’t so much a knock against the authors of those schemes, but rather, an indication that they may just be trying to do the impossible.

Hopefully this is a good enough start to get you drawn in. Which is excellent, because I’m going to need to give you a lot of background.

### What’s an “encrypted” database, and why are they a problem?

Databases (both relational and otherwise) are a pretty important part of the computing experience. Modern systems make vast use of databases and their accompanying query technology in order to power just about every software application we depend on.

Because these databases often contain sensitive information, there has been a strong push to secure that data. A key goal is to encrypt the contents of the database, so that a malicious database operator (or a hacker) can’t get access to it if they compromise a single machine. If we lived in a world where security was all that mattered, the encryption part would be pretty easy: database records are, after all, just blobs of data — and we know how to encrypt those. So we could generate a cryptographic key on our local machine, encrypt the data before we upload it to a vulnerable database server, and just keep that key locally on our client computer.

Voila: we’re safe against a database hack!

The problem with this approach is that encrypting the database records leaves us with a database full of opaque, unreadable encrypted junk. Since we have the decryption key on our client, we can decrypt and read those records after we’ve downloaded them. But this approach completely disables one of the most useful features of modern databases: the ability for the database server itself to search (or query) the database for specific records, so that the client doesn’t have to.

Unfortunately, standard encryption borks search capability pretty badly. If I want to search a database for, say, employees whose salary is between $50,000 and$100,000, my database is helpless: all it sees is row after row of encrypted gibberish. In the worst case, the client will have to download all of the data rows and search them itself — yuck.

This has led to much wailing and gnashing of teeth in the database community. As a result, many cryptographers (and a distressing number of non-cryptographers) have tried to fix the problem with “fancier” crypto. This has not gone very well.

It would take me a hundred years to detail all of various solutions that have been put forward. But let me just hit a few of the high points:

• Some proposals have suggested using deterministic encryption to encrypt database records. Deterministic encryption ensures that a given plaintext will always encrypt to a single ciphertext value, at least for a given key. This enables exact-match queries: a client can simply encrypt the exact value (“John Smith”) that it’s searching for, and ask the database to identify encrypted rows that match it.
• Of course, exact-match queries don’t support more powerful features. Most databases also need to support range queries. One approach to this is something called order revealing encryption (or its weaker sibling, order preserving encryption). These do exactly what they say they do: they allow the database to compare two encrypted records to determine which plaintext is greater than the other.
• Some people have proposed to use trusted hardware to solve these problems in a “simpler” way, but as we like to say in cryptography: if we actually had trusted hardware, nobody would pay our salaries. And, speaking more seriously, even hardware might not stop the leakage-based attacks discussed below.

This summary barely scratches the surface of this problem, and frankly you don’t need to know all the details for the purpose of this blog post.

What you do need to know is that each of the above proposals entails has some degree of “leakage”. Namely, if I’m an attacker who is able to compromise the database, both to see its contents and to see how it responds when you (a legitimate user) makes a query, then I can learn something about the data being queried.

### What some examples of leakage, and what’s a leakage function?

Leakage is a (nearly) unavoidable byproduct of an encrypted database that supports queries. It can happen when the attacker simply looks at the encrypted data, as she might if she was able to dump the contents of your database and post them on the dark web. But a more powerful type of leakage occurs when the attacker is able to compromise your database server and observe the query interaction between legitimate client(s) and your database.

Take deterministic encryption, for instance.

Deterministic encryption has the very useful, but also unpleasant feature that the same plaintext will always encrypt to the same ciphertext. This leads to very obvious types of leakage, in the sense that an attacker can see repeated records in the dataset itself. Extending this to the active setting, if a legitimate client queries on a specific encrypted value, the attacker can see exactly which records match the attacker’s encrypted value. She can see how often each value occurs, which gives and indication of what value it might be (e.g., the last name “Smith” is more common than “Azriel”.) All of these vectors leak valuable information to an attacker.

Other systems leak more. Order-preserving encryption leaks the exact order of a list of underlying records, because it causes the resulting ciphertexts to have the same order. This is great for searching and sorting, but unfortunately it leaks tons of useful information to an attacker. Indeed, researchers have shown that, in real datasets, an ordering can be combined with knowledge about the record distribution in order to (approximately) reconstruct the contents of an encrypted database.

Fancier order-revealing encryption schemes aren’t quite so careless with your confidentiality: they enable the legitimate client to perform range queries, but without leaking the full ordering so trivially. This approach can leak less information: but a persistent attacker will still learn some data from observing a query and its response — at a minimum, she will learn which rows constitute the response to a query, since the database must pack up the matching records and send them over to the client.

If you’re having trouble visualizing what this last type of leakage might look like, here’s a picture that shows what an attacker might see when a user queries an unencrypted database vs. what the attacker might see with a really “good” encrypted database that supports range queries:

So the TL;DR here is that many encrypted database schemes have some sort of “leakage”, and this leakage can potentially reveal information about (a) what a client is querying on, and (b) what data is in the actual database.

### But surely cryptographers don’t build leaky schemes?

Sometimes the perfect is the enemy of the good.

Cryptographers could spend a million years stressing themselves to death over the practical impact of different types of leakage. They could also try to do things perfectly using expensive techniques like fully-homomorphic encryption and oblivious RAM — but the results would be highly inefficient. So a common view in the field is researchers should do the very best we can, and then carefully explain to users what the risks are.

For example, a real database system might provide the following guarantee:

“Records are opaque. If the user queries for all records BETWEEN some hidden values X AND Y then all the database will learn is the row numbers of the records that match this range, and nothing else.”

This is a pretty awesome guarantee, particularly if you can formalize it and prove that a scheme achieves it. And indeed, this is something that researchers have tried to do. The formalized description is typically achieved by defining something called a leakage function. It might not be possible to prove that a scheme is absolutely private, but we can prove that it only leaks as much as the leakage function allows.

Now, I may be overdoing this slightly, but I want to be very clear about this next part:

Proving your encrypted database protocol is secure with respect to a specific leakage function does not mean it is safe to use in practice. What it means is that you are punting that question to the application developer, who is presumed to know how this leakage will affect their dataset and their security needs. Your leakage function and proof simply tell the app developer what information your scheme is (provably) going to protect, and what it won’t.

The obvious problem with this approach is that application developers probably don’t have any idea what’s safe to use either. Helping them to figure this out is one goal of this new GLMP paper and its related work.

### So what leaks from these schemes?

GLMP don’t look at a specific encryption scheme. Rather, they ask a more general question: let’s imagine that we can only see that a legitimate user has made a range query — but not what the actual queried range values are. Further, let’s assume we can also see which records the database returns for that query, but not their actual values.

How much does just this information tell us about the contents of the database?

You can see that this is a very limited amount of leakage. Indeed, it is possibly the least amount of leakage you could imagine for any system that supports range queries, and is also efficient. So in one sense, you could say authors are asking a different and much more important question: are any of these encrypted databases actually secure?

### Can you give me a simple, illuminating example?

Let’s say I’m an attacker who has compromised a database, and observes the following two range queries/results from a legitimate client:

Query 1: SELECT * FROM Salaries BETWEEN ⚙️ and 🕹    Result 1: (rows 1, 3, 5)
Query 2: SELECT * FROM Salaries BETWEEN 😨 and 🎱    Result 2: (rows 1, 43, 3, 5)

Here I’m using the emoji to illustrate that an attacker can’t see the actual values submitted within the range queries — those are protected by the scheme — nor can she see the actual values of the result rows, since the fancy encryption scheme hides all this stuff. All the attacker sees is that a range query came in, and some specific rows were scooped up off disk after running the fancy search protocol.

So what can the attacker learn from the above queries? Surprisingly: quite a bit.

At very minimum, the attacker learns that Query 2 returned all of the same records as Query 1. Thus the range of the latter query clearly somewhat overlaps with the range of the former.  There is an additional record (row 43) that is not within the range of Query 1. That tells us that row 43 must must be either the “next” greater or smaller record than each of rows (1, 3, 5). That’s useful information.

Get enough useful information, it turns out that it starts to add up. In 2016, Kellaris, Kollios, Nissim and O’Neill showed that if you know the distribution of the query range endpoints — for example, if you assumed that they were uniformly random — then you can get more than just the order of records. You can reconstruct the exact value of every record in the database.

This result is statistical in nature. If I know that the queries are uniformly random, then I can model how often a given value (say, Age=34 out of a range 1-120) should be responsive to a given random query results. By counting the actual occurrences of a specific row after many such queries, I can guess which rows correlate to specific record values. The more queries I see, the more certain I can be.The Kellaris et al. paper shows that this takes $O(N^4~log~N)$ queries, where is the number of possible values your data can take on (e.g., the ages of your employees, ranging between 1 and 100 would give N=100.) This is assuming an arbitrary dataset. The results get much better if the database is “dense”, meaning every possible value occurs once.

In practice the Kellaris et al. results mean that database fields with small domains (like ages) could be quickly reconstructed after observing a reasonable number of queries from a legitimate user, albeit one who likes to query everything randomly.

The main bright spot in this research —- at least up until recently — was that many types of data have much larger domains. If you’re dealing with salary data ranging from, say, $1 to$200,000, then N=200,000 and this dominant $N^4$ tends to make Kellaris et al. attacks impractical, simply because they’ll take too long. Similarly, data like employee last names (encoded as a form that can be sorted and range-queries) gives you even vaster domains like $N=26^{12}$, say, and so perhaps we could pleasantly ignore these results and spend our time on more amusing engagements.

### I bet we can’t ignore these results, can we?

Indeed, it seems that we can’t. The reason we can’t sit on our laurels and hope for an attacker to die of old age recovering large-domain data sets is due to something called approximate database reconstruction, or $\epsilon-$ADR.

The setting here is the same: an attacker sits and watches an attacker make (uniformly random) range queries. The critical difference is that this attacker isn’t trying to get every database record back at its exact value: she’s willing to tolerate some degree of error, up to an additive $\epsilon N$. For example, if I’m trying to recover employee salaries, I don’t need them to be exact: getting them within 1% or 5% is probably good enough for my purposes. Similarly, reconstructing nearly all of the letters in your last name probably lets me guess the rest, especially if I know the distribution of common last names.

Which finally brings us to this new GLMP paper, which puts $\epsilon-$ADR on steroids. What it shows is that the same setting, if one is willing to “sacrifice” a few of the highest and lowest values in the database, an attacker can reconstruct nearly the full database in a much smaller (asymptotic) number of queries, specifically: $O(\epsilon^{-2} log~\epsilon^{-1})$ queries, where $\epsilon$ is the error parameter.

The important thing to notice about these results is that the value N has dropped out of the equation. The only term that’s left is the error term $\epsilon$. That means these results are “scale-free”, and (asymptotically, at least), they work just as well for small values of N as large ones, and large databases and small ones. This is really remarkable.

### Big-O notation doesn’t do anything for me: what does this even mean?

Big-O notation is beloved by computer scientists, but potentially meaningless in practice. There could be huge constants in these terms that render these attacks completely impractical. Besides, weird equations involving epsilon characters are impossible for humans to understand.

Sometimes the easiest way to understand a theoretical result is to plug some actual numbers in and see what happens. GLMP were kind enough to do this for us, by first generating several random databases — each containing 1,000 records, for different values of N. They then ran their recovery algorithm against a simulated batch of random range queries to see what the actual error rate looked like as the query count increased.

Here are their results:

Even after just 100 queries, the error in the dataset has been hugely reduced, and after 500 queries the contents of the database — excluding the tails — can be recovered with only about a 1-2% error rate.

Moreover, these experimental results illustrate the fact that recovery works at many scales: that is, they work nearly as well for very different values of N, ranging from 100 to 100,000. This means that the only variable you really need to think about as an attacker is: how close do I need my reconstruction to be? This is probably not very good news for any real data set.

### How do these techniques actually work?

The answer is both very straightforward and deeply complex. The straightforward part is simple; the complex part requires an understanding of Vapnik-Chervonenkis learning theory (VC-theory) which is beyond the scope of this blog post, but is explained in the paper.

At the very highest level the recovery approach is similar to what’s been done in the past: using response probabilities to obtain record values. This paper does it much more efficiently and approximately, using some fancy learning theory results while making a few assumptions.

At the highest level: we are going to assume that the range queries are made on random endpoints ranging from 1 to N. This is a big assumption, and more on it later! Yet with just this knowledge in hand, we learn quite a bit. For example: we can compute the probability that a potential record value (say, the specific salary $34,234) is going to be sent back, provided we know the total value lies in the range 1-N (say, we know all salaries are between$1 and $200,000). If we draw the resulting probability curve in freehand, it might look something like the chart below. This isn’t actually to scale or (probably) even accurate, but it illustrates a key point: by the nature of (random) range queries, records near the center are going to have a higher overall chance of being responsive to any given query, since the “center” values are more frequently covered by random ranges, and records near the extreme high- and low values will be chosen less frequently. The high-level goal of database reconstruction is to match the observed response rate for a given row (say, row 41) to the number of responses we’d expect see for different specific concrete values in the range. Clearly the accuracy of this approach is going to depend on the number of queries you, the attacker, can observe — more is better. And since the response rates are lower at the highest and lowest values, it will take more queries to guess outlying data values. You might also notice that there is one major pitfall here. Since the graph above is symmetric around its midpoint, the expected response rate will be the same for a record at .25*N and a record at .75*N — that is, a$50,000 salary will be responsive to random queries at precisely same rate as a $150,000 salary. So even if you get every database row pegged precisely to its response rate, your results might still be “flipped” horizontally around the midpoint. Usually this isn’t the end of the world, because databases aren’t normally full of unstructured random data — high salaries will be less common than low salaries in most organizations, for example, so you can probably figure out the ordering based on that assumption. But this last “bit” of information is technically not guaranteed to come back, minus some assumptions about the data set. Thus, the recovery algorithm breaks down into two steps: first, observe the response rate for each record as random range queries arrive. For each record that responds to such a query, try to solve for a concrete value that minimizes the difference between the expected response rate on that value, and the observed rate. The probability estimation can be made more efficient (eliminating a quadratic term) by assuming that there is at least one record in the database within the range .2N-.3N (or .7N-.8N, due to symmetry). Using this “anchor” record requires a mild assumption about the database contents. What remains is to show that the resulting attack is efficient. You can do this by simply implementing it — as illustrated by the charts above. Or you can prove that it’s efficient. The GLMP paper uses some very heavy statistical machinery to do the latter. Specifically, they make use of a result from Vapnik-Chervonenkis learning theory (VC-theory), which shows that the bound can be derived from something called the VC-dimension (which is a small number, in this case) and is unrelated to the actual value of N. That proof forms the bulk of the result, but the empirical results are also pretty good. ### Is there anything else in the paper? Yes. It gets worse. There’s so much in this paper that I cannot possibly include it all here without risking carpal tunnel and boredom, and all of it is bad news for the field of encrypted databases. The biggest additional result is one that shows that if all you want is an approximate ordering of the database rows, then you can do this efficiently using something called a PQ tree. Asymptotically, this requires $O(\epsilon^{-1} log~\epsilon^{-1})$ queries, and experimentally the results are again even better than one would expect. What’s even more important about this ordering result is that it works independently of the query distribution. That is: we do not need to have random range queries in order for this to work: it works reasonably well regardless of how the client puts its queries together (up to a point). Even better, the authors show that this ordering, along with some knowledge of the underlying database distribution — for example, let’s say we know that it consists of U.S. citizen last names — can also be used to obtain approximate database reconstruction. Oy vey! And there’s still even more: • The authors show how to obtain even more efficient database recovery in a setting where the query range values are known to the attacker, using PAC learning. This is a more generous setting than previous work, but it could be realistic in some cases. • Finally, they extend this result to prefix and suffix queries, as well as range queries, and show that they can run their attacks on a dataset from the Fraternal Order of Police, obtaining record recovery in a few hundred queries. In short: this is all really bad for the field of encrypted databases. ### So what do we do about this? I don’t know. Ignore these results? Fake our own deaths and move into a submarine? In all seriousness: database encryption has been a controversial subject in our field. I wish I could say that there’s been an actual debate, but it’s more that different researchers have fallen into different camps, and nobody has really had the data to make their position in a compelling way. There have actually been some very personal arguments made about it. The schools of thought are as follows: The first holds that any kind of database encryption is better than storing records in plaintext and we should stop demanding things be perfect, when the alternative is a world of constant data breaches and sadness. To me this is a supportable position, given that the current attack model for plaintext databases is something like “copy the database files, or just run a local SELECT * query”, and the threat model for an encrypted database is “gain persistence on the server and run sophisticated statistical attacks.” Most attackers are pretty lazy, so even a weak system is probably better than nothing. The countervailing school of thought has two points: sometimes the good is much worse than the perfect, particularly if it gives application developers an outsized degree of confidence of the security that their encryption system is going to provide them. If even the best encryption protocol is only throwing a tiny roadblock in the attacker’s way, why risk this at all? Just let the database community come up with some kind of ROT13 encryption that everyone knows to be crap and stop throwing good research time into a problem that has no good solution. I don’t really know who is right in this debate. I’m just glad to see we’re getting closer to having it. # On Ghost Users and Messaging Backdoors The past few years have been an amazing time for the deployment of encryption. In ten years, encrypted web connections have gone from a novelty into a requirement for running a modern website. Smartphone manufacturers deployed default storage encryption to billions of phones. End-to-end encrypted messaging and phone calls are now deployed to billions of users. While this progress is exciting to cryptographers and privacy advocates, not everyone sees it this way. A few countries, like the U.K. and Australia, have passed laws in an attempt to gain access to this data, and at least one U.S. proposal has made it to Congress. The Department of Justice recently added its own branding to the mix, asking tech companies to deploy “responsible encryption“. What, exactly, is “responsible encryption”? Well, that’s a bit of a problem. Nobody on the government’s side of the debate has really been willing to get very specific about that. In fact, a recent speech by U.S. Deputy Attorney General Rod Rosenstein implored cryptographers to go figure it out. With this as background, a recent article by GCHQ’s Ian Levy and Crispin Robinson reads like a breath of fresh air. Unlike their American colleagues, the British folks at GCHQ — essentially, the U.K.’s equivalent of NSA — seem eager to engage with the technical community and to put forward serious ideas. Indeed, Levy and Robinson make a concrete proposal in the article above: they offer a new solution designed to surveil both encrypted messaging and phone calls. In this post I’m going to talk about that proposal as fairly as I can — given that I only have a high-level understanding of the idea. Then I’ll discuss what I think could go wrong. ### A brief, illustrated primer on E2E The GCHQ proposal deals with law-enforcement interception on messaging systems and phone calls. To give some intuition about the proposal, I first need to give a very brief (and ultra-simplified) explanation of how those systems actually work. The basic idea in any E2E communication systems is that each participant encrypts messages (or audio/video data) directly from one device to the other. This layer of encryption reduces the need to trust your provider’s infrastructure — ranging from telephone lines to servers to undersea cables — which gives added assurance against malicious service providers and hackers. If you’ll forgive a few silly illustrations, the intuitive result is a picture that looks something like this: If we consider the group chat/call setting, the picture changes slightly, but only slightly. Each participant still encrypts data to the other participants directly, bypassing the provider. The actual details (specific algorithms, key choices) vary between different systems. But the concept remains the same. The problem with the simplified pictures above is that there’s actually a lot more going on in an E2E system than just encryption. In practice, one of the most challenging problems in encrypted messaging stems is getting the key you need to actually perform the encryption. This problem, which is generally known as key distribution, is an age-old concern in the field of computer security. There are many ways for it to go wrong. In the olden days, we used to ask users to manage and exchange their own keys, and then select which users they wanted to encrypt to. This was terrible and everyone hated it. Modern E2E systems have become popular largely because they hide all of this detail from their users. This comes at the cost of some extra provider-operated infrastructure. In practice, systems like Apple iMessage, WhatsApp and Facebook Messenger actually look more like this: The Apple at the top of the picture above stands in for Apple’s “identity service”, which is a cluster of servers running in Apple’s various data centers. These servers perform many tasks, but most notably: they act as a directory for looking up the encryption key of the person you’re talking to. If that service misfires and gives you the wrong key, the best ciphers in the world won’t help you. You’ll just be encrypting to the wrong person. These identity services do more than look up keys. In at least some group messaging systems like WhatsApp and iMessage, they also control the membership of group conversations. In poorly-designed systems, the server can add and remove users from a group conversation at will, even if none of the participants have requested this. It’s as though you’re having a conversation in a very private room — but the door is unlocked and the building manager controls who can come enter and join you. (A technical note: while these two aspects of the identity system serve different purposes, in practice they’re often closely related. For example, in many systems there is little distinction between “group” and “two-participant” messaging. For example, in systems that support multiple devices connected to a single account, like Apple’s iMessage, every single device attached to your user account is treated as a separate party to the conversation. Provided either party has more than one device on their account [say, an iPhone and an iPad] , you can think of every iMessage conversation as being a group conversation.) Most E2E systems have basic countermeasures against bad behavior by the identity service. For example, client applications will typically alert you when a new user joins your group chat, or when someone adds a new device to your iMessage account. Similarly, both WhatsApp and Signal expose “safety numbers” that allow participants to verify that they received the right cryptographic keys, which offers a check against dishonest providers. But these countermeasures are not perfect, and not every service offers them. Which brings me to the GCHQ proposal. ### What GCHQ wants The Lawfare article by Levy and Robinson does not present GCHQ’s proposal in great detail. Fortunately, both authors have spent most of the touring the U.S., giving several public talks about their ideas. I had the privilege of speaking to both of them earlier this summer when they visited Johns Hopkins, so I think I have a rough handle on what they’re thinking. In its outlines, the idea they propose is extremely simple. The goal is to take advantage of existing the weaknesses in the identity management systems of group chat and calling systems. This would allow law enforcement — with the participation of the service provider — to add a “ghost user” (or in some cases, a “ghost device”) to an existing group chat or calling session. In systems where group membership can be modified by the provider infrastructure, this could mostly be done via changes to the server-side components of the provider’s system. I say that it could mostly be done server-side, because there’s a wrinkle. Even if you modify the provider infrastructure to add unauthorized users to a conversation, most existing E2E systems do notify users when a new participant (or device) joins a conversation. Generally speaking, having a stranger wander into your conversation is a great way to notify criminals that the game’s afoot or what have you, so you’ll absolutely want to block this warning. While the GCHQ proposal doesn’t go into great detail, it seems to follow that any workable proposal will require providers to suppress those warning messages at the target’s device. This means the proposal will also require changes to the client application as well as the server-side infrastructure. (Certain apps like Signal are already somewhat hardened against these changes, because group chat setup is handled in an end-to-end encrypted/authenticated fashion by clients. This prevents the server from inserting new users without the collaboration of at least one group participant. At the moment, however, both WhatsApp and iMessage seem vulnerable to GCHQ’s proposed approach.) Due to this need for extensive server and client modifications, the GCHQ proposal actually represents a very significant change to the design of messaging systems. It seems likely that the client-side code changes would need to be deployed to all users, since you can’t do targeted software updates just against criminals. (Or rather, if you could rely on such targeted software updates, you would just use that capability instead of the thing that GCHQ is proposing.) Which brings us to the last piece: how do get providers to go along with all of this? While optimism and cooperation are nice in principle, it seems unlikely that communication providers are going to to voluntarily insert a powerful eavesdropping capability into their encrypted services, if only because it represents a huge and risky modification. Presumably this means that the UK government will have to compel cooperation. One potential avenue for this is to use Technical Capability Notices from the UK’s Investigatory Powers Act. Those notices mandate that a provider offer real-time decryption for sets of users ranging from 1-10,000 users, and moreover, that providers must design their systems to ensure this such a capability remains available. And herein lies the problem. ### Providers are already closing this loophole The real problem with the GCHQ proposal is that it targets a weakness in messaging/calling systems that’s already well-known to providers, and moreover, a weakness that providers have been working to close — perhaps because they’re worried that someone just like GCHQ (or probably, much worse) will try to exploit it. By making this proposal, the folks at GCHQ have virtually guaranteed that those providers will move much, much faster on this. And they have quite a few options at their disposal. Over the past several years researchers have proposed several designs that offer transparency to users regarding which keys they’re obtaining from a provider’s identity service. These systems operate by having the identity service commit to the keys that are associated with individual users, such that it’s very hard for the provider to change a user’s keys (or to add a device) without everyone in the world noticing. As mentioned above, advanced messengers like Signal have “submerged” the group chat management into the encrypted communications flow, so that the server cannot add new users without the digitally authenticated approval of one of the existing participants. This design, if ported to in more popular services like WhatsApp, would seem to kill the GCHQ proposal dead. Of course, these solutions highlight the tricky nature of GCHQ’s proposal. Note that in order to take advantage of existing vulnerabilities, GCHQ is going to have to require that providers change their system. And of course, once you’ve opened the door to forcing providers to change their system, why stop with small changes? What stops the UK government from, say, taking things a step farther, and using the force of law to compel providers not to harden their systems against this type of attack? Which brings us to the real problem with the GCHQ proposal. As far as I can see, there are two likely outcomes. In the first, providers rapidly harden their system — which is good! — and in the process kill off the vulnerabilities that make GCHQ’s proposal viable (which is bad, at least for GCHQ). The more interest that governments express towards the proposal, the more likely this first outcome is. In the second outcome, the UK government, perhaps along with other governments, solve this problem by forcing the providers to keep their systems vulnerable. This second outcome is what I worry about. More concretely, it’s true that today’s systems include existing flaws that are easy to exploit. But that does not mean we should entomb those flaws in concrete. And once law enforcement begins to rely on them, we will effectively have done so. Over time what seems like a “modest proposal” using current flaws will rapidly become an ossifying influence that holds ancient flaws in place. In the worst-case outcome, we’ll be appointing agencies like GCHQ as the ultimate architect of Apple and Facebook’s communication systems. That is not a good outcome. In fact, it’s one that will likely slow down progress for years to come. # Let’s talk about PAKE The first rule of PAKE is: nobody ever wants to talk about PAKE. The second rule of PAKE is that this is a shame, because PAKE — which stands for Password Authenticated Key Exchange — is actually one of the most useful technologies that (almost) never gets used. It should be deployed everywhere, and yet it isn’t. To understand why this is such a damn shame, let’s start by describing a very real problem. Imagine I’m operating a server that has to store user passwords. The traditional way to do this is to hash each user password and store the result in a password database. There are many schools of thought on how to handle the hashing process; the most common recommendation these days is to use a memory-hard password hashing function like scrypt or argon2 (with a unique per-password salt), and then store only the hashed result. There are various arguments about which hash function to use, and whether it could help to also use some secret value (called “pepper“), but we’ll ignore these for the moment. Regardless of the approach you take, all of these solutions have a single achilles heel: When the user comes back to log into your website, they will still need to send over their (cleartext) password, since this is required in order for the server to do the check. This requirement can lead to disaster if your server is ever persistently compromised, or if your developers make a simple mistake. For example, earlier this year Twitter asked all of its (330 million!) users to change their passwords — because it turned out that company had been logging cleartext (unhashed) passwords. Now, the login problem doesn’t negate the advantage of password hashing in any way. But it does demand a better solution: one where the user’s password never has to go to the server in cleartext. The cryptographic tool that can give this to us is PAKE, and in particular a new protocol called OPAQUE, which I’ll get to at the end of this post. ### What’s a PAKE? A PAKE protocol, first introduced by Bellovin and Merritt, is a special form of cryptographic key exchange protocol. Key exchange (or “key agreement”) protocols are designed to help two parties (call them a client and server) agree on a shared key, using public-key cryptography. The earliest key exchange protocols — like classical Diffie-Hellman — were unauthenticated, which made them vulnerable to man-in-the-middle attacks. The distinguishing feature of PAKE protocols is the client will authenticate herself to the server using a password. For obvious reasons, the password, or a hash of it, is assumed to be already known to the server, which is what allows for checking. If this was all we required, PAKE protocols would be easy to build. What makes a PAKE truly useful is that it should also provide protection for the client’s password. A stronger version of this guarantee can be stated as follows: after a login attempt (valid, or invalid) both the client and server should learn only whether the client’s password matched the server’s expected value, and no additional information. This is a powerful guarantee. In fact, it’s not dissimilar to what we ask for from a zero knowledge proof. Of course, the obvious problem with PAKE is that many people don’t want to run a “key exchange” protocol in the first place! They just want to verify that a user knows a password. The great thing about PAKE is that the simpler “login only” use-case is easy to achieve. If I have a standard PAKE protocol that allows a client and server to agree on a shared key K if (and only if) the client knows the right password, then all we need add is a simple check that both parties have arrived at the same key. (This can be done, for example, by having the parties compute some cryptographic function with it and check the results.) So PAKE is useful even if all you’ve got in mind is password checking. ### SRP: The PAKE that Time Forgot The PAKE concept seems like it provides an obvious security benefit when compared to the naive approach we use to log into servers today. And the techniques are old, in the sense that PAKEs have been known since way back in 1992! Despite this, they’ve seen from almost no adoption. What’s going on? There are a few obvious reasons for this. The most obvious has to do with the limitations of the web: it’s much easier to put a password form onto a web page than it is to do fancy crypto in the browser. But this explanation isn’t sufficient. Even native applications rarely implement PAKE for their logins. Another potential explanation has to do with patents, though most of these are expired now. To me there are two likely reasons for the ongoing absence of PAKE: (1) there’s a lack of good PAKE implementations in useful languages, which makes it a hassle to use, and (2) cryptographers are bad at communicating the value of their work, so most people don’t know PAKE is even an option. Even though I said PAKE isn’t deployed, there are some exceptions to the rule. One of the remarkable ones is a 1998 protocol designed by Tom Wu [correction: not Tim Wu] and called “SRP”. Short for “Secure Remote Password“, this is a simple three-round PAKE with a few elegant features that were not found in the earliest works. Moreover, SRP has the distinction of being (as far as I know) the most widely-deployed PAKE protocol in the world. I cite two pieces of evidence for this claim: 1. SRP has been standardized as a TLS ciphersuite, and is actually implemented in libraries like OpenSSL, even though nobody seems to use it much. 2. Apple uses SRP extensively in their iCloud Key Vault. This second fact by itself could make SRP one of the most widely used cryptographic protocols in the world, so vast is the number of devices that Apple ships. So this is nothing to sneer at. Industry adoption of SRP is nice, but also kind of a bummer: mainly because while any PAKE adoption is cool, SRP itself isn’t the best PAKE we can deploy. I was planning to go into the weeds about why I feel so strongly about SRP, but it got longwinded and it distracted from the really nice protocol I actually want to talk about further below. If you’re still interested, I moved the discussion onto this page. In lieu of those details, let me give a quick and dirty TL;DR on SRP: 1. SRP does some stuff “right”. For one thing, unlike early PAKEs it does not require you to store a raw password on the server (or, equivalently, a hash that could be used by a malicious client in place of the password). Instead, the server stores a “verifier” which is a one-way function of the password hash. This means a leak of the password database does not (immediately) allow the attacker to impersonate the user — unless they conduct further expensive dictionary attacks. (The technical name for this is “asymmetric” PAKE.) 2. Even better, the current version of SRP (v4 v6a) isn’t obviously broken! 3. However (and with no offense to the designers) the SRP protocol design is completely bonkers, and earlier versions have been broken several times — which is why we’re now at revision 6a. Plus the “security proof” in the original research paper doesn’t really prove anything meaningful. 4. SRP currently relies on integer (finite field) arithmetic, and for various reasons (see point 3 above) the construction is not obviously transferable to the elliptic curve setting. This requires more bandwidth and computation, and thus SRP can’t take advantage of the many efficiency improvements we’ve developed in settings like Curve25519. 5. SRP is vulnerable to pre-computation attacks, due to the fact that it hands over the user’s “salt” to any attacker who can start an SRP session. This means I can ask a server for your salt, and build a dictionary of potential password hashes even before the server is compromised. 6. Despite all these drawbacks, SRP is simple — and actually ships with working code. Plus there’s working code in OpenSSL that even integrates with TLS, which makes it relatively easy to adopt. Out of all these points, the final one is almost certainly responsible for the (relatively) high degree of commercial success that SRP has seen when compared to other PAKE protocols. It’s not ideal, but it’s real. This is something for cryptographers to keep in mind. ### OPAQUE: The PAKE of a new generation When I started thinking about PAKEs a few months ago, I couldn’t help but notice that most of the existing work was kind of crummy. It either had weird problems like SRP, or it required the user to store the password (or an effective password) on the server, or it revealed the salt to an attacker — allowing pre-computation attacks. Then earlier this year, Jarecki, Krawczyk and Xu proposed a new protocol called OPAQUE. Opaque has a number of extremely nice advantages: 1. It can be implemented in any setting where Diffie-Hellman and discrete log (type) problems are hard. This means that, unlike SRP, it can be easily instantiated using efficient elliptic curves. 2. Even better: OPAQUE does not reveal the salt to the attacker. It solves this problem by using an efficient “oblivious PRF” to combine the salt with the password, in a way that ensures the client does not learn the salt and the server does not learn the password. 3. OPAQUE works with any password hashing function. Even better, since all the hashing work is done on the client, OPAQUE can actually take load off the server, freeing an online service up to use much strong security settings — for example, configuring scrypt with large RAM parameters. 4. In terms of number of messages and exponentiations, OPAQUE is not much different from SRP. But since it can be implemented in more efficient settings, it’s likely to be a lot more efficient. 5. Unlike SRP, OPAQUE has a reasonable security proof (in a very strong model). There’s even an Internet Draft proposal for OPAQUE, which you can read here. Unfortunately, at this point I’m not aware of any production quality implementations of the code (if you know of one, please link to it in the comments and I’ll update). (Update: There are several potential implementations listed in the comments — I haven’t looked closely enough to endorse any, but this is great!) But that should soon change. The full OPAQUE protocol is given a little bit further below. In the rest of this section I’m going to go into the weeds on how OPAQUE works. Problem 1: Keeping the salt secret. As I mentioned above, the main problem with earlier PAKEs is the need to transmit the salt from a server to a (so far unauthenticated) client. This enables an attacker to run pre-computation attacks, where they can build an offline dictionary based on this salt. The challenge here is that the salt is typically fed into a hash function (like scrypt) along with the password. Intuitively someone has to compute that function. If it’s the server, then the server needs to see the password — which defeats the whole purpose. If it’s the client, then the client needs the salt. In theory one could get around this problem by computing the password hashing function using secure two-party computation (2PC). In practice, solutions like this are almost certainly not going to be efficient — most notably because password hashing functions are designed to be complex and time consuming, which will basically explode the complexity of any 2PC system. OPAQUE gets around this with the following clever trick. They leave the password hash on the client’s side, but they don’t feed it the stored salt. Instead, they use a special two-party protocol called an oblivious PRF to calculate a second salt (call it salt2) so that the client can use salt2 in the hash function — but does not learn the original salt. The basic idea of such a function is that the server and client can jointly compute a function PRF(salt, password), where the server knows “salt” and the client knows “password”. Only the client learns the output of this function. Neither party learns anything about the other party’s input. The gory details: The actual implementation of the oblivious PRF relies on the idea that the client has the password $P$ and the server has the salt, which is expressed as a scalar value $s$. The output of the PRF function should be of the form $H(P)^s$, where $H:\{0,1\}^* \rightarrow {\mathcal G}$ is a special hash function that hashes passwords into elements of a cyclic (prime-order) group. To compute this PRF requires a protocol between the client and server. In this protocol, the client first computes $H(P)$ and then “blinds” this password by selecting a random scalar value r, and blinding the result to obtain $C = H(P)^r$. At this point, the client can send the blinded value C over to the server, secure in the understanding that (in a prime-order group), the blinding by r hides all partial information about the underlying password. The server, which has a salt value s, now further exponentiates this calue to obtain $R = C^s$ and sends the result R back to the client. If we write this out in detail, the result can be expressed as$R = H(P)^{rs}$. The client now computes the inverse of its own blinding value r and exponentiates one more time as follows: $R' = R^{r^{-1}} = H(P)^s$. This element $R'$, which consists of the hash of the password exponentiated by the salt, is the output of the desired PRF function. A nice feature of this protocol is that, if the client enters the wrong password into the protocol, she should obtain a value that is very different from the actual value she wants. This guarantee comes from the fact that the hash function is likely to produce wildly different outputs for distinct passwords. Problem 2: Proving that the client got the right key K. Of course, at this point, the client has derived a key K, but the server has no idea what it is. Nor does the server know whether it’s the right key. The solution OPAQUE uses based an old idea due to Gentry, Mackenzie and Ramzan. When the user first registers with the server, she generates a strong public and private key for a secure agreement protocol (like HMQV), and encrypts the resulting private key under K, along with the server’s public key. The resulting authenticated ciphertext (and the public key) is stored in the password database. C = Encrypt(K, client secret key | server’s public key) When the client wishes to authenticate using the OPAQUE protocol, the server sends it the stored ciphertext C. If the client entered the right password into the first phase, she can derive K, and now decrypt this ciphertext. Otherwise it’s useless. Using the embedded secret key, she can now run a standard authenticated key agreement protocol to complete the handshake. (The server verifies the clients’ inputs against its copy of the client’s public key, and the client does similarly.) Putting it all together. All of these different steps can be merged together into a single protocol that has the same number of rounds as SRP. Leaving aside the key verification steps, it looks like the protocol above. Basically, just two messages: one from the client and one returned to the server. The final aspect of the OPAQUE work is that it includes a strong security proof that shows the resulting protocol can be proven secure under the 1-more discrete logarithm assumption in the random oracle model, which is a (well, relatively) standard assumption that appears to hold in the settings we work with. ### In conclusion So in summary, we have this neat technology that could make the process of using passwords much easier, and could allow us to do it in a much more efficient way — with larger hashing parameters, and more work done by the client? Why isn’t this everywhere? Maybe in the next few years it will be. # Why I’m done with Chrome This blog is mainly reserved for cryptography, and I try to avoid filling it with random “someone is wrong on the Internet” posts. After all, that’s what Twitter is for! But from time to time something bothers me enough that I have to make an exception. Today I wanted to write specifically about Google Chrome, how much I’ve loved it in the past, and why — due to Chrome’s new user-unfriendly forced login policy — I won’t be using it going forward. ### A brief history of Chrome When Google launched Chrome ten years ago, it seemed like one of those rare cases where everyone wins. In 2008, the browser market was dominated by Microsoft, a company with an ugly history of using browser dominance to crush their competitors. Worse, Microsoft was making noises about getting into the search business. This posed an existential threat to Google’s internet properties. In this setting, Chrome was a beautiful solution. Even if the browser never produced a scrap of revenue for Google, it served its purpose just by keeping the Internet open to Google’s other products. As a benefit, the Internet community would receive a terrific open source browser with the best development team money could buy. This might be kind of sad for Mozilla (who have paid a high price due to Chrome) but overall it would be a good thing for Internet standards. For many years this is exactly how things played out. Sure, Google offered an optional “sign in” feature for Chrome, which presumably vacuumed up your browsing data and shipped it off to Google, but that was an option. An option you could easily ignore. If you didn’t take advantage of this option, Google’s privacy policy was clear: your data would stay on your computer where it belonged. ### What changed? A few weeks ago Google shipped an update to Chrome that fundamentally changes the sign-in experience. From now on, every time you log into a Google property (for example, Gmail), Chrome will automatically sign the browser into your Google account for you. It’ll do this without asking, or even explicitly notifying you. (However, and this is important: Google developers claim this will not actually start synchronizing your data to Google — yet. See further below.) Your sole warning — in the event that you’re looking for it — is that your Google profile picture will appear in the upper-right hand corner of the browser window. I noticed mine the other day: The change hasn’t gone entirely unnoticed: it received some vigorous discussion on sites like Hacker News. But the mainstream tech press seems to have ignored it completely. This is unfortunate — and I hope it changes — because this update has huge implications for Google and the future of Chrome. In the rest of this post, I’m going to talk about why this matters. From my perspective, this comes down to basically four points: 1. Nobody on the Chrome development team can provide a clear rationale for why this change was necessary, and the explanations they’ve given don’t make any sense. 2. This change has enormous implications for user privacy and trust, and Google seems unable to grapple with this. 3. The change makes a hash out of Google’s own privacy policies for Chrome. 4. Google needs to stop treating customer trust like it’s a renewable resource, because they’re screwing up badly. I warn you that this will get a bit ranty. Please read on anyway. ### Google’s stated rationale makes no sense The new feature that triggers this auto-login behavior is called “Identity consistency between browser and cookie jar” (HN). After conversations with two separate Chrome developers on Twitter (who will remain nameless — mostly because I don’t want them to hate me), I was given the following rationale for the change: To paraphrase this explanation: if you’re in a situation where you’ve already signed into Chrome and your friend shares your computer, then you can wind up accidentally having your friend’s Google cookies get uploaded into your account. This seems bad, and sure, we want to avoid that. But note something critical about this scenario. In order for this problem to apply to you, you already have to be signed into Chrome. There is absolutely nothing in this problem description that seems to affect users who chose not to sign into the browser in the first place. So if signed-in users are your problem, why would you make a change that forces unsignedin users to become signed-in? I could waste a lot more ink wondering about the mismatch between the stated “problem” and the “fix”, but I won’t bother: because nobody on the public-facing side of the Chrome team has been able to offer an explanation that squares this circle. And this matters, because “sync” or not… ### The change has serious implications for privacy and trust The Chrome team has offered a single defense of the change. They point out that just because your browser is “signed in” does not mean it’s uploading your data to Google’s servers. Specifically: While Chrome will now log into your Google account without your consent (following a Gmail login), Chrome will not activate the “sync” feature that sends your data to Google. That requires an additional consent step. So in theory your data should remain local. This is my paraphrase. But I think it’s fair to characterize the general stance of the Chrome developers I spoke with as: without this “sync” feature, there’s nothing wrong with the change they’ve made, and everything is just fine. This is nuts, for several reasons. User consent matters. For ten years I’ve been asked a single question by the Chrome browser: “Do you want to log in with your Google account?” And for ten years I’ve said no thanks. Chrome still asks me that question — it’s just that now it doesn’t honor my decision. The Chrome developers want me to believe that this is fine, since (phew!) I’m still protected by one additional consent guardrail. The problem here is obvious: If you didn’t respect my lack of consent on the biggest user-facing privacy option in Chrome (and didn’t even notify me that you had stopped respecting it!) why should I trust any other consent option you give me? What stops you from changing your mind on that option in a few months, when we’ve all stopped paying attention? The fact of the matter is that I’d never even heard of Chrome’s “sync” option — for the simple reason that up until September 2018, I had never logged into Chrome. Now I’m forced to learn these new terms, and hope that the Chrome team keeps promises to keep all of my data local as the barriers between “signed in” and “not signed in” are gradually eroded away. The Chrome sync UI is a dark pattern. Now that I’m forced to log into Chrome, I’m faced with a brand new menu I’ve never seen before. It looks like this: Does that big blue button indicate that I’m already synchronizing my data to Google? That’s scary! Wait, maybe it’s an invitation to synchronize! If so, what happens to my data if I click it by accident? (I won’t give it the answer away, you should go find out. Just make sure you don’t accidentally upload all your data in the process. It can happen quickly.) In short, Google has transformed the question of consenting to data upload from something affirmative that I actually had to put effort into — entering my Google credentials and signing into Chrome — into something I can now do with a single accidental click. This is a dark pattern. Whether intentional or not, it has the effect of making it easy for people to activate sync without knowing it, or to think they’re already syncing and thus there’s no additional cost to increasing Google’s access to their data. Don’t take my word for it. It even gives (former) Google people the creeps. Big brother doesn’t need to actually watch you. We tell things to our web browsers that we wouldn’t tell our best friends. We do this with some vague understanding that yes, the Internet spies on us. But we also believe that this spying is weak and probabilistic. It’s not like someone’s standing over our shoulder checking our driver’s license with each click. What happens if you take that belief away? There are numerous studies indicating that even the perception of surveillance can significantly greatly magnify the degree of self-censorship users force on themselves. Will user feel comfortable browsing for information on sensitive mental health conditions — if their real name and picture are always loaded into the corner of their browser? The Chrome development team says “yes”. I think they’re wrong. For all we know, the new approach has privacy implications even if sync is off. The Chrome developers claim that with “sync” off, a Chrome has no privacy implications. This might be true. But when pressed on the actual details, nobody seems quite sure. For example, if I have my browser logged out, then I log in and turn on “sync”, does all my past (logged-out) data get pushed to Google? What happens if I’m forced to be logged in, and then subsequently turn on “sync”? Nobody can quite tell me if the data uploaded in these conditions is the same. These differences could really matter. ### The changes make hash of the Chrome privacy policy The Chrome privacy policy is a remarkably simple document. Unlike most privacy policies, it was clearly written as a promise to Chrome’s users — rather than as the usual lawyer CYA. Functionally, it describes two browsing modes: “Basic browser mode” and “signed-in mode”. These modes have very different properties. Read for yourself: In “basic browser mode”, your data is stored locally. In “signed-in” mode, your data gets shipped to Google’s servers. This is easy to understand. If you want privacy, don’t sign in. But what happens if your browser decides to switch you from one mode to the other, all on its own? Technically, the privacy policy is still accurate. If you’re in basic browsing mode, your data is still stored locally. The problem is that you no longer get to decide which mode you’re in. This makes a mockery out of whatever intentions the original drafters had. Maybe Google will update the document to reflect the new “sync” distinction that the Chrome developers have shared with me. We’ll see. Update: After I tweeted about my concerns, I received a DM on Sunday from two different Chrome developers, each telling me the good news: Google is updating their privacy policy to reflect the new operation of Chrome. I think that’s, um, good news. But I also can’t help but note that updating a privacy policy on a weekend is an awful lot of trouble to go to for a change that… apparently doesn’t even solve a problem for signed-out users. ### Trust is not a renewable resource For a company that sustains itself by collecting massive amounts of user data, Google has managed to avoid the negative privacy connotations we associate with, say, Facebook. This isn’t because Google collects less data, it’s just that Google has consistently been more circumspect and responsible with it. Where Facebook will routinely change privacy settings and apologize later, Google has upheld clear privacy policies that it doesn’t routinely change. Sure, when it collects, it collects gobs of data, but in the cases where Google explicitly makes user security and privacy promises — it tends to keep them. This seems to be changing. Google’s reputation is hard-earned, and it can be easily lost. Changes like this burn a lot of trust with users. If the change is solving an absolutely critical problem for users , then maybe a loss of trust is worth it. I wish Google could convince me that was the case. ### Conclusion This post has gone on more than long enough, but before I finish I want to address two common counterarguments I’ve heard from people I generally respect in this area. One argument is that Google already spies on you via cookies and its pervasive advertising network and partnerships, so what’s the big deal if they force your browser into a logged-in state? One individual I respect described the Chrome change as “making you wear two name tags instead of one”. I think this objection is silly both on moral grounds — just because you’re violating my privacy doesn’t make it ok to add a massive new violation — but also because it’s objectively silly. Google has spent millions of dollars adding additional tracking features to both Chrome and Android. They aren’t doing this for fun; they’re doing this because it clearly produces data they want. The other counterargument (if you want to call it that) goes like this: I’m a n00b for using Google products at all, and of course they were always going to do this. The extreme version holds that I ought to be using lynx+Tor and DJB’s custom search engine, and if I’m not I pretty much deserve what’s coming to me. I reject this argument. I think It’s entirely possible for a company like Google to make good, usable open source software that doesn’t massively violate user privacy. For ten years I believe Google Chrome did just this. Why they’ve decided to change, I don’t know. It makes me sad. # Friday Dachshund Blogging For over a year this blog has failed to deliver on an essential promise — that there would someday be pictures of dachshunds. Today we deliver. This is Callie (short for Calliope) working her way through a bit of summer crypto reading: But sometimes that’s exhausting and you’ve gotta take a break. A visit from a strange metallic dachshund: Summer: And in memoriam, Zoe and Sophie, who helped me start this blog. # Wonk post: chosen ciphertext security in public-key encryption (Part 2) This continues the post from Part 1. Note that this is a work in progress, and may have some bugs in it 🙂 I’ll try to patch them up as I go along. In the previous post I discussed the problem of building CCA-secure public key encryption. Here’s a quick summary of what we discussed in the first part: • We covered the definition of CCA2 security. • We described how you can easily achieve this notion in the symmetric encryption setting using a CPA-secure encryption scheme, plus a secure MAC. • We talked about why this same approach doesn’t work for standard public-key encryption. In this post I’m going to discuss a few different techniques that actually do provide CCA security for public key encryption. We’ll be covering these in no particular order. A quick note on security proofs. There are obviously a lot of different ways you could try to hack together a CCA2 secure scheme out of different components. Some of those might be secure, or they might not be. In general, the key difference between a “secure” and “maybe secure” scheme is the fact that we can construct some kind of security proof for it. The phrase “some kind” will turn out to be an important caveat, because these proofs might require a modest amount of cheating. ### The bad and the ugly Before we get to the constructive details, it’s worth talking a bit about some ideas that don’t work to achieve CCA security. The most obvious place to start is with some of the early RSA padding schemes, particularly the PKCS#1v1.5 padding standard. PKCS#1 padding was developed in the 1980s, when it was obvious that public key encryption was going to become widely deployed. It was intended as a pre-processing stage for messages that were going to be encrypted using an RSA public key. This padding scheme had two features. First, it added randomness to the message prior to encrypting it. This was designed to defeat the simple ciphertext guessing attacks that come from deterministic encryption schemes like RSA. It can be easily shown that randomized encryption is absolutely necessary for any IND-CPA (and implicitly, IND-CCA) secure public key encryption scheme. Second, the padding added some “check” bytes that were intended to help detect mangled ciphertexts after decryption; this was designed (presumably) to shore the scheme up against invalid decryption attempts. PKCS#1v1.5 is still widely used in protocols, including all versions of TLS prior to TLS 1.3. The diagram below shows what the padding scheme looks like when used in TLS with a 2048-bit RSA key. The section labeled “48 bytes PMS” (pre-master secret) in this example represents the plaintext being encrypted. The 205 “non-zero padding” consists of purely random bytes that exclude the byte “0”, because that value is reserved to indicate the end of the padding section and the beginning of the plaintext. After using the RSA secret key to recover the padded message, the decryptor is supposed to parse the message and verify that the first two bytes (“00 02”) and the boundary “00” byte are all correct and in not violating any rules. The decryptor may optionally conduct other checks like verifying the length and structure of the plaintext, in case that’s known in advance. One of the most immediate observations about PKCS#1v1.5 is that the designers kind of intuitively understood that chosen ciphertext attacks were a thing. They clearly added some small checks to make sure that it would be hard for an attacker to modify a given ciphertext (e.g., by multiplying it by a chosen value). It’s also obvious that these checks aren’t very strong. In the standardized version of the padding scheme, there are essentially three bytes to check — and one of them (the “00” byte after the padding) can “float” at a large number of different positions, depending on how much padding and plaintext there is in the message. The use of a weak integrity check leads to a powerful CCA2 attack on the encryption scheme that was first discovered by Daniel Bleichenbacher. The attack is powerful due to the fact that it actually leverages the padding check as a way to learn information about the plaintext. That is: the attacker “mauls” a ciphertext and sends it to be decrypted, and relies on the fact that the decryptor will simply perform the decryption checks they’re supposed to perform — and output a noticable error if they fail. Given only this one bit of information per decryption, the attack can gradually recover the full plaintext of a specific ciphertext by (a) multiplying it with some value, (b) sending the result to be decrypted, (c) recording the success/failure result, (d) adaptively picking a new value and repeating step (a) many thousands or millions of times. The PKCS#1v1.5 padding scheme is mainly valuable to us today because it provides an excellent warning to cryptographic engineers, who would otherwise continue to follow the “you can just hack together something that looks safe” school of building protocols. Bleichenbacher-style attacks have largely scared the crypto community straight. Rather than continuing to use this approach, the crypto community has (mostly) moved towards techniques that at least offer some semblance of provable security. That’s what we’ll cover in just a moment. ### A few quick notes on achieving CCA2-secure public key encryption Before we get to a laundry list of specific techniques and schemes, it’s worth asking what types of design features we might be looking for in a CCA2 public key encryption scheme. Historically there have been two common requirements: • It would be super convenient if we could start with an existing encryption scheme, like RSA or Elgamal encryption, and generically tweak (or “compile”) that scheme into a CCA2-secure scheme. (Re-usable generic techniques are particularly useful in the event that someone comes up with new underlying encryption schemes, like post-quantum secure ones.) • The resulting scheme should be pretty efficient. That rules out most of the early theoretical techniques that use huge zero knowledge proofs (as cool as they are). Before we get to the details, I also want to repeat the intuitive description of the CCA2 security game, which I gave in the previous post. The game (or “experiment”) works like this: 1. I generate an encryption keypair for a public-key scheme and give you the public key. 2. You can send me (sequentially and adaptively) many ciphertexts, which I will decrypt with my secret key. I’ll give you the result of each decryption. 3. Eventually you’ll send me a pair of messages (of equal length) $M_0, M_1$ and I’ll pick a bit $b$ at random, and return to you the encryption of $M_b$, which I will denote as $C^* \leftarrow {\sf Encrypt}(pk, M_b)$. 4. You’ll repeat step (2), sending me ciphertexts to decrypt. If you send me $C^*$ I’ll reject your attempt. But I’ll decrypt any other ciphertext you send me, even if it’s only slightly different from $C^*$. 5. You (the attacker) will output your guess $b'$. They “win” the game if $b'=b$. 6. We say a scheme is IND-CCA2 secure if the attacker wins with probability “not much greater” than 1/2 (which is the best an attacker can do if they just guess randomly.) A quick review of this definition shows that we need a CCA2-encryption scheme to provide at least two major features. First off, it should be obvious that the scheme must not leak information about the secret key, even when I’m using it to decrypt arbitrary chosen ciphertexts of your choice. There are obvious examples of schemes that fail to meet this requirement: the most famous is the (textbook) Rabin cryptosystem — where the attacker’s ability to obtain the decryption of a single chosen ciphertext can leak the entire secret key. More subtly, it seems obvious that CCA2 security is related to non-malleabilityHere’s why: suppose I receive a challenge ciphertext $C^*$ at step (3). It must be the case that I cannot easily “maul” that ciphertext into a new ciphertext $C'$ that contains a closely related plaintext (and that the challenger will be able and willing to meaingfully decrypt). It’s easy to see that if I could get away with this, by the rules of the game I could probably win at step (4), simply by sending $C'$ in to be decrypted, getting the result, and seeing whether it’s more closely related to $M_0$ or $M_1$. (This is, in fact, a very weak explanation of what the Bleichenbacher attack does.) It turns out that an even stronger property that helps achieve both of these conditions is something called plaintext awareness. There are various subtly-different mathematical formulations of this idea, but here I’ll try to give only the English-language intuition: If the attacker is able to submit a (valid) ciphertext to be decrypted, it must be the case that she already knows the plaintext of that message. This guarantee is very powerful, because it helps us to be sure that the decryption process doesn’t give the attacker any new information that she doesn’t already have. She can submit any messages she wants (including mauling the challenge ciphertext $C^*$) but if this plaintext-awareness property holds in the strongest sense, those decryptions won’t tell her anything she doesn’t already know. Of course, just because your scheme appears to satisfy the above conditions does not mean it’s secure. Both rules above are heuristics: that is, they’re necessary conditions to prevent attacks, but they may or may not be sufficient. To really trust a scheme (in the cryptographic sense) we should be able to offer a proof (under some assumptions) that these guarantees hold. We’ll address that a bit as we go forward. ### Technique 1: Optimal Asymmetric Encryption Padding One of the earlier practical CCA2 transforms was developed by Bellare and Rogaway as a direct replacement for the PKCS#1v1.5 padding scheme in RSA encryption. The scheme they developed — called Optimal Asymmetric Encryption Padding — represents a “drop-in” replacement for the earlier, broken v1.5 padding scheme. It also features a security proof. (Mostly. We’ll come back to this last point.) (Confusingly, OAEP was adopted into the PKCS#1 standards as of version 2.0, so sometimes you’ll see it referred to as PKCS#1v2.0-OAEP.) OAEP’s most obvious advance over the previous state of the art is the addition of not one, but two separate hash functions $G()$ and $H()$ that don’t exist in the v1.5 predecessor. (These are sometimes referred to as “mask generation functions”, which is just a fancy way of saying they’re hash functions with outputs of a custom, chosen size. Such functions can be easily built from existing hash functions like SHA256.) Expressed graphically, this is what OAEP it looks like: If you’ve ever seen the DES cipher, this structure should look familiar to you. Basically OAEP is a two-round (unkeyed) Feistel network that uses a pair of hash functions to implement the round functions. There are a few key observations you can make right off the bat: • Just looking at the diagram above, you can see that it’s very easy to compute this padding function forward (going from a plaintext $m$ and some random padding $r$ to a padded message) and backwards — that is, it’s an easily-invertible permutation. The key to this feature is the Feistel network structure. • Upon decryption, a decryptor can invert the padding of a given message and verify that the “check string” (the string of k1 “0” bits) is correctly structured. If this string is not structured properly, the decryptor can simply output an error. This comprises the primary decryption check. • Assuming some (strong) properties of the hash functions, it intuitively seems that the OAEP transform is designed to create a kind of “avalanche effect” where even a small modification of a padded message will result in a very different unpadded result when the transform is inverted. In practice any such modification should “trash” the check string with overwhelming probability. From an intuitive point of view, these last two properties are what makes OAEP secure against chosen-ciphertext attacks. The idea here is that, due to the random properties of the hash function, it should be hard for the attacker to construct a valid ciphertext (one that has a correct check string) if she does not already know the plaintext that goes into the transform. This should hold even if the attacker already has some known valid ciphertext (like $C^*$) that she wishes to maul. More specifically related to mauling: if I send an RSA-OAEP ciphertext $C^*$ that encrypts a specific message $m$, the attacker should not be able to easily maul that ciphertext into a different ciphertext $C'$ that will still pass the decryption checks. This is due to two facts: (1) because RSA is a (trapdoor) permutation, any change to $C^*$ will implicitly change the padded message your recover after inverting the RSA function. And (2) sending this altered padded message backwards through the OAEP transform should, with overwhelming probability, trash the check string (and the message $m$). The result is that the adversary can’t maul someone else’s ciphertext. This all assumes some very strong assumptions about the hash functions, which we’ll discuss below. The OAEP proof details (at the most ridiculously superficial level) Proving OAEP secure requires two basic techniques. Both fundamentally rely on the notion that the functions $G()$ and $H()$ are random oraclesThis is important for two very different reasons. First: assuming a function is a “random oracle” means that we’re assuming it to have the same behavior as a random function. This is an awesome property for a hash function to have! (Note: real hash functions don’t have it. This means that hypothetically they could have very ‘non-random’ behavior that would make RSA-OAEP insecure. In practice this has not yet been a practical concern for real OAEP implementations, but it’s worth keeping in mind. It’s easy to see that if the hash functions $G()$ and $H()$ were random functions, it would give OAEP some very powerful properties. Remember, one of the main intuitive goals of the OAEP scheme is to prevent attackers from successfully getting you to decrypt an improperly-constructed (e.g., mauled) ciphertext. If both hash functions are truly random, then this implies that any invalid ciphertext will almost certainly fail decryption, because the padding check will fail. At a much deeper level, the use of random oracles in RSA’s security proof gives the security reduction a great deal of “extra power” to handle things like decrypt chosen ciphertexts. This is due to the fact that, in a random oracle proof, the proof reduction is allowed to both “see” every value hashed through those hash functions, and also to “program” the functions so that they will produce specific outputs. This would not be possible if $G()$ and $H()$ were implemented using real hash functions, and so the entire security proof would break down. These properties provide a tool in the security proof to enable decryption even when the secret key is unknown. In a traditional proof of the RSA-OAEP scheme, the idea is to show that an attacker who breaks the encryption (in the IND-CCA2 sense) can be used to construct a second attacker who solves the RSA problem. This is done by taking some random values $(N, e, C)$ where $N, e$ is an RSA public key of unknown factorization and “programming” the random oracles such that $C^* = C$. The intuitive idea is that an attacker who is able to learn something about the underlying message must query the functions $G()$ and $H()$ on correct inputs that, ultimately will allow the security reduction to obtain the RSA inverse of $C^*$ even when the reduction does not know the RSA secret key, That is, such an attacker will allow us to find an integer $M'$ such that $M'^e = C$. (There turned out to be some issues in the original OAEP proof that make it not quite work for arbitrary trapdoor permutations. Shoup fixed these by providing a new padding padding scheme called OAEP+, but the original OAEP had since gone into heavy usage within standards! It turns out that RSA-OAEP does work, however, for RSA with public exponents 3 and other exponents, though proving this required some ugly band-aids. This whole story is part of a cautionary tail about provably security, which Koblitz discusses here.) ### Technique 2: The Fujisaki-Okamoto Transform One limitation of OAEP (and OAEP+) padding is that it requires a trapdoor permutation in order to work. This applies nicely to RSA encryption, but does not necessarily work with every existing public-key encryption scheme. This motivates the need for other CCA transforms that work with arbitrary existing (non-CCA) encryption schemes. One of the nicest generic techniques for building CCA2-secure public-key encryption is due to Eiichiro Fujisaki and Tatsuaki Okamoto. The idea of this transform is to begin with a scheme that already meets the definition of IND-CPA security — that is, it is semantically secure, but not against chosen ciphertext attacks. (For this description, we’ll also require that this scheme has a large [exponentially-sized] message space and some specific properties related to randomness.) The beauty of the “Fujisaki-Okamoto transform” (henceforth: F-O) is that, like OAEP before it, given a working public-key encryption scheme, it requires only the addition of some hash functions, and can be proven secure in the random oracle model. Let’s imagine that we have an IND-CPA encryption public-key encryption algorithm that consists of the algorithms ${\sf KeyGen}, {\sf Encrypt}, {\sf Decrypt}$. We’ll also make use of two independent hash functions $H_1, H_2$. A key observation here is that in every IND-CPA (semantically secure) public key encryption scheme, the ${\sf Encrypt}$ algorithm is randomized. This actually has to be the case, due to the definition of IND-CPA. (See here for a discussion of why that is.) Put more explicitly, what this means is that the encryption algorithm must have acccess to some set of random bits that will be used to produce the ciphertext. The main trick that the F-O transform uses is to de-randomize this public-key encryption algorithm. Instead of using real random bits to encrypt, it will instead use the output of the hash function $H_1$ to produce the random bits that will be used for encryption. This turns a randomized encryption into a deterministic one. (This, of course, requires that both the input and the internals of $H_1$ are capable of producing bits that “look” random.) Let’s get to the nuts and bolts. The F-O transform does not change the key generation algorithm of the original encryption scheme at all, except to specify the hash functions $H_1, H_2$. The main changes come in the new encryption and decryption algorithms. I’m going to present one variant of the transform, though there are others. This one works as follows. To encrypt a message $M$, which we’ll express as some fixed-length string of bits: 1. Instead of encrypting the actual message $M$, we instead sample a random message $R$ from the message space of the original CPA-secure scheme. 2. We hash the random message $R$ together with the original message $M$ using that first hash function $H_1$. The output of this function will give us a ‘random’ bitstring. Let’s denote this as: $r \leftarrow H_1(R \| M)$. 3. Next, we’ll encrypt the new random message $R$ using the original (CPA-secure) encryption scheme’s ${\sf Encrypt}$ algorithm, but critically: we will use the bits $r$ as the randomness for that encryption. The result of this process will give the first part of the ciphertext: $C_1 \leftarrow {\sf Encrypt}(pk, R; r)$. Note that here $r$ just refers to the randomness for the encryption algorithm, not an actual message being encrypted. 4. Finally, we derive a “key” for encrypting the real message we want to send. We can compute this as $K \leftarrow H_2(R)$. 5. We now encrypt the original message $M$ we want to send using some secure encryption scheme, for example the simple one-time pad: $C_2 \leftarrow M \oplus K$. 6. We output the “ciphertext” $C = (C_1, C_2)$. To decrypt $C = (C_1, C_2)$, we would perform the following steps: 1. First, use the original public-key encryption scheme’s secret key to decrypt the ciphertext $C_1$, which (if all is well) should give us $R' \leftarrow {\sf Decrypt}(sk, C_1)$. 2. Now use knowledge of $R'$ to recover the key $K' \leftarrow H_2(R')$ and thus the message $M'$ which we can obtain as $M' \leftarrow C_2 \oplus K'$. 3. Now check that both $R', M'$ are valid by re-computing the randomness $r' \leftarrow H_1(R' \| M')$ and verifying the condition $C_1 = {\sf Encrypt}(pk, R'; r')$. If this final check fails, simply output a decryption error. Phew. So what the heck is going on here? Let’s tackle this scheme from a practical perspective. Earlier in this post, we said that to achieve IND-CCA2 security, a scheme must have two features. First, it must be plaintext aware, which means that in order to construct a valid ciphertext (that passes all decryption checks) the attacker should already know the plaintext. Does F-O have this property? Well, intuitively we would hope that the answer is “yes”. Note for some valid F-O ciphertext $C = (C_1, C_2)$ the decrypted plaintext is implicitly defined as $M' \leftarrow C_2 \oplus H_2(R')$. So what we really want to prove is that in order to construct a valid ciphertext the attacker must already know $R'$ and $M'$ prior to sending the message for decryption. This guarantee (with high probability) comes from the structure of $C_1$. In order for the ciphertext to be considered valid by the decryptor, it must be the case that $C_1$ satisfies the check $C_1 = {\sf Encrypt}(pk, M'; r' = H_1(R' \| M'))$. The idea of this proof is that it should be hard for an attacker to construct such a $C_1$ unless she has previously called the hash function $H_1$ on input $(R', M')$. If she has called the hash function to produce this portion of the ciphertext, then she already knows those values and the decryption oracle provides her with no additional information she didn’t already have. (Alternatively, if she did not call the hash function, then her probability of making a valid $C_1$ should be extremely low.) Of course, this is only one strategy available to the attacker. She could also maul an existing ciphertext like $C^* = (C_1^*, C_2^*)$. In this case her strategy is twofold: she can tamper with the first portion of the ciphertext and/or she can tamper with the second. But it’s easy to see that this will tend to break some portion of the decryption checks: 1. If she tampers with any bit of $C_2^*$, she will change the recovered message into a new value that we can call $M''$. However this will in turn (with overwhelming probability) cause the decryptor to recover very different random coins $r'' \leftarrow H_1(R' \| M'')$ than were used in the original construction of $C_1^*$, and thus decryption check on that piece will probably fail. 2. If she tampers with any bit of $C_1^*$, the decryption check$latex C_1^* = {\sf Encrypt}(pk, M’; r’) ought not to pass, and decryption will just produce an error.
3. She might try to tamper with both parts of the ciphertext, of course. But this would seem even more challenging.

The problem with the exercise above is that none of this constitutes a proof that the approach works. There is an awful lot of should and probably in this argument, and none of this ought to make you very happy. A rough sketch of the proof for an F-O scheme can be found here. (I warn you that it’s probably got some bugs in it, and I’m offering it mainly as an intuition.)

The F-O scheme has many variants. A slightly different and much more formal treatment by Hofheinz and Kiltz can be found here, and deals with some other requirements on the underlying CPA-secure scheme.

### To be continued…

So far in this discussion we’ve covered two basic techniques — both at a very superficial level — that achieve CCA2 security under the ridiculously strong assumption that random oracles exist. Unfortunately, they don’t. This motivates the need for better approaches that don’t require random oracles at all.

There are a couple of those that, sadly, nobody uses. Those will have to wait until the next post.