Tuesday, September 23, 2014

Slate piece

Blogging has been slow, but only because some of it has been redirected. There's good stuff coming, including a neat post on the subject of RSA encryption and how it relates to the German army in World War II. In the meantime, please go read this (somewhat non-technical) piece I wrote for Slate on the importance of Apple's new encryption policy.

Wednesday, August 13, 2014

What's the matter with PGP?

Image source: @bcrypt.
Last Thursday, Yahoo announced their plans to support end-to-end encryption using a fork of Google's end-to-end email extension. This is a Big Deal. With providers like Google and Yahoo onboard, email encryption is bound to get a big kick in the ass. This is something email badly needs.

So great work by Google and Yahoo! Which is why following complaint is going to seem awfully ungrateful. I realize this and I couldn't feel worse about it.

As transparent and user-friendly as the new email extensions are, they're fundamentally just re-implementations of OpenPGP -- and non-legacy-compatible ones, too. The problem with this is that, for all the good PGP has done in the past, it's a model of email encryption that's fundamentally broken.

It's time for PGP to die. 

In the remainder of this post I'm going to explain why this is so, what it means for the future of email encryption, and some of the things we should do about it. Nothing I'm going to say here will surprise anyone who's familiar with the technology -- in fact, this will barely be a technical post. That's because, fundamentally, most of the problems with email encryption aren't hyper-technical problems. They're still baked into the cake.

Background: PGP

Back in the late 1980s a few visionaries realized that this new 'e-mail' thing was awfully convenient and would likely be the future -- but that Internet mail protocols made virtually no effort to protect the content of transmitted messages. In those days (and still in these days) email transited the Internet in cleartext, often coming to rest in poorly-secured mailspools.

This inspired folks like Phil Zimmermann to create tools to deal with the problem. Zimmermann's PGP was a revolution. It gave users access to efficient public-key cryptography and fast symmetric ciphers in package you could install on a standard PC. Even better, PGP was compatible with legacy email systems: it would convert your ciphertext into a convenient ASCII armored format that could be easily pasted into the sophisticated email clients of the day -- things like "mail", "pine" or "the Compuserve e-mail client".

It's hard to explain what a big deal PGP was. Sure, it sucked badly to use. But in those days, everything sucked badly to use. Possession of a PGP key was a badge of technical merit. Folks held key signing parties. If you were a geek and wanted to discreetly share this fact with other geeks, there was no better time to be alive.

We've come a long way since the 1990s, but PGP mostly hasn't. While the protocol has evolved technically -- IDEA replaced BassOMatic, and was in turn replaced by better ciphers -- the fundamental concepts of PGP remain depressingly similar to what Zimmermann offered us in 1991. This has become a problem, and sadly one that's difficult to change.

Let's get specific.

PGP keys suck

Before we can communicate via PGP, we first need to exchange keys. PGP makes this downright unpleasant. In some cases, dangerously so.

Part of the problem lies in the nature of PGP public keys themselves. For historical reasons they tend to be large and contain lots of extraneous information, which it difficult to print them a business card or manually compare. You can write this off to a quirk of older technology, but even modern elliptic curve implementations still produce surprisingly large keys.
Three public keys offering roughly the same security level. From top-left: (1) Base58-encoded Curve25519 public key used in miniLock. (2) OpenPGP 256-bit elliptic curve public key format. (3a) GnuPG 3,072 bit RSA key and (3b) key fingerprint.
Since PGP keys aren't designed for humans, you need to move them electronically. But of course humans still need to verify the authenticity of received keys, as accepting an attacker-provided public key can be catastrophic.

PGP addresses this with a hodgepodge of key servers and public key fingerprints. These components respectively provide (untrustworthy) data transfer and a short token that human beings can manually verify. While in theory this is sound, in practice it adds complexity, which is always the enemy of security.

Now you may think this is purely academic. It's not. It can bite you in the ass.

Imagine, for example, you're a source looking to send secure email to a reporter at the Washington Post. This reporter publishes his fingerprint via Twitter, which means most obvious (and recommended) approach is to ask your PGP client to retrieve the key by fingerprint from a PGP key server. On the GnuPG command line can be done as follows:


Now let's ignore the fact that you've just leaked your key request to an untrusted server via HTTP. At the end of this process you should have the right key with high reliability. Right?

Except maybe not: if you happen to do this with GnuPG 2.0.18 -- one version off from the very latest GnuPG -- the client won't actually bother to check the fingerprint of the received key. A malicious server (or HTTP attacker) can ship you back the wrong key and you'll get no warning. This is fixed in the very latest versions of GPG but... Oy Vey.

PGP Key IDs are also pretty terrible,
due to the short length and continued
support for the broken V3 key format.
You can say that it's unfair to pick on all of PGP over an implementation flaw in GnuPG, but I would argue it speaks to a fundamental issue with the PGP design. PGP assumes keys are too big and complicated to be managed by mortals, but then in practice it practically begs users to handle them anyway. This means we manage them through a layer of machinery, and it happens that our machinery is far from infallible.

Which raises the question: why are we bothering with all this crap infrastructure in the first place. If we must exchange things via Twitter, why not simply exchange keys? Modern EC public keys are tiny. You could easily fit three or four of them in the space of this paragraph. If we must use an infrastructure layer, let's just use it to shunt all the key metadata around.

PGP key management sucks

If you can't trust Phil who can
you trust?
Manual key management is a mug's game. Transparent (or at least translucent) key management is the hallmark of every successful end-to-end secure encryption system. Now often this does involve some tradeoffs -- e.g.,, the need to trust a central authority to distribute keys -- but even this level of security would be lightyears better than the current situation with webmail.

To their credit, both Google and Yahoo have the opportunity to build their own key management solutions (at least, for those who trust Google and Yahoo), and they may still do so in the future. But today's solutions don't offer any of this, and it's not clear when they will. Key management, not pretty web interfaces, is the real weakness holding back widespread secure email.

ZRTP authentication string as
used in Signal.
For the record, classic PGP does have a solution to the problem. It's called the "web of trust", and it involves individuals signing each others' keys. I refuse to go into the problems with WoT because, frankly, life is too short. The TL;DR is that 'trust' means different things to you than it does to me. Most OpenPGP implementations do a lousy job of presenting any of this data to their users anyway.

The lack of transparent key management in PGP isn't unfixable. For those who don't trust Google or Yahoo, there are experimental systems like Keybase.io that attempt to tie keys to user identities. In theory we could even exchange our offline encryption keys through voice-authenticated channels using apps like OpenWhisperSystems' Signal. So far, nobody's bothered to do this -- all of these modern encryption tools are islands with no connection to the mainland. Connecting them together represents one of the real challenges facing widespread encrypted communications.

No forward secrecy

Try something: go delete some mail from your Gmail account. You've hit the archive button. Presumably you've also permanently wiped your Deleted Items folder. Now make sure you wipe your browser cache and the mailbox files for any IMAP clients you might be running (e.g., on your phone). Do any of your devices use SSD drives? Probably a safe bet to securely wipe those devices entirely. And at the end of this Google may still have a copy which could be vulnerable to law enforcement request or civil subpoena.

(Let's not get into the NSA's collect-it-all policy for encrypted messages. If the NSA is your adversary just forget about PGP.)

Forward secrecy (usually misnamed "perfect forward secrecy") ensures that if you can't destroy the ciphertexts, you can at least dispose of keys when you're done with them. Many online messaging systems like off-the-record messaging use PFS by default, essentially deriving a new key with each message volley sent. Newer 'ratcheting' systems like Trevor Perrin's Axolotl (used by TextSecure) have also begun to address the offline case.

Adding forward secrecy to asynchronous offline email is a much bigger challenge, but fundamentally it's at least possible to some degree. While securing the initial 'introduction' message between two participants may be challenging*, each subsequent reply can carry a new ephemeral key to be used in future communications. However this requires breaking changes to the PGP protocol and to clients -- changes that aren't likely to happen in a world where webmail providers have doubled down on the PGP model.

The OpenPGP format and defaults suck
If Will Smith looked like
this when your cryptography
was current, you need better
cryptography.

Poking through a modern OpenPGP implementation is like visiting a museum of 1990s crypto. For legacy compatibility reasons, many clients use old ciphers like CAST5 (a cipher that predates the AES competition). RSA encryption uses padding that looks disturbingly like PKCS#1v1.5 -- a format that's been relentlessly exploited in the past. Key size defaults don't reach the 128-bit security level. MACs are optional. Compression is often on by default. Elliptic curve crypto is (still!) barely supported.

Most of these issues are not exploitable unless you use PGP in a non-standard way, e.g., for instant messaging or online applications. And some people do use PGP this way.

But even if you're just using PGP just to send one-off emails to your grandmother, these bad defaults are pointless and unnecessary. It's one thing to provide optional backwards compatibility for that one friend who runs PGP on his Amiga. But few of my contacts do -- and moreover, client versions are clearly indicated in public keys.** Even if these archaic ciphers and formats aren't exploitable today, the current trajectory guarantees we'll still be using them a decade from now. Then all bets are off.

On the bright side, both Google and Yahoo seem to be pushing towards modern implementations that break compatibility with the old. Which raises a different question. If you're going to break compatibility with most PGP implementations, why bother with PGP at all?

Terrible mail client implementations

This is by far the worst aspect of the PGP ecosystem, and also the one I'd like to spend the least time on. In part this is because UX isn't technically PGP's problem; in part because the experience is inconsistent between implementations, and in part because it's inconsistent between users: one person's 'usable' is another person's technical nightmare.

But for what it's worth, many PGP-enabled mail clients make it ridiculously easy to send confidential messages with encryption turned off, to send unimportant messages with encryption turned on, to accidentally send to the wrong person's key (or the wrong subkey within a given person's key). They demand you encrypt your key with a passphrase, but routinely bug you to enter that passphrase in order to sign outgoing mail -- exposing your decryption keys in memory even when you're not reading secure email.

Most of these problems stem from the fact that PGP was designed to retain compatibility with standard (non-encrypted) email. If there's one lesson from the past ten years, it's that people are comfortable moving past email. We now use purpose-built messaging systems on a day-to-day basis. The startup cost of a secure-by-default environment is, at this point, basically an app store download.

Incidentally, the new Google/Yahoo web-based end-to-end clients dodge this problem by providing essentially no user interface at all. You enter your message into a separate box, and then plop the resulting encrypted data into the Compose box. This avoids many of the nastier interface problems, but only by making encryption non-transparent. This may change; it's too soon to know how.

So what should we be doing?

Quite a lot actually. The path to a proper encrypted email system isn't that far off. At minimum, any real solution needs:

  • A proper approach to key management. This could be anything from centralized key management as in Apple's iMessage -- which would still be better than nothing -- to a decentralized (but still usable) approach like the one offered by Signal or OTR. Whatever the solution, in order to achieve mass deployment, keys need to be made much more manageable or else submerged from the user altogether.
  • Forward secrecy baked into the protocol. This should be a pre-condition to any secure messaging system. 
  • Cryptography that post-dates the Fresh Prince. Enough said.
  • Screw backwards compatibility. Securing both encrypted and unencrypted email is too hard. We need dedicated networks that handle this from the start.

A number of projects are already going in this direction. Aside above-mentioned projects like Axolotl and TextSecure -- which pretend to be text messaging systems, but are really email in disguise -- projects like Mailpile are trying to re-architect the client interface (though they're sticking with the PGP paradigm). Projects like SMIMP are trying to attack this at the protocol level.*** At least in theory projects like DarkMail are also trying to adapt text messaging protocols to the email case, though details remain few and far between.

It also bears noting that many of the issues above could, in principle at least, be addressed within the confines of the OpenPGP format. Indeed, if you view 'PGP' to mean nothing more than the OpenPGP transport, a lot of the above seems easy to fix -- with the exception of forward secrecy, which really does seem hard to add without some serious hacks. But in practice, this is rarely all that people mean when they implement 'PGP'. 

Conclusion

I realize I sound a bit cranky about this stuff. But as they say: a PGP critic is just a PGP user who's actually used the software for a while. At this point so much potential in this area and so many opportunities to do better. It's time for us to adopt those ideas and stop looking backwards.

Notes:

* Forward security even for introduction messages can be implemented, though it either require additional offline key distribution (e.g., TextSecure's 'pre-keys') or else the use of advanced primitives. For the purposes of a better PGP, just handling the second message in a conversation would be sufficient.

** Most PGP keys indicate the precise version of the client that generated them (which seems like a dumb thing to do). However if you want to add metadata to your key that indicates which ciphers you prefer, you have to use an optional command.

*** Thanks to Taylor Hornby for reminding me of this.

Saturday, July 26, 2014

Noodling about IM protocols

The last couple of months have been a bit slow in the blogging department. It's hard to blog when there are exciting things going on. But also: I've been a bit blocked. I have two or three posts half-written, none of which I can quite get out the door.

Instead of writing and re-writing the same posts again, I figured I might break the impasse by changing the subject. Usually the easiest way to do this is to pick some random protocol and poke at it for a while to see what we learn. 

The protocols I'm going to look at today aren't particularly 'random' -- they're both popular encrypted instant messaging protocols. The first is OTR (Off the Record Messaging). The second is Cryptocat's group chat protocol. Each of these protocols has a similar end-goal, but they get there in slightly different ways.

I want to be clear from the start that this post has absolutely no destination. If you're looking for exciting vulnerabilities in protocols, go check out someone else's blog. This is pure noodling.

The OTR protocol

OTR is probably the most widely-used protocol for encrypting instant messages. If you use IM clients like Adium, Pidgin or ChatSecure, you already have OTR support. You can enable it in some other clients through plugins and overlays.

OTR was originally developed by Borisov, Goldberg and Brewer and has rapidly come to dominate its niche. Mostly this is because Borisov et al. are smart researchers who know what they’re doing. Also: they picked a cool name and released working code.

OTR works within the technical and usage constraints of your typical IM system. Roughly speaking, these are:
  1. Messages must be ASCII-formatted and have some (short) maximum length.
  2. Users won't bother to exchange keys, so authentication should be "lazy" (i.e., you can authenticate your partners after the fact).
  3. Your chat partners are all FBI informants so your chat transcripts must be plausibly deniable -- so as to keep them from being used as evidence against you in a court of law.
Coming to this problem fresh, you might find goal (3) a bit odd. In fact, to the best of my knowledge no court in the history of law has ever used a cryptographic transcript as evidence that a conversation occurred. However it must be noted that this requirement makes the problem a bit more sexy. So let's go with it!
"Dammit, they used a deniable key exchange protocol" said no Federal prosecutor ever.
The OTR (version 2/3) handshake is based on the SIGMA key exchange protocol. Briefly, it assumes that both parties generate long-term DSA public keys which we'll denote by (pubA, pubB). Next the parties interact as follows:

The OTRv2/v3 AKE. Diagram by Bonneau and Morrison, all colorful stuff added. There's also
an OTRv1 protocol that's too horrible to talk about here.
There are four elements to this protocol:
  1. Hash commitment. First, Bob commits to his share of a Diffie-Hellman key exchange (g^x) by encrypting it under a random AES key r and sending the ciphertext and a hash of g^x over to Alice.
     
  2. Diffie-Hellman Key Exchange. Next, Alice sends her half of the key exchange protocol (g^y). Bob can now 'open' his share to Alice by sending the AES key r that he used to encrypt it in the previous step. Alice can decrypt this value and check that it matches the hash Bob sent in the first message.

    Now that both sides have the shares (g^x, g^y) they each use their secrets to compute a shared secret g^{xy} and hash the value several ways to establish shared encryption keys (c', Km2, Km'2) for subsequent messages. In addition, each party hashes g^{xy} to obtain a short "session ID".

    The sole purpose of the commitment phase (step 1) is to prevent either Alice or Bob from controlling the value of the shared secret g^{xy}. Since the session ID value is derived by hashing the Diffie-Hellman shared secret, it's possible to use a relatively short session ID value to authenticate the channel, since neither Alice nor Bob will be able to force this ID to a specific value.
     
  3. Exchange of long-term keys and signatures. So far Alice and Bob have not actually authenticated that they're talking to each other, hence their Diffie-Hellman exchange could have been intercepted by a man-in-the-middle attacker. Using the encrypted channel they've previously established, they now set about to fix this.

    Alice and Bob each send their long-term DSA public key (pubA, pubB) and key identifiers, as well as a signature on (a MAC of) the specific elements of the Diffie-Hellman message (g^x, g^y) and their view of which party they're communicating with. They can each verify these signatures and abort the connection if something's amiss.**
     
  4. Revealing MAC keys. After sending a MAC, each party waits for an authenticated response from its partner. It then reveals the MAC keys for the previous messages.
     
  5. Lazy authentication. Of course if Alice and Bob never exchange public keys, this whole protocol execution is still vulnerable to a man-in-the-middle (MITM) attack. To verify that nothing's amiss, both Alice and Bob should eventually authenticate each other. OTR provides three mechanisms for doing this: parties may exchange fingerprints (essentially hashes) of (pubA, pubB) via a second channel. Alternatively, they can exchange the "session ID" calculated in the second phase of the protocol. A final approach is to use the Socialist Millionaires' Problem to prove that both parties share the same secret.
The OTR key exchange provides the following properties:

Protecting user identities. No user-identifying information (e.g., long-term public keys) is sent until the parties have first established a secure channel using Diffie-Hellman. The upshot is that a purely passive attacker doesn't learn the identity of the communicating partners -- beyond what's revealed by the higher-level IM transport protocol.*

Unfortunately this protection fails against an active attacker, who can easily smash an existing OTR connection to force a new key agreement and run an MITM on the Diffie-Hellman used during the next key agreement. This does not allow the attacker to intercept actual message content -- she'll get caught when the signatures don't check out -- but she can view the public keys being exchanged. From the client point of view the likely symptoms are a mysterious OTR error, followed immediately by a successful handshake.

One consequence of this is that an attacker could conceivably determine which of several clients you're using to initiate a connection.

Weak deniability. The main goal of the OTR designers is plausible deniability. Roughly, this means that when you and I communicate there should be no binding evidence that we really had the conversation. This rules out obvious solutions like GPG-based chats, where individual messages would be digitally signed, making them non-repudiable.

Properly defining deniability is a bit complex. The standard approach is to show the existence of an efficient 'simulator' -- in plain English, an algorithm for making fake transcripts. The theory is simple: if it's trivial to make fake transcripts, then a transcript can hardly be viewed as evidence that a conversation really occurred.

OTR's handshake doesn't quite achieve 'strong' deniability -- meaning that anyone can fake a transcript between any two parties -- mainly because it uses signatures. As signatures are non-repudiable, there's no way to fake one without actually knowing your public key. This reveals that we did, in fact, communicate at some point. Moreover, it's possible to create an evidence trail that I communicated with you, e.g., by encoding my identity into my Diffie-Hellman share (g^x). At very least I can show that at some point you were online and we did have contact.

But proving contact is not the same thing as proving that a specific conversation occurred. And this is what OTR works to prevent. The guarantee OTR provides is that if the target was online at some point and you could contact them, there's an algorithm that can fake just about any conversation with the individual. Since OTR clients are, by design, willing to initiate a key exchange with just about anyone, merely putting your client online makes it easy for people to fake such transcripts.***

Towards strong deniability. The 'weak' deniability of OTR requires at least tacit participation of the user (Bob) for which we're faking the transcript. This isn't a bad property, but in practice it means that fake transcripts can only be produced by either Bob himself, or someone interacting online with Bob. This certainly cuts down on your degree of deniability.

A related concept is 'strong deniability', which ensures that any party can fake a transcript using only public information (e.g., your public keys).

OTR doesn't try achieve strong deniability -- but it does try for something in between. The OTR version of deniability holds that an attacker who obtains the network traffic of a real conversation -- even if they aren't one of the participants -- should be able alter the conversation to say anything he wants. Sort of.

The rough outline of the OTR deniability process is to generate a new message authentication key for each message (using Diffie-Hellman) and then reveal those keys once they've been used up. In theory, a third party can obtain this transcript and -- if they know the original message content -- they can 'maul' the AES-CTR encrypted messages into messages of their choice, then they can forge their own MACs on the new messages.
OTR message transport (source: Bonneau and Morrison, all colored stuff added).
Thus our hypothetical transcript forger can take an old transcript that says "would you like a Pizza" and turn it into a valid transcript that says, for example, "would you like to hack STRATFOR"... Except that they probably can't, since the first message is too short and... oh lord, this whole thing is a stupid idea -- let's stop talking about it.

The OTRv1 handshake. Oh yes, there's also an OTRv1 protocol that has a few issues and isn't really deniable. Even better, an MITM attacker can force two clients to downgrade to it, provided both support that version. Yuck.

So that's the OTR protocol. While I've pointed out a few minor issues above, the truth is that the protocol is generally an excellent way to communicate. In fact it's such a good idea that if you really care about secrecy it's probably one of the best options out there.

Cryptocat

Since we're looking at IM protocols I thought it might be nice to contrast with another fairly popular chat protocol: Cryptocat's group chat. Cryptocat is a web-based encrypted chat app that now runs on iOS (and also in Thomas Ptacek's darkest nightmares).

Cryptocat implements OTR for 'private' two-party conversations. However OTR is not the default. If you use Cryptocat in its default configuration, you'll be using its hand-rolled protocol for group chats.

The Cryptocat group chat specification can be found here, and it's remarkably simple. There are no "long-term" keys in Cryptocat. Diffie-Hellman keys are generated at the beginning of each session and re-used for all conversations until the app quits. Here's the handshake between two parties:

Cryptocat group chat handshake (current revision). Setting is Curve25519. Keys are
generated when the application launches, and re-used through the session.
If multiple people join the room, every pair of users repeats this handshake to derive a shared secret between every pair of users. Individuals are expected to verify each others' keys by checking fingerprints and/or running the Socialist Millionaire protocol.

Unlike OTR, the Cryptocat handshake includes no key confirmation messages, nor does it attempt to bind users to their identity or chat room. One implication of this is that I can transmit someone else's public key as if it were my own -- and the recipients of this transmission will believe that the person is actually part of the chat.

Moreover, since public keys aren't bound to the user's identity or the chat room, you could potentially route messages between a different user (even a user in a different chat room) while making it look like they're talking to you. Since Cryptocat is a group chat protocol, there might be some interesting things you could do to manipulate the conversation in this setting.****


Does any of this matter? Probably not that much, but it would be relatively easy (and good) to fix these issues.

Message transmission and consistency. The next interesting aspect of Cryptocat is the way it transmits encrypted chat messages. One of the core goals of Cryptocat is to ensure that messages are consistent between individual users. This means that all users should be able to verify that the other user is receiving the same data as it is.

Cryptocat uses a slightly complex mechanism to achieve this. For each pair of users in the chat, Cryptocat derives an AES key and an MAC key from the Diffie-Hellman shared secret. To send a message, the client:

  1. Pads the message by appending 64 bytes of random padding.
  2. Generates a random 12-byte Initialization Vector for each of the N users in the chat.
  3. Encrypts the message using AES-CTR under the shared encryption key for each user.
  4. Concatenates all of the N resulting ciphertexts/IVs and computes an HMAC of the whole blob under each recipient's key.
  5. Calculates a 'tag' for the message by hashing the following data:

    padded plaintext || HMAC-SHA512alice || HMAC-SHA512bob || HMAC-SHA512carol || ...
  6. Broadcasts the ciphertexts, IVs, MACs and the single 'tag' value to all users in the conversation.
When a recipient receives a message from another user, it verifies that:
  1. The message contains a valid HMAC under its shared key.
  2. This IV has not been received before from this sender.
  3. The decrypted plaintext is consistent with the 'tag'.
Roughly speaking, the idea here is to make sure that every user receives the same message. The use of a hashed plaintext is a bit ugly, but the argument here is that the random padding protects the message from guessing attacks. Make what you will of this.

Anti-replay. Cryptocat also seeks to prevent replay attacks, e.g., where an attacker manipulates a conversation by simply replaying (or reflecting) messages between users, so that users appear to be repeating statements. For example, consider the following chat transcripts:

Replays and reflection attacks. 
Replay attacks are prevented through the use of a global 'IV array' that stores all previously received and sent IVs to/from all users. If a duplicate IV arrives, Cryptocat will reject the message. This is unwieldy but it generally seems ok to prevent replays and reflection.

A limitation of this approach is that the IV array does not live forever. In fact, from time to time Cryptocat will reset the IV array without regenerating the client key. This means that if Alice and Bob both stay online, they can repeat the key exchange and wind up using the same key again -- which makes them both vulnerable to subsequent replays and reflections. (Update: This issue has since been fixed).

In general the solution to these issues is threefold:

  1. Keys shouldn't be long-term, but should be regenerated using new random components for each key exchange.
  2. Different keys should be derived for the Alice->Bob and Bob->Alice direction
  3. It would be be more elegant to use a message counter than to use this big, unwieldy key array.
The good news is that the Cryptocat developers are working on a totally new version of the multi-party chat protocol that should be enormously better.

In conclusion

I said this would be a post that goes nowhere, and I delivered! But I have to admit, it helps to push it out of my system.

None of the issues I note above are the biggest deal in the world. They're all subtle issues, which illustrates two things: first, that crypto is hard to get right. But also: that crypto rarely fails catastrophically. The exciting crypto bugs that cause you real pain are still few and far between.

Notes:

* In practice, you might argue that the higher-level IM protocol already leaks user identities (e.g., Jabber nicknames). However this is very much an implementation choice. Moreover, even when using Jabber with known nicknames, you might access the Jabber server using one of several different clients (your computer, phone, etc.). Assuming you use Tor, the only indication of this might be the public key you use during OTR. So there's certainly useful information in this protocol.

** Notice that OTR signs a MAC instead of a hash of the user identity information. This happens to be a safe choice given that the MAC used is based on HMAC-SHA2, but it's not generally a safe choice. Swapping the HMAC out for a different MAC function (e.g., CBC-MAC) would be catastrophic.

*** To get specific, imagine I wanted to produce a simulated transcript for some conversation with Bob. Provided that Bob's client is online, I can send Bob any g^x value I want. It doesn't matter if he really wants to talk to me -- by default, his client will cheerfully send me back his own g^y and a signature on (g^x, g^y, pub_B, keyid_B) which, notably, does not include my identity. From this point on all future authentication is performed using MACs and encrypted under keys that are known to both of us. There's nothing stopping me from faking the rest of the conversation.

**** Incidentally, a similar problem exists in the OTRv1 protocol. 

Thursday, April 24, 2014

Attack of the Week: Triple Handshakes (3Shake)

3Shake logo designed by @Raed667
The other day Apple released a major security update that fixes a number of terrifying things that can happen to your OS/X and iOS devices. You should install it. Not only does this fix a possible remote code execution vulnerability in the JPEG parser (!), it also patches a TLS/SSL protocol bug known as the "Triple Handshake" vulnerability. And this is great timing, since Triple Handshakes are something I've been meaning (and failing) to write about for over a month now.

But before we get there: a few points of order.

First, if Heartbleed taught us one thing, it's that when it comes to TLS vulnerabilities, branding is key. Henceforth, and with apologies to Bhargavan, Delignat-Lavaud, Pironti,  Fournet and Strub (who actually discovered the attack*), for the rest of this post I will be referring to the vulnerability simply as "3Shake". I've also taken the liberty of commissioning a logo. I hope you like it.

On a more serious note, 3Shake is not Heartbleed. That's both good and bad. It's good because Heartbleed was nasty and 3Shake really isn't anywhere near as dangerous. It's bad since, awful as it was, Heartbleed was only an implementation vulnerability -- and one in a single TLS library to boot. 3Shake represents a novel and fundamental bug in the TLS protocol.

The final thing you should know about 3Shake is that, according to the cryptographic literature, it shouldn't exist.

You see, in the last few years there have been at least three four major crypto papers purporting to prove the TLS protocol secure. The existence of 3Shake doesn't make those results wrong. It may, however, indicate that cryptographers need to think a bit more about what 'secure' and 'TLS' actually mean. For me, that's the most fascinating implication of this new attack.

I'll proceed with the usual 'fun' question-and-answer format I save for this sort of thing.

What is TLS and why should you care?

Since you're reading this blog, you probably already know something about TLS. You might even realize how much of our infrastructure is protected by this crazy protocol.

In case you don't: TLS is a secure transport protocol that's designed to establish communications between two parties, who we'll refer to as the Client and the Server. The protocol consists of two sub-protocols called the handshake protocol and the record protocol. The handshake is intended to authenticate the two communicating parties and establish shared encryption keys between them. The record protocol uses those keys to exchange data securely.

For the purposes of this blog post, we're going to focus primarily on the handshake protocol, which has (at least) two major variants: the RSA handshake and the Diffie-Hellman handshake (ECDHE/DHE). These are illustrated below.

Simplified description of the TLS handshakes. Top: RSA handshake. Bottom: Diffie-Hellman handshake.
As much as I love TLS, the protocol is a hot mess. For one thing, it inherits a lot of awful cryptography from its ancient predecessors (SSLv1-3). For another, it's only really beginning to be subjected to rigorous, formal analysis.

All this means we're just now starting to uncover some of the bugs that have been present in the protocol since it was first designed. And we're likely to discover more! That's partly because this analysis is at a very early stage. It's also partly because, from an analysts' point of view, we're still trying to figure out exactly what the TLS handshake is supposed to do.

Well, what is the TLS handshake supposed to do?

Up until this result, we thought we had a reasonable understanding of the purpose of the TLS handshake. It was intended to authenticate one or both sides of the connection, then establish a shared cryptographic secret (called the Master Secret) that could be used to derive cryptographic keys for encrypting application data.

The first problem with this understanding is that it's a bit too simple. There isn't just one TLS handshake, there are several variants of it. Worse, multiple different handshake types can be used within a single connection.

The standard handshake flow is illustrated -- without crypto -- in the diagram below. In virtually every TLS connection, the server authenticates to the client by sending a public key embedded in a certificate. The client, for its part, can optionally authenticate itself by sending a corresponding certificate and proving it has the signing key. However this client authentication is by no means common. Many TLS connections are authenticated only in one direction.

Common TLS handshakes. Left: only server authenticates.
Right: client and server both authenticate with certificates.
TLS also supports a "renegotiation" handshake that can switch an open connection from one mode to the other. This is typically used to change a connection that was authenticated only in one direction (Server->Client) into a connection that's authenticated in both directions. The server usually initiates renegotiation when the client has e.g., asked for a protected resource.

Renegotiating a session. A new handshake causes the existing connection to be mutually authenticated.
Renegotiation has had problems before. Back in 2009, Ray and Dispensa showed that a man-in-the-middle attacker could actually establish a (non-authenticated) connection with some server; inject some data; and when the server asks for authentication, the attacker could then "splice" on a real connection with an authorized client by simply forwarding the new handshake messages to the legitimate client. From the server's point of view, both communications would seem to be coming from the same (now authenticated) person:

Ray/Dispensa attack from 2009. The attacker first establishes an unauthenticated connection and injects some traffic ("drop table *"). When the server initiates a renegotiation for client authentication, the attacker forwards the new handshake messages to an honest client Alice who then sends real traffic. Since the handshakes are not bound together, Bob views this as a single connection to one party.
To fix this, a "secure renegotiation" band-aid to TLS was proposed. The rough idea of this extension was to 'bind' the renegotiation handshake to the previous handshake, by having the client present the "Finished" message of the previous handshake. Since the Finished value is (essentially) a hash of the Master Secret and the (hash of) the previous handshake messages, this allows the client to prove that it -- not an attacker -- truly negotiated the previous connection.

All of this brings us back to the question of what the TLS handshake is supposed to do.

You see, the renegotiation band-aid now adds some pretty interesting new requirements to the TLS handshake. For one thing, the security of this extension depends on the idea that (1) no two distinct handshakes will happen to use the same Master Secret, and (2) that no two handshakes will have the same handshake messages, ergo (3) no two handshakes will have the same Finished message.

Intuitively, this seemed like a pretty safe thing to assume -- and indeed, many other systems that do 'channel binding' on TLS connections also make this assumption. The 3Shake attack shows that this is not safe to assume at all.

So what's the problem here?

It turns out that TLS does a pretty good job of establishing keys with people you've authenticated. Unfortunately there's a caveat. It doesn't truly guarantee the established key will be unique to your connection. This is a pretty big violation of the assumptions that underlie the "secure renegotiation" fix described above.

For example: imagine that Alice is (knowingly) establishing a TLS connection to a server Mallory. It turns out that Mallory can simultaneously -- and unknown to Alice -- establish a different connection to a second server Bob. Moreover, if Mallory is clever, she can force both connections to use the same "Master Secret" (MS).

Mallory creates two connections that use the same Master Secret.
The first observation of the 3Shake attack is that this trick can be played if Alice supports either of the or RSA and DHE handshakes -- or both (it does not seem to work on ECDHE). Here's the RSA version:**

RSA protocol flow from the triple handshake attack (source). The attacker is in the middle, while the client and server are on the left/right respectively. MS is computed as a function of (pms, cr, sr) which are identical in both handshakes. 
So already we have a flaw in the logic undergirding secure renegotiation. The Master Secret (MS) values are not necessarily distinct between different handshakes.

Fortunately, the above attack does not let us resurrect the Ray/Dispensa injection attack. While the attacker has tricked the client into using a specific MS value, the handshake Finished messages -- which the client will attach to the renegotiation handshake -- will not be the same in both handshakes. That's because (among other things) the certificates sent on each connection were very different, hence the handshake hashes are not identical. In theory we're safe.

But here is where TLS gets awesome.

You see, there is yet another handshake I haven't told you about. It's called the "session resumption handshake", and it allows two parties who've previously established a master secret (and still remember it) to resume their session with new encryption keys. The advantage of resumption is that it uses no public-key cryptography or certificates at all, which is supposed to make it faster.



It turns out that if an attacker knows the previous MS and has caused it to be the same on both sides, it can now wait until the client initiates a session resumption. Then it can replay messages between the client and server in order to update both connections with new keys:

An attacker replays the session resumption handshake to ensure the same key on both sides. Note that the handshake messages are identical in both connections. (authors of source)
Which brings us to the neat thing about this handshake. Not only is the MS the same on both connections, but both connections now see exactly the same (resumption) handshake messages. Hence the hash of these handshakes will be identical, which means in turn that their "Finished" message will be identical.

By combining all of these tricks, a clever attacker can pull off the following -- and absolutely insane -- "triple handshake" injection attack:

Triple handshake attack. The attacker mediates two handshakes that give MS on both sides, but two different handshake hashes. The resumption handshake leaves the same MS and an identical handshake hash on both sides. This means that the Finished message from the resumption handshake will be the same for the connections on either side of the attacker. Now he can hook up the two without anyone noticing that he previously injected traffic.
In the above scenario, an attacker first runs a (standard) handshake to force both sides of the connection to use the same MS. It then causes both sides to perform session resumption, which results in both sides using the same MS and having the same handshake hash and Finished messages on both sides. When the server initiates renegotiation, the attacker can forward the third (renegotiation) handshake on to the legitimate client as in the Ray/Dispensa attack -- secure in the knowledge that both client and server will expect the same Finished token.

And that's the ballgame.

What's the fix?

There are several, and you can read about them here.

One proposed fix is to change the derivation of the Master Secret such that it includes the handshake hash. This should wipe out most of the attacks above. Another fix is to bind the "session resumption" handshake to the original handshake that led to it.

Wait, why should I care about injection attacks?

You probably don't, unless you happen to be one of the critical applications that relies on the client authentication and renegotiation features of TLS. In that case, like most applications, you probably assumed that a TLS connection opened with a remote user was actually from that user the whole time, and not from two different users.

If you -- like most applications -- made that assumption, you might also forget to treat the early part of the connection (prior to client authentication) as a completely untrusted bunch of crap. And then you'd be in a world of hurt.

But don't take my word for it. There's video! (See here for the source, background and details):

video


What does this have to do with the provable security of TLS?

Of all the questions 3Shake raises, this one is the most interesting. As I mentioned earlier, there have been several recent works that purport to prove things about the security of TLS. They're all quite good, so don't take any of this as criticism.

However, they (with one exception, the miTLS project) didn't find this attack. Why is that?

The first reason is simple: many of these works analyze only the basic TLS handshake, or they omit at least one of the possible handshakes (e.g., resumption). This means they don't catch the subtle interactions between the resumption handshake, the renegotiation handshake, and extensions -- all of which are the exact ingredients that make most TLS attacks possible.

The second problem is that we don't quite know what standard we're holding TLS to. For example, the common definition of security for TLS is called "Authenticated and Confidential Channel Establishment" (ACCE). Roughly speaking this ensures that two parties can establish a channel and that nobody will be able to determine what data is being sent over said channel.

The problem with ACCE is that it's a definition that was developed specifically so that TLS could satisfy it. As a result, it's necessarily weak. For example, ACCE does not actually require that each handshake produces a unique Master Secret -- one of the flaws that enables this attack -- because such a definition was not possible to achieve with the existing TLS protocol. In general this is what happens when you design a protocol first and prove things about it later.

What's the future for TLS? Can't we throw the whole thing out and start over again?

Sure, go ahead and make TLS Rev 2. It can strip out all of this nonsense and start fresh.

But before you get cocky, remember -- all these crazy features in TLS were put there for a reason. Someone wanted and demanded them. And sadly, this is the difference between a successful, widely-used protocol and your protocol.

Your new replacement for TLS might be simple and wonderful today, but that's only because nobody uses it. Get it out into the wild and before long it too will be every bit as crazy as TLS.

Notes:

* An earlier version of this post incorrectly identified the researchers who discovered the attack.

** The Diffie-Hellman (DHE) version is somewhat more clever. It relies on the attacker manipulating the D-H parameters such that they will force the client to use a particular key. Since DHE parameters sent down from the server are usually 'trusted' by TLS implementations, this trick is relatively easy to pull off.

Tuesday, April 8, 2014

Attack of the week: OpenSSL Heartbleed

Ouch. (Logo from heartbleed.com)
I start every lecture in my security class by asking the students to give us any interesting security or crypto news they've seen recently, preferably with a focus on vulnerabilities. The start of my last class was pretty lame, which meant either (1) we'd finally learned how to make our crypto software secure, or (2) something terrible was about to happen.

I'll let you take a guess which one it was.

The news today is all about the Heartbleed bug in OpenSSL (nb: a truly awful name that makes me glad there are no other vital organs referenced in the TLS code). Unlike all the fancy crypto-related attacks we've seen recently -- BEAST, CRIME, Lucky13, etc. -- Heartbleed doesn't require any interesting crypto. In fact it's the result of a relatively mundane coding error. And predictably, this makes it more devastating than all of those fancy attacks put together.

I'm going to talk about Heartbleed using the 'fun' question/answer format I save for this kind of post.

What's Heartbleed and why should I care about OpenSSL?

In case you haven't read the Heartbleed website, go do that. Here I'll just give a quick overview.

Heartbleed is a surprisingly small bug in a piece of logic that relates to OpenSSL's implementation of the TLS 'heartbeat' mechanism. The bug is present in OpenSSL versions 1.0.1 through 1.0.1f (and not in other versions). Sadly, these versions have seen a great deal of adoption lately, because security professionals have been urging developers to implement more recent versions of TLS (1.1 and 1.2). Deployment of TLS 1.2 currently stands at about 30% of the SSL Pulse data set. Many of those servers are likely vulnerable.

The problem is fairly simple: there's a tiny vulnerability -- a simple missing bounds check -- in the code that handles TLS 'heartbeat' messages. By abusing this mechanism, an attacker can request that a running TLS server hand over a relatively large slice (up to 64KB) of its private memory space. Since this is the same memory space where OpenSSL also stores the server's private key material, an attacker can potentially obtain (a) long-term server private keys, (b) TLS session keys, (c) confidential data like passwords, (d) session ticket keys.

Alleged Yahoo user credentials visible due to Heartbleed (source: Mark Loman).
Any of the above may allow an attacker to decrypt ongoing TLS sessions or steal useful information. However item (a) above is by far the worst, since an attacker who obtains the server's main private keys can potentially decrypt past sessions (if made using the non-PFS RSA handshake) or impersonate the server going forward. Worst of all, the exploit leaves no trace.

You should care about this because -- whether you realize it or not -- a hell of a lot of the security infrastructure you rely on is dependent in some way on OpenSSL. This includes many of the websites that store your personal information. And for better or for worse, industry's reliance on OpenSSL is only increasing.

What's the remedy?

Unfortunately it's pretty nasty.

You can test if a given server is vulnerable using one of these tools (note: use at your own risk). Having identified a problem, the first step is to patch OpenSSL. Fortunately this is relatively easy. The 1.0.1g version is not vulnerable, and Debian has a patch. You can also recompile OpenSSL with the -DOPENSSL_NO_HEARTBEATS option.

Sadly, this is only the beginning. Since there's no way to tell whether a server has been exploited (and exploit code is now in the wild) you need to assume that it is. This means the safe move is to revoke your certificate and get a new one. Have fun.

What's the bug?

The TLS Heartbeat mechanism is designed to keep connections alive even when no data is being transmitted. Heartbeat messages sent by one peer contain random data and a payload length. The other peer is suppose to respond with a mirror of exactly the same data.

   When a HeartbeatRequest message is received and sending a
   HeartbeatResponse is not prohibited as described elsewhere in this
   document, the receiver MUST send a corresponding HeartbeatResponse
   message carrying an exact copy of the payload of the received
   HeartbeatRequest.

The data structure of the request looks like this. Note the two-byte payload length:

   struct {
      HeartbeatMessageType type;
      uint16 payload_length;
      opaque payload[HeartbeatMessage.payload_length];
      opaque padding[padding_length]; 
   } HeartbeatMessage;

Which brings us to the bug. The original bug was introduced in this Git commit. The code appears in different files (for DTLS and TLS). Here we'll look at the file t1_lib.c:

2412         /* Read type and payload length first */
2413         hbtype = *p++;
2414         n2s(p, payload);
2415         pl = p;
2417         if (s->msg_callback)
2418                 s->msg_callback(0, s->version, TLS1_RT_HEARTBEAT,
2419                         &s->s3->rrec.data[0], s->s3->rrec.length,
2420                         s, s->msg_callback_arg);
2422         if (hbtype == TLS1_HB_REQUEST)
2423                 {
2424                 unsigned char *buffer, *bp;
2425                 int r;
2427                 /* Allocate memory for the response, size is 1 bytes
2428                  * message type, plus 2 bytes payload length, plus
2429                  * payload, plus padding
2430                  */
2431                 buffer = OPENSSL_malloc(1 + 2 + payload + padding);
2432                 bp = buffer;
2433                 
2434                 /* Enter response type, length and copy payload */
2435                 *bp++ = TLS1_HB_RESPONSE;
2436                 s2n(payload, bp);
2437                 memcpy(bp, pl, payload);
2438                 
2439                 r = ssl3_write_bytes(s, TLS1_RT_HEARTBEAT, buffer, 3 + payload + padding);

As you can see, the incoming (adversarially-generated) data contains a payload length ("payload") which is trusted without bounds checks. OpenSSL then allocates a buffer for its response, and copies "payload" data bytes from the pointer "pl" into it. Unfortunately, there's no check to make sure that there are actually "payload" bytes in data, or that this is in bounds. Hence the attacker gets a slice of data from main memory -- one that's up to 64KB in length.

The fix is equally simple. Just add a bounds check:

+    /* Read type and payload length first */
+    if (1 + 2 + 16 > s->s3->rrec.length)
+        return 0; /* silently discard */
+    hbtype = *p++;
+    n2s(p, payload);
+    if (1 + 2 + payload + 16 > s->s3->rrec.length)
+        return 0; /* silently discard per RFC 6520 sec. 4 */
+    pl = p;
+

Wasn't that dull?

Come on guys, next time please: use a cache timing weakness in AES-GCM encryption. This whole 'effectively breaking OpenSSL by finding simple C coding bugs' isn't even challenging.

Should we rail against the OpenSSL developers for this?

Don't even think about it. The OpenSSL team, which is surprisingly small, has been given the task of maintaining the world's most popular TLS library. It's a hard job with essentially no pay. It involves taking other folks' code (as in the case of Heartbeat) and doing a best-possible job of reviewing it. Then you hope others will notice it and disclose it responsibly before disasters happen.

The OpenSSL developers have a pretty amazing record considering the amount of use this library gets and the quantity of legacy cruft and the number of platforms (over eighty!) they have to support. Maybe in the midst of patching their servers, some of the big companies that use OpenSSL will think of tossing them some real no-strings-attached funding so they can keep doing their job.

Wednesday, March 19, 2014

How do you know if an RNG is working?

No matter how much cryptographers accomplish, we're always
building on a questionable foundation. (illustration: Marc S. Rousseau)
Last week, Edward Snowden spoke to a packed crowd at SXSW about the many problems (and limited solutions) facing those of us who want to keep our communications private. Snowden said a number of things -- including a shout out to Moxie's company Whisper Systems, who certainly deserve it. But instead of talking about that, I wanted to focus on (in my opinion) one of Snowden's most important quotes:
We need all those brilliant Belgian cryptographers to go "alright we know that these encryption algorithms we are using today work, typically it is the random number generators that are attacked as opposed to the encryption algorithms themselves. How can we make them [secure], how can we test them?"
Now it's possible I'm a little biased, but it seems to me this cuts to the core of our problems with building secure systems in an increasingly hostile world. Namely: most encryption relies on some source of "random" numbers, either to generate keys or (particularly in the case of public key encryption) to provide semantic security for our ciphertexts.

What this means is that an attacker who can predict the output of your RNG -- perhaps by taking advantage of a bug, or even compromising it at a design level -- can often completely decrypt your communications. The Debian project learned this firsthand, as have many others. This certainly hasn't escaped NSA's notice, if the allegations regarding its Dual EC random number generator are true.

All of this bring us back to Snowden's quote above, and the question he throws open for us. How do you know that an RNG is working? What kind of tests can we run on our code to avoid flaws ranging from the idiotic to the highly malicious? Unfortunately this question does not have an easy answer. In the rest of this post I'm going to try to explain why.

Background: Random and Pseudorandom Number Generation

I've written quite a bit about random number generation on this blog, but before we go forward it's worth summarizing a few basic facts about random number generation.

First off, the 'random' numbers we use in most deployed cryptographic systems actually come from two different systems:
  1. A 'true' random number generator (or entropy generator) that collects entropy from the physical world. This can include entropy collected from low-level physical effects like thermal noise and shot noise, or it can include goofy stuff like mouse movements and hard disk seek times.
  2. An algorithmic 'pseudorandom number generator' (PRNG) that typically processes the output of (1) to both stretch the output to provide more bits and, in some cases, provide additional security protections in case the output of (1) proves to be biased.
It's important to note that pseudorandom number generators aren't "random number generators" at all. These generators typically use cryptographic algorithms (e.g., block ciphers or hash functions) to process a seed value from the RNG into many apparently random looking and unpredictable bytes.

In most cases, it's quite rare for your application to ever see the raw output of a true random number generator.* Even the low-level entropy collector within Linux's RNG uses cryptographic constructs like hash functions in order to 'mix' the output of various entropy sources. To produce the bits produced in /dev/random or /dev/urandom Linux then seeds a PRNG like Yarrow or Fortuna.**

Another similar pattern occurs inside of the Intel "secure key" random number generator included in Intel Ivy Bridge processors. When you buy one of these processors, you're getting (free!) a hardware '1-shot' circuit that collects high-entropy electronic noise, which is then measured and processed into useful RNG output. The design looks like this:

Hardware random number generator used on Intel Ivy Bridge processors. Left: the '1-shot' circuit used to collect physical entropy. Right: the data flow from generator to output, including health checks and
PRNG computation. (source).
Once again, with Intel's design you (i.e., the application developer) don't get access to this raw randomness. It's first used to seed a PRNG based on AES (CTR-DRBG from NIST SP800-90A). What you actually get as an application developer is the processed output of that algorithm.

In practice this typical design some implications. On the positive side, the presence of a PRNG means that the underlying RNG circuit can get pretty borked (e.g., biased) without the results being detectable by your application. On the negative side, the underlying RNG circuit can get pretty borked without the results being detectable in your application.

In other words, with only a few ugly glitches -- things that can happen in real life -- you can easily get a broken random number generator that nobody notices until it's way too late. And that's without deliberate tampering, which makes things way, way worse.

Which brings us back to our fundamental question: how do systems know that their RNG is working. This turns out to be a question without a perfect answer.

Statistical Tests

If you look at the literature on random number generators, you'll find a lot of references to statistical randomness testing suites like Diehard or NIST's SP 800-22. The gist of these systems is that they look a the output of an RNG and run tests to determine whether the output is, from a statistical perspective, "good enough" for government work (very literally, in the case of the NIST suite.)

The nature of these tests varies. Some look at simple factors like bias (the number of 1s and 0s) while others look for more sophisticated features such as the distribution of numbers when mapped into 3-D space.

Now I don't want to knock these tests. They're a perfectly valid way to detect serious flaws in a (true) RNG -- I can attest to this, since I've built one that failed the tests miserably -- but they probably won't detect flaws in your system. That's because like I said above, most deployed systems include a combination of RNG and PRNG, or even RNG plus "conditioning" via cryptographic hash functions or ciphers. The nature of these cryptographic, algorithmic processes is such that virtually every processed output will pass statistical tests with flying colors -- even if the PRNG is initialized with 'garbage' input.

This means, unfortunately, that it can be very hard to use statistical tests to detect a broken RNG unless you properly test it only at the low level. And even there you won't rule out intentional backdoors -- as I'll discuss in a moment.

Known Answer Tests (KATs)

Assuming that you've tested your true RNG properly and it's passing all tests, it's still important to test your PRNG. One approach to doing this is to use Known Answer Tests (KATs) that are essentially test vectors. These contain some input seed material as well as a set of output bytes that should be the algorithmic result of running the PRNG on that seed.

Since PRNGs are purely algorithmic, the theory here is that you can test them like algorithms. While this approach is valid, it raises two potential issues (both of which I've seen in practice).

First, you can only test your PRNG on so many points. Thus it's quite possible that your PRNG will succeed on one particular test vector (i.e., it'll output just so many valid bytes) but go completely off the reservation on some other input. This is unlikely, but not impossible in normal conditions. It's very possible if someone is trying to build a malicious backdoor into your PRNG implementation.

Second, the process of instrumenting your PRNG implementation for testing can actually introduce vulnerabilities in your deployed system! Think about this for a second. Normal PRNGs take in real random seeds from your RNG. The last thing you'd ever want to do is run your PRNG on some predictable seed -- if you did, everyone would be able to predict the PRNGs outputs. Yet adding a test harness your system means building in logic to re-seed your RNG to something predictable!

This is like adding an ejection seat to your car. Might make you safer -- unless it goes off while you're driving to work.

A quick glance through e.g., the OpenSSL code shows that indeed, exactly this sort of code exists and ships in some versions of the library. Of course, the experienced developers will note that surely such features could be surrounded by pre-processor directives (or the equivalent in your language of choice) ensuring that they'll never be activated in production code. Sadly, at least in the case of FIPS, this is not possible -- for reasons I'll explain next.

Runtime Health Checks

Another approach to testing RNGs is to test them while the system is running. This isn't intended to rule out design-level flaws (as the above statistical and KAT tests are) but it is intended to catch situations where the RNG becomes broken during normal operation. This can occur for a variety of reasons, e.g., manufacturing defects, system damage, and even exposure to outside radiation.

Health checks can take different forms. FIPS 140, for example, mandates that all approved RNGs be tested at startup time using KATs. (This is why you can't make your test harness conditional on compilation flags -- it must ship in your production code!) They subsequently mandate a runtime health check that verifies the generator has not become 'stuck', i.e., is spitting out the same bytes over and over again.

While I'm sure this last test may have saved someone, somewhere, it seems totally inappropriate and useless when applied to the output of an RNG/PRNG pair, which is how NIST recommends it be used. This is because even the most broken algorithmic PRNGs will almost never spit out duplicate values -- even if the underlying RNG fails completely.

The upshot of this decision is that NIST (FIPS) recommend a check that will almost never succeed in catching anything useful from a PRNG, but does introduce a whole bunch of extra logic that can suffer from flaws and/or malicious circumvention. I'm sure the good folks at NIST realize this, but they recommend it anyway -- after all, what else are they going to do?

Tampering

Which brings us to the $10 million question. What happens if an attacker is deliberately tampering with our RNG/PRNG in order to make it fail? Note that this is not an academic question. We have excellent reason to believe it's happened in some real systems.

Just for fun, let's go back to the Intel Ivy Bridge RNG described above. We'll take a look specifically at the PRNG portion of the design, which uses the NIST CTR-DRBG random number generator with AES:

Portion of the Intel Ivy Bridge design, with a few annotations added by yours truly. (original source
The CTR-DRBG design relies on two features. First, an AES key is selected at random along with some input seed. This pair goes into the AES cipher, where it is processed to derive a new key and data. The result should be unpredictable to most attackers.

But if you were able to change the way keys were updated (in the key_in_mux hilighted) so that instead of updating the key and/or using an unpredictable one, it chose a fixed key known to the attacker, you would now have a very powerful backdoor. Specifically, the output would still look statistically perfectly random. But an attacker who knows this key could simply decrypt one block of RNG output to obtain all future and past outputs of the generator until the next time it was reseeded.

Note that I am not saying the Intel system has a backdoor in it -- far from it. I'm only considering how easily it might be made to have one if you were an attacker with control of Intel's fabrication plants (or their microcode updates). And this is hardly Intel's fault. It's just the nature of this particular RNG design. Others could be just as vulnerable.

Actually using this knowledge to attack applications would be more complex, since many system-level RNGs (including the Linux Kernel RNG) combine the output of the RNG with other system entropy (through XOR, unfortunately, not hashing). But Intel has pushed hard to see their RNG output used directly, and there exist plugins for OpenSSL that allow you to use it similarly. If you used such a method, these hypothetical flaws could easily make their way all the way into your cryptography.

Designing against these issues

Unfortunately, so far all I've done is call out the challenges with building trustworthy RNGs. And there's a reason for this: the challenges are easy to identify, while the solutions themselves are hard. And unfortunately at this time, they're quite manual.

Building secure RNG/PRNGs still requires a combination of design expertise, careful low-level (true) RNG testing -- using expert design and statistical tests -- and the use of certified algorithms with proper tests. All of the techniques above contribute to building a secure RNG, but none of them are quite sufficient.

Solving this problem, at least in software, so we can ensure that code is correct and does not contain hidden 'easter eggs', represents one of the more significant research challenges facing those of us who depend on secure cryptographic primitives. I do hope some enterprising graduate students will give these issues the attention they deserve.

Notes:

* Though there are some exceptions. See, for example, this FIPS certified smart card that included a bad RNG which was used to generate cryptographic secrets. In general FIPS disallows this except for a very small number of approved RNGs. Perhaps this was one.

** The original version of this post claimed that /dev/random seeds /dev/urandom. This is actually a mistake -- both /dev/random and /dev/urandom use the same PRNG, but /dev/random simply keeps track of how much 'entropy' is in the pool and blocks when you have drawn too many bits. Thanks to Brendan Long and Thomas Ptacek for setting me straight.