Tuesday, April 8, 2014

Attack of the week: OpenSSL Heartbleed

Ouch. (Logo from heartbleed.com)
I start every lecture in my security class by asking the students to give us any interesting security or crypto news they've seen recently, preferably with a focus on vulnerabilities. The start of my last class was pretty lame, which meant either (1) we'd finally learned how to make our crypto software secure, or (2) something terrible was about to happen.

I'll let you take a guess which one it was.

The news today is all about the Heartbleed bug in OpenSSL (nb: a truly awful name that makes me glad there are no other vital organs referenced in the TLS code). Unlike all the fancy crypto-related attacks we've seen recently -- BEAST, CRIME, Lucky13, etc. -- Heartbleed doesn't require any interesting crypto. In fact it's the result of a relatively mundane coding error. And predictably, this makes it more devastating than all of those fancy attacks put together.

I'm going to talk about Heartbleed using the 'fun' question/answer format I save for this kind of post.

What's Heartbleed and why should I care about OpenSSL?

In case you haven't read the Heartbleed website, go do that. Here I'll just give a quick overview.

Heartbleed is a surprisingly small bug in a piece of logic that relates to OpenSSL's implementation of the TLS 'heartbeat' mechanism. The bug is present in OpenSSL versions 1.0.1 through 1.0.1f (and not in other versions). Sadly, these versions have seen a great deal of adoption lately, because security professionals have been urging developers to implement more recent versions of TLS (1.1 and 1.2). Deployment of TLS 1.2 currently stands at about 30% of the SSL Pulse data set. Many of those servers are likely vulnerable.

The problem is fairly simple: there's a tiny vulnerability -- a simple missing bounds check -- in the code that handles TLS 'heartbeat' messages. By abusing this mechanism, an attacker can request that a running TLS server hand over a relatively large slice (up to 64KB) of its private memory space. Since this is the same memory space where OpenSSL also stores the server's private key material, an attacker can potentially obtain (a) long-term server private keys, (b) TLS session keys, (c) confidential data like passwords, (d) session ticket keys.

Alleged Yahoo user credentials visible due to Heartbleed (source: Mark Loman).
Any of the above may allow an attacker to decrypt ongoing TLS sessions or steal useful information. However item (a) above is by far the worst, since an attacker who obtains the server's main private keys can potentially decrypt past sessions (if made using the non-PFS RSA handshake) or impersonate the server going forward. Worst of all, the exploit leaves no trace.

You should care about this because -- whether you realize it or not -- a hell of a lot of the security infrastructure you rely on is dependent in some way on OpenSSL. This includes many of the websites that store your personal information. And for better or for worse, industry's reliance on OpenSSL is only increasing.

What's the remedy?

Unfortunately it's pretty nasty.

You can test if a given server is vulnerable using one of these tools (note: use at your own risk). Having identified a problem, the first step is to patch OpenSSL. Fortunately this is relatively easy. The 1.0.1g version is not vulnerable, and Debian has a patch. You can also recompile OpenSSL with the -DOPENSSL_NO_HEARTBEATS option.

Sadly, this is only the beginning. Since there's no way to tell whether a server has been exploited (and exploit code is now in the wild) you need to assume that it is. This means the safe move is to revoke your certificate and get a new one. Have fun.

What's the bug?

The TLS Heartbeat mechanism is designed to keep connections alive even when no data is being transmitted. Heartbeat messages sent by one peer contain random data and a payload length. The other peer is suppose to respond with a mirror of exactly the same data.

   When a HeartbeatRequest message is received and sending a
   HeartbeatResponse is not prohibited as described elsewhere in this
   document, the receiver MUST send a corresponding HeartbeatResponse
   message carrying an exact copy of the payload of the received
   HeartbeatRequest.

The data structure of the request looks like this. Note the two-byte payload length:

   struct {
      HeartbeatMessageType type;
      uint16 payload_length;
      opaque payload[HeartbeatMessage.payload_length];
      opaque padding[padding_length]; 
   } HeartbeatMessage;

Which brings us to the bug. The original bug was introduced in this Git commit. The code appears in different files (for DTLS and TLS). Here we'll look at the file t1_lib.c:

2412         /* Read type and payload length first */
2413         hbtype = *p++;
2414         n2s(p, payload);
2415         pl = p;
2417         if (s->msg_callback)
2418                 s->msg_callback(0, s->version, TLS1_RT_HEARTBEAT,
2419                         &s->s3->rrec.data[0], s->s3->rrec.length,
2420                         s, s->msg_callback_arg);
2422         if (hbtype == TLS1_HB_REQUEST)
2423                 {
2424                 unsigned char *buffer, *bp;
2425                 int r;
2427                 /* Allocate memory for the response, size is 1 bytes
2428                  * message type, plus 2 bytes payload length, plus
2429                  * payload, plus padding
2430                  */
2431                 buffer = OPENSSL_malloc(1 + 2 + payload + padding);
2432                 bp = buffer;
2433                 
2434                 /* Enter response type, length and copy payload */
2435                 *bp++ = TLS1_HB_RESPONSE;
2436                 s2n(payload, bp);
2437                 memcpy(bp, pl, payload);
2438                 
2439                 r = ssl3_write_bytes(s, TLS1_RT_HEARTBEAT, buffer, 3 + payload + padding);

As you can see, the incoming (adversarially-generated) data contains a payload length ("payload") which is trusted without bounds checks. OpenSSL then allocates a buffer for its response, and copies "payload" data bytes from the pointer "pl" into it. Unfortunately, there's no check to make sure that there are actually "payload" bytes in data, or that this is in bounds. Hence the attacker gets a slice of data from main memory -- one that's up to 64KB in length.

The fix is equally simple. Just add a bounds check:

+    /* Read type and payload length first */
+    if (1 + 2 + 16 > s->s3->rrec.length)
+        return 0; /* silently discard */
+    hbtype = *p++;
+    n2s(p, payload);
+    if (1 + 2 + payload + 16 > s->s3->rrec.length)
+        return 0; /* silently discard per RFC 6520 sec. 4 */
+    pl = p;
+

Wasn't that dull?

Come on guys, next time please: use a cache timing weakness in AES-GCM encryption. This whole 'effectively breaking OpenSSL by finding simple C coding bugs' isn't even challenging.

Should we rail against the OpenSSL developers for this?

Don't even think about it. The OpenSSL team, which is surprisingly small, has been given the task of maintaining the world's most popular TLS library. It's a hard job with essentially no pay. It involves taking other folks' code (as in the case of Heartbeat) and doing a best-possible job of reviewing it. Then you hope others will notice it and disclose it responsibly before disasters happen.

The OpenSSL developers have a pretty amazing record considering the amount of use this library gets and the quantity of legacy cruft and the number of platforms (over eighty!) they have to support. Maybe in the midst of patching their servers, some of the big companies that use OpenSSL will think of tossing them some real no-strings-attached funding so they can keep doing their job.

Wednesday, March 19, 2014

How do you know if an RNG is working?

No matter how much cryptographers accomplish, we're always
building on a questionable foundation. (illustration: Marc S. Rousseau)
Last week, Edward Snowden spoke to a packed crowd at SXSW about the many problems (and limited solutions) facing those of us who want to keep our communications private. Snowden said a number of things -- including a shout out to Moxie's company Whisper Systems, who certainly deserve it. But instead of talking about that, I wanted to focus on (in my opinion) one of Snowden's most important quotes:
We need all those brilliant Belgian cryptographers to go "alright we know that these encryption algorithms we are using today work, typically it is the random number generators that are attacked as opposed to the encryption algorithms themselves. How can we make them [secure], how can we test them?"
Now it's possible I'm a little biased, but it seems to me this cuts to the core of our problems with building secure systems in an increasingly hostile world. Namely: most encryption relies on some source of "random" numbers, either to generate keys or (particularly in the case of public key encryption) to provide semantic security for our ciphertexts.

What this means is that an attacker who can predict the output of your RNG -- perhaps by taking advantage of a bug, or even compromising it at a design level -- can often completely decrypt your communications. The Debian project learned this firsthand, as have many others. This certainly hasn't escaped NSA's notice, if the allegations regarding its Dual EC random number generator are true.

All of this bring us back to Snowden's quote above, and the question he throws open for us. How do you know that an RNG is working? What kind of tests can we run on our code to avoid flaws ranging from the idiotic to the highly malicious? Unfortunately this question does not have an easy answer. In the rest of this post I'm going to try to explain why.

Background: Random and Pseudorandom Number Generation

I've written quite a bit about random number generation on this blog, but before we go forward it's worth summarizing a few basic facts about random number generation.

First off, the 'random' numbers we use in most deployed cryptographic systems actually come from two different systems:
  1. A 'true' random number generator (or entropy generator) that collects entropy from the physical world. This can include entropy collected from low-level physical effects like thermal noise and shot noise, or it can include goofy stuff like mouse movements and hard disk seek times.
  2. An algorithmic 'pseudorandom number generator' (PRNG) that typically processes the output of (1) to both stretch the output to provide more bits and, in some cases, provide additional security protections in case the output of (1) proves to be biased.
It's important to note that pseudorandom number generators aren't "random number generators" at all. These generators typically use cryptographic algorithms (e.g., block ciphers or hash functions) to process a seed value from the RNG into many apparently random looking and unpredictable bytes.

In most cases, it's quite rare for your application to ever see the raw output of a true random number generator.* Even the low-level entropy collector within Linux's RNG uses cryptographic constructs like hash functions in order to 'mix' the output of various entropy sources. To produce the bits produced in /dev/random or /dev/urandom Linux then seeds a PRNG like Yarrow or Fortuna.**

Another similar pattern occurs inside of the Intel "secure key" random number generator included in Intel Ivy Bridge processors. When you buy one of these processors, you're getting (free!) a hardware '1-shot' circuit that collects high-entropy electronic noise, which is then measured and processed into useful RNG output. The design looks like this:

Hardware random number generator used on Intel Ivy Bridge processors. Left: the '1-shot' circuit used to collect physical entropy. Right: the data flow from generator to output, including health checks and
PRNG computation. (source).
Once again, with Intel's design you (i.e., the application developer) don't get access to this raw randomness. It's first used to seed a PRNG based on AES (CTR-DRBG from NIST SP800-90A). What you actually get as an application developer is the processed output of that algorithm.

In practice this typical design some implications. On the positive side, the presence of a PRNG means that the underlying RNG circuit can get pretty borked (e.g., biased) without the results being detectable by your application. On the negative side, the underlying RNG circuit can get pretty borked without the results being detectable in your application.

In other words, with only a few ugly glitches -- things that can happen in real life -- you can easily get a broken random number generator that nobody notices until it's way too late. And that's without deliberate tampering, which makes things way, way worse.

Which brings us back to our fundamental question: how do systems know that their RNG is working. This turns out to be a question without a perfect answer.

Statistical Tests

If you look at the literature on random number generators, you'll find a lot of references to statistical randomness testing suites like Diehard or NIST's SP 800-22. The gist of these systems is that they look a the output of an RNG and run tests to determine whether the output is, from a statistical perspective, "good enough" for government work (very literally, in the case of the NIST suite.)

The nature of these tests varies. Some look at simple factors like bias (the number of 1s and 0s) while others look for more sophisticated features such as the distribution of numbers when mapped into 3-D space.

Now I don't want to knock these tests. They're a perfectly valid way to detect serious flaws in a (true) RNG -- I can attest to this, since I've built one that failed the tests miserably -- but they probably won't detect flaws in your system. That's because like I said above, most deployed systems include a combination of RNG and PRNG, or even RNG plus "conditioning" via cryptographic hash functions or ciphers. The nature of these cryptographic, algorithmic processes is such that virtually every processed output will pass statistical tests with flying colors -- even if the PRNG is initialized with 'garbage' input.

This means, unfortunately, that it can be very hard to use statistical tests to detect a broken RNG unless you properly test it only at the low level. And even there you won't rule out intentional backdoors -- as I'll discuss in a moment.

Known Answer Tests (KATs)

Assuming that you've tested your true RNG properly and it's passing all tests, it's still important to test your PRNG. One approach to doing this is to use Known Answer Tests (KATs) that are essentially test vectors. These contain some input seed material as well as a set of output bytes that should be the algorithmic result of running the PRNG on that seed.

Since PRNGs are purely algorithmic, the theory here is that you can test them like algorithms. While this approach is valid, it raises two potential issues (both of which I've seen in practice).

First, you can only test your PRNG on so many points. Thus it's quite possible that your PRNG will succeed on one particular test vector (i.e., it'll output just so many valid bytes) but go completely off the reservation on some other input. This is unlikely, but not impossible in normal conditions. It's very possible if someone is trying to build a malicious backdoor into your PRNG implementation.

Second, the process of instrumenting your PRNG implementation for testing can actually introduce vulnerabilities in your deployed system! Think about this for a second. Normal PRNGs take in real random seeds from your RNG. The last thing you'd ever want to do is run your PRNG on some predictable seed -- if you did, everyone would be able to predict the PRNGs outputs. Yet adding a test harness your system means building in logic to re-seed your RNG to something predictable!

This is like adding an ejection seat to your car. Might make you safer -- unless it goes off while you're driving to work.

A quick glance through e.g., the OpenSSL code shows that indeed, exactly this sort of code exists and ships in some versions of the library. Of course, the experienced developers will note that surely such features could be surrounded by pre-processor directives (or the equivalent in your language of choice) ensuring that they'll never be activated in production code. Sadly, at least in the case of FIPS, this is not possible -- for reasons I'll explain next.

Runtime Health Checks

Another approach to testing RNGs is to test them while the system is running. This isn't intended to rule out design-level flaws (as the above statistical and KAT tests are) but it is intended to catch situations where the RNG becomes broken during normal operation. This can occur for a variety of reasons, e.g., manufacturing defects, system damage, and even exposure to outside radiation.

Health checks can take different forms. FIPS 140, for example, mandates that all approved RNGs be tested at startup time using KATs. (This is why you can't make your test harness conditional on compilation flags -- it must ship in your production code!) They subsequently mandate a runtime health check that verifies the generator has not become 'stuck', i.e., is spitting out the same bytes over and over again.

While I'm sure this last test may have saved someone, somewhere, it seems totally inappropriate and useless when applied to the output of an RNG/PRNG pair, which is how NIST recommends it be used. This is because even the most broken algorithmic PRNGs will almost never spit out duplicate values -- even if the underlying RNG fails completely.

The upshot of this decision is that NIST (FIPS) recommend a check that will almost never succeed in catching anything useful from a PRNG, but does introduce a whole bunch of extra logic that can suffer from flaws and/or malicious circumvention. I'm sure the good folks at NIST realize this, but they recommend it anyway -- after all, what else are they going to do?

Tampering

Which brings us to the $10 million question. What happens if an attacker is deliberately tampering with our RNG/PRNG in order to make it fail? Note that this is not an academic question. We have excellent reason to believe it's happened in some real systems.

Just for fun, let's go back to the Intel Ivy Bridge RNG described above. We'll take a look specifically at the PRNG portion of the design, which uses the NIST CTR-DRBG random number generator with AES:

Portion of the Intel Ivy Bridge design, with a few annotations added by yours truly. (original source
The CTR-DRBG design relies on two features. First, an AES key is selected at random along with some input seed. This pair goes into the AES cipher, where it is processed to derive a new key and data. The result should be unpredictable to most attackers.

But if you were able to change the way keys were updated (in the key_in_mux hilighted) so that instead of updating the key and/or using an unpredictable one, it chose a fixed key known to the attacker, you would now have a very powerful backdoor. Specifically, the output would still look statistically perfectly random. But an attacker who knows this key could simply decrypt one block of RNG output to obtain all future and past outputs of the generator until the next time it was reseeded.

Note that I am not saying the Intel system has a backdoor in it -- far from it. I'm only considering how easily it might be made to have one if you were an attacker with control of Intel's fabrication plants (or their microcode updates). And this is hardly Intel's fault. It's just the nature of this particular RNG design. Others could be just as vulnerable.

Actually using this knowledge to attack applications would be more complex, since many system-level RNGs (including the Linux Kernel RNG) combine the output of the RNG with other system entropy (through XOR, unfortunately, not hashing). But Intel has pushed hard to see their RNG output used directly, and there exist plugins for OpenSSL that allow you to use it similarly. If you used such a method, these hypothetical flaws could easily make their way all the way into your cryptography.

Designing against these issues

Unfortunately, so far all I've done is call out the challenges with building trustworthy RNGs. And there's a reason for this: the challenges are easy to identify, while the solutions themselves are hard. And unfortunately at this time, they're quite manual.

Building secure RNG/PRNGs still requires a combination of design expertise, careful low-level (true) RNG testing -- using expert design and statistical tests -- and the use of certified algorithms with proper tests. All of the techniques above contribute to building a secure RNG, but none of them are quite sufficient.

Solving this problem, at least in software, so we can ensure that code is correct and does not contain hidden 'easter eggs', represents one of the more significant research challenges facing those of us who depend on secure cryptographic primitives. I do hope some enterprising graduate students will give these issues the attention they deserve.

Notes:

* Though there are some exceptions. See, for example, this FIPS certified smart card that included a bad RNG which was used to generate cryptographic secrets. In general FIPS disallows this except for a very small number of approved RNGs. Perhaps this was one.

** The original version of this post claimed that /dev/random seeds /dev/urandom. This is actually a mistake -- both /dev/random and /dev/urandom use the same PRNG, but /dev/random simply keeps track of how much 'entropy' is in the pool and blocks when you have drawn too many bits. Thanks to Brendan Long and Thomas Ptacek for setting me straight.

Thursday, February 20, 2014

Cryptographic obfuscation and 'unhackable' software

I have a thing for over-the-top cryptography headlines -- mostly because I enjoy watching steam come out of researchers' ears when their work gets totally misrepresented. And although I've seen quite a few good ones, last week WIRED managed a doozy.

The headline in question, Cryptography Breakthrough Could Make Software Unhackable, managed to accomplish something that few cryptography headlines do. It sent its own protagonist, Amit Sahai, into the comments section to perform intellectual garbage pickup.

In contrast to the headline, which is quite bad, the article is actually pretty decent. Still, the discussion around it has definitely led to some confusion. As a result, many now think an amazing breakthrough has taken place -- one that will finally make software secure. They're probably right about the first part. They may be disappointed about the rest.

The truth, as usual, is complicated. There is, indeed, something very neat going on with the new obfuscation results. They're just not likely to make software 'unhackable' anytime soon. They might, however, radically expand what we can do with cryptography. Someday. When they're ready. And in ways we don't fully understand yet.

But before I go into all that, it's probably helpful to give some background.

Program obfuscation

The Wired article deals with the subject of 'program obfuscation', which is a term that software developers and cryptographers have long been interested in. The motivation here is pretty simple: find a way that we can give people programs they can run -- without letting them figure out how the programs work.

Note that the last part necessarily covers a lot of ground. In principle it includes aspects ranging from the nature of specific secret algorithms used -- which may be proprietary and thus worth money -- to secret information like passwords and cryptographic keys that might be hardcoded into the program.

For a simple example, consider the following routine:

// Check a password and print out top secret information if it's correct
//
SuperSecretPasswordProtectedStuff(string passwd) {
  if (password == "0xt438fh27266629zn28366492923aai3jnqobbyc4t!") {
    print("Congratulations. Here's some super secret private information: ....\n");
  } else {
    print("Wrong password, fool.\n");
  }
}

If you're like me, you probably wrote a program like this at some point in your life. You may have eventually realized how ineffective it would be -- against a smart attacker who could dump the program code. This is because most programs (and even compiled binaries) are pretty easy to read. An attacker could just look at this program to recover the secret password.

Program obfuscation is motivated by the idea that many useful programs would benefit if we could somehow 'stop' people from doing this, while still letting them possess and run the code on their own computers.

In real world software systems, 'obfuscation' usually refers to a collection of ad-hoc techniques that turn nice, sensible programs into a morass of GOTOs and spaghetti code. Sometimes important constants are chopped up and distributed around the code. Some portions of the code may even be encrypted -- though only temporarily, since decryption keys must ship with the program so it can actually be run. Malware authors and DRM folks love this kind of obfuscation.
A chunk of 'birken', one of the winning entries in the 2013 obfuscated C contest.
While these techniques are nice, they're not really 'unhackable' or even secure against reverse-engineering in the strong sense we'd like. Given enough time, brains and tools, you can get past most common software obfuscation techniques. If you have, say, Charlie Miller or Dion Blazakis on your side, you probably do it quickly. This isn't just because those guys are smart (though they are), it's because the techniques are not mathematically rigorous. Most practical obfuscation amounts to roadblocks -- designed to slow reverse-engineers down and make them give up.

So what does it mean to securely 'obfuscate' a program?

The poor quality of existing software obfuscation set cryptographers up with a neat problem. Specifically, they asked, could we define a strong definition of program obfuscation that would improve on, to put it politely, the crap people were actually using? Given such an obfuscator I could hand you my obfuscated program to run while provably protecting all partial information -- except for the legitimate inputs and outputs.

If this could be done, it would have an incredible number of applications. Stock traders could obfuscate their proprietary trading algorithms and send them to the cloud or to end-customers. The resulting programs would still work -- in the sense that they produced the right results -- but customers would never learn anything about 'how they worked'. The internals of the program would be secret sauce.

Unfortunately, when the first researchers started looking at this problem, they ran into a very serious problem: nobody had any idea what it meant to securely 'obfuscate' a program in the first place.

Do think about it for a second before you roll your eyes. What do you think it means to obfuscate a program? You're probably thinking something like 'people shouldn't learn stuff' about the program. But can you explain what stuff? And does 'stuff' depend on the program? What about programs where you can efficiently learn the whole program from sending it inputs and seeing the results? What does it mean to obfuscate those?

Clearly before progress could begin on solving the problem, cryptographers needed to sit down and figure out what they were trying to do. And indeed, several cryptographers immediately began to do exactly this.

Black-box cryptographic obfuscation

The first definitions cryptographers came up addressed a very powerful type of obfuscation called 'virtual black box obfuscation'. Roughly speaking, it starts from the following intuitive thought experiment.

Imagine you have a program P, as well as some code obfuscation technique you've developed. It's important that the obfuscation be efficient, meaning it doesn't slow the program down too much. To determine whether the obfuscation is successful, you can conduct the following experiment.
  1. Give Alice a copy of the obfuscated program code Obf(P), in a form that she can look at and run on her own computer.               
  2. Take the original (unobfuscated) program P, and seal it in a special computer located inside an armored 'black box'. Let a user Sam interact with the program by sending it inputs and receiving outputs. But don't let him access the code.
Roughly speaking, what we want from secure obfuscation is that Alice should learn 'no more' information after seeing the obfuscated program code, than could some corresponding user Sam who interacts with the same program locked up in the black box.

What's nice here is that this is we have the beginnings of an intuitive definition. In some sense the obfuscation should render the program itself basically unintelligible -- Alice should not get any more information from seeing the obfuscated program than Sam could get simply by interacting with its input/output interface. If Sam can 'learn' how the program works just by talking to it, even that's ok. What's not ok is for Alice to learn more than Sam.

The problem with this intuition is, that of course, it's hard to formulate. One thing you may have noticed is that the user Alice who views the obfuscated program truly does learn something more than the user Sam, who only interacts with it via the black box. When the experiment is over, the user who got the obfuscated program still has a copy of the obfuscated program. The user who interacted with the black box does not.

To give a practical example, let's say this program P is a certificate signing program that has a digital signing key hard-coded inside of it -- say, Trustwave's -- that will happily attach valid digital signatures on certificates (CSRs) of the user's devising. Both Alice and Sam can formulate certificate signing requests and send them into the program (either the obfuscated copy or the version in the 'black box') and get valid, signed certificates out the other end.

But here's the thing: when the experiment is over and we shut down Sam's access to the black box, he won't be able to sign any more certificates. On the other hand, Alice, who actually who learned the obfuscated program Obf(P), will still have the program! This means she can keep signing certificates on and forevermore. Knowing a program that can sign arbitrary certificates is a pretty darn significant 'piece' of information for Alice to have!

Worse, there's no way the 'black box' user Sam can ever learn a similar piece of information -- at least not if the signature scheme is any good. If Sam could ask the black box for a bunch of signatures, then somehow learn a program that could make more signatures, this would imply a fundamental weakness in the digital signature scheme. This cuts against a basic property we expect of secure signatures.

Barak et al. proposed a clever way to get around this problem -- rather than outputting an arbitrary piece of information, Alice and Sam would be restricted to outputting a single bit at the end of their experiment (i.e., computing a predicate function). This helps to avoid the problems above and seems generally like the "right" definition.

An impossibility result

Having proposed this nice definition, Barak et al. went on to do an irritating thing that cryptographers sometimes do: they proved that even their definition doesn't work. Specifically, they showed that there exist programs that simply can't be obfuscated under this definition.

The reason is a little wonky, but I'm going to try to give the flavor for it below.

Imagine that you have two programs A and B, where A is similar to our password program above. That is, it contains a hard-coded, cryptographic secret which we'll denote by password. When you run A(x), the program checks whether (x == password) and if so, it outputs a second cryptographically strong password, which we'll cleverly denote by password_two. If you run A on any other input, it doesn't output anything.

Now imagine the second program B works similarly to the first one. It contains both password and password_two hardcoded within it. A major difference is that B doesn't take a string as input. It takes another computer program. You feed a program in as input to B, which then:
  1. Executes the given program on input password to get a result r
  2. If (r == password_two)it outputs a secret bit.
It should be apparent to you that if you run the program B on input the program A -- that is, you compute B(A) -- it will always output the secret bit.* It doesn't even matter if either B or A has been obfuscated, since obfuscation doesn't change the behavior of the programs.

At the same time, Barak et al. pointed out that this trick only works if you actually have code for the program A. If I give you access to a black box that contains the programs A, B and will simply let you query them on chosen inputs, you're screwed. As long as password and password_two are cryptographically strong, you won't be able to get A to output anything in any reasonable amount of time, and hence won't be able to get B to output the secret bit.

What this means is that Alice, who gets the obfuscated copies of both programs, will always learn the secret bit. But if Sam is a reasonable user (that is, he can't run long enough to brute-force the passwords) he won't be able to learn the secret bit. Fundamentally, the obfuscated pair of programs always gives more information than black-box access to them.

It remains only to show that the two programs can be combined into one program that still can't be obfuscated. And this is what Barak et al. did, completing their work and showing the general programs can't be obfuscated using their definition.

Now you can be forgiven for looking askance at this. You might, for example, point out that the example above only shows that it's hard to obfuscate a specific and mildly ridiculous program. Or that this program is some kind of weird exception. But in general it's not. There are many other useful programs that also can't obfuscate, particularly when you try to use them in real systems. These points were quickly pointed out by Barak et al., and later in other practical settings by Goldwasser and Kalai, among others.

You might also think that obfuscation is plain impossible. But here cryptography strikes again -- it's never quite that simple.

We can obfuscate some things!

Before we completely write off black box obfuscation, let's take a moment to go back to the password checking program I showed at the beginning of this post.

Here's a slightly simpler version:

// Check a password and return true if it's correct, false otherwise
//
bool SuperSecretPasswordProtectedStuff(string passwd) {
  if (password == "0xt438fh27266629zn28366492923aai3jnqobbyc4t!") {
    return true;
  } else {
    return false;
  }
}

This function is an example of a 'point function': that is, a functions that returns false on most inputs, but returns true at exactly one point. As you can see, there is exactly one point (the correct password) that makes this routine happy.

Point functions are interesting for a two simple reasons: the first is that we use them all the time in real systems -- password checking programs being the obvious example. The second is that it turns out we can obfuscate them, at least under some strong assumptions. Even better, we can do it in a way that should be pretty familiar to real system designers.

Let be a secure hash function -- we'll get back to what that means in a second -- and consider the following obfuscated password checking routine:

// Check a password and return 'true' if it's correct, 'false' otherwise
// But do it all *obfuscated* and stuff
//
bool ObfuscatedSuperSecretPasswordProtectedStuff(string passwd) {
  static string HARDCODED_SALT          = 0x......; // this is a salt value
  static string HARDCODED_PASSWORD_HASH = 0x......; // this value is H(salt + target password)

  // First, hash the input password with the salt
  hashedPass = H(HARDCODED_SALT + passwd);

  if (hashedPass == HARDCODED_PASSWORD_HASH) {
    return true;
  } else {
    return false;
  }
}

Note that our 'obfuscated' program no longer stores the plaintext of the password. Instead we store its hash only (and salt), and wrap it inside a program that simply compares this hash to a hash of the user's input. This should be familiar, since it's the way you should be storing password files today.**

A number of cryptographers looked at formulations like this and showed the following: if the password is very hard to guess (for example, it's secret and drawn at random from an exponentially-sized space of passwords) and the hash function is 'strong' enough, then the hash-checking program counts as a secure 'obfuscation' of the basic password comparison program above it.

The intuition for this is fairly simple: imagine Alice and Sam don't know the password. If the password is strong, i.e., it's drawn from a large enough space, then their probability of ever getting the program to output 'true' is negligible. If we assume an ideal hash function -- in the simplest case, a random oracle -- then the hard-coded hash value that Alice learns is basically just a useless random string, and hides all partial input about the real password. This means, in practice, that any general question Alice can answer at the end of the experiment, Sam can answer with about the same probability.***

Finding better definitions

More than a decade after the definition was formulated, there are basically two kinds of results about 'strong' virtual-black-box obfuscation. The first set shows that it's impossible to do it for general programs, and moreover, that many of the interesting functions we want to obfuscate (like some signatures and pseudorandom functions) can't be obfuscated in this powerful sense.

The second class of results shows that we can black-box obfuscate certain functions, but only very limited ones like, say, point functions and re-encryption. These results are neat, but they're hardly going to set the world on fire.

What cryptographers have been trying to achieve since then is a different -- and necessarily weaker -- definition that could capture interesting things we wanted from obfuscating general programs, but without all the nasty impossibility results.

And that, finally, brings us to the advances described in the WIRED article.

Indistinguishability Obfuscation

In shooting down their own proposed definitions, Barak et al. also left some breadcrumbs towards a type of obfuscation that might actually work for general programs. They called their definition "indistinguishability obfuscation" (IO) and roughly speaking it says the following:

Imagine we have two programs C1, C2 -- which we'll describe as similarly-sized circuits -- that compute the same function. More concretely, let's say they have exactly the same input/output behavior, although they may be implemented very differently inside. The definition of indistinguishability obfuscation states that it should be possible to obfuscate the two circuits C1, C2 such that no efficient algorithm will be able to tell the difference between Obf(C1) from Obf(C2). 

While this idea was proposed years ago, nobody actually knew how to build any such thing, and it was left as one of those 'open problems' that cryptographers love to tear their hair out over. This remained the case until just last year, when a group of authors from UCLA, UT Austin and IBM Research proposed a 'candidate construction' for building such obfuscators, based on the new area of multilinear map-based cryptography.

Another interesting variant of this notion is called extractability obfuscation (EO), which implies not only that you can't distinguish between Obf(C1) and Obf(C2), but moreover, that if you could distinguish the two, then you could necessarily find an input value on which both C1 and C2 would produce different outputs. Moreover, other work indicates IO and EO give essentially the 'best possible' obfuscation you can provide for general programs.

The question you're probably asking is: so what? What can we do with indistinguishability obfuscation?

And this is where the WIRED article differs substantially from the reality. The truth is that IO will probably bring us major cryptographic and software advances -- it's already been shown to bring about succinct functional encryption for all circuits, as well as advances in deniable encryption. Moreover it can be used to build known forms of encryption such as public-key encryption based on new techniques that differ from what we use today -- for example, by obfuscating 'symmetric' primitives like pseudorandom functions.****

And if this doesn't seem quite as exciting as the WIRED article would imply, that's because we're still learning what the possibilities are. The new techniques could, for example, allow us to build exciting new types of security systems -- or else show us radical new ways to build systems that we already have. Even the latter is very important, in case advances in our field result in 'breaks' to our existing constructions.

What IO and EO will probably not do is make programs unhackable, since that implies something a whole lot stronger than either of these techniques currently provide. In fact, it's not clear what the heck you'd need to make programs unhackable.

And the reason for that is quite simple: even obfuscated software can still suck.

Notes:

Thanks to Susan Hohenberger and Zooko for help with this post. 

The bit in the Barak et al. proof doesn't have to be secret.

** With a pretty damn big caveat, which is that in real life (as opposed to cryptographic examples) users pick terrible passwords, which makes your password hashes vulnerable to dictionary attacks. We have a variety of countermeasures to slow down these attacks, including salt and computationally intensive password hashing functions.

*** God I hate footnotes. But it's worth unpacking this. Imagine that Alice learns 'some information' from seeing an average program containing the hash of a strong password. If the password is strong, then Alice can only guess the right input password with negligible probability. Then Sam can 'use' Alice to get the same information as follows. He formulates a 'fake' program that contains a completely bogus password hash and mails it to Alice. Whatever information she learns from his program, he takes and outputs as the information he 'learned'. It remains to show that if the password hash is strong (a random oracle), Alice isn't going to be able to learn more from a random hash than she would from the hash of a strong password.

**** The basic idea here is that symmetric encryption (like AES, say) can't be used for public key encryption, since anyone who learns your encryption key also learns your decryption key. But if you can obfuscate an encryption program that contains your symmetric encryption key, then the new program becomes like a public key. You can hand it out to people to encrypt with.

Saturday, January 25, 2014

A letter from US security researchers

This week a group of more than fifty prominent security and cryptography researchers signed a letter protesting the mass surveillance efforts of the NSA, and attempts by NSA to weaken cryptography and privacy protections on the Internet. The full letter can be found here.

Most of you have already formed your own opinions on the issue over the past several months, and it's unlikely that one letter is going to change that. Nonetheless, I'd like a chance to explain why this statement matters.

For academic professionals in the information security field, the relationship with NSA has always been a bit complicated. However, for the most part the public side of that relationship has been generally positive. Up until 2013 if you'd asked most US security researchers for their opinions on NSA, you would, of course, have heard a range of views. But you also might have heard notes of (perhaps grudging) respect. This is because many of the NSA's public activities have been obviously in everyone's interest -- helping to fund research and secure our information systems.

Even where evidence indicated the possibility of unfair dealing, most researchers were content to dismiss these allegations as conspiracy theories. We believed the NSA would stay between the lines. Putting backdoors into US information standards was possible, of course. But would they do it? We thought nobody would be that foolish. We were wrong.

In my opinion this letter represents more than just an appeal to conscience. It measures the priceless trust and goodwill the NSA has lost -- and continues to lose while our country fails to make serious reforms to this agency.

While I'm certain the NSA itself will survive this loss of faith in the short term, in the long term our economic and electronic security depend very much on the cooperation of academia, industry and private citizens. The NSA's actions have destroyed this trust. And ironically, that makes us all less safe.

Saturday, December 28, 2013

A few more notes on NSA random number generators

Last Friday, Joseph Menn from Reuters published an article claiming that RSA, the pioneering security firm and division of EMC, accepted $10 million dollars to include the Dual EC random number generator as the default in their flagship BSAFE library. I've written a bit about Dual EC on this blog, so readers will know that I don't think highly of it. For one thing, it's a rotten choice for a commercial product, due to its lousy software performance. Much worse: it may introduce a practical backdoor into any system that uses it.

Given the numerous problems with Dual EC it's baffling that RSA would select this particular generator as the default for their software. It's particularly strange given that BSAFE is designed for use in Java-based and embedded systems where performance truly is at a premium. And none of this can be explained by the needs of a single RSA customer, since those needs could be satisfied merely by making BSAFE an option, rather than the default.

Of course there have been many people who had viewed RSA's decision as potentially malicious. Unsupported rumors have been floating around since long before Reuters got involved. What's new this time is that RSA insiders appear to be going on the record.

I suppose time will tell how the industry reacts to this news. In the interim I have a few small facts to add to the discussion, so I thought I'd sketch them out in this post.

#1: Dual_EC_DRBG's 'backdoor' was known as of January 2005

It's widely believed that the 'vulnerability' in Dual_EC was first identified by Microsoft employees Dan Shumow and Niels Ferguson in the summer of 2007. Tanja Lange (and nymble) recently tipped me off to the fact that this isn't precisely true.

In point of fact, the possibility of a backdoor was known to at least some members of the ANSI X9.82 standardization committee as far back in January 2005. This surprising news comes via a patent application filed by Certicom employees Dan Brown and Scott Vanstone. The application claims a priority date of January 2005. Here's the scary bit:
If P and Q are established in a security domain controlled by an administrator, and the entity who generates Q for the domain does so with knowledge of e (or indirectly via knowledge of d). The administrator will have an escrow key for every ECRNG that follows that standard. 
Escrow keys are known to have advantages in some contexts. They can provide a backup functionality. If a cryptographic key is lost, then data encrypted under that key is also lost. However, encryption keys are generally the output of random number generators. Therefore, if the ECRNG is used to generate the encryption key K, then it may be possible that the escrow key e can be used to recover the encryption key K. Escrow keys can provide other functionality, such as for use in a wiretap. In this case, trusted law enforcement agents may need to decrypt encrypted traffic of criminals, and to do this they may want to be able to use an escrow key to recover an encryption key.
...
For example, in the SSL and TLS protocols, which are used for securing web (HTTP) traffic, a client and server perform a handshake in which their first actions are to exchange random values sent in the clear.
The patent also describes a number of ways to close the backdoor in Dual_EC_DRBG. Indeed, it may be due to Brown and Vanstone that the NIST standard includes an alternative method to close the backdoor (by generating a random Q point).

The existence of this patent does not mean that Brown and Vanstone were responsible for Dual EC. In fact, the generator appears to be an NSA invention, and may date back to the early 2000s. What this patent demonstrates is that some members of the ANSI committee, of which RSA was also a member, had reason to at least suspect that Dual EC could be used to create a wiretapping backdoor. (Update: John Kelsey confirms this.) It would be curious to know how widely this information was shared, and whether anyone on the committee ever inquired as to the provenance of the default parameters.

#2. Dual_EC_DRBG is not really a NIST standard.

This is hardly a secret, but it's something that hasn't been widely acknowledged in the press. Dual_EC_DRBG is generally viewed as a NIST standard, since it was published in NIST Special Publication 800-90. But that's not the only place it appears, nor was it developed at NIST.

A complete history of Dual_EC_DRBG would begin with NSA's drive to include it in the ANSI X9.82 DRBG standard, with a standardization process kicked off in the early 2000s. The draft ANSI standard includes Dual_EC_DRBG with all of the known parameters, along with several additional elliptic curve parameters that were not included in the NIST standards.

Members of the ANSI X9F1 Tool Standards and Guidelines Group which wrote ANSI X9.82.
However you'll also find Dual_EC_DRBG in the international standard ISO 18031. In other words, Dual EC was widely propagated in numerous standards long before NIST finalized it, and it's hardly limited to US borders. The ANSI and ISO drafts are in some sense worse, since they don't include a technique for generating your own Q parameter.

#3. Dual_EC_DRBG is not the only asymmetric random number generator in the ANSI and ISO standards.

Cryptographers generally think of Dual EC as the only 'public key' random number generator to be widely standardized. We also point to NSA's generation of the public parameters as evidence that Dual_EC may be compromised.

But in fact, the ANSI X9.82 and ISO standards each include a second generator based on public key cryptographic techniques. And like Dual EC, this one ships with a complete set of default parameters! The additional generator is based on an algorithm due to Micali and Schnorr, and relies for its security on assumptions related to the hardness of factoring large composite numbers. It requires an RSA-type modulus, several of which are conveniently provided in the specification.

Two default MS-DRBG moduli from the ISO 18031 specification.
There's no reason to suspect that MS-DRBG is used by any real products, let alone that there's a backdoor in the standard. In fact, there's a curious fact about it: it's not obvious from the public literature how one would attack the generator even if one knew the factorization of the n values above -- though it seems intuitively likely that an attack does exist. Solving the problem, and finding a practical attack for known factorization, would be a fun project for an enthusiastic mathematician.

Since MS-DRBG comes from the same people who brought you Dual EC, if you are using it you might want to think twice.

Friday, December 27, 2013

Can hackers decrypt Target's PIN data?

Short answer: probably not.

Slightly longer answer: it depends on whether they have access to the encryption key, or to a machine that contains the encryption key.

In case you have no idea what I'm talking about: there was recently a massive credit card breach at Target. If you're like many people you probably heard about this three times. First in the news, then again in your email when Target notified you that you were a victim, and finally a third time when you checked your credit card bill. Not a proud day for our nation's retailers.

The news got a bit messier today when Target announced the thieves had also managed to get their hands on the PIN numbers of unfortunate debit card customers. But this time there's a silver lining: according to Target, the PIN data was encrypted under a key the hackers don't have.
Snyder said PIN data is encrypted at a retail location’s keypad with Triple-DES [3DES] encryption and that data remains encrypted over the wire until it reaches its payment processor. Attackers would have to have compromised the point-of-sale system and intercepted the PIN data before it is encrypted in order to have accessed it.
Several folks on Twitter have noted that 3DES is no spring chicken, but that's not very important. Aside from a few highly impractical attacks, there isn't much to worry about with 3DES. Moreover, PCI standards appear to mandate unique keys for every payment terminal, which means that the attackers would need to compromise the terminals themselves, or else break into the back-end payment processor. If Target is to be believed, this has not happened.

Others have pointed out that PINs are pretty short. For example, there are only 10,000 4-digit PINs -- so surely the attackers can "brute-force" through this space to figure out your PIN. The good news is that encryption is decidedly not the same thing as password hashing, which means this is unlikely to be a serious concern. Provided that Target is being proactive and makes sure to change the keys now.

Of course you shouldn't take my word for this. It helps to take a quick look at the PCI PIN encryption standards themselves. Before you encrypt a 4-digit PIN, the PIN is first processed and in some cases padded to increase the complexity of the data being encrypted. There are four possible encryption formats:
  • Format 0. XOR the PIN number together with the Primary Account Number (PAN), usually the rightmost twelve digits of the card number, not including the last digit. Then encrypt the result using 3DES in ECB mode.
  • Format 1. Concatenate the PIN number with a unique transaction number and encrypt using 3DES in ECB mode.
  • Format 2. Pad with some fixed (non-random) padding, then encrypt in 3DES/ECB with a unique, derived per-transaction key (called a DUKPT). Update: only used for EMV cards.
  • Format 3. Pad with a bunch of random bytes, then 3DES/ECB encrypt.
Notice that in each case the encryption is ECB mode, but in Formats 0, 1 and 3 the plaintext has been formatted in such a way that two users with PIN "1234" are unlikely to encrypt exactly the same value under the same key. For example, consider the Format 0 encryptions for two users with the same PIN (1234) but different PANs:
(PIN) 0x1234FFFFFFFF ⊕ (PAN) 0x937492492032 = 0x81406DB6DFCD
(PIN) 0x1234FFFFFFFF ⊕ (PAN) 0x274965382343 = 0x357D9AC7DCBC
Notice that the values being encrypted (at right) will be quite different. ECB mode has many flaws, but one nice feature is that the encryption of two different values (even under the same key) should lead to effectively unrelated ciphertexts. This means that even an attacker who learns the user's PAN shouldn't be able to decompose the encrypted PIN without knowledge of the key. Their best hope would be to gain access to the terminal, hope that it was still configured to use the same key, and build a dictionary -- encrypting every possible PIN under a specific user's PAN -- before they could learn anything useful about one user's key.

This does not seem practical.

The one exception to the above rule is Format 2, which does not add any unpredictable padding to the plaintext at all. While the PIN is padded out, but there are still exactly 10,000 possible plaintexts going into the encryption. PCI deals with this by mandating that the payment terminal derive a unique key per transaction, hopefully using a secure key derivation function. Update: this one probably isn't used by Target.

All of this is a long, dull way of saying that encryption is not like password hashing. Provided that you can keep the keys secret, it's perfectly fine to encrypt messages drawn from even small message spaces -- like PINs -- provided you're not an idiot about it. The PCI standards clearly skirt the borders of idiocy, but they mostly steer clear of disaster.

So in summary, Target debit card users are probably just fine. Until tomorrow, when we learn that the thieves also have the decryption keys. Then we can panic.

Friday, December 20, 2013

An update on Truecrypt

Several people have been asking for an update on our public audit of the Truecrypt disk encryption software. I'm happy to say that the project is on track and proceeding apace. Here I wanted to give a few quick updates:
  1. Thanks to the amazingly generous donations of 1,434 individual donors from over 90 countries, as of today, we've collected $62,104 USD and 32.6 BTC* towards this effort. This is an unbelievable response and I can't thank our donors enough. I'm blown away that this is happening.
  2. We've assembled a stellar technical advisory board to make sure we spend this money properly and generally to keep us honest. More details shortly.
  3. In order to make best use of the donated funds and manage on-going governance of the project, we've incorporated as a non-profit corporation in North Carolina—the Open Crypto Audit Project (OCAP)—and are currently seeking 501c(3) tax-exempt designation. Board members include myself, Kenn White (who has been doing most of the heavy organizational lifting) and the amazing Marcia Hoffman. We have high hopes that OCAP will find a purpose beyond this Truecrypt audit.
  4. The Open Technology Fund has generously agreed to donate a substantial amount of contracted evaluation time to our effort
  5. And finally, the most exciting news: we've signed a first contract with iSEC partners to evaluate large portions of the Windows software and bootloader code. This review will begin in January.
Despite the progress above, there's still a lot of work to do. The iSEC review will cover a lot of the thorniest bits of the code, but we are still working to audit the core cryptographic routines of Truecrypt and move the project onto a secure (deterministic) build footing. We hope to have further announcements in the next few weeks.

Let me add one more personal note.

I usually take a pretty skeptical attitude on this blog when it comes to Internet security. For the most part we do things wrong, and I used to think most people didn't care. The fact is that I was wrong. If the response to our audit call is any evidence, you do care. You care a lot.
Donations (click to enlarge)
I can't tell you how amazed I am that any of this is happening. As far as I know, this is the first time that the Internet has come together in this way for the purposes of making us all a bit safer. I hope it's the beginning of a trend. 

More updates to come.

* Incidentally, determining the dollar value of BTC is fun, fun fun. We've been trying to responsibly sell these at the 'best' price. But, ugh.