Thursday, February 20, 2014

Cryptographic obfuscation and 'unhackable' software

I have a thing for over-the-top cryptography headlines -- mostly because I enjoy watching steam come out of researchers' ears when their work gets totally misrepresented. And although I've seen quite a few good ones, last week WIRED managed a doozy.

The headline in question, Cryptography Breakthrough Could Make Software Unhackable, managed to accomplish something that few cryptography headlines do. It sent its own protagonist, Amit Sahai, into the comments section to perform intellectual garbage pickup.

In contrast to the headline, which is quite bad, the article is actually pretty decent. Still, the discussion around it has definitely led to some confusion. As a result, many now think an amazing breakthrough has taken place -- one that will finally make software secure. They're probably right about the first part. They may be disappointed about the rest.

The truth, as usual, is complicated. There is, indeed, something very neat going on with the new obfuscation results. They're just not likely to make software 'unhackable' anytime soon. They might, however, radically expand what we can do with cryptography. Someday. When they're ready. And in ways we don't fully understand yet.

But before I go into all that, it's probably helpful to give some background.

Program obfuscation

The Wired article deals with the subject of 'program obfuscation', which is a term that software developers and cryptographers have long been interested in. The motivation here is pretty simple: find a way that we can give people programs they can run -- without letting them figure out how the programs work.

Note that the last part necessarily covers a lot of ground. In principle it includes aspects ranging from the nature of specific secret algorithms used -- which may be proprietary and thus worth money -- to secret information like passwords and cryptographic keys that might be hardcoded into the program.

For a simple example, consider the following routine:

// Check a password and print out top secret information if it's correct
//
SuperSecretPasswordProtectedStuff(string passwd) {
  if (password == "0xt438fh27266629zn28366492923aai3jnqobbyc4t!") {
    print("Congratulations. Here's some super secret private information: ....\n");
  } else {
    print("Wrong password, fool.\n");
  }
}

If you're like me, you probably wrote a program like this at some point in your life. You may have eventually realized how ineffective it would be -- against a smart attacker who could dump the program code. This is because most programs (and even compiled binaries) are pretty easy to read. An attacker could just look at this program to recover the secret password.

Program obfuscation is motivated by the idea that many useful programs would benefit if we could somehow 'stop' people from doing this, while still letting them possess and run the code on their own computers.

In real world software systems, 'obfuscation' usually refers to a collection of ad-hoc techniques that turn nice, sensible programs into a morass of GOTOs and spaghetti code. Sometimes important constants are chopped up and distributed around the code. Some portions of the code may even be encrypted -- though only temporarily, since decryption keys must ship with the program so it can actually be run. Malware authors and DRM folks love this kind of obfuscation.
A chunk of 'birken', one of the winning entries in the 2013 obfuscated C contest.
While these techniques are nice, they're not really 'unhackable' or even secure against reverse-engineering in the strong sense we'd like. Given enough time, brains and tools, you can get past most common software obfuscation techniques. If you have, say, Charlie Miller or Dion Blazakis on your side, you probably do it quickly. This isn't just because those guys are smart (though they are), it's because the techniques are not mathematically rigorous. Most practical obfuscation amounts to roadblocks -- designed to slow reverse-engineers down and make them give up.

So what does it mean to securely 'obfuscate' a program?

The poor quality of existing software obfuscation set cryptographers up with a neat problem. Specifically, they asked, could we define a strong definition of program obfuscation that would improve on, to put it politely, the crap people were actually using? Given such an obfuscator I could hand you my obfuscated program to run while provably protecting all partial information -- except for the legitimate inputs and outputs.

If this could be done, it would have an incredible number of applications. Stock traders could obfuscate their proprietary trading algorithms and send them to the cloud or to end-customers. The resulting programs would still work -- in the sense that they produced the right results -- but customers would never learn anything about 'how they worked'. The internals of the program would be secret sauce.

Unfortunately, when the first researchers started looking at this problem, they ran into a very serious problem: nobody had any idea what it meant to securely 'obfuscate' a program in the first place.

Do think about it for a second before you roll your eyes. What do you think it means to obfuscate a program? You're probably thinking something like 'people shouldn't learn stuff' about the program. But can you explain what stuff? And does 'stuff' depend on the program? What about programs where you can efficiently learn the whole program from sending it inputs and seeing the results? What does it mean to obfuscate those?

Clearly before progress could begin on solving the problem, cryptographers needed to sit down and figure out what they were trying to do. And indeed, several cryptographers immediately began to do exactly this.

Black-box cryptographic obfuscation

The first definitions cryptographers came up addressed a very powerful type of obfuscation called 'virtual black box obfuscation'. Roughly speaking, it starts from the following intuitive thought experiment.

Imagine you have a program P, as well as some code obfuscation technique you've developed. It's important that the obfuscation be efficient, meaning it doesn't slow the program down too much. To determine whether the obfuscation is successful, you can conduct the following experiment.
  1. Give Alice a copy of the obfuscated program code Obf(P), in a form that she can look at and run on her own computer.               
  2. Take the original (unobfuscated) program P, and seal it in a special computer located inside an armored 'black box'. Let a user Sam interact with the program by sending it inputs and receiving outputs. But don't let him access the code.
Roughly speaking, what we want from secure obfuscation is that Alice should learn 'no more' information after seeing the obfuscated program code, than could some corresponding user Sam who interacts with the same program locked up in the black box.

What's nice here is that this is we have the beginnings of an intuitive definition. In some sense the obfuscation should render the program itself basically unintelligible -- Alice should not get any more information from seeing the obfuscated program than Sam could get simply by interacting with its input/output interface. If Sam can 'learn' how the program works just by talking to it, even that's ok. What's not ok is for Alice to learn more than Sam.

The problem with this intuition is, that of course, it's hard to formulate. One thing you may have noticed is that the user Alice who views the obfuscated program truly does learn something more than the user Sam, who only interacts with it via the black box. When the experiment is over, the user who got the obfuscated program still has a copy of the obfuscated program. The user who interacted with the black box does not.

To give a practical example, let's say this program P is a certificate signing program that has a digital signing key hard-coded inside of it -- say, Trustwave's -- that will happily attach valid digital signatures on certificates (CSRs) of the user's devising. Both Alice and Sam can formulate certificate signing requests and send them into the program (either the obfuscated copy or the version in the 'black box') and get valid, signed certificates out the other end.

But here's the thing: when the experiment is over and we shut down Sam's access to the black box, he won't be able to sign any more certificates. On the other hand, Alice, who actually who learned the obfuscated program Obf(P), will still have the program! This means she can keep signing certificates on and forevermore. Knowing a program that can sign arbitrary certificates is a pretty darn significant 'piece' of information for Alice to have!

Worse, there's no way the 'black box' user Sam can ever learn a similar piece of information -- at least not if the signature scheme is any good. If Sam could ask the black box for a bunch of signatures, then somehow learn a program that could make more signatures, this would imply a fundamental weakness in the digital signature scheme. This cuts against a basic property we expect of secure signatures.

Barak et al. proposed a clever way to get around this problem -- rather than outputting an arbitrary piece of information, Alice and Sam would be restricted to outputting a single bit at the end of their experiment (i.e., computing a predicate function). This helps to avoid the problems above and seems generally like the "right" definition.

An impossibility result

Having proposed this nice definition, Barak et al. went on to do an irritating thing that cryptographers sometimes do: they proved that even their definition doesn't work. Specifically, they showed that there exist programs that simply can't be obfuscated under this definition.

The reason is a little wonky, but I'm going to try to give the flavor for it below.

Imagine that you have two programs A and B, where A is similar to our password program above. That is, it contains a hard-coded, cryptographic secret which we'll denote by password. When you run A(x), the program checks whether (x == password) and if so, it outputs a second cryptographically strong password, which we'll cleverly denote by password_two. If you run A on any other input, it doesn't output anything.

Now imagine the second program B works similarly to the first one. It contains both password and password_two hardcoded within it. A major difference is that B doesn't take a string as input. It takes another computer program. You feed a program in as input to B, which then:
  1. Executes the given program on input password to get a result r
  2. If (r == password_two)it outputs a secret bit.
It should be apparent to you that if you run the program B on input the program A -- that is, you compute B(A) -- it will always output the secret bit.* It doesn't even matter if either B or A has been obfuscated, since obfuscation doesn't change the behavior of the programs.

At the same time, Barak et al. pointed out that this trick only works if you actually have code for the program A. If I give you access to a black box that contains the programs A, B and will simply let you query them on chosen inputs, you're screwed. As long as password and password_two are cryptographically strong, you won't be able to get A to output anything in any reasonable amount of time, and hence won't be able to get B to output the secret bit.

What this means is that Alice, who gets the obfuscated copies of both programs, will always learn the secret bit. But if Sam is a reasonable user (that is, he can't run long enough to brute-force the passwords) he won't be able to learn the secret bit. Fundamentally, the obfuscated pair of programs always gives more information than black-box access to them.

It remains only to show that the two programs can be combined into one program that still can't be obfuscated. And this is what Barak et al. did, completing their work and showing the general programs can't be obfuscated using their definition.

Now you can be forgiven for looking askance at this. You might, for example, point out that the example above only shows that it's hard to obfuscate a specific and mildly ridiculous program. Or that this program is some kind of weird exception. But in general it's not. There are many other useful programs that also can't obfuscate, particularly when you try to use them in real systems. These points were quickly pointed out by Barak et al., and later in other practical settings by Goldwasser and Kalai, among others.

You might also think that obfuscation is plain impossible. But here cryptography strikes again -- it's never quite that simple.

We can obfuscate some things!

Before we completely write off black box obfuscation, let's take a moment to go back to the password checking program I showed at the beginning of this post.

Here's a slightly simpler version:

// Check a password and return true if it's correct, false otherwise
//
bool SuperSecretPasswordProtectedStuff(string passwd) {
  if (password == "0xt438fh27266629zn28366492923aai3jnqobbyc4t!") {
    return true;
  } else {
    return false;
  }
}

This function is an example of a 'point function': that is, a functions that returns false on most inputs, but returns true at exactly one point. As you can see, there is exactly one point (the correct password) that makes this routine happy.

Point functions are interesting for a two simple reasons: the first is that we use them all the time in real systems -- password checking programs being the obvious example. The second is that it turns out we can obfuscate them, at least under some strong assumptions. Even better, we can do it in a way that should be pretty familiar to real system designers.

Let be a secure hash function -- we'll get back to what that means in a second -- and consider the following obfuscated password checking routine:

// Check a password and return 'true' if it's correct, 'false' otherwise
// But do it all *obfuscated* and stuff
//
bool ObfuscatedSuperSecretPasswordProtectedStuff(string passwd) {
  static string HARDCODED_SALT          = 0x......; // this is a salt value
  static string HARDCODED_PASSWORD_HASH = 0x......; // this value is H(salt + target password)

  // First, hash the input password with the salt
  hashedPass = H(HARDCODED_SALT + passwd);

  if (hashedPass == HARDCODED_PASSWORD_HASH) {
    return true;
  } else {
    return false;
  }
}

Note that our 'obfuscated' program no longer stores the plaintext of the password. Instead we store its hash only (and salt), and wrap it inside a program that simply compares this hash to a hash of the user's input. This should be familiar, since it's the way you should be storing password files today.**

A number of cryptographers looked at formulations like this and showed the following: if the password is very hard to guess (for example, it's secret and drawn at random from an exponentially-sized space of passwords) and the hash function is 'strong' enough, then the hash-checking program counts as a secure 'obfuscation' of the basic password comparison program above it.

The intuition for this is fairly simple: imagine Alice and Sam don't know the password. If the password is strong, i.e., it's drawn from a large enough space, then their probability of ever getting the program to output 'true' is negligible. If we assume an ideal hash function -- in the simplest case, a random oracle -- then the hard-coded hash value that Alice learns is basically just a useless random string, and hides all partial input about the real password. This means, in practice, that any general question Alice can answer at the end of the experiment, Sam can answer with about the same probability.***

Finding better definitions

More than a decade after the definition was formulated, there are basically two kinds of results about 'strong' virtual-black-box obfuscation. The first set shows that it's impossible to do it for general programs, and moreover, that many of the interesting functions we want to obfuscate (like some signatures and pseudorandom functions) can't be obfuscated in this powerful sense.

The second class of results shows that we can black-box obfuscate certain functions, but only very limited ones like, say, point functions and re-encryption. These results are neat, but they're hardly going to set the world on fire.

What cryptographers have been trying to achieve since then is a different -- and necessarily weaker -- definition that could capture interesting things we wanted from obfuscating general programs, but without all the nasty impossibility results.

And that, finally, brings us to the advances described in the WIRED article.

Indistinguishability Obfuscation

In shooting down their own proposed definitions, Barak et al. also left some breadcrumbs towards a type of obfuscation that might actually work for general programs. They called their definition "indistinguishability obfuscation" (IO) and roughly speaking it says the following:

Imagine we have two programs C1, C2 -- which we'll describe as similarly-sized circuits -- that compute the same function. More concretely, let's say they have exactly the same input/output behavior, although they may be implemented very differently inside. The definition of indistinguishability obfuscation states that it should be possible to obfuscate the two circuits C1, C2 such that no efficient algorithm will be able to tell the difference between Obf(C1) from Obf(C2). 

While this idea was proposed years ago, nobody actually knew how to build any such thing, and it was left as one of those 'open problems' that cryptographers love to tear their hair out over. This remained the case until just last year, when a group of authors from UCLA, UT Austin and IBM Research proposed a 'candidate construction' for building such obfuscators, based on the new area of multilinear map-based cryptography.

Another interesting variant of this notion is called extractability obfuscation (EO), which implies not only that you can't distinguish between Obf(C1) and Obf(C2), but moreover, that if you could distinguish the two, then you could necessarily find an input value on which both C1 and C2 would produce different outputs. Moreover, other work indicates IO and EO give essentially the 'best possible' obfuscation you can provide for general programs.

The question you're probably asking is: so what? What can we do with indistinguishability obfuscation?

And this is where the WIRED article differs substantially from the reality. The truth is that IO will probably bring us major cryptographic and software advances -- it's already been shown to bring about succinct functional encryption for all circuits, as well as advances in deniable encryption. Moreover it can be used to build known forms of encryption such as public-key encryption based on new techniques that differ from what we use today -- for example, by obfuscating 'symmetric' primitives like pseudorandom functions.****

And if this doesn't seem quite as exciting as the WIRED article would imply, that's because we're still learning what the possibilities are. The new techniques could, for example, allow us to build exciting new types of security systems -- or else show us radical new ways to build systems that we already have. Even the latter is very important, in case advances in our field result in 'breaks' to our existing constructions.

What IO and EO will probably not do is make programs unhackable, since that implies something a whole lot stronger than either of these techniques currently provide. In fact, it's not clear what the heck you'd need to make programs unhackable.

And the reason for that is quite simple: even obfuscated software can still suck.

Notes:

Thanks to Susan Hohenberger and Zooko for help with this post. 

The bit in the Barak et al. proof doesn't have to be secret.

** With a pretty damn big caveat, which is that in real life (as opposed to cryptographic examples) users pick terrible passwords, which makes your password hashes vulnerable to dictionary attacks. We have a variety of countermeasures to slow down these attacks, including salt and computationally intensive password hashing functions.

*** God I hate footnotes. But it's worth unpacking this. Imagine that Alice learns 'some information' from seeing an average program containing the hash of a strong password. If the password is strong, then Alice can only guess the right input password with negligible probability. Then Sam can 'use' Alice to get the same information as follows. He formulates a 'fake' program that contains a completely bogus password hash and mails it to Alice. Whatever information she learns from his program, he takes and outputs as the information he 'learned'. It remains to show that if the password hash is strong (a random oracle), Alice isn't going to be able to learn more from a random hash than she would from the hash of a strong password.

**** The basic idea here is that symmetric encryption (like AES, say) can't be used for public key encryption, since anyone who learns your encryption key also learns your decryption key. But if you can obfuscate an encryption program that contains your symmetric encryption key, then the new program becomes like a public key. You can hand it out to people to encrypt with.

25 comments:

  1. Thanks for this explanation!

    One thing confuses me in the example of the obfuscated password checking routine: if Alice gets to view the obfuscated program, while Sam only has access to a black box, then Alice STILL learns more than Sam: Alice can see from the source code that there indeed exists SOME input which will return true, while Sam cannot know even that, since whatever input he passes to the black box returns false (unless he gets VERY lucky and gives it an appropriate password).

    Since Alice knows more than Sam, it therefore doesn't seem to meet the stated black-box obfuscation criterion. What am I missing?

    ReplyDelete
    Replies
    1. Alice would need to prove something suitable about the hash function, and that seems (at least potentially) difficult enough to satisfy the criterion. (Just because it looks like a function might return true doesn't necessarily mean that there are inputs which will actually cause it to do so.)

      Delete
    2. I guess that depends on what the hash function is and what is known about it.

      IF we make some (seemingly reasonable) assumptions about the hash function (i.e. that each possible hash value has multiple inputs which can produce that output value - otherwise the hash function doesn't seem very good), Alice knows that there are multiple inputs which can return true (as opposed to the original non-obfuscated version, where Alice knows that there is exactly one input value which can return true), whereas Sam doesn't even know that much.

      But even if Alice can't figure out those aspects of the apparent hash function, Alice can surely figure out OTHER things about it which Sam cannot, e.g. that the function's return value does not depend on the current date/time. (For all Sam knows, the function returns true if you happen to call it at the right date/time.) Or e.g. that the function has no memory (each call is independent of other calls). (For all Sam knows, if you call it with the same value 137 times in a row, it returns true.)

      Delete
    3. I imagine you'd need to disallow such trivial properties. (Trivial in the sense that they require only surface syntactic examination: "does the code contain a data/time function", for example.) Much as you have to for Rice's theorem.

      Delete
    4. The formulation of using a single bit return is a bit annoying, as you'd expect the bit to indicate whether there is a meaningful output, as well as what that meaningful output is.

      As the toy function, use something like

      if (hashedPW == HARDCODE_1)
      return 0;
      else if (hashedPW == HARDCODE_2)
      return 1;
      else
      return random_zero_or_one();

      Then, someone knowing the original hashed passwords can control/predict the output, but neither Alice nor Sam can.

      Delete
    5. @Russ: I suppose the setting of the experiment is such that both Alice and Sam know that they have a "point function", and the definition of "point function" is such that the output only depends on the input, and is true for exactly one input, false for all others. The remaining open question then is only "which input gives true".

      Delete
    6. Perhaps, but then we're seemingly violating the setting of the experiment by using the hash technique, since multiple inputs, not just exactly one input, will return true (assuming a sane hash function), as Alice can see by knowing the implementation works by using a hash of the input, but which Sam cannot know if all he knows is the specification that "exactly one input" returns true.

      PS: Saluton! :)

      Delete
  2. You're also assuming that the black-box is secure, which it isn't.

    Suppose that password check again, and, as you pointed out, some great "breaktrough" that allows people to get a lot more functions to behave that way...

    Alice doesn't need to crack the password. She just needs to replace the password check function. This could be done in a multitude of ways, among others, by replacing it in the program, or by modifying the "black box" to react always with true to string compares when it's in the password check function.

    Obfuscation, does exactly what it says on the label, it obfuscates. And that's the only thing it'll ever do.

    ReplyDelete
    Replies
    1. Instead of explicitly comparing the hash value, it would likely be possible to make the result of hashing the string affect the flow of the program (e.g. decode some code and try to run it, where the decoding process can not indicate success/failure).

      Intuitively, it feels like that would be possible, but hard to make provably secure.

      Disclaimer: I have no real background in this field.

      Delete
    2. It depends on exactly what you wish to obfuscate. If you wish to prevent use of the program unless a certain password is inputted, then yes, you're out of luck.

      But if you simply wish to obfuscate the plaintext password, which I think is what is desired in this scenario, then you can succeed at that.

      Delete
    3. Anonymous@10:44_feb21: are you meaning something like the things described in http://www.schneier.com/paper-clueless-agents.html ?

      Delete
  3. OMG! I have to say, your blog needs to be one of the best written blogs that I have learn in a protracted time. What I wouldn't give as a way to put up posts which might be as interesting| as yours. I guess I will have to continue studying yours and pray that someday I will write on a subject with as lot of wisdom as you have! Bravo!!

    ReplyDelete
  4. Should we downplay Kerckhoff's principle ?
    Should we see it as an exageration?
    Should we see it as a mistake?
    Should we see it as wrong?
    Should we suspect it to be a globally spread malicious recommendation that persists just for making easier spying worldwide?

    I would answer yes, to many of those questions..

    ReplyDelete
    Replies
    1. I would venture that the principle is simpy overkill in many practical scenarios. If you abide by it, you simply won't get anything done. Instead, you must often go to war with the soldiers you have. On the other hand, you might have the luxury of building up your army and waiting till they develop force fields and transphasic torpedoes and universe-imploding wormhole weapons and whatnot -- if you have the time (or if you're able to control and mold time to do your wishing).

      Delete
  5. It's not clear how that impossibility proof (or flavor thereof) is useful. After all, programs A and B are correlated (or colluding) in that B knows A's secrets. An even simpler example of this general notion is a program that only prints "Hello, world." It can't be obfuscated for much the same reason -- you already know its full functionality.

    ReplyDelete
  6. Just write your program in APL, and don't bother with the encryption.

    ReplyDelete
  7. Y0u would also have to obfuscate the hash function.

    Most of the examples are self-contained, trivial programs. And it compiles down to machine code operators.

    What it sounds like is they are adding a layer of encryption around a plaintext routine. But to read an email, or for a processor to read a program, it must be decrypted.

    Even if youncould obfuscate thendata using prime prodicts fields, modulos, some things would become recognizable as add, multiply, etc.

    ReplyDelete
  8. 1. Many will recall the difference between Thevenin and Norton equivalents: the Norton dissipates energy at idle.

    2. You can obfuscate the text, but the internal processing will still be observable in some ways.

    ReplyDelete
  9. Hey dude, I think you
    should increase
    the width of
    the article,
    it's making
    me dizzy trying
    to follow what
    you say when
    your sentence
    is chopped up
    in 15 lines.
    Oh and by the way where's the print button so i don't have to waste paper on comments by morons like me.

    ReplyDelete
  10. I don't hate footnotes but rather how you use paper books' format online with no links whatsoever. Wikipedia's footnotes work beautifully by linking back to the reference on same page. Please start implementing them.

    ReplyDelete
  11. Thanks for the explanation. I basically had the impression up until now that obfuscation is nothing I find exceptionally interesting, which was supported by DJBs comments at 30C3.

    However, you got me interested with your last footnote. Basically, you're saying we could turn every symmetric cipher into an asymmetric one? I have a few questions on that:
    a) Can we really do that or is this a "maybe we can do it someday if a lot more research happens"?
    b) What performance impact would that have? And what keysize? Is this practical in any way?
    c) if the answers for a) and b) are favorable: Is this the holy grail of post-quantum-cryptography? To me it sounds like it could be. But maybe I'm just completely wrong or things are much more complicated than that.

    ReplyDelete
  12. I see quite often that many people say that "in the end it's just machine code so you can learn anything anyway". If you have read the actual paper, you would see that it never says anything about actual impossibility of learning "stuff" (in case of efficient obfuscation). The claim is that it is computationally not feasible to learn that "stuff" (otherwise you would be also able to efficiently solve some problems believed to be hard).

    To put it in other way, this means (for example) that if you have an encrypted message (with some strong encryption), and if you have an obfuscated program that outputs the key for that message (e.g. if you give an appropriate key to that program), then you need roughly the same effort to learn the hardcoded key from the program as to break the encryption directly. In mathematical terms, this of course doesn't mean impossibility (theoretically you can always brute-force a key anyway).

    As for post-quantum-crypto (noting that I'm not an expert): the reduction doesn't show NP-hardness, and, at a glance, you could break it if you could efficiently invert some operations in some groups, so I'd bet that it is about as hard as discrete log.

    ReplyDelete
    Replies
    1. (clarification: "you" in the second sentence does not refer to the author, it refers to some commenters)

      Delete
  13. Out of curiosity, what do you think of this quine-with-payload approach against obfuscation: http://eprint.iacr.org/2011/497 ?

    ReplyDelete
  14. It is possible to make software unhackable.
    However, security is not built on obscurity. The direction is not obfuscation but making it run in a trusted execution engine. Trust is built from the bottom and you cannot drop stuff at the top of the software stack and simply expect it to be bulletproof.
    Note however, that those who in power do not want unhackable software and trusted system:
    1) Governments and organizations would have though time spying on what’s going on
    2) Security systems would have hard time in realizing what they are dealing with
    3) The big giants have an agenda to push data to the cloud claiming that the device cannot be trusted to hold all the data and algorithms.

    Yes we can make systems butter, but we don’t have the power to throw the big brothers over the ditch

    ReplyDelete