Let’s talk about ZRTP

I just checked the date on my last post and it seems that I haven’t blogged in nearly a month. Believe me, this isn’t for lack of trying. The world has just been a very, very busy place.

But this is the Thanksgiving holiday and every (US) American knows the best part of Thanksgiving is sitting around for hours waiting for a huge bird to cook thoroughly enough that it won’t kill your family. And that means I finally have time to write about my favorite wonky topic: cryptographic protocols. And how utterly confusing they can be.


The subject of today’s post is the ZRTP key agreement protocol. ZRTP has recently gotten some press for being the primary key agreement used by Silent Circle, a new encrypted voice/video/text service launched by PGP inventor Phil Zimmermann and some other bright folks. But it’s also used in other secure VoIP solutions, like Zfone and Moxie’s RedPhone.

Silent Circle’s an interesting case, since it’s gotten some gentle criticism lately from a variety of folks — well, mostly Nadim Kobeissi — for being partially closed-source and for having received no real security audits. Nadim’s typical, understated critique goes like this:

And is usually followed by the whole world telling Nadim he’s being a jerk. Eventually a tech reporter notices the fracas and chimes in to tell us the whole Infosec community is a bunch of jerks:

And the cycle repeats, without anyone having actually learned much at all. (Really, it’s enough to make you think Twitter isn’t the right place to get your Infosec news.)

Now, as unhelpful as all this is, maybe we can make lemonade and let all of this serve as a teaching moment. For one thing, it gives us a (wonky) chance to learn a little bit about this ZRTP protocol that so many people seem to be using.

Overview of ZRTP

The ZRTP protocol is fully laid out in RFC 6189. This is one of the more confusing specs I’ve read — partly because the critical information is spread out across so many different sections, and partly because ZRTP seems determined to address every possible attack simultaneously.

Fortunately the Intro does a good job of telling us what the protocol’s up to:

ZRTP is a key agreement protocol that performs a Diffie-Hellman key exchange during call setup in the media path and is transported over the same port as the Real-time Transport Protocol (RTP) media stream which has been established using a signaling protocol such as Session Initiation Protocol (SIP). This generates a shared secret, which is then used to generate keys and salt for a Secure RTP (SRTP) session.

So: simple enough. ZRTP lets us establish keys over an insecure channel using Diffie-Hellman.

However we all know that Diffie-Hellman isn’t secure against active (Man-in-the-Middle) attacks. Normally we prevent these by signing Diffie-Hellman messages using keys distributed via a PKI. ZRTP is having none of it:

Although it uses a public key algorithm, [ZRTP] does not rely on a public key infrastructure (PKI) … [ZRTP] allows the detection of man-in-the-middle (MiTM) attacks by displaying a short authentication string (SAS) for the users to read and verbally compare over the phone. … But even if the users are too lazy to bother with short authentication strings, we still get reasonable authentication against a MiTM attack, based on a form of key continuity. It does this by caching some key material to use in the next call, to be mixed in with the next call’s DH shared secret, giving it key continuity properties analogous to Secure SHell (SSH).

So our protection is twofold: (1) every time we establish a connection with some remote party, we can verbally compare a “short authentication string” (SAS) to ensure that we’ve both agreed on the same key. Assuming that our attacker isn’t a voice actor, this should let us easily detect a typical MiTM. And just in case we’re too lazy to bother with this, even completing one ‘un-attacked’ connection leaves us with (2) a long-term shared secret that we can use to validate future connection attempts.

So is this a reasonable model? Well, you can draw your own conclusions — but it works fine for me. Moreover, I’m willing to accept just about any assumption that allows us to not think about the ‘Bill Clinton’ or ‘Rich Little attacks‘. So let’s just assume that this voice thing works… and move on to the interesting bits.

The Guts of the Protocol

ZRTP is an interaction between two parties, defined as an Initiator and a Responder. This figure from the spec illustrates the flow of a typical transaction:

Note that the identity of the party acting as the Initiator is determined during the protocol run — it’s the one that sends the Commit message (F5). The protocol breaks down into roughly four phases:

  1. Discovery and protocol negotiation (F1-F4), in which the parties start up a protocol transaction and agree on a supported ZRTP version and ciphersuites.
  2. Diffie-Hellman key establishment (F5-F7). This is almost ‘missionary position’ Diffie-Hellman, with one exception (the F5 message), which we’ll talk more about in a second.
  3. Key confirmation (F8-F10), in which the parties verify that they’ve agreed on the same key.
  4. Secure communication. That’s the last section, labeled “SRTP begins”.

Discovery and Protocol Negotiation

Messages F1-F4 are responsible for setting up a ZRTP connection. This stuff is (almost) entirely non-cryptographic, which means this is the part of the protocol where stuff is most likely to go wrong.

Let me explain: prior to completing the Diffie-Hellman key exchange in messages F5-F7, we can’t assume that Alice or Bob share a cryptographic key yet. Without a key, they can’t authenticate their messages — at least not until after the Diffie-Hellman transaction is complete. That means an attacker can potentially re-write any portion of those messages and get away with it. At least for a while.
So what’s in those messages? Well, just a couple of small things:
  1. A 4-character string containing the version of the ZRTP protocol.
  2. A Client Identifier string (cid), which is 4 words long and identifies the vendor and release of the ZRTP software.
  3. The 96-bit-long unique identifier for the ZRTP endpoint (ZID).
  4. A Signature-capable flag (S) indicates this Hello message is sent from a ZRTP endpoint which is able to parse and verify digital signatures.
  5. The MiTM flag (M), which has something to do with PBXes.
  6. The Passive flag (P), which is set to true if and only if this Hello message is sent from a device that is configured to always act as a Responder (not Initiator).
  7. The supported Hash algorithms, Cipher algorithms (including Diffie-Hellman handshake type), SRTP Auth Tag Types, Key Agreement Types, and SAS Types.
Which is to say: quite a lot! And some of these values are pretty critical. Changing the version number might allow an attacker to roll us back to an earlier (potentially insecure) version of the protocol. Changing the ciphers might (theoretically) allow us to switch to a weaker set of Diffie-Hellman parameters, hash function or Short Authentication String algorithm.
At this point a careful protocol reviewer will be asking ‘what does ZRTP does to prevent this?’ ZRTP gives us several answers, both good and not-so-good:
  • On the bad side, ZRTP allows us to send a hash of the Initiator’s Hello message through a separate integrity-protected channel, e.g.,secure SIP. (This protection gets folded on into later messages using a crazy hash-chain and MAC construction.) In general I’m not a fan of this protection — it’s like saying a life-preserver will keep you alive… as long as you have a separate lifeboat to wear it in. If you have a secure channel in the first place, why are you using ZRTP? (Even the ZRTP authors seem a little sheepish about this one.)
  • On the confusing side, Hello messages are MACed using a MAC key that isn’t revealed until the subsequent message (e.g., Commit). In theory this means you can’t forge a MAC until the Commit message has been delivered. In practice, an MiTM attacker can just capture both the Initiator Hello (F3) and the Commit message (F5), learn the MAC key, and then forge the Hello message to its heart’s content… before sending both messages on to the recipient. This entire mechanism baffles me. The less said about it the better.
  • A more reasonable protection comes from the key derivation process. In a later phase of the protocol, both ZRTP parties compute a hash over the handshake messages. This value is folded into the Key Derivation Function (KDF) that’s ultimately to compute session keys. The hashing process looks like:  total_hash = hash(Hello of responder || Commit ||                  DHPart1 || DHPart2)

But… zounds! One important message is missing: the Initiator’s Hello (F3). I can’t for the life of me figure out why this message would be left out. And unless there’s something really clever that I’m missing, this means an attacker can tamper with the Initiator’s Hello message without affecting the key at all.*So is this a problem? Well, in theory it means you could roll back any field in the Initiator Hello message without detection. It’s not exactly clear what practical benefit this would have. You could certainly modify the Diffie-Hellman parameters or ciphers to turn off advanced options like ECDH or 256-bit AES. Fortunately ZRTP does not support ‘export weakened‘ ciphers, so even the ‘weak’ options are pretty strong. Still: this seems like a pretty big oversight, and possibly an unforced error.

For the moment, let’s file it under ‘this protocol is really complicated‘ or ‘don’t analyze protocols at Thanksgiving‘. At very least I think this could use a good explanation.

(Update 11/25: After some back and forth, Moxie points out that the Hash, SAS and Cipher information is repeated in the Commit message [F5], so that should provide an extra layer of protection against rollback on those fields — but not the other fields like version or signature capability. And of course, an implementation might have the Responder prune its list of supported algorithms by basing it off the Initiator’s list, which would be a big problem.)
Diffie-Hellman Key EstablishmentAssuming that the two parties have successfully completed the negotiation, the next phase of the protocol requires the two parties to establish a shared key. This is done using (nearly) straight-up Diffie-Hellman. The parameters are negotiated in the previous phase, and are drawn from a list defined by RFC3526 and a NIST publication.
Normally a Diffie-Hellman agreement would work something like this (all exponentiation is in a group, i.e., mod p):

  1. Alice picks a, sends g^a (message F6).
  2. Bob picks b, sends g^b (message F7).
  3. Both parties compute g^ab and HMAC this value together with lots of other things to derive a session key s0.
  4. The parties further compute a 32-bit Short Authentication value as a function of s0, convert this into a human-readable Short Authentication String (SAS), and compare notes verbally.

The problem with the traditional Diffie-Hellman protocol is that it doesn’t play nice with the Short Authentication String mechanism. Say an MiTM attacker Eve has established a connection with Bob, during which they agreed on SAS value X. Eve now tries to run the Diffie-Hellman protocol with Alice. Once Alice sends her choice of g^a, Eve can now run an offline attack, trying millions of candidate b values until she gets one such that the derived SAS between her and Alice is also X.This is a problem, since Alice and Bob will now see the same SAS, even though Eve is duping them.ZRTP deals with this by forcing one party (Bob) to commit to g^b before the protocol even begins. Thus, in ZRTP Bob picks b and sends H(g^b) before Alice sends her first message. This ‘commitment’ is transmitted in the “Commit” message (F5).

This prevents Bob from ‘changing his mind’ after seeing Alice’s input, and thus the remote party has at most one chance to get a colliding SAS per protocol run. Which means Eve is (probably) out of luck.**

Key Confirmation & Secure Communication

The rest of the protocol does all of the stuff you expect from a key agreement protocol: the two parties, having successfully completed a Diffie-Hellman exchange, now derive a session key by HMACing together the value g^ab with total_hash and any pre-shared secrets they hold from older sessions. If all goes well, the result should be a strong secret that only the two parties know — or an SAS mismatch that reveals Eve’s tampering.

From this point it’s all downhill, or it should be at least. Both parties now have a shared secret that they can use to derive secure encryption and MAC keys. They now construct “Confirm” messages (F8, F9), which they encrypt and (partially) MAC. In principle this exchange gives both parties a chance to detect a mismatch in keys and call the whole thing off.There’s not a ton to say about this section except for one detail: the MAC on these messages is computed only over the encrypted portion (between the == signs below), and leaves out critical details like the Initialization Vector that’s used to encrypt them: ***

Structure of a ZRTP “Confirm” message (source).

Once again it’s not clear if there’s any practical impact here, but at least technically speaking, someone could tamper with this IV and thus change the decryption of the message (specifically, the first 64 bits of H0). I have no idea if this matters — or even if it’s a real attack — but in general it seems like the kind of thing you want to avoid. It’s another example of a place where I just plain don’t understand ZRTP.


I think it should be obvious at this point that I have no real point with this post — I just love protocols. If you’re still reading at this point, you must also love protocols… or else you did something really wrong, and reading this post is your punishment.

ZRTP is a fun protocol to analyze, since it’s a particularly complicated spec that deals with many use-cases at once. Plus it’s an old protocol and it’s been subjected to lots of analysis (including analysis by robots). That’s why I’m so surprised to see loose ends at this date, even if they’re not particularly useful loose ends.

Regardless of the results, anyone who’s interested in cryptography should try their hands at this from time to time. There are so many protocols that we rely on in our day-to-day existence, and far too few of these have really been poked at.


* An older paper by Gupta and Shmatikov also notes the lack of authentication for ZID messages, and describes a clever attack that causes the peers to not detect mutually shared secrets. This seems to be fixed in later versions of the protocol, since the ZID values are now included in the Key Derivation Function. But not the rest of the fields in the Initiator Hello.

*** This is another place where the specification is a bit vague. It says that the MAC is computed over the “encrypted part of the message”, but isn’t entirely clear if the MAC is computed pre- or post- encryption. If it’s pre- encryption this is a bit non-standard, but not a major concern.

18 thoughts on “Let’s talk about ZRTP

  1. I like reading your blog, so thanks again and keep them coming!
    There are 2 things I can't understand:
    1) you say that Eve can “learn the MAC key, and then forge the Hello message” -> seeing the plaintext and the MACed version allows you to get the key only by bruteforcing it… so how long is the key??
    2) you say that “Eve can now run an offline attack, trying millions of candidate b values until she gets one such that the derived SAS between her and Alice is also X.”. But both exponents are randomly generated for each protocol run. If the exponents are large enough, how can Eve try enough possibilities without Alice or Bob hanging up the call?? It must take a long time to bruteforce the exponents, no?
    3) Is it possible for a MiTM Eve to force alice and bob to be de-syncrhonized (in terms of previous call secrets included in total_hash)? Say that Eve lets alice and bob run the protocol but at the end (__ should be blank but the blog removes the blank):

    F8 Confirm1 ———————–> F8

    <-- (TCP RST)--- BLOCKED <-------- F9 ___________ F10 Conf2ACK ——–> B stores the session secret for future sessions because it thinks F10 comes from A.

    A discards the session secret because the session was aborted. Next time they run the protocol, it aborts, hence DoS attack.

    I guess there is no workaround: even if Conf2ACK was MACed, A and B inherently need to store the session secret at different times (the signal needs to travel from A to B and hence can be intercepted and blocked by Eve) …

  2. Got the answer for 2), SAS is only 4 bytes long… it does not matter the size of the exponent. There is only 2^32 possibilities… and actually much less than that since if we only consider ASCII characters…

  3. 1. Eve doesn't have to brute-force the key. The MAC key /itself/ is in the Commit message — it's the value H2.

    3. I think you have to complete the session to store the secret. At least in this case both parties would notice a protocol abort.

  4. Please remind that when ZRTP is used along with SIPS (SIP with TLS) there is another check that can be done, by using such end-to-site encrypted/authenticated channel.

    So, when ZRTP is used:
    – With key continuity
    – Forcing the user to verify the trust between two peers
    – With SIPS key-hash relay verification

    You are basically mixing 3 multiple security authentication schema where each single schema must fail:
    – First Trust (key continuity)
    – End user human verification
    – SSL certificate validation

    At PrivateWave, where we made the Zorg opensource ZRTP implementation (http://www.zrtp.org ), we are using such kind of layered approach when using ZRTP.

    Fabio Pietrosanti (naif)
    Founder and Former CTO of PrivateWave

  5. It's good to see you decided to post another very informative article. Hopefully the timeframe between successive posts will be somewhat shorter in future, though.:)

  6. You say cryptography is hard. I agree but will add that it also is a big failure. Why we couldn't come up with a simple, secure and agreed-upon protocol for exchanging a key in such conditions (no PKI), and why every comm protocol on earth does it in its own little way is beyond me.
    Cryptographic primitives shouldn't be block ciphers, hash functions etc… but protocol building blocks, such as e.g. a simple protocol that establishes a tunnel between two parties. A bit like SSL but simpler.


  7. I like your blog post. Keep on writing this type of great stuff. I'll make sure to follow up on your blog in the future.Very good, informative piece. Smartly done as all the time.
    agen bola

  8. I am impressed. I don't think Ive met anyone who knows as much about this subject as you do. You are truly well informed and very intelligent. You wrote something that people could understand and made the subject intriguing for everyone. Really, great blog you have got here.
    Jasa SEO

  9. I am using linphpone library for ZRTP for encrypted call, but I am getting SAS value nil for linphone_call_get_authentication_token(call)

    Is any one have idea, why this happens ?

Comments are closed.