Over on twitter, some folks were chatting about the latest big security botch. A major service, called Evernote, had a security breach where a password file was stolen. Evernote has handled the situation quite well, being open about what happened, and explaining the risks.
In their description of the breach, they said that the stolen passwords were "both hashed and salted". Apparently this sounds funny to people outside of software. (Amazing how jargon becomes so ingrained that I didn't even notice the fact that it could be interpreted in a funny way until it was pointed out to me!)
Anyway, since discussion of this is going around, I thought I'd explain just what password hashing and salting means.
Let's start at the beginning. You're some kind of system that wants to have password security. Obviously, you need to save the passwords somewhere, right?
As we'll see, that's only partially correct - but let's go with it for now. You need to store something that lets you check if a user supplied the right password, so you need to store something.
The most naive approach is create a file (or database, or whatever) that contains the usernames and passwords of all of your users. Something like:
alice:abc mark:pass joe:123 jen:hello
Suppose you were a thief, and you wanted to crack this password file, what would you do? You'd try to steal that file! If you can get hold of that password file, then you'd have all of the passwords for all of the users of the system.
That means that this is a terrible way of storing the passwords. One step, and a thief has completely breached your system. We don't want that. So what should we do?
First, we could encrypt the file.
This might seem like a good idea at first. If a thief were the steal the file, the wouldn't be able to find your user's passwords without figuring out the encryption key! It's going to take a lot of work to figure out that key!
The first problem with this is that any password can be cracked given enough time and power. If there's only one encryption key for the entire file, then it's worth investing a lot of time and power into breaking it - and once it's broken, then everything is revealed.
The second problem is: how does your system check a user's password? It needs to decrypt the file! That means that the encryption key must be available to your system! So all that a thief needs to do is figure out where your system is getting the key from. You've got your entire security system for all of your users set up with a single point of protection, and somewhere in your system, everything that you need to break that protection is available!
What can we do to improve this? The answer is something called crypto graphic hashing.
Cryptographic hashing is a combination of two concepts: hashing, and one-way functions.
A really simple example of a not-very-good hash function of a string would be something like: convert all of the characters in the string to their numeric values, and exclusive-or the binary representation of those bits. With that hash function, you could take a string like "ABCD", and convert it to the numeric values of the characters ([65, 66, 67, 68]), and then x-or them together (1000001 xor 1000010 xor 1000011 xor 1000100 = 0000100) for a result of 4. Real practical hash functions are more complicated.
For example, at least some versions of Java use the following as the default hash for a string of characters:
There's a class of mathematical functions called one-way functions. A one way function is a function , where given , it's easy to compute , but given it's extremely difficult (or even impossible) to compute .
If we combine those two, we have what's called a crpytogrphic hash function: a function that takes an arbitrary input string, and converts it to a number, in a way where it's very difficult to figure out what the input string that produced the number was. That's great for storing passwords! We don't store the password at all. We just store the hash-value produced from the password. Then when a user comes and logs in, we take their password, hash it, and compute the result to the stored hash.
Instead of the file with explicit passwords, we get something like:
alice:7a2d28fc mark:dfd4e1c6 joe:ed849ee1 jen:bb76e739
This is much better than storing the encrypted password. There is no encryption key that a thief can use to decrypt the password. Even if a thief knows the hash values of your user's passwords, they can't get in to the system! And your system actually never stores the actual values of the user's passwords - just their hashcodes!
So again, let's look at this from the perspective of a thief. How can a thief break into a system with hashed passwords?
If they don't know what hash function you're using, then they're completely stuck. Sadly, they can probably figure it out. Designing new crpytographic hash functions is hard. Implementing cryptographic hash functions correctly is hard. As a result, most people just use a hash function from a library. That means that for a thief, it's usually pretty easy to figure out what hash function is being used by a system.
Once they know what hash function you used, their only choice to break your system is to try to guess the passwords. That is, they can guess passwords, compute their hash codes, and search through your password file to see if any of the users password hashes matches. If they find one, they're gold!
In fact, there's a common strategy based on this idea called a rainbow table. A rainbox table is a list of common passwords, and the numeric value that they hash to with a common crptographic hash value. Something like:
|Password String||Hash value|
If you can somehow steal the passwords file, then with a rainbow table, you can find users with common passwords. For example, in the table above, you can see that the hashcode "7a2d28fc" occurs in the passwords file for the username "alice", and it's also in the rainbow table for the password "abc". So a thief could determing that alice's password was "abc". Even with the best crpytographic hash, all it takes is one idiot user who uses "password" as their password, and your system's security is breached.
Salting passwords addresses that problem. In a salting strategy, you don't hash a user's password by itself: you combine it with some additional data, and then hash that combination. The additional information is called the salt..
You can use lots of different things for the salt. There's a complex set of tradeoffs in the exact salting strategy, which are beyond the scope of this post, but a few examples include:
- Always use a fixed salt string. This is weak, but better than nothing. It's got a similar weakness to the encrypted password system: you only need one salt to give you a handle on breaking all of the passwords, and that one salt needs to be in the system.
- Add a random piece of data for each password. The catch here is that you need to store the salt data for each password. This is what unix passwords used to use. They added 12 random bits to each password. In the passwords file, they stored the salt and the hashed password. The weakness of this is that the salt is right there with the password entry. But because each user has a different salt, that means that any attempt to breach the system needs to look at each user separately.
- Salt on metadata: that is, take information about the user that isn't part of their username, and use that as the salt. For example, you could use a person's birthday as the salt for their account.
If each user has a different salt, then even if you've got terrible passwords, a thief needs to do a lot of work to try to break your system. Even with a rainbow-table like strategy, they can't compute the hashcode for a given common password once, and then search the password hash list for that code - they need to recompute it for each possible salt value!
What salting does is, effectively, increase the amount of effort needed to break the passwords. If you add 12 bits of salt, then a rainbow table needs 4096 times more entries to find common passwords! If your salt is long enough, then it can make it effectively impossible to create a rainbox table at all. If they try to attack you without a rainbow table, a 12 bit salt means that your attacker needs to attack the passwords of each of your users seperately! Even if they know the value of the salt, you've made it much harder for them to breach your security.