I'm still editing this, but thought I'd at least open it up to read in the meantime.
Also, special thanks for helping me review this one to:
Dave Hoover @redsquirrel
Blake Smith @blacksmith
Jason Casillas @RAGEBARRAGE
The setting
Periodically I'm asked about whether or not it's considered safe to store hashes of credit cards. It's possibly for a secondary form of user authentication or potentially fraud detection to see if a card is used in different places by different users. In general, I am strongly against storing hashes of credit cards, because combinatorically it isn't that much safer than storing the number outright against a clever attacker.
The combinatorics
When it comes to security, or more specifically cryptanalysis, your system is only as safe as the narrowest/smallest brute-forceable surface. Consider an arbitrary and perfect hash algorithm h that generates 160 bits of output. If you had the hashed output c and the function h you would need to find the input p such that h(p) = c and you'd have all the secrets. Well, the size of all possible inputs is infinite but no hash is perfect, so where's the upper-bound? It's 2^160 ( the size of the hash, N bits can represent 2^N unique pieces of information: see Integers in Computer Science and Pigeonhole principal).
Here we consider the difference between an infinite set and 2^160. Thanks to a great writeup by Jeff Bonwick we know that fully populating a 128 bit set would literally require more energy then it would take to boil the oceans of Earth, so 160 bits should be plenty safe, right?
Let's keep going and move on to look at the structure of a credit card number:
0000-0000-0000-0000
Consider a typical credit card number is 16 digits (and only digits, no letters/symbols) long, so that gives us 10^16 total possibilities. Written out: 10,000,000,000,000,000. 10 quadrillion numbers. Not necessarily at the point at which we start calling things intractable, but still prohibitively large. Let's try and remove some entropy from this number.
First of all, there's the Bank card number that accounts for the first 4-6 digits of the credit card number. I'm going to X out the ones that we may know ahead of time. In the event of a 4 card bank number:
XXXX-0000-0000-0000
and a 6 digit:
XXXX-XX00-0000-0000
In the case of a 4 digit card we now have just 1,000,000,000,000 (10^12, 1 trillion) possibilities and the 6 leaves us with 10,000,000,000 (10^10, 10 billion) numbers.
So given a stack of 160 bit hashed credit card numbers how much computer power would it take to reverse one of them out by brute forcing the hash? 10 quadrillion is certainly smaller than 2^160 so brute forcing the number is going to be the easier target. If we select a specific bank it's even easier. First, let's look at some code:
The CPU implementation
Assuming that we're going after a 6 digit credit card number, how long would it take us to try all the possibilities? We have three tasks:
- Generate a number
- Determine if it's valid
- Calculate the hash of the input and find a collision
The code to do that is here https://github.com/cchandler/cc-hash-probe .
Let's see how long it takes to get my 2.8Ghz Intel Core i7 to do the above steps 10,000,000 times:
So if it takes 15 seconds to do 10 million iterations scanning every possible number for a 6 digit card it would take us 4.16 hours. In the event of a 4 digit card we'd have to do about 100 times the work or about 416 hours. Inconvenient, but not unthinkable. Especially considering I have more cores I'm not using.
However, we have other faster/better options...
The GPU
My commodity laptop has a nVidia GeForce GT330M card inside it. I've highlighted the relevant stats:
That card most people aren't using has 48 cores, is capable of executing 32 hardware threads at a time (the warp size), and is running at a Clock rate of 1.1Ghz by itself. How long does it take my GPU to calculate the above 3 steps 10M times?
1.9s
Only about 2 seconds to do an equivalent workload. My GPU code is available here: https://github.com/cchandler/cc-hash-probe/blob/master/gpu.cu
Let's do some math: at 2 seconds / 10M hashes we'll get through a 6 digit BIN in 33 minutes down from 4.16 hours. The 4 digit BIN in 55.56 hours down from 416.
But wait! It gets better. Amazon semi-recently announced the general availability of their High Performance Computing GPU computer cluster instances. If you haven't had a chance to play with them yet, you'll discover they come equipped with dual nVidia Tesla C2050 (Fermi) cards. For perspective, these cards have 448 cores a piece. For each instance you fire up, at $2.30/hr you get 896 cores @1.15Ghz or roughly ~20 times the computer power of my laptop (I'm rounding up).
Let's suppose these cards do 20 times the workload of my laptop, then in 2 seconds we'll have 200,000,000 hashes instead of 10,000,000. How long will it take us to go through that 6 digit card now? 200M hashes in 2 seconds will yield all possible outputs in a little over a minute and a half (1.67m). The 4 digit card will be ours in 2.7 hours.
A more clever attack
So now we know how fast we can potentially recover all the possible values using our fancy-pants Amazon GPU instances. A total brute-force attack might still be a bit prohibitive because we don't want to shell out that much money for the compute instances.
Thanks to institutional banking, it turns out that a massive amount of money is deposited in very few banks. Here are the top 3:
- Bank of America
- JPMorgan Chase
- Citigroup
What if we only bothered to go after credit cards at these banks? A quick check at Wikipedia confirms that a list of known BIN numbers is available.
Here are all currently listed Bank of America BINs:
- 377311 - MBNA Europe Bank (Bank of America) bmi plus Credit Card (UK)
- 377311 - MBNA Europe Bank (Bank of America) Virgin Atlantic Credit Card (UK)
- 41177 - Bank of America (US; formerly FleetBoston Financial|Fleet) VISA Debit Card
- 414716 - Bank of America (US) - Alaska Airlines Signature Visa Credit Card
- 417008-11 - Bank of America (USA; Formerly Fleet) - Business Visa Card
- 421764 to 66 - Bank of America VISA Debit Card
- 4256 - Bank of America General Motors|GM Visa Check Card
- 426428 to 29 - Bank of America (formerly MBNA) Platinum Visa Credit Card
- 426451 to 52, 65 - Bank of America (formerly MBNA) Platinum Visa Credit Card
- 430536, 44, 46, 50, 94 - Bank of America (formerly Fleet) Visa Credit Card
- 431301 to 05, 07, 08- Bank of America (formerly MBNA) Preferred Visa & Visa Signature Credit Cards
- 432624 to 30 - Bank of America (formerly Fleet National Bank) Visa Check Card, Debit
- 4342 - Bank of America Classic Visa Credit Card
- 4356 - Bank of America Visa Debit Card
- 435680 to 90 - Bank of America, Visa, Platinum Check Card, Debit
- 449533 - Bank of America (USA), National Association - Classic, Debit, Visa
- 4635 - Bank of America Business Platinum Debit
- 4744 - Bank of America Visa Debit
- 474480 - Bank of America Visa Debit, Midwest USA
- 4888** - Bank of America (US) - Visa Credit Card
- 5401 - Bank of America (formerly MBNA) MasterCard Gold Credit Card
- 549035 - MBNA American Bank [Now part of Bank of America]
- 549099 - MBNA American Bank [Now part of Bank of America]
- 587781 - Bank of America ATM Card
More generally: 37 6-digits and 7 4-digits. If we used the high-powered compute instances that means reversing out all the Bank of America cards would take:
2.7 hours * 7 cards + 0.02 hours * 37 cards = 19.64 hours
Just 20 hours of compute time to potentially recover all possible BofA numbers. Considering it's $2.30/hr for these instances that's about $46.00.
The takeaway
Don't hash credit card numbers. Or, if you insist on doing it, store/use them in a way that guarantees if they are reversed the would-be attacker can't connect them back to personal identifiable information that would make them useable elsewhere (eg foreign key on user table and don't store them with created_at/updated_at timestamps that can be linked to other tables/columns). GPU availability is only getting better and better, and they're cramming more and more cores into them. Over the next few years you can expect to see the cost of this kind of computing to drop and become easier and easier. What's "irritating" today at 2.7 hours is going to be trivial in the not-so-distant future.