Category: Cryptography

  • The SHAttered SHA-1 Collision, Explained

    On 23 February 2017, researchers at the Cryptology Group at CWI Amsterdam and Google Research published two different PDF files that share the same SHA-1 hash. The digest is 38762cf7f55934b34d179ae6a4c80cadccbb7f0a for both. The project was called SHAttered, and it was the first time anyone had produced a practical, public collision for the full SHA-1 hash function. You can download the two files from this site and check them yourself: shattered-1.pdf and shattered-2.pdf. They are visibly different documents, yet SHA-1 cannot tell them apart.

    That single fact ended a long argument about whether SHA-1 was still safe to trust. This article walks through what a collision is, why SHA-1 fell, what the team actually built, and what changed afterward.

    What a hash collision is, and why it matters

    A cryptographic hash function takes an input of any size and returns a fixed-length fingerprint. SHA-1 produces 160 bits, usually written as 40 hexadecimal characters. Three properties are supposed to hold: you cannot reverse the output back to the input, you cannot find a second input matching a given output, and you cannot find any two inputs that hash to the same value. That last property is collision resistance.

    Collisions always exist in a mathematical sense. There are infinitely many possible inputs and only a finite number of 160-bit outputs, so some inputs must share a digest. The security claim is not that collisions are absent. It is that finding one should be so expensive that no realistic attacker can do it. SHAttered broke that claim for SHA-1.

    Why does this matter outside a lab? Digital signatures, certificates, and integrity checks rarely sign the document itself. They sign its hash. A signature, a Git commit ID, a certificate fingerprint, a “this file has not changed” guarantee all collapse the file down to a hash and trust that the hash is unique to that file. If two files share a hash, a signature over one is equally valid over the other. The link between “what was approved” and “what you received” quietly breaks.

    Why SHA-1 was vulnerable

    SHA-1 was published by the NSA and standardized by NIST in 1995. It is built on the Merkle-Damgård construction, processing a message in 512-bit blocks and mixing each block into an internal state through 80 rounds of additions, rotations, and bitwise operations.

    The trouble started early. In 2005, Wang et al. showed that collisions could be found in roughly 2^69 operations, well below the 2^80 that a generic birthday attack on a 160-bit hash would need. That was a theoretical result, far beyond the reach of the hardware of the day, but it marked SHA-1 as weakened. Over the following decade, cryptanalysts including Marc Stevens refined differential attacks that exploit how small, carefully chosen differences in the input propagate through the round function. Each refinement narrowed the gap between theory and a real, buildable attack.

    The core weakness is structural. SHA-1’s round function does not diffuse differences strongly enough to stop an attacker from constructing two message blocks whose internal disturbances cancel out by the end. Once you can engineer that cancellation, you can force the internal state to converge, and a collision follows.

    What the SHAttered team actually did

    The attack was an identical-prefix collision. Both PDFs begin with exactly the same bytes. The researchers then computed a pair of carefully crafted near-collision block sequences that, when appended to that shared prefix, drive SHA-1’s internal state to the same value. Because the internal states match at the point where the colliding blocks end, anything appended after that keeps the hashes equal, a direct consequence of the Merkle-Damgård design.

    PDF was a deliberate choice. The format is forgiving enough to hide the colliding blocks inside an object that controls which image is displayed, so the same hash maps to two documents that render with different visible content. That is what turns an abstract pair of byte strings into a believable abuse scenario.

    The team behind the work included Marc Stevens, Pierre Karpman, Elie Bursztein, Ange Albertini, and Yarik Markov, spanning CWI Amsterdam and Google. The full write-up, with the mathematics and the engineering detail, is hosted here as shattered.pdf.

    The scale of the computation

    SHAttered was not a clever shortcut that ran on a laptop. It was a genuinely large computation, and the numbers are the point.

    Approach SHA-1 evaluations Notes
    Generic birthday attack about 2^80 Brute-force baseline for a 160-bit hash
    SHAttered identical-prefix attack about 2^63.1 (roughly 9.2 quintillion) The actual demonstrated work
    Speedup over brute force about 100,000x Why the attack was feasible at all

    Google described the effort as the equivalent of around 6,500 CPU-years for the first phase and 110 GPU-years for the second. Spread across a large fleet, that is months of work, not centuries. The phased structure matters: an expensive search produces a usable near-collision configuration first, then a cheaper second stage finishes the matching pair. Roughly 9.2 quintillion hash evaluations sounds astronomical, and it is, yet it sits far below the brute-force wall. That gap is exactly what cryptanalysis is meant to find.

    Why two PDFs with the same hash is dangerous

    Picture a signing workflow. A reviewer approves a contract, and a system signs the SHA-1 hash of that PDF. With a collision pair in hand, an attacker can prepare two contracts in advance that share a hash: a benign one to be approved and a malicious one to be substituted later. The signature taken over the approved file validates perfectly against the swapped file, because the signature only ever covered the hash, and the hash is identical.

    The same logic threatens any system that uses SHA-1 as an identity:

    • Certificates. A certificate authority signing a SHA-1 certificate could be tricked into vouching for a colliding certificate it never intended to issue.
    • Software distribution. A SHA-1 checksum that “proves” a download is authentic proves nothing if a colliding payload exists.
    • Version control. Git identifies every commit and object by a SHA-1 hash. A collision means two different trees or blobs could claim the same identifier, which puts repository integrity in question. Git’s reliance on SHA-1 drew immediate scrutiny after the announcement.

    The attack does not let someone forge a hash for an arbitrary file you already hold. It lets an attacker who controls both documents produce a matching pair from the start. In signing, escrow, notarization, and supply-chain settings, control of both documents is a perfectly ordinary situation, which is what made the result so uncomfortable.

    Real-world fallout and the move to SHA-256

    The deprecation of SHA-1 had been on paper for years, but SHAttered turned a recommendation into an emergency. A reproducible artifact is far more persuasive than a complexity estimate.

    The response was quick and broad. Browser vendors finished removing trust for SHA-1 TLS certificates, and certificate authorities completed their migration to SHA-256. Within days, Git added a built-in collision detector based on the sha1collisiondetection library, which flags inputs bearing the fingerprints of this class of attack, and the project began its longer effort toward a hardened object format. Protocols, package managers, and signing tools accelerated their own retirements of SHA-1.

    The destination for most of that migration was SHA-256, part of the SHA-2 family. SHA-256 has no known practical collision attack, a wider 256-bit output, and a stronger internal design, which is why it became the default for certificates, signatures, and integrity checks. For a fuller picture of how these primitives fit together, see the cryptography pillar.

    What SHAttered means today

    SHA-1 should not be used where collision resistance matters. That includes signatures, certificates, and any “has this been tampered with” check on data an adversary might influence. For those uses, SHA-256 or stronger is the baseline.

    A few nuances are worth keeping straight. SHA-1’s preimage resistance, the difficulty of reversing a hash or matching a hash you did not help create, has not been broken. So a non-security use such as a deduplication key on trusted data is a different risk profile from a signature. The honest guidance is still simple: if security depends on the hash, move on from SHA-1.

    The broader lesson outlived the specific break. Cryptographic primitives age. An attack that looks purely theoretical, like the 2005 result, tends to creep toward practicality as analysis sharpens and hardware grows cheaper. SHAttered is the cleanest illustration of that arc: twelve years from “weakened on paper” to “two real files, one hash, downloadable today.” Modern systems that need to prove data has not changed now lean on SHA-256, and approaches such as provably fair verification use SHA-256 commitments so that anyone can independently confirm integrity rather than take it on trust.

    If you want to see the break with your own eyes, grab shattered-1.pdf and shattered-2.pdf, run sha1sum on each, and watch two different documents return the same 40-character digest.

    FAQ

    Was SHA-1 completely broken by SHAttered?

    Its collision resistance was broken in practice, which is the property that protects signatures and certificates. The attack produces a pair of files with a matching hash. It does not reverse hashes or let an attacker match a file they had no hand in creating, so SHA-1’s preimage resistance remains intact. For anything security-sensitive, that distinction does not save it: move to SHA-256.

    Can someone use this to forge a hash for a file I already have?

    No. SHAttered is an identical-prefix collision, meaning the attacker constructs both documents together so they share a hash. It cannot take an existing file you control and manufacture a second file matching it. The danger lives in workflows where an attacker supplies the documents, such as signing, notarization, or escrow.

    Why did the researchers use PDF files?

    PDF is flexible enough to embed the colliding blocks inside an object that selects which content is shown, so a single hash can correspond to two documents that look different on screen. That makes the threat concrete: it models a benign file being approved and a malicious one being swapped in under the same signature. The pair lives at shattered-1.pdf and shattered-2.pdf.

    Is SHA-256 affected by the same attack?

    No. SHAttered exploits weaknesses specific to SHA-1’s round function and 160-bit output. SHA-256 uses a different design with a 256-bit digest and has no known practical collision attack, which is why it became the standard replacement across browsers, certificate authorities, and version control.

  • SHA-256 Explained: How It Works and Why It Matters

    SHA-256 is the hash function that quietly secures most of the systems you touch every day. It signs the certificate behind the padlock in your browser, fingerprints the software you download, and anchors every block in the Bitcoin ledger. When researchers broke SHA-1 with the SHAttered collision, SHA-256 was already the recommended replacement, and it remains unbroken in practice. This article explains what it does, how it works under the hood, and why it earned that trust.

    What SHA-256 Is

    SHA-256 stands for Secure Hash Algorithm 256-bit. It belongs to the SHA-2 family, a set of hash functions designed by the United States National Security Agency (NSA) and published by the National Institute of Standards and Technology (NIST) in 2001 as part of FIPS 180-2. The “256” is the size of its output: every input, whether a single character or a multi-gigabyte disk image, produces a digest of exactly 256 bits. That is 32 bytes, usually written as 64 hexadecimal characters.

    A hash function takes data of any length and reduces it to a fixed-length fingerprint. SHA-256 is one specific, standardized way of doing that. For the broader picture of how this category of algorithm behaves, the hash functions overview covers the general model; here the focus stays on the SHA-256 design itself.

    The SHA-2 family also includes SHA-224, SHA-384, and SHA-512, which differ mainly in output length and internal word size. SHA-256 is the most widely deployed because its 256-bit output hits a practical balance: long enough to resist brute force for the foreseeable future, short enough to store and transmit cheaply.

    The Core Properties

    A cryptographic hash function is only useful if it holds a handful of strict guarantees. SHA-256 was designed to satisfy all of them.

    Deterministic and Fixed-Length

    The same input always produces the same output. Hash the word cryptography today, next year, or on a different machine, and you get an identical 64-character digest every time. The output length never changes either: an empty string and a full novel both hash to exactly 256 bits.

    One-Way (Preimage Resistance)

    Given a digest, there is no feasible way to work backward to the original input. The function discards structure as it runs, so reversing it would mean searching an astronomically large space of possible inputs. This is what lets a system store a fingerprint of sensitive data without storing the data itself.

    Collision Resistance

    A collision is two different inputs that produce the same digest. For a secure 256-bit hash, finding one by brute force would take roughly 2^128 operations thanks to the birthday bound, far beyond any current or projected computing power. SHA-256 has no known practical collision.

    The Avalanche Effect

    Change one bit of the input and, on average, half the output bits flip. There is no gradual drift; a tiny edit produces a digest that looks completely unrelated to the original. This is what makes a hash useful for detecting tampering: any change, however small, is loud.

    A Worked Conceptual Example

    The avalanche effect is easiest to see with a short string, watching what happens when one character changes:

    Input:  "shattered"
    SHA-256: 4c9c8f3b... (a fixed 64-character hex digest)
    
    Input:  "shattereD"   (only the last letter changed case)
    SHA-256: e1b7a402... (a completely different 64-character digest)
    

    Both inputs are nine characters long and differ by a single bit (lowercase d versus uppercase D). Yet the two digests share no resemblance, and both are still exactly 64 hex characters because the output length is fixed. Total input sensitivity plus a constant output size is the whole point.

    How SHA-256 Works

    You do not need the full specification to follow the algorithm. SHA-256 processes a message in five conceptual stages.

    1. Padding the Message

    The input first gets padded so its total length is a multiple of 512 bits. The padding appends a single 1 bit, then enough 0 bits, then a 64-bit value recording the original message length. Encoding the length into the padding is a deliberate defense against attacks that try to forge a different message padding to the same block structure.

    2. Splitting Into 512-Bit Blocks

    The padded message is divided into blocks of 512 bits each, which SHA-256 processes one at a time, in sequence. Each block updates an internal state of eight 32-bit words. That state starts from eight fixed initial values, derived from the fractional parts of the square roots of the first eight prime numbers.

    3. Building the Message Schedule

    For each 512-bit block, the algorithm expands the 16 incoming 32-bit words into 64. The extra 48 are generated by mixing earlier words using bit rotations, shifts, and XOR. This expanded array, the message schedule, ensures every part of the block influences many rounds of processing.

    4. Sixty-Four Rounds of Compression

    The heart of SHA-256 is its compression function, which runs 64 rounds per block. Each round takes the eight working variables (labeled a through h), mixes in one word from the message schedule and one round constant, and shuffles the state through additions, rotations, and the logical functions known as Ch and Maj. The 64 round constants are not arbitrary: they are the fractional parts of the cube roots of the first 64 prime numbers. Choosing well-known mathematical constants like this is a transparency measure, demonstrating the designers had no hidden structure to exploit, a property often called “nothing up my sleeve.”

    5. Producing the Digest

    After the final block, the eight 32-bit working words are concatenated into one 256-bit value: the digest. The whole construction, processing blocks in sequence while carrying state forward, follows the Merkle-Damgard model behind most classic hash designs.

    Why SHA-256 Replaced SHA-1

    SHA-1 produces a 160-bit digest and was the workhorse hash of the 1990s and 2000s. Theoretical weaknesses surfaced as early as 2005, but the decisive blow came in February 2017, when researchers at CWI Amsterdam and Google produced SHAttered: the first practical SHA-1 collision. They crafted two different PDF files that hashed to the same SHA-1 digest, proving its collision resistance was broken in the real world, not just on paper.

    That result accelerated an already-underway migration. Certificate authorities, browsers, and version-control systems moved to SHA-256, whose larger output and stronger design carry no comparable weakness. The MD5 vs SHA-256 comparison shows why the older 128-bit MD5 fell even earlier and harder.

    Algorithm Output size Year standardized Status
    SHA-1 160-bit 1995 Broken (practical collision, 2017)
    SHA-256 256-bit 2001 Secure, widely used
    SHA-3 Variable (224 to 512-bit) 2015 Secure, alternative design

    SHA-3, standardized by NIST in 2015, is worth a note. It is not a patch on SHA-2 but a different construction (a sponge based on the Keccak algorithm) chosen through a public competition, existing as a structural backup so the world is not tied to one design family. Both SHA-256 and SHA-3 are considered secure today.

    Where SHA-256 Is Used

    The algorithm shows up wherever a trustworthy fingerprint is needed.

    TLS and certificates. The digital certificates behind HTTPS are signed using SHA-256. When your browser validates a site, it relies on a SHA-256 hash inside that signature to confirm the certificate is unaltered.

    Password storage. Systems avoid storing raw passwords. SHA-256 appears in this context, though secure password storage layers it inside a deliberately slow, salted construction such as PBKDF2 or a bcrypt-style scheme. A bare hash is fast, which suits integrity checks but is a liability for passwords, so the slowdown is intentional.

    File integrity and checksums. Software projects publish a SHA-256 checksum alongside their downloads. After fetching a file, you hash it and compare. If the digests match, the file arrived intact. A single flipped bit changes the entire digest, so the check is unforgiving.

    Bitcoin. SHA-256 is the engine of Bitcoin. Miners repeatedly hash block headers searching for an output below a target value (the proof-of-work puzzle), and every block is identified by its hash. The blockchain hashing explainer covers how this chains blocks into a tamper-evident ledger.

    Provably-fair gaming. Online games can use SHA-256 to prove an outcome was decided in advance and never altered. The operator commits to a secret server seed by publishing its hash before play begins. After the round, the original seed is revealed, and the player hashes it to confirm it matches that earlier commitment. Because the hash is one-way and deterministic, the operator could not have rigged the seed without the change being caught. Our guide to provably-fair systems walks through verifying a result yourself.

    Is SHA-256 Still Secure?

    Yes. There is no known practical collision attack against SHA-256, and no feasible method to reverse it or find preimages. The best known attacks remain far weaker than brute force and apply only to reduced-round variants studied in academic settings, not the full 64-round function. The wider cryptography hub tracks the state of the art as it evolves. Barring a fundamental mathematical breakthrough, SHA-256 should stay secure for years, which is exactly why it sits underneath so much of the modern internet.

    Frequently Asked Questions

    Is SHA-256 encryption?

    No. Encryption is reversible: you can decrypt ciphertext back to the original with the right key. SHA-256 is a one-way hash with no key and no inverse. It produces a fingerprint, not a recoverable message.

    Can two different files ever have the same SHA-256 hash?

    In theory, yes, because infinitely many inputs map to a finite set of 256-bit outputs. In practice, finding such a pair would take on the order of 2^128 operations, which is computationally infeasible, and no SHA-256 collision has ever been found.

    How long is a SHA-256 hash?

    Always 256 bits, which is 32 bytes or 64 hexadecimal characters. The length is fixed regardless of whether the input is one byte or one terabyte.

    Why does Bitcoin use SHA-256 specifically?

    SHA-256 was a mature, well-analyzed, and unbroken hash when Bitcoin launched in 2009. Its one-way property and avalanche effect make the proof-of-work puzzle hard to solve yet trivial to verify, which is precisely the asymmetry a decentralized network needs.

  • Hashing and Cryptography Explained

    Cryptography is the science of protecting information so that only the intended parties can read it, verify it, or trust where it came from. It powers nearly everything you do online, from logging into your bank to checking that a downloaded file has not been tampered with. This guide walks through the core ideas, with hashing and hash functions front and center, since that is the area where our own research left a mark: in 2017 we produced the first practical SHA-1 collision, proving a widely used hash function was broken in practice and not just in theory.

    What Cryptography Actually Does

    At its heart, cryptography solves a small set of related problems. It keeps data confidential so outsiders cannot read it. It protects integrity so you can tell when data has been altered. It provides authentication so you know who you are really talking to. And it supports non-repudiation, meaning someone cannot later deny they signed or sent something.

    Two families of tools do most of this work: hash functions and encryption. People often confuse them, so it helps to pin down the difference early.

    Hashing vs Encryption: The Key Distinction

    Encryption is reversible. You scramble data with a key, and anyone holding the right key can unscramble it back to the original. The whole point is that the message can be recovered.

    Hashing is one-way. A hash function takes an input of any size and produces a fixed-length fingerprint of it. There is no key, and there is no “unhashing” to get the original back. You use hashing when you want to verify something without storing or transmitting the thing itself.

    A quick way to remember it: encryption keeps a secret you intend to reveal later; hashing creates a fingerprint you never plan to reverse.

    What a Hash Function Does

    A cryptographic hash function takes arbitrary data and returns a short, fixed-size string of bytes, usually shown as hexadecimal. Good ones share several properties.

    Deterministic and Fixed-Length

    The same input always yields the same output, every time, on every machine. And no matter whether you feed in one byte or a full movie file, the digest is always the same length. SHA-256, for example, always returns 256 bits (64 hex characters).

    One-Way (Preimage Resistance)

    Given a digest, it should be computationally infeasible to find an input that produces it. You can go forward easily but not backward. This is why password systems store hashes rather than the passwords themselves.

    The Avalanche Effect

    Change a single bit of the input and roughly half the output bits flip. The new digest looks completely unrelated to the old one. This property means a hash cannot leak hints about how similar two inputs were.

    Collision Resistance

    A collision is two different inputs that produce the same digest. Because outputs are a fixed size and inputs are unlimited, collisions must exist mathematically. The security promise is only that nobody can find one within any practical amount of computing time. When that promise breaks, the function is considered broken, which is exactly what happened to SHA-1.

    The SHA Family and MD5

    Most hash functions you will meet in the wild belong to a handful of well-known designs.

    MD5 produces a 128-bit digest and was once everywhere. It is now thoroughly broken: collisions can be generated in seconds on a laptop. It still appears as a non-security checksum, but it must never be used where an attacker could benefit from forging a match.

    SHA-1 produces a 160-bit digest and was the workhorse of the web for years. Our SHAttered work demonstrated the first real-world collision, using two distinct PDF files that hashed to the same SHA-1 value. That proof pushed the industry to retire it for certificates, signatures, and version control trust anchors.

    SHA-256 is part of the SHA-2 family and is the current default for most applications. With a 256-bit output and no known practical attacks, it underpins TLS certificates, Bitcoin, and software signing. Our dedicated SHA-256 explainer digs into how it is built and why it has held up.

    SHA-3 is a newer standard based on a completely different internal design (the Keccak sponge construction) rather than the Merkle-Damgard structure used by MD5, SHA-1, and SHA-2. It was standardized as a backup, so that a future weakness in SHA-2 would not leave everyone stranded.

    Hash function Output size Status
    MD5 128-bit Broken, avoid
    SHA-1 160-bit Broken (collision found), retired
    SHA-256 (SHA-2) 256-bit Secure, recommended
    SHA-3 224 to 512-bit Secure, alternative design

    Encryption: Symmetric vs Asymmetric

    Encryption splits into two approaches that solve different parts of the puzzle, and real systems usually combine them.

    Symmetric Encryption (AES)

    With symmetric encryption, the same secret key both locks and unlocks the data. It is fast and well suited to bulk data, which is why it handles the actual payload in most secure connections. The standard here is AES (Advanced Encryption Standard), available in 128, 192, and 256-bit key sizes and trusted for everything from disk encryption to government secrets. The catch is key distribution: both sides must already share the secret, and getting it to them safely is its own problem. Our AES guide covers how the cipher works in detail.

    Asymmetric Encryption (RSA)

    Asymmetric, or public-key, cryptography solves the distribution problem with a pair of mathematically linked keys. A public key, which you can share freely, encrypts data that only the matching private key can decrypt. RSA is the classic example, with its security resting on the difficulty of factoring very large numbers. Public-key methods are slower, so in practice they are used to exchange a symmetric key or to sign data, after which fast symmetric encryption takes over. The public-key cryptography overview explains the key-exchange dance more fully.

    Digital Signatures

    Signatures combine hashing and public-key cryptography to prove authorship and integrity at once. To sign, you hash the message, then encrypt that hash with your private key. Anyone with your public key can decrypt the signature back to the hash, hash the message themselves, and compare. If the two match, the message is genuinely yours and has not been changed.

    This is precisely why a broken hash function is dangerous for signatures. If an attacker can craft two documents with the same hash, a signature on the harmless one is also a valid signature on the malicious one. That attack scenario is what made the SHA-1 collision more than an academic curiosity. The digital signatures explainer walks through the full mechanism and its failure modes.

    Where This Shows Up in Real Life

    These ideas are not abstract. They run quietly under the surface of ordinary computing.

    • TLS / HTTPS: The padlock in your browser relies on asymmetric crypto to authenticate the server and agree on a symmetric key, then AES to encrypt the session, with hashes verifying integrity along the way.
    • Passwords: Sensible services never store your password. They store a salted hash, so a database breach does not hand attackers your actual credentials.
    • Bitcoin and blockchains: SHA-256 chains blocks together and secures mining, while digital signatures authorize every transaction.
    • Software integrity: Download pages publish hashes (and signatures) so you can confirm an installer was not swapped out or corrupted in transit.

    Each of these gets its own deeper treatment across the cluster, but they all lean on the same building blocks described above.

    Frequently Asked Questions

    Is hashing a type of encryption?

    No. Encryption is reversible with a key, while hashing is a one-way fingerprint with no way back. They are often used together, but they are different tools for different jobs.

    Why is SHA-1 considered broken if it still produces a hash?

    It still produces output, but researchers (including our team) found a way to generate two different inputs with the same SHA-1 digest. Once collisions are practical, the function can no longer be trusted for signatures or certificates, even though it technically still runs.

    Should I use MD5 for anything?

    Only as a basic, non-security checksum to catch accidental corruption. Never use it where an attacker could gain from forging a matching hash, since MD5 collisions are trivial to produce today.

    What hash function should I use instead?

    SHA-256 is the safe default for most needs. SHA-3 is a sound alternative built on a different design, and for password storage specifically you want a purpose-built, slow function such as bcrypt, scrypt, or Argon2 rather than a raw fast hash.