Hash Functions

Saravanan Vijayakumaran

Department of Electrical Engineering, IIT Bombay

September 13, 2023

Cryptographic Hash Functions

  • Methods for deterministically mapping a long input string to a shorter output called a digest
  • Primary requirement is that it should be infeasible to find collisions, i.e. two distinct inputs having same digest

Non-Cryptographic Hash Functions

  • Used to build hash tables — key-value stores with \mathcal{O}(1) lookup time

  • Example: Multiplicative hashing

    • Let M be the size of the hash table
    • Let a,W \in \mathbb{N} be positive integers such that W > M and \gcd(a,W) = 1
    • A integer key x is mapped to \{0,1,\ldots, M-1\} as h_a(x) = \left\lfloor \frac{a x \bmod W}{W} M \right\rfloor
    • Special case of W = 2^w and M = 2^m can be implemented as h_a(x) = ax \gg (w-m) if w is the machine word size

Collision Avoidance

  • A collision is said to occur if h(x) = h(x') for x \neq x'
  • For non-cryptographic hash functions, the goal is to minimize collisions
  • For cryptographic hash functions, the goal is to completely avoid collisions
    • Achieved by defining a large co-domain for h

Co-Domain for 256-bit Output Length

SHA-256

  • SHA = Secure Hash Algorithm, 256-bit output length
  • Accepts bit strings of length upto 2^{64}-1
  • Announced in 2001 by NIST

Two Stages

  • Output calculation has two stages
    • Preprocessing
    • Hash Computation

Preprocessing

  • The input M is padded to a multiple of 512 bits
  • A 256-bit state variable H^{(0)} is set to \begin{align*} \begin{split} H_0^{(0)} = \texttt{0x6A09E667},& \quad H_1^{(0)} = \texttt{0xBB67AE85},\\ H_2^{(0)} = \texttt{0x3C6EF372},& \quad H_3^{(0)} = \texttt{0xA54FF53A},\\ H_4^{(0)} = \texttt{0x510E527F},& \quad H_5^{(0)} = \texttt{0x9B05688C},\\ H_6^{(0)} = \texttt{0x1F83D9AB},& \quad H_7^{(0)} = \texttt{0x5BE0CD19}. \end{split} \end{align*}
  • First 32 bits in the fractional parts of the square roots of the first eight primes
  • NUMS = Nothing Up My Sleeve

Hash Computation

  • A function f: \{0,1\}^{768} \times \{0,1\}^{256} \rightarrow \{0,1\}^{256} is used. It is called the compression function

  • Padded input is split into N 512-bit blocks M^{(1)}, M^{(2)}, \ldots, M^{(N)}

  • Given H^{(i-1)}, the next H^{(i)} is calculated as \begin{equation*} H^{(i)} = f(M^{(i)}, H^{(i-1)}), \quad 1 \le i \le N. \end{equation*}

  • H^{(N)} is the output of SHA-256 for input M

Building Blocks of f

  • U, V, W are 32-bit words
  • U \land V, U \lor V, U \oplus V denote bitwise AND, OR, XOR
  • U + V denotes integer sum modulo 2^{32}
  • \lnot U denotes bitwise complement
  • For 1 \le n \le 32, the shift right and rotate right operations \begin{align*} \textsf{SHR}^n(U) & = \underbrace{000 \cdots 000}_{n \textrm{ zeros}} u_0 u_1 \cdots u_{30-n} u_{31-n}, \\ \textsf{ROTR}^n(U) & = u_{31-n+1} u_{31-n+2} \cdots u_{30} u_{31} u_0 u_1 \cdots u_{30-n} u_{31-n}, \end{align*}

Building Blocks of f

  • Bitwise choice and majority functions \begin{align*} \textsf{Ch}(U,V,W) & = (U \land V) \oplus (\lnot U \land W), \\ \textsf{Maj}(U,V,W) & = (U \land V) \oplus (U \land W) \oplus (V \land W), \end{align*}
  • Let \begin{align*} \Sigma_0(U) & = \textsf{ROTR}^{2}(U) \oplus \textsf{ROTR}^{13}(U) \oplus \textsf{ROTR}^{22}(U) \\ \Sigma_1(U) & = \textsf{ROTR}^{6}(U) \oplus \textsf{ROTR}^{11}(U) \oplus \textsf{ROTR}^{25}(U) \\ \sigma_0(U) & = \textsf{ROTR}^{7}(U) \oplus \textsf{ROTR}^{18}(U) \oplus \textsf{SHR}^{3}(U) \\ \sigma_1(U) & = \textsf{ROTR}^{17}(U) \oplus \textsf{ROTR}^{19}(U) \oplus \textsf{SHR}^{10}(U) \end{align*}

f Calculation

  • Maintains internal state of 64 words \{W_j \mid j = 0,1,\ldots,63\}
  • Uses 64 constant words K_0, K_1, \ldots, K_{63} derived from the first 64 primes
  • f(M^{(i)}, H^{(i-1)}) proceeds as follows
    1. Internal state initialization \begin{equation*} W_j = \begin{cases} M_j^{(i)} & 0 \le j \le 15, \\ \sigma_1(W_{j-2}) + W_{j-7} + \sigma_0(W_{j-15}) + W_{j-16} & 16 \le j \le 63. \end{cases} \end{equation*}
    2. Initialize eight 32-bit words \begin{equation*} (A,B,C,D,E,F,G,H) = \left(H_0^{(i-1)}, H_1^{(i-1)}, \ldots, H_6^{(i-1)}, H_7^{(i-1)}\right). \end{equation*}
    3. For j = 0,1,\ldots,63, iteratively update A,B,\ldots,H \begin{align*} \begin{split} & T_1 = H + \Sigma_1(E) + \textsf{Ch}(E,F,G) + K_j + W_j \\ & T_2 = \Sigma_0(A) + \textsf{Maj}(A,B,C) \\ & (A,B,C,D,E,F,G,H) = (T_1+T_2, A, B, C, D+T_1, E, F, G) \end{split} \end{align*}
    4. Calculate H^{(i)} from H^{(i-1)} \begin{equation*} (H_0^{(i)}, H_1^{(i)}, \ldots, H_7^{(i)}) = \left(A+H_0^{(i-1)}, B+H_1^{(i-1)}, \ldots, H+H_7^{(i-1)}\right). \end{equation*}
  • Demo: https://sha256algorithm.com/

Observations

  • SHA-256 is complicated but can be computed easily
  • Difficult to invert
  • Difficult to find collisions

Collision Probabilities for Random Functions

  • If a random function with co-domain of size M is queried Q times, the probability of at least one collision is \epsilon \approx 1 - e^{-\frac{Q(Q-1)}{2M}} \implies Q \approx \sqrt{2M \ln \frac{1}{1-\epsilon}}
  • For \epsilon = 0.5, Q \approx 1.17 \sqrt{M}
  • For M = 2^{256}, we require approximately 2^{128} queries
  • To find collisions for CHFs with n-bit outputs, adversaries require \mathcal{O}(2^{n/2}) evaluations

Formal Definitions

Need for Keyed Hash Functions

  • A hash function is a deterministic function H : \{0,1\}^* \rightarrow \{0,1\}^l

  • For any such H, there is always a constant-time algorithm that outputs a collision

    • The algorithm simply outputs a colliding pair (x,x') hardcoded in the algorithm itself
  • In practice, finding a colliding pair may be hard but we cannot define collision resistance for all PPT adversaries

  • We define keyed hash functions are used as a workaround

Hash Function Definition

  • A hash function with output length l(n) is a pair of PPT algorithms (\textsf{Gen}, H) such that

    • \textsf{Gen} takes 1^n as input and outputs a key s. We assume s depends on n.
    • H is a deterministic algorithm that takes key s and x \in \{0,1\}^* as inputs and outputs a string H^s(x) \in \{0,1\}^{l(n)}
  • If H^s is defined only for inputs x of length l'(n) > l(n), then we call H a compression function

Collision-finding Experiment

  • For hash function \mathcal{H} = (\textsf{Gen}, H) and adversary \mathcal{A}, consider the following experiment \textsf{Hash-coll}_{\mathcal{A}, \mathcal{H}}(n)

    1. A key is generated by running \textsf{Gen}(1^n)
    2. \mathcal{A} is given s, and outputs x, x' from the domain of H
    3. The output of the experiment is 1 iff x \neq x' and H^s(x) = H^s(x'). In this case, we say \mathcal{A} has found a collision.

Collision Resistance Definition

  • A hash function \mathcal{H} = (\textsf{Gen}, H) is collision resistant if for all PPT adversaries \mathcal{A} there is a negligible function \textsf{negl} such that \Pr[\textsf{Hash-coll}_{\mathcal{A}, \mathcal{H}}(n) = 1] \le \textsf{negl}(n).

Weaker Notions of Security

  • Security requirements weaker than collision resistance

    • Second-preimage resistance: A hash function is said to be second-preimage resistant if given s and a uniform x it is infeasible for a PPT adversary to find x' \neq x such that H^s(x') = H^s(x)
    • Preimage resistance: A hash function is said to be second-preimage resistant if given s and y = H^s(x) for a uniform x it is infeasible for a PPT adversary to find x' such that H^s(x') = y

Applications of Hash Functions

Hash-and-MAC

  • Suppose we have a fixed-length secure MAC for l(n)-bit messages and a collision-resistant hash function with l(n)-bit output length
  • We can construct a secure MAC for an arbitrary-length message m by authenticating the hash of m
  • For key (k,s) and message m \in \{0,1\}^*, the tag is t \leftarrow \textsf{Mac}_k(H^s(m))

Fingerprinting and Deduplication

  • Virus fingerprinting
    • Virus scanners can identify viruses by comparing hashes of incoming files to hashes of known viruses
  • Deduplication
    • Cloud storage providers can identify duplicate files by comparing their hashes
    • User first uploads hash of a file; the file itself is uploaded only if it is not in the cloud
  • P2P file sharing
    • Hashes of files are used as unique identifiers for advertising
    • Called content-based addressing

Fingerprinting Multiple Files

  • Suppose a user upload a file x to a server and retains H(x)
  • When user retrieves x, she checks if the hash matches
  • What if the user wants to store multiple files x_1, \ldots, x_t?
  • User needs to store as many as hashes as files
  • One option is to store h = H(x_1, x_2, \ldots, x_t)
  • Merkle trees are a better option

Merkle Trees

Password Hashing

  • Web servers store hashes of passwords instead of the passwords in plaintext
  • Even if a web server is breached, the attacker has to find hash function preimages to recover the password
  • Passwords are salted before hashing to make preimages harder to find
  • For some random salt r \leftarrow \{0,1\}^n, the server will store \langle r, H(r \| \textsf{password})\rangle

Commitment Schemes

  • Commitment schemes allow a user to commit to a secret that can be revealed at a later time

  • A commitment scheme is a pair of PPT algorithms (\textsf{Gen}, \textsf{Com})

    • \textsf{Gen} takes 1^n as input and outputs public parameters \textsf{params}
    • \textsf{Com} takes \textsf{params}, a message m, and randomness r as inputs and outputs a commitment \textsf{com} = \textsf{Com}(\textsf{params}, m; r)

Commitment Scheme Workflow

  • The committer creates \textsf{com} and sends it to a receiver

  • The committer can later reveal m by sending m and r to the receiver

  • The receiver can verify the commitment by checking that \textsf{com} = \textsf{Com}(\textsf{params}, m; r)

Desirable Properties of Commitment Schemes

  • Hiding: The commitment \textsf{com} reveals nothing about m
  • Binding: It is infeasible for the committer to create a commitment \textsf{com} that can be opened later to two different messages m, m'

Commitment Schemes using Hash Functions

  • \textsf{Gen} = Choosing a hash function of appropriate output length; \textsf{params} = Hash function H
    • Recall that a hash function with l(n)-bit output requires \mathcal{O}(2^{l(n)/2}) operations to find collisions
  • \textsf{Com}(\textsf{params}, m; r) = H(r \| m)
    • Randomness r should be long enough to prevent a PPT adversary from finding its value by exhaustive search

Hashcash

  • Proposed in 1997 to combat email spam

  • Suppose a client wants to send an email to an email server

  • Client and server agree upon a hash function H

  • Email server sends the client a challenge string c

Hashcash

  • Client needs to find a string r such that H(c \| r) begins with k zeros

  • Server accepts the email only if the client finds such an r

Hashcash

  • If H is modeled as a random function, the probability of success in a single trial is \frac{1}{2^k}

  • Around 2^k trials are required to find a satisfactory r

  • The r corresponding to c is a proof-of-work (PoW)

  • PoW is difficult to generate but easy to verify

  • Demo

Further Reading

Sections 6.1, 6.3.1, 6.6 of Katz and Lindell