Hash Functions

Saravanan Vijayakumaran

Department of Electrical Engineering, IIT Bombay

September 13, 2023

Cryptographic Hash Functions

Methods for deterministically mapping a long input string to a shorter output called a digest
Primary requirement is that it should be infeasible to find collisions, i.e. two distinct inputs having same digest

Non-Cryptographic Hash Functions

Used to build hash tables — key-value stores with \mathcal{O}(1) lookup time
Example: Multiplicative hashing
- Let M be the size of the hash table
- Let a,W \in \mathbb{N} be positive integers such that W > M and \gcd(a,W) = 1
- A integer key x is mapped to \{0,1,\ldots, M-1\} as h_a(x) = \left\lfloor \frac{a x \bmod W}{W} M \right\rfloor
- Special case of W = 2^w and M = 2^m can be implemented as h_a(x) = ax \gg (w-m) if w is the machine word size

Collision Avoidance

A collision is said to occur if h(x) = h(x') for x \neq x'
For non-cryptographic hash functions, the goal is to minimize collisions
For cryptographic hash functions, the goal is to completely avoid collisions
- Achieved by defining a large co-domain for h

Co-Domain for 256-bit Output Length

SHA-256

SHA = Secure Hash Algorithm, 256-bit output length
Accepts bit strings of length upto 2^{64}-1
Announced in 2001 by NIST

Two Stages

Output calculation has two stages
- Preprocessing
- Hash Computation

Preprocessing

The input M is padded to a multiple of 512 bits
A 256-bit state variable H^{(0)} is set to \begin{align*} \begin{split} H_0^{(0)} = \texttt{0x6A09E667},& \quad H_1^{(0)} = \texttt{0xBB67AE85},\\ H_2^{(0)} = \texttt{0x3C6EF372},& \quad H_3^{(0)} = \texttt{0xA54FF53A},\\ H_4^{(0)} = \texttt{0x510E527F},& \quad H_5^{(0)} = \texttt{0x9B05688C},\\ H_6^{(0)} = \texttt{0x1F83D9AB},& \quad H_7^{(0)} = \texttt{0x5BE0CD19}. \end{split} \end{align*}
First 32 bits in the fractional parts of the square roots of the first eight primes
NUMS = Nothing Up My Sleeve

Hash Computation

A function f: \{0,1\}^{768} \times \{0,1\}^{256} \rightarrow \{0,1\}^{256} is used. It is called the compression function
Padded input is split into N 512-bit blocks M^{(1)}, M^{(2)}, \ldots, M^{(N)}
Given H^{(i-1)}, the next H^{(i)} is calculated as \begin{equation*} H^{(i)} = f(M^{(i)}, H^{(i-1)}), \quad 1 \le i \le N. \end{equation*}
H^{(N)} is the output of SHA-256 for input M

Building Blocks of f

U, V, W are 32-bit words
U \land V, U \lor V, U \oplus V denote bitwise AND, OR, XOR
U + V denotes integer sum modulo 2^{32}
\lnot U denotes bitwise complement
For 1 \le n \le 32, the shift right and rotate right operations \begin{align*} \textsf{SHR}^n(U) & = \underbrace{000 \cdots 000}_{n \textrm{ zeros}} u_0 u_1 \cdots u_{30-n} u_{31-n}, \\ \textsf{ROTR}^n(U) & = u_{31-n+1} u_{31-n+2} \cdots u_{30} u_{31} u_0 u_1 \cdots u_{30-n} u_{31-n}, \end{align*}

Building Blocks of f

Bitwise choice and majority functions \begin{align*} \textsf{Ch}(U,V,W) & = (U \land V) \oplus (\lnot U \land W), \\ \textsf{Maj}(U,V,W) & = (U \land V) \oplus (U \land W) \oplus (V \land W), \end{align*}
Let \begin{align*} \Sigma_0(U) & = \textsf{ROTR}^{2}(U) \oplus \textsf{ROTR}^{13}(U) \oplus \textsf{ROTR}^{22}(U) \\ \Sigma_1(U) & = \textsf{ROTR}^{6}(U) \oplus \textsf{ROTR}^{11}(U) \oplus \textsf{ROTR}^{25}(U) \\ \sigma_0(U) & = \textsf{ROTR}^{7}(U) \oplus \textsf{ROTR}^{18}(U) \oplus \textsf{SHR}^{3}(U) \\ \sigma_1(U) & = \textsf{ROTR}^{17}(U) \oplus \textsf{ROTR}^{19}(U) \oplus \textsf{SHR}^{10}(U) \end{align*}

f Calculation

Maintains internal state of 64 words \{W_j \mid j = 0,1,\ldots,63\}
Uses 64 constant words K_0, K_1, \ldots, K_{63} derived from the first 64 primes
f(M^{(i)}, H^{(i-1)}) proceeds as follows
1. Internal state initialization \begin{equation*} W_j = \begin{cases} M_j^{(i)} & 0 \le j \le 15, \\ \sigma_1(W_{j-2}) + W_{j-7} + \sigma_0(W_{j-15}) + W_{j-16} & 16 \le j \le 63. \end{cases} \end{equation*}
2. Initialize eight 32-bit words \begin{equation*} (A,B,C,D,E,F,G,H) = \left(H_0^{(i-1)}, H_1^{(i-1)}, \ldots, H_6^{(i-1)}, H_7^{(i-1)}\right). \end{equation*}
3. For j = 0,1,\ldots,63, iteratively update A,B,\ldots,H \begin{align*} \begin{split} & T_1 = H + \Sigma_1(E) + \textsf{Ch}(E,F,G) + K_j + W_j \\ & T_2 = \Sigma_0(A) + \textsf{Maj}(A,B,C) \\ & (A,B,C,D,E,F,G,H) = (T_1+T_2, A, B, C, D+T_1, E, F, G) \end{split} \end{align*}
4. Calculate H^{(i)} from H^{(i-1)} \begin{equation*} (H_0^{(i)}, H_1^{(i)}, \ldots, H_7^{(i)}) = \left(A+H_0^{(i-1)}, B+H_1^{(i-1)}, \ldots, H+H_7^{(i-1)}\right). \end{equation*}
Demo: https://sha256algorithm.com/

Observations

SHA-256 is complicated but can be computed easily
Difficult to invert
Difficult to find collisions

Collision Probabilities for Random Functions

If a random function with co-domain of size M is queried Q times, the probability of at least one collision is \epsilon \approx 1 - e^{-\frac{Q(Q-1)}{2M}} \implies Q \approx \sqrt{2M \ln \frac{1}{1-\epsilon}}
For \epsilon = 0.5, Q \approx 1.17 \sqrt{M}
For M = 2^{256}, we require approximately 2^{128} queries
To find collisions for CHFs with n-bit outputs, adversaries require \mathcal{O}(2^{n/2}) evaluations

Formal Definitions

Need for Keyed Hash Functions

A hash function is a deterministic function H : \{0,1\}^* \rightarrow \{0,1\}^l
For any such H, there is always a constant-time algorithm that outputs a collision
- The algorithm simply outputs a colliding pair (x,x') hardcoded in the algorithm itself
In practice, finding a colliding pair may be hard but we cannot define collision resistance for all PPT adversaries
We define keyed hash functions are used as a workaround

Hash Function Definition

A hash function with output length l(n) is a pair of PPT algorithms (\textsf{Gen}, H) such that
- \textsf{Gen} takes 1^n as input and outputs a key s. We assume s depends on n.
- H is a deterministic algorithm that takes key s and x \in \{0,1\}^* as inputs and outputs a string H^s(x) \in \{0,1\}^{l(n)}
If H^s is defined only for inputs x of length l'(n) > l(n), then we call H a compression function

Collision-finding Experiment

For hash function \mathcal{H} = (\textsf{Gen}, H) and adversary \mathcal{A}, consider the following experiment \textsf{Hash-coll}_{\mathcal{A}, \mathcal{H}}(n)
1. A key is generated by running \textsf{Gen}(1^n)
2. \mathcal{A} is given s, and outputs x, x' from the domain of H
3. The output of the experiment is 1 iff x \neq x' and H^s(x) = H^s(x'). In this case, we say \mathcal{A} has found a collision.

Collision Resistance Definition

A hash function \mathcal{H} = (\textsf{Gen}, H) is collision resistant if for all PPT adversaries \mathcal{A} there is a negligible function \textsf{negl} such that \Pr[\textsf{Hash-coll}_{\mathcal{A}, \mathcal{H}}(n) = 1] \le \textsf{negl}(n).

Weaker Notions of Security

Security requirements weaker than collision resistance
- Second-preimage resistance: A hash function is said to be second-preimage resistant if given s and a uniform x it is infeasible for a PPT adversary to find x' \neq x such that H^s(x') = H^s(x)
- Preimage resistance: A hash function is said to be second-preimage resistant if given s and y = H^s(x) for a uniform x it is infeasible for a PPT adversary to find x' such that H^s(x') = y

Applications of Hash Functions

Hash-and-MAC

Suppose we have a fixed-length secure MAC for l(n)-bit messages and a collision-resistant hash function with l(n)-bit output length
We can construct a secure MAC for an arbitrary-length message m by authenticating the hash of m
For key (k,s) and message m \in \{0,1\}^*, the tag is t \leftarrow \textsf{Mac}_k(H^s(m))

Fingerprinting and Deduplication

Virus fingerprinting
- Virus scanners can identify viruses by comparing hashes of incoming files to hashes of known viruses
Deduplication
- Cloud storage providers can identify duplicate files by comparing their hashes
- User first uploads hash of a file; the file itself is uploaded only if it is not in the cloud
P2P file sharing
- Hashes of files are used as unique identifiers for advertising
- Called content-based addressing

Fingerprinting Multiple Files

Suppose a user upload a file x to a server and retains H(x)
When user retrieves x, she checks if the hash matches
What if the user wants to store multiple files x_1, \ldots, x_t?
User needs to store as many as hashes as files
One option is to store h = H(x_1, x_2, \ldots, x_t)
Merkle trees are a better option

Merkle Trees

Password Hashing

Web servers store hashes of passwords instead of the passwords in plaintext
Even if a web server is breached, the attacker has to find hash function preimages to recover the password
Passwords are salted before hashing to make preimages harder to find
For some random salt r \leftarrow \{0,1\}^n, the server will store \langle r, H(r \| \textsf{password})\rangle

Commitment Schemes

Commitment schemes allow a user to commit to a secret that can be revealed at a later time
A commitment scheme is a pair of PPT algorithms (\textsf{Gen}, \textsf{Com})
- \textsf{Gen} takes 1^n as input and outputs public parameters \textsf{params}
- \textsf{Com} takes \textsf{params}, a message m, and randomness r as inputs and outputs a commitment \textsf{com} = \textsf{Com}(\textsf{params}, m; r)

Commitment Scheme Workflow

The committer creates \textsf{com} and sends it to a receiver
The committer can later reveal m by sending m and r to the receiver
The receiver can verify the commitment by checking that \textsf{com} = \textsf{Com}(\textsf{params}, m; r)

Desirable Properties of Commitment Schemes

Hiding: The commitment \textsf{com} reveals nothing about m
Binding: It is infeasible for the committer to create a commitment \textsf{com} that can be opened later to two different messages m, m'

Commitment Schemes using Hash Functions

\textsf{Gen} = Choosing a hash function of appropriate output length; \textsf{params} = Hash function H
- Recall that a hash function with l(n)-bit output requires \mathcal{O}(2^{l(n)/2}) operations to find collisions
\textsf{Com}(\textsf{params}, m; r) = H(r \| m)
- Randomness r should be long enough to prevent a PPT adversary from finding its value by exhaustive search

Hashcash

Proposed in 1997 to combat email spam
Suppose a client wants to send an email to an email server
Client and server agree upon a hash function H
Email server sends the client a challenge string c

Hashcash

Client needs to find a string r such that H(c \| r) begins with k zeros
Server accepts the email only if the client finds such an r

Hashcash

If H is modeled as a random function, the probability of success in a single trial is \frac{1}{2^k}
Around 2^k trials are required to find a satisfactory r
The r corresponding to c is a proof-of-work (PoW)
PoW is difficult to generate but easy to verify
Demo

Hash Functions

Cryptographic Hash Functions

Non-Cryptographic Hash Functions

Collision Avoidance

Co-Domain for 256-bit Output Length

SHA-256

Two Stages

Preprocessing

Hash Computation

Building Blocks of f

Building Blocks of f

f Calculation

Observations

Collision Probabilities for Random Functions

Formal Definitions

Need for Keyed Hash Functions

Hash Function Definition

Collision-finding Experiment

Collision Resistance Definition

Weaker Notions of Security

Applications of Hash Functions

Hash-and-MAC

Fingerprinting and Deduplication

Fingerprinting Multiple Files

Merkle Trees

Password Hashing

Commitment Schemes

Commitment Scheme Workflow

Desirable Properties of Commitment Schemes

Commitment Schemes using Hash Functions

Hashcash

Hashcash

Hashcash

Further Reading