Generating Order-Preserving Minimal Perfect Hash Functions

Aug 7, 2024

A hash function is a function that maps a set of strings to integers between an interval. If this function is injective, in other words the function produces no hash collisions, then we call the hash function perfect. If furthermore the function is bijective, then we call it a minimal perfect hash function, sometimes abbreviated as MPHF. In computer science we sometimes refer to strings as words.

I spent an evening implementing the CHM algorithm from a 1992 paper in C++. I think the algorithm is quite interesting, so I'll share it here. Given a set of words $W$ the algorithm can generate minimal perfect hash functions of the form:

$$h(w) = g(f_1(w)) + g(f_2(w)) \bmod{|W|}. $$

At a high-level, the algorithm randomly generates functions $f_1,f_2$ that map words to integers. It then derives a function $g:\mathbb{Z}\to\mathbb{Z}$ such that $h$ becomes a bijection.

Specifically, we start by randomly generating two arrays $T_1,T_2$ of length $\max\{|w| : w \in W\}$ by populating each element with a random integer between $[0,n-1]$ where $n\geq |W|$ is a number we will discuss in detail later. Then, define,

$$\begin{aligned} f_1(w) &= \sum_{i=1}^{|w|} w[i]\cdot T_1[i] \bmod {n} \\ f_2(w) &= \sum_{i=1}^{|w|} w[i]\cdot T_2[i] \bmod {n} \end{aligned}$$

Now, think of the graph $G=(V,E)$ with $V=\{0,\dotsc,n-1\}$ and $E=\{(f_1(w),f_2(w)) : w\in W\}$. This undirected graph $G$ associates each edge with a word. Test whether or not the graph is acyclic, if not re-initialize $T_1,T_2$ to random values. Proceed until you find arrays $T_1,T_2$ such that the graph $G$ is acyclic.

That was the first part of the algorithm. In particular, the value $n$ influences the probability that we get an acyclic graph. For more details, and proofs on the probabilities, see the paper.

In the second part, Note how the value of the hash function can be written as $g(f_2(w)) = h(w) - g(f_1(w))$, which permits us to find a function $g$ such that $h$ is a bijection by simply performing DFS on each connected component of the graph. This is sure to work because the graph is acyclic. In more detail,

Go through each vertex $v\in V$ and if $v$ has not been visited, set $g(v) = 0$ and run the procedure DFS on $v$.
In DFS(v), mark $v$ as visited and for each neighbour $n\in N(v)$ if $n$ has not been visited, then set $g(n) = h((v,n)\in E) - g(v)$ and recursively run DFS on $n$. You can set $h(e)$ to whatever number you want the word associated with the edge $e$ to be hashed to.

That was the algorithm. It consisted of two stages, first finding a suitable acyclic graph, then deducing a function $g$ that would make $h$ a bijection. Let's start our C++ implementation by defining $f$, which can be directly translated to C++.

int f(const std::string &w, const std::vector<int> &T) {
    int s = 0;
    for (int i = 0; i < w.size(); ++i) {
        s += T[i] * w[i];
    }
    return s;
}

In the global scope, start by setting up the devices we need to generate random numbers. I intended on practicing my C++23, hence the usage of views and ranges.

namespace rs = std::ranges;
using std::views::transform;
using std::views::iota;

std::random_device rd;
std::mt19937 gen(rd());

In the main function, we read the words for the hash function to accept in a file. Then comes the main loop to find an acyclic graph. Usually, with properly chosen $n$, a suitable graph can be found in no more than 1–2 iterations. The graph is represented edge-wise as an adjacency list.

std::ifstream ifs("words.in");
std::string line;
std::vector<std::string> words;
while (std::getline(ifs, line)) {
    words.push_back(line);
}

int n = words.size() * 2;
std::uniform_int_distribution<> dist(0, n - 1);

std::vector<std::tuple<int, int>> graph;
std::vector<int> T1, T2;
do {
    graph.clear();
    T1.clear();
    T2.clear();

    for (int i = 0; i < rs::max(words | transform(&std::string::size)); ++i) {
        T1.push_back(dist(gen));
        T2.push_back(dist(gen));
    }
    for (const auto &w : words) {
        graph.push_back({f(w, T1) % n, f(w, T2) % n});
    }
} while (has_cycle(n, graph));

The has_cycle function is implemented with a DSU, which is more suitable and natural for cycle detection than an algorithm such as DFS, especially if the graph is represented edge-wise.

class DSU {
    std::vector<int> parent, rank;
    int n;

public:
    DSU(int n) : n{n} {
        parent = rs::to<std::vector<int>>(iota(0, n - 1));
        rank = std::vector<int>(n, 0);
    }

    int find(int u) {
        if (u == parent[u]) return u;
        return parent[u] = find(parent[u]);
    }

    void unite(int u, int v) {
        u = find(u);
        v = find(v);
        if (u != v) {
            if (size[u] < size[v])
                std::swap(u, v);
            parent[v] = u;
            size[u] += size[v];
        }
    }
};

bool has_cycle(int n, const auto &graph) {
    DSU dsu(n);
    for (auto [u, v] : graph) {
        if (dsu.find(u) == dsu.find(v)) {
            return true;
        }
        dsu.unite(u, v);
    }
    return false;
}

Now, we deduce the value of the function $g$.

std::vector<int> deduce_g(int n, const auto &graph) {
    std::vector<int> g(n, 0);
    std::set<int> visited;

    auto dfs = [&](auto &self, int v) -> void {
        visited.insert(v);
        for (int i = 0; i < graph.size(); ++i) {
            auto [a, b] = graph[i];
            if (b == v) {
                std::swap(a, b);
            }
            if (a == v && !visited.contains(b)) {
                // CPP promotes dividend into unsigned integer,
                // causing me hours of pain.
                // g[b] = (i - g[a]) % graph.size();

                g[b] = i - g[a]; // Just use this, we do modulo later
                self(self, b);
            }
        }
    };

    for (int i = 0; i < n; ++i) {
        if (!visited.contains(i)) {
            g[i] = 0;
            dfs(dfs, i);
        }
    }
    return g;
}

Fun fact, I spent hours debugging my code. Turns out, C++ automatically promotes the dividend of the modulo operation into an unsigned integer because graph.size() was unsigned. Hence, if i-g[a] was a negative integer, the sign was just dropped unceremoniously.

auto g = deduce_g(n, graph);
emit_python(words.size(), T1, T2, g);

for (int i = 0; i < words.size(); ++i) {
    const auto w = words[i];
    assert((g[f(w, T1) % n] + g[f(w, T2) % n]) % graph.size() == i);
}

In the final step here, we can use the arrays $T_1,T_2$ and the function $g$ however we like. You can confirm that it is indeed a minimal perfect hash function or we can emit the hash function into a different language to use. I won't show the emit_python function here, it's straightforward but pretty messy.

The CHM algorithm is definitely not optimal; there exists better MPHF generation algorithms such as the CHD algorithm.