back to homepage

Longest Common Substring in Haskell

Rosalind is a website resembling Project Euler but for bioinformatics problems instead. On Rosalind, there is a problem that asks you to find the longest common substring between $k$ sequences of length $n$. I am aware of three different approaches to this problem:

  1. With dynamic programming, taking $O(n^k)$ time.
  2. By building a suffix tree, taking $O(nk)$ time.
  3. By conducting a binary search of the possible substring lengths. Substrings are detected in $O(nk)$ time on average by computing rolling hashes, similar to the approach used in Rabin–Karp string search. Takes $O(nk \log n)$ time in total.

I'm solving most of the problems on Rosalind with Haskell, as a programming exercise. I don't like dynamic programming or trees, so I chose to implement the third approach.

Let's get started with the code. Firstly, we need to import some standard fare:

{-# LANGUAGE DataKinds #-}

import Data.Char (ord)
import Data.List (elemIndex, groupBy)
import Data.Maybe (fromJust)
import Data.Mod
import Data.Set qualified as S

Then, we write a function that returns the rolling hash of a string where $k$ denotes the sliding window size. For example, if we have the string "hello" with $k=3$ then the first/starting hash is, with characters in ASCII,

$$H_0 = 104a^2 + 101a^1 + 108a^0$$

and the next hash is $H_1 = a(H_0 - 104a^2) + 108$ where we update the hash by striping off the first character (the "h") then shift the entire polynomial up by multiplying by $a$ before pushing a new character onto the newly created space on the right end, which in this case is the next "l" in "hello" producing the hash $H_1 = 101a^2 + 108a^1 + 108a^0$. This update proceduresmakes the algorithm much faster when $k$ is large compared to recomputing the hash from scratch for each sliding window.

In Haskell, we produce $a^0, \dotsc, a^{k-1}$ in a list called as and convert the input string s to a list of numbers called contents. We compute the startingHash and the middle variable, which stores the first and last character of each sliding window. Finally, we write a function to update a hash given the first and last characters of a sliding window. Then, a simple scanl gets the job done.

type ModP = Mod 2147483647 -- 2^31 - 1

a = 257 :: ModP

rollingKarp :: Int -> String -> [ModP]
rollingKarp k s = scanl update startingHash middle
  where
    as = reverse $ take k $ iterate (a *) 1
    contents = map (fromIntegral . ord) s
    middle = zip contents (drop k contents)
    startingHash = sum $ zipWith (*) contents as
    update ac (x, y) = (ac - x * head as) * a + y

Using this hash function, we can find common substrings of a certain length by taking a hash that is present in all strings. The function below assumes that there are no hash collisions, which suffices for the one-time challenge the website presents, but obviously does not ensure a correct solution. You should check manually that the substrings indicated by the hash are actually equal, but I was too lazy to implement that.

commonSubstringOfLength :: Int -> [String] -> Maybe String
commonSubstringOfLength n xs = hashToSubstring <$> candidateMatch
  where
    hashes = map (rollingKarp n) xs
    candidateMatch =
      let s = map S.fromList hashes
       in S.lookupMin $ foldl S.intersection (head s) s
    hashToSubstring h =
      let i = fromJust $ elemIndex h (head hashes)
       in take n $ drop i (head xs)

Now, we can implement a binary search to search for the maximum substring. If a common substring of length $n$ is found then we know that either the greatest common substring is of length $n$ or its greater than length $n$, so we search higher. Otherwise, if there is no common substring, we know that there can't be a common substring greater than length $n$, so we search lower.

search lo hi xs
  | hi - lo == 1 = fromJust result
  | otherwise =
      case result of
        Just x -> search (mid + 1) hi xs
        Nothing -> search lo mid xs
  where
    mid = (lo + hi) `div` 2
    result = commonSubstringOfLength mid xs

To solve the challenge presented by Rosalind, we just read in the FASTA file and run the binary search on the DNA strands.

fasta s = map (\x -> (tail (head x), concat (tail x))) d
  where
    d = groupBy (\a b -> head a /= '>' || head b /= '>') (lines s)

main = do
  s <- readFile "/home/kjc/Downloads/rosalind_lcsm.txt"
  let dna = map snd $ fasta s
  putStrLn $ search 0 (maximum (map length dna) - 1) dna

And we are done!