Yifei Li
MSCS @ UBC | Vancouver, BC
Rabin-Karp algorithm is commonly used in some string problem when it comes to finding the occurrences of a substring. The motivation of this algorithm is to devise a hash function that can efficiently represent distinctive strings. Therefore, this kind of hash function are sometimes refered to as a “rolling hash”. This hash function is usually paired with sliding window because it’s very efficient to compute the hashed result with small shifts.
Repeated DNA Sequences
The DNA sequence is composed of a series of nucleotides abbreviated as ‘A’, ‘C’, ‘G’, and ‘T’. Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.
(from leetcode 187
Explanation
Given a fixed size 10, how can we find duplicated substrings that have a size of 10? An intuition is to inspect all possible 10-letters substrings with all the other 10-letters substrings. When comparing two substrings, the brute-force method has a time complexity of O(n) by comparing the characters one by one. We can use Rabin-Karp to improve the average time complexity. By applying a rolling hash function and comparing hash values, the time complexity of comparing two substrings will be O(1) but we cannot fully avoid the collision so the worst case is still O(n).
Step by step algorithm of Rabin-Karp is:
- Define the
basevalue. Usuallybasea prime number and >= maximum the maximum of all possible chars to avoid collision - Compute hash values of substring(s)
- Iterate through the whole string, if a substring’s hash value is seen, we compare the two substrings letter-by-letter. Update the substring’s hash value. The detailed computation step is shown in the example below
Example
For the simplicity of the example, I’ll find 3-letter-long duplicate instead of 10-letter-long one. And I’ll use a letter to number map as {'A': 1, 'C':2, 'G': 3, 'T':4}. You can use your own choice for this L2N map or use the ord() function as long as each character is represented by distinctive numbers. I’ll choose base=5 to avoid collision.
s: "TACGACGA"
result: ["ACG", "CGA"]
Python3 implementation
def findRepeatedDnaSequences(s: str) -> List[str]:
n = len(s)
if n<10:
return []
results = set()
base = 5
seen_hash = set()
power = base**9
hash_val = functools.reduce(lambda h,c:h*base+ord(c), s[:10], 0)
for i in range(10, n):
if hash_val in seen_hash:
results.add(s[i-10:i])
else:
seen_hash.add(hash_val)
# move window to right and update rolling hash
hash_val -= ord(s[i-10])*power
hash_val = hash_val*base + ord(s[i])
#inspect the last 10 letters' hash
if hash_val in seen_hash:
results.add(s[n-10:n])
return list(results)
Longest Duplicate Substring
Given a string s, consider all duplicated substrings: (contiguous) substrings of s that occur 2 or more times. The occurrences may overlap. Return any duplicated substring that has the longest possible length. If s does not have a duplicated substring, the answer is “”.
(from leetcode 1044)
Explanation
This problem can be decomposed into two steps: inspect substrings with different length + test if a substring with fixed length is duplicated. The first part can be quickly implemented by a binary search approach (for more info about binary search: Binary Search Basic and Advanced). The second part can be a helper function validating if such a fixed length substring exists. The helper function is implemented with Rabin-karp and base is selected to be 27 since all characters are lower case English letters.
Python3 implementation
def longestDupSubstring(s: str) -> str:
n = len(s)
left, right = 1, n
res = ''
def rabin_karp(size):
mapping = {} # fingerprint: actual value
base = 27
s_hash = functools.reduce(lambda h, c: h*base+ord(c), s[:size],0)
power_s = base**max(size-1,0)
# i is the index of the next char for the current substr
for i in range(size, n):
if s_hash in mapping:
return (True, s[i-size:i])
mapping[s_hash] = i-1
s_hash -= ord(s[i-size])*power_s
s_hash = ord(s[i]) + s_hash * base
if s_hash in mapping:
return (True, s[~size+1:])
return (False, '')
while left<=right:
size = left + (right-left)//2
flag, record = rabin_karp(size)
if flag:
if size > len(res):
res = record
# print(res)
left = size+1
else:
# move to smaller size
right = size - 1
return res