8 minute read

Longest common subsequence.

Problem description

Taken directly from LeetCode.

Given two strings text1 and text2, return the length of their longest common subsequence. If there is no common subsequence, return 0.

A subsequence of a string is a new string generated from the original string with some characters (can be none) deleted without changing the relative order of the remaining characters.

  • For example, "ace" is a subsequence of "abcde".

A common subsequence of two strings is a subsequence that is common to both strings.

Example 1:

Input: text1 = “abcde”, text2 = “ace”
Output: 3
Explanation: The longest common subsequence is “ace” and its length is 3.

Example 2:

Input: text1 = “abc”, text2 = “abc”
Output: 3
Explanation: The longest common subsequence is “abc” and its length is 3.

Example 3:

Input: text1 = “abc”, text2 = “def”
Output: 0
Explanation: There is no such common subsequence, so the result is 0.

Constraints:

  • 1 <= text1.length, text2.length <= 1000
  • text1 and text2 consist of only lowercase English characters.

Solution type

Caching, DP style.

Solution

Brute force is to scan the longer string with all \(2^{m}\) possible subsequences of the shorter string, where \(n\) is the length of the shorter string. Obviously this won’t work for an actual solution, and I’m not sure it tells us much about the DP approach.

My first thought is “if I look at the first x characters of text1 and the first y characters of text2, what is the longest common subsequence, and based on that, what does it look like if I look at the first x + 1 characters of test1 or the first y + 1 characters of test2 instead?”. Basically I have a 2D memo as my initial basis for a potential solution.

Our base case is if we only look at the very first characters of text1 and text2. If the characters match, then we have a longest common subsequence of length 1; else we have a longest common subsequence of length 0. Adding characters feels a little tricky; how do you know if a character is “used” already? I suppose what you could store is a triplet of (total length so far, subsequence end index in text1, subsequence end index in text2), but that feels a little ugly; you end up needing to do \(O(n)\) work to fill out a single position in the memo.

This feels similar to string Hamming distance, which I vaguely remember but can’t recite from memory at the moment.

So question is: do we need to do \(O(n)\) work for each character we potentially add to our subsequence? Right now, I don’t see a nice way to avoid it. Is there some \(O(1)\) way to check for the existence of a character in a string after a specific index? I don’t see a way to do it without messing with the order, and unless you do something funky with some storage for letter presence in substrings, I don’t see a way to avoid making it \(O(n)\) each time. We have improved dramatically on \(O(2^{m})\), so I’m okay writing this down for now and thinking of something more efficient later.

Hmm, here’s an issue. When you have two options (i.e. look at memo position [i - 1][j] or [i][j - 1]) and you get a subsequence out of either, which one is “better”? Let’s actually illustrate what that looks like first:

"abcde", "ace"

"abc", "ac" coming from "abc", "a" versus "ab", "ac". In both cases the earlier memo value should be (1, 1, 1), and the recalculated memo value should be (2, 3, 2). The difference is that in the former case, since we are expanding text2, we should scan in text1 from the position we stored for text1, and in the latter case we should switch which variables we are examining because we are expanding text1. Since we’re arriving at the two same substrings from different starting places, we should still get the same memo value, and if we don’t then something has gone wrong. Okay, this is fine.

Wait do I need a 2D memo?

def longestCommonSubsequence(self, text1: str, text2: str) -> int:
    memo = [[None] * len(text2)] * len(text1)
    memo[0][0] = (1, 1, 1) if text1[0] == text2[0] else (0, 1, 1)
    # text1 all the way
    for i in range(1, len(text1)):
        if memo[i - 1][0][0] == 1:
            memo[i][0] = memo[i - 1][0]
        else:    
            memo[i][0] = (1, i, 1) if text1[i] == text2[0] else (0, 1, 1)
    # text2 all the way
    for i in range(1, len(text2)):
        if memo[0][i - 1][0] == 1:
            memo[0][i] = memo[0][i - 1]
        else:    
            memo[0][i] = (1, 1, i) if text1[0] == text2[i] else (0, 1, 1)
    for text1Index in range(1, len(text1)):
        for text2Index in range(1, len(text2)):
            text1Expand, text2Expand = find()

    return memo[len(text1) - 1][len(text2) - 1][0]

No, this doesn’t work. Consider text1 := "abcde", text2 := "deabc".

def longestCommonSubsequence(self, text1: str, text2: str) -> int:
    if len(text1) < len(text2):
        text1, text2 = text2, text1
    length, subsequenceEnd = 0, 0
    for text2Char in text2:
        for text1Index in range(subsequenceEnd, len(text1)):
            if text2Char == text1[text1Index]:
                length += 1
                subsequenceEnd = text1Index + 1
                break
    return length

I’m looking at this the wrong way with the 2D memo as well though. There has to be some way to ask “should I start the subsequence here”? Right now I’m assuming the earlier you start, the longer subsequence you get; the example above shows that is clearly not true. Gah, I’m missing something obvious…

The recurrence relation must be wrong. It can’t just be based on the longest subsequence you’ve found so far. How do you use DP then; what are you caching?

What about a memo just on the substring of text2 where text2 is the shorter string? 2D triangle memo where we initialize for all single-character substrings, then fill in the triangle.

def longestCommonSubsequence(self, text1: str, text2: str) -> int:
    if len(text1) < len(text2):
        text1, text2 = text2, text1
    memo = [[None] * len(text2)] * len(text2)
    for i in range(len(text2)):
        text1Start = text1.find(text2, i, i + 1)
        memo[i][0] = (0, 0) if text1Start == -1 else (1, text1Start)
    for i in range(len(text2)):
        memo[i][i] 
    return memo[0][len(text2)]

Nope. Christ, I feel a bit dumb about this one.


Okay, fundamental issue here: I insisted on isolating pieces in the center instead of just starting from the start. Subsequences just care about the one “direction” when it comes to ordering, and there’s no particular need to consider cde from abcdefg separately from just the whole prefix abcde. So there’s no need to go out of your way to isolate “middle sections”.

So an actual solution looks like this:

def longestCommonSubsequence(self, text1: str, text2: str) -> int:
    memo = [[0] * (len(text2) + 1) for _ in range(len(text1) + 1)]
    for i in range(1, len(text1) + 1):
        for j in range(1, len(text2) + 1):
            if text1[i - 1] == text2[j - 1]:
                memo[i][j] = memo[i - 1][j - 1] + 1
            else:
                memo[i][j] = max(memo[i - 1][j], memo[i][j - 1])
    return memo[len(text1)][len(text2)]

The recurrence relation relies on comparing prefixes of text1 and text2. memo[i][j] stores the length of the longest common subsequence from the first i characters of text1 and the first j characters of text2. If for a given i,j pair the indices i - 1, j - 1 in text1 and text2 match, then we can add that character to our common subsequence (hence memo[i][j] = memo[i - 1][j - 1] + 1). Note that we compare to the memoized information from the prefixes that are each one character shorter, so indices i - 1 and j - 1 in the memo. Otherwise, we carry over the longest common subsequence length between the case of the first i - 1 and j characters each or the first i and j - 1 characters each. Base case is that for all pairs of indices where i = 0 and j = 0, the longest common subsequence is length 0 (since one of the prefixes is 0 characters).

Notably, I got thrown off and a bit tilted by some Python syntax. The way to declare a 2D array in Python is memo = [[0] * (len(text2) + 1) for _ in range(len(text1) + 1)], NOT memo = [[0] * (len(text2) + 1)] * *(len(text1) + 1). If you try it out in terminal they look the same:

>>> memo = [[0] * (2) for _ in range(5)]
>>> memo
[[0, 0], [0, 0], [0, 0], [0, 0], [0, 0]]
>>> memo = [[0] * (2)] * (5)
>>> memo
[[0, 0], [0, 0], [0, 0], [0, 0], [0, 0]]

…but because of how Python does references, each row here is actually the same object and modifying a single row will modify “all” of them (“all” is in quotation marks because they’re “all” really the same thing).

>>> memo[0][1] = 1
>>> memo
[[0, 1], [0, 1], [0, 1], [0, 1], [0, 1]]

The syntax with the for loop creates a new, distinct [0] * 2 object 5 times. There’s a good explanation on geeksforgeeks here, but this is essentially because when you use the [elt] * num_elts syntax for declaring arrays, each index of the array will point to the same elt object until reassigned. This saves on memory allocations and is basically invisible for 1D arrays, but causes problems as shown above for 2D arrays.

Moral of the story? Always use the list comprehension syntax in Python to do things. Really, though, I have no idea when you’d prefer a native 2D array over a NumPy implementation - not a lot of non-scientific computing requires multidimensional arrays, and if it does, then Python is probably not the best language for it? At least not for something production stage.

I got so frustrated (and also got thrown off kilter by a break in my routine) that I stopped doing leetcode for a week. Finally picking it back up though.

Leave a comment

Your email address will not be published. Required fields are marked. *

Loading...