Algorithm for Sentence Matching

I built an algorithm to determine if two sentances are similar

Posted by Tejus Parikh on December 1, 2014

Use the Levenshtein distance on an encoded sentence

Use the Levenshtein distance on an encoded sentence

The application that I build for work functions much better when there’s a historical track record of answers to a specific question. Since the users enter the questions in themselves, I needed a way to alert them that significantly changing an existing question was not desired behavior. This was a rare opportunity to do some “real programming” and flex the normally dormant theory muscles. I took a morning and came up with an algorithm that fit my needs.

The Approach

What I wanted was an algorithm that would flag

What are your three priorities this week?

changing to

What do you think about your team this week?

but not flag the former sentence changing to

What are your three major goals this week? 

There’s a well known metric for the difference between two sequences of letters, the Levenshtein Distance. I did not feel that this metric alone would get the kinds of results I wanted.

Here is where I made a few assumptions. I felt that differences in words were more important that differences in letters. I also assumed that a differences in words can be a suitable stand in for a difference in meaning. My final assumption was that the two sentences I was comparing would have a combined total of fewer than unique words.

Since words are really nothing more than symbols, I decided to encode each unique word as a byte, reconstruct the sentence as a byte array using the encoding, then use the Levenshtein algorithm to determine the distance between the two sentences. Prior to assigning a byte value to a word, I also ran the word through a stemmer to both reduce the number of unique words and to account for small grammatical changes.

The Code

I wrote the algorithm in Ruby. Since I’m looking for a dumb match, I set the difference threshold to a relatively high 40%. The code below requires the text ruby gem for stemming and computing the Levenshtein distance. The following is also available as a gist.

module ViJedi::SentenceUtil
    SIMILARITY_PERCENTAGE = 0.4

    # boolean check to see if the source and prime sentences are the same
    def self.similar(source, prime)
        return compute_values(source, prime)[:mismatch_percentage] < SIMILARITY_PERCENTAGE
    end

    # returns the word difference between the source and prime
    def self.word_distance(source, prime)
        compute_values(source, prime)[:distance]
    end

    private
    
    # returns an array representing the stemmed words from the string
    def self.tokenize(sentence)
        tokens = sentence.scan(/[\w']+/)
        tokens.collect do |token|
            # Stem each of the tokens using the Text gem
            Text::PorterStemming.stem(token)
        end
    end

    # returns a unicode string of the bytes representing the tokens
    def self.encode(tokens, token_map)
        bytes = tokens.collect do |token|
            token_map[token]
        end

        bytes.pack('U*')
    end

    # returns a hash containing the distance and the percent of mismatch
    def self.compute_values(source, prime)
        source_tokens = (source.present?) ? tokenize(source.downcase) : []
        prime_tokens = (prime.present?) ? tokenize(prime.downcase) : []

        # assign a numerical value to each unique word
        token_map = {}
        (source_tokens + prime_tokens).each do |token|
            token_map[token] ||= token_map.length
        end

        # just give up if the number of words is too large
        if token_map.length > 255
            return { 
                distance: -1,
                mismatch_percentage: 1
            }
        end

        encoded_source = encode(source_tokens, token_map)
        encoded_prime = encode(prime_tokens, token_map)

        distance = Text::Levenshtein.distance(encoded_source, encoded_prime)

        # since the maximum Levenshtein distance is the length of the longest word, 
        # use the longest word are the base to compute the distance. 
        mismatch_percentage = distance.to_f / ([encoded_source.length, encoded_prime.length].max)

        return {
            distance: distance,
            mismatch_percentage: mismatch_percentage
        }
    end
end

This algorithm is not the best nor the fastest, but it does the job in a reasonably quick manner for relatively normal length sentences.

Tejus Parikh

I'm a software engineer that writes occasionally about building software, software culture, and tech adjacent hobbies. If you want to get in touch, send me an email at [my_first_name]@tejusparikh.com.