
Use the Levenshtein distance on an encoded sentence
The application that I build for work functions much better when there’s a historical track record of answers to a specific question. Since the users enter the questions in themselves, I needed a way to alert them that significantly changing an existing question was not desired behavior. This was a rare opportunity to do some “real programming” and flex the normally dormant theory muscles. I took a morning and came up with an algorithm that fit my needs.
The Approach
What I wanted was an algorithm that would flag
What are your three priorities this week?
changing to
What do you think about your team this week?
but not flag the former sentence changing to
What are your three major goals this week?
There’s a well known metric for the difference between two sequences of letters, the Levenshtein Distance. I did not feel that this metric alone would get the kinds of results I wanted.
Here is where I made a few assumptions. I felt that differences in words were more important that differences in letters. I also assumed that a differences in words can be a suitable stand in for a difference in meaning. My final assumption was that the two sentences I was comparing would have a combined total of fewer than unique words.
Since words are really nothing more than symbols, I decided to encode each unique word as a byte, reconstruct the sentence as a byte array using the encoding, then use the Levenshtein algorithm to determine the distance between the two sentences. Prior to assigning a byte value to a word, I also ran the word through a stemmer to both reduce the number of unique words and to account for small grammatical changes.
The Code
I wrote the algorithm in Ruby. Since I’m looking for a dumb match, I set the difference threshold to a relatively high 40%. The code below requires the text ruby gem for stemming and computing the Levenshtein distance. The following is also available as a gist.
module ViJedi::SentenceUtil
SIMILARITY_PERCENTAGE = 0.4
# boolean check to see if the source and prime sentences are the same
def self.similar(source, prime)
return compute_values(source, prime)[:mismatch_percentage] < SIMILARITY_PERCENTAGE
end
# returns the word difference between the source and prime
def self.word_distance(source, prime)
compute_values(source, prime)[:distance]
end
private
# returns an array representing the stemmed words from the string
def self.tokenize(sentence)
tokens = sentence.scan(/[\w']+/)
tokens.collect do |token|
# Stem each of the tokens using the Text gem
Text::PorterStemming.stem(token)
end
end
# returns a unicode string of the bytes representing the tokens
def self.encode(tokens, token_map)
bytes = tokens.collect do |token|
token_map[token]
end
bytes.pack('U*')
end
# returns a hash containing the distance and the percent of mismatch
def self.compute_values(source, prime)
source_tokens = (source.present?) ? tokenize(source.downcase) : []
prime_tokens = (prime.present?) ? tokenize(prime.downcase) : []
# assign a numerical value to each unique word
token_map = {}
(source_tokens + prime_tokens).each do |token|
token_map[token] ||= token_map.length
end
# just give up if the number of words is too large
if token_map.length > 255
return {
distance: -1,
mismatch_percentage: 1
}
end
encoded_source = encode(source_tokens, token_map)
encoded_prime = encode(prime_tokens, token_map)
distance = Text::Levenshtein.distance(encoded_source, encoded_prime)
# since the maximum Levenshtein distance is the length of the longest word,
# use the longest word are the base to compute the distance.
mismatch_percentage = distance.to_f / ([encoded_source.length, encoded_prime.length].max)
return {
distance: distance,
mismatch_percentage: mismatch_percentage
}
end
end
This algorithm is not the best nor the fastest, but it does the job in a reasonably quick manner for relatively normal length sentences.
Did you like this content?
Related Posts:
Showing the Right Data with Out-of-Order Responses in Javascript
Javascript is an asynchronous language, which means sometimes things don't happen in the order you expect. This pattern ensures that the right response is shown.
The Rise and Fall of a Personal Cargo-Cult
Cargo-cults are common when confronted with the unfamiliar and new. This is not that story. Instead this is a story about forgotten knowledge around a complex interaction.
Pagination in AngularJS with Paginate-Anything and Kaminari
It takes a little bit of rails controller glue to connect angular-paginate-anything to Kaminari