The application that I build for work functions much better when there's a historical track record of answers to a specific question. Since the users enter the questions in themselves, I needed a way to alert them that significantly changing an existing question was not desired behavior. This was a rare opportunity to do some "real programming" and flex the normally dormant theory muscles. I took a morning and came up with an algorithm that fit my needs.
What I wanted was an algorithm that would flag
What are your three priorities this week?
What do you think about your team this week?
but not flag the former sentence changing to
What are your three major goals this week?
There's a well known metric for the difference between two sequences of letters, the Levenshtein Distance. I did not feel that this metric alone would get the kinds of results I wanted.
Here is where I made a few assumptions. I felt that differences in words were more important that differences in letters. I also assumed that a differences in words can be a suitable stand in for a difference in meaning. My final assumption was that the two sentences I was comparing would have a combined total of fewer than unique words.
Since words are really nothing more than symbols, I decided to encode each unique word as a byte, reconstruct the sentence as a byte array using the encoding, then use the Levenshtein algorithm to determine the distance between the two sentences. Prior to assigning a byte value to a word, I also ran the word through a stemmer to both reduce the number of unique words and to account for small grammatical changes.
I wrote the algorithm in Ruby. Since I'm looking for a dumb match, I set the difference threshold to a relatively high 40%. The code below requires the text ruby gem for stemming and computing the Levenshtein distance. The following is also available as a gist.
This algorithm is not the best nor the fastest, but it does the job in a reasonably quick manner for relatively normal length sentences.
Did you like this? Please share:
The Lost Year: A Failed Experiment to Switch Away From Mac
Fed up with the Apple Keyboard, I bought a ThinkPad, installed Linux, and promptly decided that I hated computers.
Maker's Space, Manager's Space
The Grand Remote Work Experiment: A Retrospective
The COVID-19 pandemic has lead to an unexpected experiment in remote working. What has worked and why?