Okay. Let's say, for the sake of argument, that you have to import a whole whack of data into a new application. Those data have been maintained, yea, verily, passed down from generation to generation, on a series of clay tablets. Well, they might as well have been, anyway.
You see, just as your brand-spanking-new application has finished all of the various tire-kicking, surprise "requirements" miraculously extrapolated from the font face used in the original request document, haggling over the UI and so forth intact and approved, you are handed the old "system" for import. A finer collection of Excel spreadsheets there never was. Each subtly different from all of the others. Oh, they all contain the same set of values, but the columns are in different orders on every sheet, and even from sheet to sheet within the same workbook, the column headers are slightly different. There is nothing really jarring about the differences. Any idiot could look at the sheets and understand what was there.
Your average computer doesn't quite make it up to the level of idiot, though, does it? When I saw that the columns were out of order, I thought it would be easy to simply look at the header row and do some field mapping there. Then when I started to spot exceptions —abbreviations used in one place, full spellings elsewhere, plurals used at random, and so forth — I thought I could code for the oddballs. When I saw that they were all oddballs in one way or another, I had to rethink things. I could edit all of the spreadsheets, but could I be sure that I'd get them all, and get them all right? Manually entering the data was an even sillier idea, since it would be months before everything was entered, and the data are constantly changing. There would be duplicates everywhere, and we'd never be able to whittle it down to a single, correct set of data.
So I needed a programmatic solution, and I needed one that could tell whether the value it was reading was "close enough" to any one of the list of canonical labels I had assembled.
Have you ever played Word Morph? You know, the game where you try to get from one word to another by changing only one letter at a time? The Levenschtein Distance, or Edit Distance, uses the same sort of system to determine the difference between two strings. How many leters need to be changed, added or removed to get from one word or phrase to another? There are a lot of implementations out there on the web already, so there wouldn't be a whole lot of benefit posting the code I used. Google, as ever, is your bestest friend.
Data import is probably not the best use of Levenschtein. It's something you'd normally see in a search, returning results that are close but not quite what the user entered, or in a "did you mean ..." suggestion. In this case, though, I could count on the fact that the variations on a theme were closer to one another than they were to any of the other headings.
Now that I know it works, though, I have to go back and take a look at that PNL query code I was talking about earlier.