Tuesday, August 09, 2005

Leven Schtein all over the place....

Okay. Let's say, for the sake of argument, that you have to import a whole whack of data into a new application. Those data have been maintained, yea, verily, passed down from generation to generation, on a series of clay tablets. Well, they might as well have been, anyway.

You see, just as your brand-spanking-new application has finished all of the various tire-kicking, surprise "requirements" miraculously extrapolated from the font face used in the original request document, haggling over the UI and so forth intact and approved, you are handed the old "system" for import. A finer collection of Excel spreadsheets there never was. Each subtly different from all of the others. Oh, they all contain the same set of values, but the columns are in different orders on every sheet, and even from sheet to sheet within the same workbook, the column headers are slightly different. There is nothing really jarring about the differences. Any idiot could look at the sheets and understand what was there.

Your average computer doesn't quite make it up to the level of idiot, though, does it? When I saw that the columns were out of order, I thought it would be easy to simply look at the header row and do some field mapping there. Then when I started to spot exceptions —abbreviations used in one place, full spellings elsewhere, plurals used at random, and so forth — I thought I could code for the oddballs. When I saw that they were all oddballs in one way or another, I had to rethink things. I could edit all of the spreadsheets, but could I be sure that I'd get them all, and get them all right? Manually entering the data was an even sillier idea, since it would be months before everything was entered, and the data are constantly changing. There would be duplicates everywhere, and we'd never be able to whittle it down to a single, correct set of data.

So I needed a programmatic solution, and I needed one that could tell whether the value it was reading was "close enough" to any one of the list of canonical labels I had assembled.

Enter Levenschtein.

Have you ever played Word Morph? You know, the game where you try to get from one word to another by changing only one letter at a time? The Levenschtein Distance, or Edit Distance, uses the same sort of system to determine the difference between two strings. How many leters need to be changed, added or removed to get from one word or phrase to another? There are a lot of implementations out there on the web already, so there wouldn't be a whole lot of benefit posting the code I used. Google, as ever, is your bestest friend.

Data import is probably not the best use of Levenschtein. It's something you'd normally see in a search, returning results that are close but not quite what the user entered, or in a "did you mean ..." suggestion. In this case, though, I could count on the fact that the variations on a theme were closer to one another than they were to any of the other headings.

Now that I know it works, though, I have to go back and take a look at that PNL query code I was talking about earlier.

4 comments:

Mika Heinonen said...

This sounds to me like the classical concept of fuzzy logic and AI.
I've also done some similar applications in Notes which try to "understand" the content and convert it into computer-understandable unconditional information. The methods I have used beneath the basic "partial word matching" (for example like the word "products" or "productions", would be matched with the base word "product"), were an match weighted array of words, for example that 3 partial words match words in the target document, thus getting weight 3. While some other 1 or 2 word match would stay below that weight and thus the most weighted word match combination would "win". I've also dealt with context driven applications, which basically means that the previous words matched would be remembered, and so the new words would only need to be a addition to the context.
The "easiest" way to make unstandardized sources, like you mentioned the excel sheets, to be correctly understood by the computer, is to tell the computer how humans recognize the content. It's not only the cell titles, but also content. Just make your way and count the weight of a statistical probability according the partial title and content samples, as the only information you need in the end is "Yes" or "No", 1 or 0, to tell the computer if that data set is what it's claiming to be.

Stan Rogers said...

Yep, it's fuzzy logic, and in the end there is a weighting involved. The advantage of incorporating Levenschtein distances in the calculation is that it can also deal with spelling mistakes. That way, if the initial pattern matching doesn't get any direct hits (and a simple Instr, indexOf or scanning the char array can get you that far), there is still a way to match user input with the target data. It is a relatively expensive, though, so using Levenschtein is a "fall through" method if pattern matching fails. In this case, it works only with unmatched items after a simple pattern scan.

I can look for, say "deliver" and "date", and on most sheets it will match with either "Delivery Date" , "Delivered Date" or "Date Delivered", but pattern matching alone won't find "Delvery Date" or "Delivery Dtae". And if the sheet is more than trivially wide, there is likely to be more than one occurence of both "deliver" and "date", so pattern matching alone will give you several partial matches with equal match weights. That's where Levenschtein (or the even more expensive Soundex) come in.

Still, nothing's perfect -- if you have columns labelled "Delivery Date" and "Delivery Data" and both are misspelled, there's still a chance to look to the data rows below the heading if the data types can be easily differentiated. When that fails, there's no alternative but to spit out the spreadsheet and have a human do the fix. But fixing one bad sheet out of a couple of hundred is still better than doing it all by hand. A lot better.

Ben Dubuc said...

Don't tell me this has to do with ... RASA???? ;)

Stan Rogers said...

Hey! No swearing in the comments, Ben :-D

(I just finished a fix/realignment -- yes, the GMS stuff -- on RASA yesterday. 340-odd views changed and rebuilt, took more than seven hours to implement. And it's still not fixed, of course, because I wasn't allowed to touch anything that would have actually made it a good app.)