ruby - Methodology for identifying grocery items from a OCR read -
i'm writing ruby application reads text off of grocery store receipt , allows user see how paying per ounce , possibly serving based on ingredients. i'm using tesseract gem pretty straight forward. however, line items wrong, comically so, in case of "burly parsley" "curly parsley".
i assume solving problem in way natural language processing problem don't have background know direction go in. first idea hack ideas of others, make google request , if suggest different, use that. however, i'd read , learn how problem might solved correctly.
so how should go solving burly parsley problem?
there lot of ways go dealing problem this. here's 1 off top of head:
dictionaries - if you're restricting vertical - retail in case - should possible build dictionary of possible items encounter. proceed compare results ocr read words in dictionary using form of string similarity/matching. i'd written an article on subject here while ago covering approximate string matching techniques. it's little old still relevant covers basics.
if run item not existing in dictionary , not reasonable approximate match of items there (that is, entirely new), temporarily treat new item purposes of current case, , flag review. review later can decide whether it's new item altogether, or bad read. in first case, add dictionary , in second map original item.
you create data structure maps variations original item. example, let's take "burly parsley" case. picked in step 1 outlined above match "curly parsley". typically, doing bunch of string approximation comparisons expensive. save time next time encounter it, add "burly parsley" list of known variations item.
the next time encounter "burly parsley" you'd see variation of "curly parsley" , pick without having spend time doing comparisons again.
Comments
Post a Comment