Building a text extraction tool using image processing and text mining
Abstract
Optical Character Recognition (OCR) Technology, which became popular in the early 1990s, is widespread in the field of text recognition in many types of documents, such as scanned documents and photos. OCR Engine processes and converts the scanned paper-based documents into machine-readable text data. Before OCR technology was available, the only option of digitizing documents from images was to manually retype the content of the whole text images. This was not only massively time consuming, but also lack of accuracy due to typographical errors. With OCR’s ability, an enormous number of paper-based documents with a variety of languages and formats can be processed at a time easily.
Followed by OCR inspiration, my thesis attempts to utilize this technology by building a tool to recognize text from any types of receipts of supermarket, restaurant, coffee shop, etc. and extract printed information. To be more specific, the tool will retrieve relevant information, such as vendor name, date, product names along with prices, tax, and total. This information can be used for customer-oriented application; for example, automatically budgeting and categorizing expenses. A user inputs an image of the receipt into the tool, then it processes and displays the result. Also, user can export data as an excel file if needed. To achieve extraction results with high accuracy, Image Processing and Text Processing are required. In Image Processing, different methods such as resizing, cropping, thresholding and removing noises are applied to improve the quality of image. While in Text Processing, among string search techniques, the most effective and fastest searching algorithm will be applied to correct misspelled text. Moreover, the text is analyzed to detect desired information.