Text Mining from Unstructured Documents

Technologies used
Python
Tensorflow
Darknet
OpenCV
Challenges
Text mining from non-readable documents
The documents are image/pdf documents received from hundreds of companies
All the documents are dumped at a shared location on server
The documents can be scanned or generated by some software
Solutions
Schedule pickup of incoming documents from a shared location on the server
Configure the template of each company’s document for its region of interest
Check the input documents for type and quality
Reject the documents not matching the requirements for quality
Perform document classification using AI model
Perform template matching on a classified document
Automatically detect boundaries on each document
Perform Smart OCR on each document
Benefits
100% information availability in near real-time
Quality check each document
Generate the results
- Store in database
- Store in excel format
Generate MIS Charts and reports