Wednesday, December 15, 2021

AI Based PDF OCR using Microsoft Azure Form Recognizer

In real world, we will have many PDF files to read the content and prefill the forms in web application. To automatically read this PDF and predict the values, Microsoft offering cognitive service called Form Recognizer. Using this service, we can pass our PDF file and get the extracted OCR values as JSON back with bounding box coordinates. 

Ref: https://azure.microsoft.com/en-in/services/form-recognizer/#features

We can use some custom libraries to highlight the bounding box coordinates in UI over the image.  For this we can convert PDF into images and display it in UI as well.


For ex: 

https://www.w3schools.com/tags/tag_map.asp


Sample JSON extracted: Highlighted sample bounding box coordinates.

{"status":"succeeded","createdDateTime":"2021-02-23T05:09:00Z","lastUpdatedDateTime":"2021-02-23T05:09:11Z","analyzeResult":{"version":"2.1.0","readResults":[{"page":1,"angle":0,"width":1700,"height":2200,"unit":"pixel","lines":[{"text":"CONTOSO LTD.","boundingBox":[114,134,466,134,466,175,115,175],"words":[{"text":"CONTOSO","boundingBox":[115,135,333,134,333,176,115,176],"confidence":0.994},{"text":"LTD.","boundingBox":[357,134,465,134,465,176,358,176],"confidence":0.994}],"appearance":{"style":{"name":"other","confidence":0.878}}},{"text":"INVOICE","boundingBox":[1410,114,1601,115,1601,155,1410,155],"words":[{"text":"INVOICE","boundingBox":[1411,115,1593,115,1592,156,1411,155],"confidence":0.995}],"appearance":{"style":{"name":"other","confidence":0.878}}},.......}