OCR Index Extraction with AI transformer

Donut versus Pix2Struct on custom data

Image by author (with)

How well do these two transformer models understand documents? In this second part I will show you how to train them and compare their results for the task of key index extraction.

So let’s pick up from part 1, where I explain how to prepare the custom data. I zipped the two folders of the dataset and uploaded them into a new huggingface dataset here. The colab notebook I used can be found here. It will download the dataset, set up the environment, load the Donut model and train it.

After finetuning for 75 minutes I stopped it when the validation metric (which is the edit distance) reached 0.116:

Image by author

On field level I get these results for the validation set:

Image by author

When we look at Doctype, we see Donut always correctly identifies the docs as either a patent or a datasheet. So we can say that classification reaches a 100% accuracy. Also note that even though we have a class datasheet it doesn’t need this exact word to be on the document to be classifying it as such. It does not matter to Donut as it was finetuned to recognize it like that.

Other fields score quite OK as well, but it’s hard to say with this graph alone what goes on under the hood. I’d like to see where the model goes right and wrong in specific cases. So I created a routine in my notebook to generate an HTML-formatted report table. For every document in my validation set I have a row entry like this:

Image by author

On the left is the recognized (inferred) data together with its ground truth. On the right side is the image. I also used color codes to have a quick overview:

Source link

Leave a Comment