Transkribus: the learning process
Written by William Robertson and Dr. Natália da Silva Perez
Transkribus is a software that is mainly used for transcriptions, handwriting recognition, and analysis of historical documents. Anyone working with historical documents or archives will find Transkribus to be a helpful tool because it can speed up research, increase accuracy, and facilitate easier access to historical records.
This blog post aims to provide insights into the learning process and experience of using Transkribus (expert client) to transcribe historical newspapers' images into machine-readable texts. The first section focuses on the importance of layout analysis before building a model for text recognition. The second section focuses on the text recognition model. We will be using historical newspapers from the 19th century that belong to De Curaçaosche Courant collection. This project was the result of my internship with Dr. Natália da Silva Perez.
Organizing the layout:
Checking the accuracy of the image or text's default layout analysis is crucial before beginning the transcription process. Numerous models are available, but they are not very good at analysing newspaper layouts. Incorrect transcriptions can result from incorrect reading orders of text during transcription. For instance, default layout analysis makes no distinction between the text regions of various sections, which results in an incorrect reading order. The mistake is made because the computer reads the text as one continuous section rather than as three distinct ones. Image 1 shows what it looks like when we use the default layout analysis:
To solve these issues, it is recommended to train a P2PaLa model for the text regions and a baseline model for the baseline. The "regions and lines" option for P2PaLa should not be used as it results in an even worse layout analysis overall. Therefore, we recommend using the P2Pala model only for text regions. The Baseline are the blue lines that highlight the texts in the images. As you can see in Image 1, many words will be missing from the transcription without training a baseline model first.
Before training a model using P2PaLa, it is recommended to have at least 50 pages of ground truth, and the same goes for the baseline training. The ground truth is done manually by adjusting the text regions and the baselines so that every character becomes readable through the HTR scan. It is also essential to "tag" the different text regions; otherwise, the computer will not accurately perform the layout analysis. This can be done by right-clicking on the text region, assigning the structure type, and selecting the type. You can make new tags by going to Metadata>Structural>Customize. In Image 2 you can see a zoomed-in version of the manually edited layout:
In Image 3, you can see the outcome of the trained baseline model and P2PaLa model:
Recent Findings/Discoveries for P2PaLa improvement
We recently discovered ways to improve our P2PaLa model after engaging in problem finding. For optimal results, we now recommend at least 200 pages of ground truths. We also discovered that within the P2PaLa tool if you tick the “Rectify regions” box, it will smoothen the edges of the text regions If there are issues with the P2PaLa model creating several small boxes of text regions, increase the “Min area” since this will increase the minimum size of text regions. The “Rectify regions” and “Min area” options can be found at the bottom of the P2PaLa menu.
Training a Text Recognition Model
After working on models for the layout of the newspaper, we moved toward making a text recognition model for De Curaçaosche Courant collection. However, due to time limitations, we focused on making a model for the years 1816-1880 and left out pages from 1880-1882. This is because the collection underwent massive changes during that period.
Image 4 shows that there are a few errors (a few words and letters not highlighted) with the baseline model for the 1880-1882 collection. This is with 100 pages of baseline training.
To start training, you are going to need around 25-75 pages of manually transcribed ground truth training data. However, for newspaper collections, around 25 pages can be enough. This is because it is easier to train data from newspapers than handwritten texts. If there are public models available that are somewhat effective at transcribing your documents, then you can use them as a base model for training. For example, we used Transkribus Print M1 as our base model since it was the best fit out of all multilingual public models that included Dutch, English, and Spanish.
Once your ground truths are ready, you can go to tools>model training>train a new model. That should bring up the model training screen and then fill in the information about the model. If you are training a text recognition model with multiple languages, make sure to include all languages in the “Model Training” menu, in “languages”. In our case, we had 52 pages of ground truths/training data and selected 10% of it as a validation set that tests the accuracy of the model. Once you have all the information filled out and chosen the training and validation set, you can click train. According to Transkribus, a model with a CER (character error rate) of 10% or below can be seen as accurate for automated transcription. The model we trained has less than 1% CER (character error rate) (error rate) which means the model is very accurate. This can be seen in image 5:
Working on the layout analysis presented significant challenges due to the complex and frequently changing layout of the newspapers De Curaçaosche Courant collection. As a result, we focused on 1816-1880, as the layout completely changed during 1880-1882, with additional columns, images, and new fonts. Unfortunately, the current P2PaLa model is still in its early stages of development, meaning it could not accurately train a model for the layout analysis for the newspaper in the 1880-1882 collection. Furthermore, training a baseline model proved to be quite difficult, as some small letters were too small to recognize, and images would sometimes be recognized as texts. However, for most of the pages spanning 1816-1880, both the P2PaLa and baseline models were able to accurately analyze the layout.
The training of the text recognition model was much simpler to work out since we are working with newspapers, therefore we were able to make an accurate model for transcription for newspapers in De Curaçaosche Courant collection between 1816-1880. The model CER (character error rate) shows an accuracy of less than 1%.
If you would like a more detailed explanation of the layout analysis, I made a video which complements this post: https://youtu.be/FOCtrCgeGuU
If you would like to watch a walkthrough on the text recognition model and training:
Read Coop. “How To Train and Apply Handwritten Text Recognition Models in Transkribus eXpert.” Accessed June 21, 2023. https://readcoop.eu/transkribus/howto/how-to-train-a-handwritten-text-recognition-model-in-transkribus/
Read Coop. “How To Use Transkribus in 10 Steps.” Accessed June 21, 2023. https://readcoop.eu/transkribus/howto/use-transkribus-in-10-steps/
Huygens Instituut. “Transkribus Webinar #2 - Advanced Use by Annemieke Romein.” Accessed June 21, 2023.