How optical character recognition (OCR) makes documents accessible

Interview with Michael Hoffmann, head of the Processes and Quality Management

In order to make all our document collections digitally searchable, it is necessary to completely record, i.e. index, all the data they contain: Places of birth, names, document type, place of origin… A mammoth project. We use various technologies for this, which are constantly improving thanks to AI. One basic technology is Optical Character Recognition (OCR). Michael Hoffmann, Head of Processes and Quality Management, explains how we go about this in an interview.

Mr. Hoffmann, why is the online searchability of our documents so important?

Easy access is the key to a good archive. Both experts and common users who are only looking for something once need to find their way around. And without any prior knowledge! Handling must be self-explanatory and simple, and the search options must be designed for simple and complex searches. The reliability of the search is particularly important: Will all the people with the name I am looking for be displayed? Is the search complete? We must always have the target group in mind: How does it search, what does it need, what helps? The archive is geared towards searching for people – we used to be a pure search service – but today, of course, it is also used for historical research with a focus on factual topics and for archival search topics. For all of this, it is necessary to fully record, i.e. index, all of our documents.

What is OCR being used for at the Arolsen Archives?

Optical character recognition is used to automatically read the information contained in our documents. This means that a scanned document, which is initially available to us as an image, is made searchable, the technical term for this is indexing. To put it even more simply, an image is turned into information modules, i.e. texts, which in turn can be found using text input, i.e. search terms. This data ends up in an online database. With OCR, we can therefore quickly index documents that have not yet been captured. We add metadata to them and make them easy to find using the search function.

While handwriting was a problem three years ago or lists could not be read automatically until recently, this is now possible.

Michael Hoffmann, head of the Processes and Quality Management department

Is this as simple as it sounds?

Not entirely. One of the typical problems with optical character recognition is that it can mix up similar letters. For example, a small L might be confused with a big i, rn can become m and hn can become lm. O and 0 are also inevitably problematic. The smarter, i.e. better trained, the OCR machine is, the fewer errors there are.

How does the procedure work in practice?

First, the graphical information is binarized: This means that the contrasts are increased and the colors are removed, o the graphical information is just black and white. This is also automated via a script, whereby we manually adjust the parameters from document type to document type. Then we eliminate all of the blank spaces and blank characters in what is known as the segmentation. Only then can the actual text recognition process start. The starting material therefore has to be analyzed and prepared first. If a document is misaligned, the lines can get jumbled up. This is why documents need to be carefully prepared. Garbage characters are eliminated and the script is aligned.

Is some of the preparation manual? How time-consuming is this process?

Considering the large number of documents to process, it would be impossible to do this purely manually. The analysis and evaluation of a document bundle, meaning a collection of documents, largely determines which automated processes will be applied to a group of documents. Over time, you start to learn the best way to process a particular document group. Ultimately, the goal is to create the best possible starting point for all following processes. These automated processes often run overnight, so we can continue working on the documents the next day.

What are the challenges?

One special aspect of our collection is that different types of documents were filed together. Specifically, these included various types of index cards, questionnaires and forms relating to concentration camp prisoners. We call this a mixed collection. This is why we need clustering as a further automated process. By clustering the form types are automatically sorted into groups. This makes it possible to filter out specific document types, for instance. This is important, because we use pure text recognition to determine how the program will need to read the material to ensure the information elements are correctly identified during the OCR. So clustering is a kind of OCR for layouts and form types.

OCR in action

On the left, the source document (a list) and a “raw OCR result” before possible post-processing. In the enlarged section below you can see the line within the document and in the text block above the OCR result. The prisoner number “91551” was recognized correctly, although there is fraying and blurring in the text block. The surname was recognized incorrectly (D and e not printed correctly, OCR gets mixed up).

Do the Arolsen Archives use a special type of OCR?

A combination of many different methods is required for successful text recognition. You could call this combination – which is adapted to the peculiarities of our holdings and materials – the “special type” of OCR used by the Arolsen Archives. But it is more like a collection of the various methods and components that are available. Fundamentally, a variety of steps are necessary for successful text recognition: analysis of the material, form recognition (clustering/classification), preparation of the images, character recognition, data checking, error correction (automatic or manual) and transfer to the database. Optical character recognition is therefore only one component in the overall process.

What are the advantages of optical text recognition compared to manual processing?

In the best case, the automatic indexing is faster and cheaper than manual indexing. In many use cases, the best practice involves a combination of automated and manual processes, such as clustering with IT tools, which separates a mixed collection into different form types that can then be further processed using the most promising method. In the worst case, the results require extensive post-editing, which makes the whole process inefficient compared to manual indexing. This is why a preliminary analysis of the starting material is just as important as an initial test run. We use the selected technology to turn a few documents into digital documents and then check the error rate. We don’t apply a technology to larger numbers of documents until the initial test run is positive.

By comparison, our experts from the indexing team have an error rate of just 1 percent, while the AI-supported OCR system produces around 5 percent errors when documents are prepared well. The indexing team then has to look at these documents again. And yet, the system speeds up indexing in such a way that we decide to let the OCR system capture 99 percent of the documents first. Smaller units that require specialist knowledge are still indexed manually internally by our specialists.

What are the limits of the technologies being used?

There are no limits. OCR technologies are constantly evolving. If handwriting was a problem three years ago or lists could not be read automatically until recently, this is now possible. We have found practicable solutions for both. On the one hand, we have trained our OCR machine on handwriting and have expanded our system to include Handwritten Text Recognition (HTR). Secondly, together with the volunteer software developer Thomas Werkmeister (link to interview if necessary), we have developed an application that helps us to index lists. An automated data comparison helps us to fill in gaps or illegible data from other documents (link to AI article). Completely new approaches, for example through the possibilities of chatbots, will also be based on a type of optical character recognition, i.e. principle OCR; current AI tools on the market are in any case based on this.

AI applications from various providers (ChatGPT; Gemini; etc.) are currently developing at a breathtaking speed and are setting out to surpass the “classic” methods in terms of performance in the independent combination of tools. The “price” for this is the comparatively poor traceability of how the AI has arrived at the results – this can be difficult when troubleshooting systematic errors. Here, too, we are gathering empirical values and examining the extent to which this can be used and implemented for mass data collection.

Digitization dossier

Further insights behind the scenes of our online archive