OCR 101: All you need to know
A walk-through of research, tools, and challenges in OCR
I Love OCR (Optical Character Recognition). To me, it represents the real challenge in data science, and specifically in computer vision. It’s a real-world problem, it has many approaches, it involves computer vision, pipeline tweaking, and even some NLP. It also requires a lot of engineering. It encapsulates many of the problems in data science: undermining a robust benchmark, over-emphasis on complication and “novelty” of approaches instead for real-world progress.
Two years ago, I’ve published an article about OCR. Like most of my articles, it was intended to Review the field — Research and practice, Shed light on what you can and can’t do, how and why and Provide practical examples.
Its essence was — back then deep learning OCR was good, but not good enough. Nowadays later, OCR is much better. But still not great.
However, considering the vibrancy of the deep learning field, it requires, in my opinion, an update, or perhaps even a full rewrite. So here it is. If you got here, you’re probably interested in OCR in some way. Either you are a student or a researcher who wants to study this field, or you have a business interest. Either way, this article should get you up to speed.
First, let’s set our concepts straight:
OCR — Optical character recognition. This is the common term, which mostly refers to the structured text on documents.
STR — Scene text recognition. Mostly refers to the more challenging text in the wild. For sake of simplicity, We’ll refer to them both as OCR.
As said, OCR depicts many of the achievements but also the challenges in deep learning and data science in general. On one hand, a tremendous advancement to what we had before. Also an impressive year on year improvement. However, OCR is still not solved.
There are still some very annoying failure cases that are accounted for different reasons, most of them stem from the standard deep learning root cause — lack of generalization, susceptibility to noise, and so on. Therefore, even though models can handle many cases (different fonts, orientations, angles, curves, backgrounds) there are some deviations that just won’t work (as long as they weren't introduced manually into the training set): not popular fonts, symbols, backgrounds and so on.
Additionally, a great and useful library has emerged — Easy OCR, which set a goal to make the state-of-the-art OCR approach(s) accessible and easy to use in open source. As an extra treat, this library also tackles the multi-language problem in OCR (currently includes ~80 languages and more to come) and the speed of models (still in early stages). This library is not perfect, but it’s really a great solution to have. More on this later.
So without further ado, let’s examine the state of OCR.
As always, data science tasks boundaries are extended by research, while practice lags behind in innovation, but leads in robustness.
In my previous post, I’ve reviewed 3 approaches:
The back-then popular approach of classic computer vision.
General deep learning approaches, detection and recognition, which were efficient and easy to use.
In this post, we may say that the Specific deep learning approaches have ripened, and became the dominant one both in research and in practice.
In the previous post, we’ve used a few examples, that might look simple in the current state: license plates recognition, captcha recognition and so on. Today’s models are more potent and we can discuss much harder tasks, such as:
Parsing commercial brochures
Digital media Parsing
Street text detection
After applying standard object detection and segmentation methods on OCR, approaches started to become more specific, and tailored to text attributes:
Text is homogeneous — every subpart of text is still text.
Text may be detected in different levels —characters, words, sentences, paragraphs, etc.
Therefore, modern OCR approaches “isolate” specific text characteristics, and use a pipeline of different models to address them.
Here, we are going to focus on a certain setting, actually a pipeline of models which apart from the vision model (feature extractor) there are a few more helpful components :
The first part of the pipeline is Text detection — the most intuitive split. It is clear that if you are going to use different parts, detecting where is the text before recognizing the actual characters may be a good idea. This part is trained separately from the others.
The second part of the pipeline is optional — the transformation layer. Its goal is to handle distorted text of all kinds, and convert it into more “regular” settings (see pipeline figure)
The third part is the vision feature extractor, which can be your favorite deep model (but good old ResNet works best of course).
The fourth part of the pipeline is an RNN, which is intended to learn recurring text sequences.
The fifth and last part is the CTC loss. Recent works replaced it with attention.
This pipeline, except the detection part, is mostly trained end to end, to reduce complexity.
The pipeline “disease”
Having a pipeline of different components is a good idea, however it has some drawbacks. Each component has it own set of biases and hyper parameters, and it induces another level of complexity in the pipeline.
It is known that the base of all good data science work is datasets, and in OCR the are crucial: Results were critically influenced by the train and test datasets of choice. Throughout the years, the OCR task honed in around a dozen different datasets. However, most of them didn’t include more than a few thousands of annotated images, which didn't seem enough for scale-up. On the other hand, OCR task is one of the easiest to use synthetic data.
Let's see what are the prominent datasets available:
A few datasets exploited the massive coverage of google street view. These datasets can be divided by their focus on regular or irregular (distorted, angled, rounded) texts.
SVHN — street view numbers, which we used for the example in the previous post.
SVT — street view text — text images from google street view
ICDAR (2003, 2013,2015, 2019) — some datasets that were created for the ICDAR convention and competition, with different emphasis. E.g, 2019 datasets is called “arbitrary shaped text”, which mean, as irregular as it gets.
There are 2 popular synthetic datasets, which were used in most OCR works. Inconsistent use of them makes the comparison between works challenging.
MJ Synth — includes relatively simple compositions of words. Datasets itself includes ~9M images.
Synth text — a more elaborate mechanism, which applies segmentation and depth estimation of images in the first stage, and then “plants” text on inferred surfaces. The dataset itself includes ~5.5M images.
DALL-E — a bit of a wild card, but the future of text image generation (and perhaps OCR) seems far more unsupervised.
The synthetic datasets also excel in their ability to generate different languages, even difficult ones, such as Chinese, Hebrew, and Arabic.
Before addressing specific research papers, we need to determine metrics for success. There is clearly more than one option.
First, lets get the text detection out of the way: it can use the standard metrics of object detection such as mean average precision, and even standard precision and recall.
Now for the interesting part — the recognition. There are two main metrics: word level accuracy and character level accuracy. Specific tasks may use even higher levels of accuracy (e.g text chunk accuracy). The current state of the art methods present >80% accuracy on challenging datasets (we’ll discuss this later).
The character level itself is wrapped with “Normalized Edit distance” which measures the ratio of similar characters between words.
In this post, we are focused here on best practices than on laying out ideas. For ideas, I suggest you go to one of these two surveys, where you’ll find out there are many methods that make it really hard to choose from.
What Is Wrong With Scene Text Recognition?
This very interesting work has a somewhat unusual name, and the work itself is also outstanding. it’s a kind of a proactive survey, which:
Defines unified train and test sets (after some optimization).
Tests and benchmarks best practices on the datasets.
Logically structures the methods and “helps” the reader to understand what to use.
So the key points from this paper are:
Train datasets (which may be considered “best”) for OCR are the 2 synthetic ones — MJ and Synthtext. Moreover, the important features not the quantity but the diversity (reducing the amount of data didn't hurt the model’s performance that much, but removing one of the datasets did)
Test datasets were ~5 real world datasets.
Paper demonstrated gradual improvement in results with each pipeline update. The most significant improvement from 60% to 80% accuracy was improving the feature extractor from VGG to ResNet. The next additions of RNN and Normalization pushed the model up to 83%. CTC to attention update added 1% of accuracy but tripled the inference time.
In most of this post, we discuss text recognition, but as you may recall, the first part of the pipeline is text detection. Reaching the current generation of text detection model was a bit tricky. Previously, text detection was intuitively subdued as a sub-branch of object detection. However, object detection has some settings that are general for objects such as cars, faces, and so on. which were required some significant updates when induces to text detection.
The essence of this is that text is both homogeneous and characterized by its locality. which means that on one hand, every part of the text is a text by itself, and on the other hand, text sub-items should be unified into bigger items (e.g characters into words). Therefore, segmentation-based methods would be a better fit for text detection than object segmentation.
Our favorite object detection method, which is also integrated into easy OCR is called CRAFT — Character Region Awareness for Text Detection. This method applies a simple segmentation network, and nicely uses both real and synthetic images, and both character and word level annotations.
This model resulted in ~80% H-mean (on P and R) and most datasets, and also does very good work on word separation, to make life easier for recognition models.
We’ve reached the practical stage. What should you use? So we’ve mostly answered this question earlier (Easy OCR…) but let's examine a few of the popular solutions
One very important thing to notice is that while OCR suffers from academy lack of robustness, it enjoys the flourishing of open source software, which allows researchers and practitioners to build upon the work of each other. Previous open-source tools (e.g Tesseract, see below) were struggling with data collection and development from scratch. Recent packages, such as easy OCR, enjoy a set of building blocks, from data generation, through all pipeline models and more tweaks.
For a long time, Tesseract OCR was the leading open source OCR tool (not considering occasional paper-related repositories ). However, this tool was built as a classic computer vision tool and didn’t undergo the transition to deep learning well.
The OCR were some of the early computer vision APIs of the big cloud providers — Google, Amazon, and Microsoft. These API’s don’t share any benchmark of their abilities, so it becomes our responsibility to test
In some way, the Easy OCR package is the driver of this post. The ability to build an open-source, state-of-the-art tool from different building blocks is fascinating.
Here is how it works:
Data generation with MJ-Synth package.
CRAFT model (see above) for detection.
Training a tweaked pipeline based on “what is wrong” paper, (see above) for text recognition.
Multi-language: as said, OCR contains some NLP elements. Therefore handling different languages has differences, but also similarities that we could benefit from CRAFT model (and probably other detection models) are multilingual. Recognition models are languages specific, but the training process is the same, with some teaks (e.g Hebrew and Arabic are rtl and not left to right)
The last piece in the puzzle, which makes it the “go to OCR tech” at this stage, is the performance. You can see below, that they are even better than the paid API results.
One thing to improve in Easy OCR, is the tweaking ability: although language selection is easy, changing models and retraining for different purposes. In our next post, we’ll show how to do some of it.
What about run time?
It should not come as a surprise that inference of OCR might be slow. Detection models is a standard deep learning model that runs in ~1 second on GPU (per image), while recognition models run again and again on the detections. An image with many items may take a few dozens of seconds on a GPU, not mentioning CPU. What if you’d like to run OCR on your mobile or PC app, using weaker hardware?
Easy OCR can get you covered: first, the library induces some tricks to make inference faster (e.g tighter shape of image slices for object recognition). Additionally, being modular, it is possible (currently with some code tweaking) to integrate your own models, which can be smaller and faster.
So after discussing different packages and models, it’s time to witness the actual results. Go to this colab notebook to try Easy OCR vs Google OCR vs Tesseract. I’ve chosen two images:
One is the common OCR case — standard structured text from a document, and the second one is a challenging book cover collection: many fonts, backgrounds, orientations (not so many), and so on.
We’ll try three variants: Easy OCR, Google OCR API (which is considered best among big tech cloud APIs), and the good old Tesseract.
On this kind of text, the good ole’ Tesseract and Google OCR performance is perfect. It makes sense since Google OCR might be somehow based on Tesseract.
Pay attention that google OCR has a special mode for this kind of text — DOCUMENT_TEXT_DETECTION, which should be applied instead of the standard TEXT_DETECTION.
Easy OCR is around 95% accurate.
In general, Easy OCR results were the best. specifically, the detection component caught around 80% of items, including very challenging diagonal ones.
Google OCR was somehow worse, around 60%.
In recognition, they were about on par — about 70% on character level, which made them not so good on word or book level. It seems that Google OCR wasn't 100% correct on a single book, while Easy OCR had a few.
One more thing I’ve noticed is that while Easy OCR was better on characters, google OCR was good on words — which makes me think it may be using s dictionary behind the scenes.
Summary and future
I hope you enjoyed this post.
We’ve seen that although a lot of work had been done, OCR suffered from the same generalization issues of deep learning, and probably will continue suffering until some paradigm change.
Recent Open-AI models — Clip and Dall-e, have shown some “unsupervised” OCR understanding
In the next posts, we’ll roll up our sleeves and:
See how can we customize the Easy OCR framework for our purposes
Work on training a tailored OCR engine for our purposes.
What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis — Baek et al.
Character Region Awareness for Text Detection — Baek et al.
https://github.com/PaddlePaddle/PaddleOCR — This OCR package was recently trending on GitHub, but we did not have the chance to test it yet.
My previous post about OCR