Tag Archives: Optical Character Recognition

Extracting Structured Data from Templatic Documents

Templatic documents, such as receipts, bills, insurance quotes, and others, are extremely common and critical in a diverse range of business workflows. Currently, processing these documents is largely a manual effort, and automated systems that do exist are based on brittle and error-prone heuristics. Consider a document type like invoices, which can be laid out in thousands of different ways — invoices from different companies, or even different departments within the same company, may have slightly different formatting. However, there is a common understanding of the structured information that an invoice should contain, such as an invoice number, an invoice date, the amount due, the pay-by date, and the list of items for which the invoice was sent. A system that can automatically extract all this data has the potential to dramatically improve the efficiency of many business workflows by avoiding error-prone, manual work.

In “Representation Learning for Information Extraction from Form-like Documents”, accepted to ACL 2020, we present an approach to automatically extract structured data from templatic documents. In contrast to previous work on extraction from plain-text documents, we propose an approach that uses knowledge of target field types to identify candidate fields. These are then scored using a neural network that learns a dense representation of each candidate using the words in its neighborhood. Experiments on two corpora (invoices and receipts) show that we’re able to generalize well to unseen layouts.

Why Is This Hard?
The challenge in this information extraction problem arises because it straddles the natural language processing (NLP) and computer vision worlds. Unlike classic NLP tasks, such documents do not contain “natural language” as might be found in regular sentences and paragraphs, but instead resemble forms. Data is often presented in tables, but in addition many documents have multiple pages, frequently with a varying number of sections, and have a variety of layout and formatting clues to organize the information. An understanding of the two-dimensional layout of text on the page is key to understanding such documents. On the other hand, treating this purely as an image segmentation problem makes it difficult to take advantage of the semantics of the text.

Solution Overview
Our approach to this problem allows developers to train and deploy an extraction system for a given domain (like invoices) using two inputs — a target schema (i.e., a list of fields to extract and their corresponding types) and a small collection of documents labeled with the ground truth for use as a training set. Supported field types include basics, such as dates, integers, alphanumeric codes, currency amounts, phone-numbers, and URLs. We also take advantage of entity types commonly detected by the Google Knowledge Graph, such as addresses, names of companies, etc.

The input document is first run through an Optical Character Recognition (OCR) service to extract the text and layout information, which allows this to work with native digital documents, such as PDFs, and document images (e.g., scanned documents). We then run a candidate generator that identifies spans of text in the OCR output that might correspond to an instance of a given field. The candidate generator utilizes pre-existing libraries associated with each field type (date, number, phone-number, etc.), which avoids the need to write new code for each candidate generator. Each of these candidates is then scored using a trained neural network (the “scorer”, described below) to estimate the likelihood that it is indeed a value one might extract for that field. Finally, an assigner module matches the scored candidates to the target fields. By default, the assigner simply chooses the highest scoring candidate for the field, but additional domain-specific constraints can be incorporated, such as requiring that the invoice date field is chronologically before the payment date field.
The processing steps in the extraction system using a toy schema with two fields on an input invoice document. Blue boxes show the candidates for the invoice_date field and gold boxes for the amount_due field.
The scorer is a neural model that is trained as a binary classifier. It takes as input the target field from the schema along with the extraction candidate and produces a prediction score between 0 and 1. The target label for a candidate is determined by whether the candidate matches the ground truth for that document and field. The model learns how to represent each field and each candidate in a vector space in which the nearer a field and candidate are in the vector space, the more likely it is that the candidate is the true extraction value for that field and document.

Candidate Representation
A candidate is represented by the tokens in its neighborhood along with the relative position of the token on the page with respect to the centroid of the bounding box identified for the candidate. Using the invoice_date field as an example, phrases in the neighborhood like “Invoice Date’” or “Inv Date” might indicate to the scorer that this is a likely candidate, while phrases like “Delivery Date” would indicate that this is likely not the invoice_date. We do not include the value of the candidate in its representation in order to avoid overfitting to values that happen to be present in a small training data set — e.g., “2019” for the invoice date, if the training corpus happened to include only invoices from that year.
A small snippet of an invoice. The green box shows a candidate for the invoice_date field, and the red box is a token in the neighborhood along with the arrow representing the relative position. Each of the other tokens (‘number’, ‘date’, ‘page’, ‘of’, etc along with the other occurrences of ‘invoice’) are part of the neighborhood for the invoice candidate.
Model Architecture
The figure below shows the general structure of the network. In order to construct the candidate encoding (i), each token in the neighborhood is embedded using a word embedding table (a). The relative position of each neighbor (b) is embedded using two fully connected ReLU layers that capture fine-grained non-linearities. The text and position embeddings for each neighbor are concatenated to form a neighbor encoding (d). A self attention mechanism is used to incorporate the neighborhood context for each neighbor (e), which is combined into a neighborhood encoding (f) using max-pooling. The absolute position of the candidate on the page (g) is embedded in a manner similar to the positional embedding for a neighbor, and concatenated with the neighborhood encoding for the candidate encoding (i). The final scoring layer computes the cosine similarity between the field embedding (k) and the candidate encoding (i) and then rescales it to be between 0 and 1.

For training and validation, we used an internal dataset of invoices with a large variety of layouts. In order to test the ability of the model to generalize to unseen layouts, we used a test-set of invoices with layouts that were disjoint from the training and validation set. We report the F1 score of the extractions from this system on a few key fields below (higher is better):

Field F1 Score
amount_due 0.801
delivery_date 0.667
due_date 0.861
invoice_date 0.940
invoice_id 0.949
purchase_order 0.896
total_amount 0.858
total_tax_amount 0.839

As you can see from the table above, the model does well on most fields. However, there’s room for improvement for fields like delivery_date. Additional investigation revealed that this field was present in a very small subset of the examples in our training data. We expect that gathering additional training data will help us improve on it.

What’s next?
Google Cloud recently announced an invoice parsing service as part of the Document AI product. The service uses the methods described above, along with other recent research breakthroughs like BERT, to extract more than a dozen key fields from invoices. You can upload an invoice at the demo page and see this technology in action!

For a given document type we expect to be able to build an extraction system given a modest sized labeled corpus. There are several follow-ons we are currently pursuing, including the improvement of data efficiency and accurately handling nested and repeated fields, and fields for which it is difficult to define a good candidate generator.

This work was a collaboration between Google Research and several engineers in Google Cloud. I’d like to thank Navneet Potti, James Wendt, Marc Najork, Qi Zhao, and Ivan Kuznetsov in Google Research as well as Lauro Costa, Evan Huang, Will Lu, Lukas Rutishauser, Mu Wang, and Yang Xu on the Cloud AI team for their support. And finally, our research interns Bodhisattwa Majumder and Beliz Gunel for their tireless experimentation on dozens of ideas.

Source: Google AI Blog

Updating Google Maps with Deep Learning and Street View

Every day, Google Maps provides useful directions, real-time traffic information and information on businesses to millions of people. In order to provide the best experience for our users, this information has to constantly mirror an ever-changing world. While Street View cars collect millions of images daily, it is impossible to manually analyze more than 80 billion high resolution images collected to date in order to find new, or updated, information for Google Maps. One of the goals of the Google’s Ground Truth team is to enable the automatic extraction of information from our geo-located imagery to improve Google Maps.

In “Attention-based Extraction of Structured Information from Street View Imagery”, we describe our approach to accurately read street names out of very challenging Street View images in many countries, automatically, using a deep neural network. Our algorithm achieves 84.2% accuracy on the challenging French Street Name Signs (FSNS) dataset, significantly outperforming the previous state-of-the-art systems. Importantly, our system is easily extensible to extract other types of information out of Street View images as well, and now helps us automatically extract business names from store fronts. We are excited to announce that this model is now publicly available!
Example of street name from the FSNS dataset correctly transcribed by our system. Up to four views of the same sign are provided.
Text recognition in a natural environment is a challenging computer vision and machine learning problem. While traditional Optical Character Recognition (OCR) systems mainly focus on extracting text from scanned documents, text acquired from natural scenes is more challenging due to visual artifacts, such as distortion, occlusions, directional blur, cluttered background or different viewpoints. Our efforts to solve this research challenge first began in 2008, when we used neural networks to blur faces and license plates in Street View images to protect the privacy of our users. From this initial research, we realized that with enough labeled data, we could additionally use machine learning not only to protect the privacy of our users, but also to automatically improve Google Maps with relevant up-to-date information.

In 2014, Google’s Ground Truth team published a state-of-the-art method for reading street numbers on the Street View House Numbers (SVHN) dataset, implemented by then summer intern (now Googler) Ian Goodfellow. This work was not only of academic interest but was critical in making Google Maps more accurate. Today, over one-third of addresses globally have had their location improved thanks to this system. In some countries, such as Brazil, this algorithm has improved more than 90% of the addresses in Google Maps today, greatly improving the usability of our maps.

The next logical step was to extend these techniques to street names. To solve this problem, we created and released French Street Name Signs (FSNS), a large training dataset of more than 1 million street names. The FSNS dataset was a multi-year effort designed to allow anyone to improve their OCR models on a challenging and real use case. FSNS dataset is much larger and more challenging than SVHN in that accurate recognition of street signs may require combining information from many different images.
These are examples of challenging signs that are properly transcribed by our system by selecting or combining understanding across images. The second example is extremely challenging by itself, but the model learned a language model prior that enables it to remove ambiguity and correctly read the street name.
With this training set, Google intern Zbigniew Wojna spent the summer of 2016 developing a deep learning model architecture to automatically label new Street View imagery. One of the interesting strengths of our new model is that it can normalize the text to be consistent with our naming conventions, as well as ignore extraneous text, directly from the data itself.
Example of text normalization learned from data in Brazil. Here it changes “AV.” into “Avenida” and “Pres.” into “Presidente” which is what we desire.
In this example, the model is not confused from the fact that there is two street names, properly normalizes “Av” into “Avenue” as well as correctly ignores the number “1600”.
While this model is accurate, it did show a sequence error rate of 15.8%. However, after analyzing failure cases, we found that 48% of them were due to ground truth errors, highlighting the fact that this model is on par with the label quality (a full analysis our error rate can be found in our paper).

This new system, combined with the one extracting street numbers, allows us to create new addresses directly from imagery, where we previously didn’t know the name of the street, or the location of the addresses. Now, whenever a Street View car drives on a newly built road, our system can analyze the tens of thousands of images that would be captured, extract the street names and numbers, and properly create and locate the new addresses, automatically, on Google Maps.

But automatically creating addresses for Google Maps is not enough -- additionally we want to be able to provide navigation to businesses by name. In 2015, we published “Large Scale Business Discovery from Street View Imagery”, which proposed an approach to accurately detect business store-front signs in Street View images. However, once a store front is detected, one still needs to accurately extract its name for it to be useful -- the model must figure out which text is the business name, and which text is not relevant. We call this extracting “structured text” information out of imagery. It is not just text, it is text with semantic meaning attached to it.

Using different training data, the same model architecture that we used to read street names can also be used to accurately extract business names out of business facades. In this particular case, we are able to only extract the business name which enables us to verify if we already know about this business in Google Maps, allowing us to have more accurate and up-to-date business listings.
The system is correctly able to predict the business name ‘Zelina Pneus’, despite not receiving any data about the true location of the name in the image. Model is not confused by the tire brands that the sign indicates are available at the store.
Applying these large models across our more than 80 billion Street View images requires a lot of computing power. This is why the Ground Truth team was the first user of Google's TPUs, which were publicly announced earlier this year, to drastically reduce the computational cost of the inferences of our pipeline.

People rely on the accuracy of Google Maps in order to assist them. While keeping Google Maps up-to-date with the ever-changing landscape of cities, roads and businesses presents a technical challenge that is far from solved, it is the goal of the Ground Truth team to drive cutting-edge innovation in machine learning to create a better experience for over one billion Google Maps users.