Recent advances in natural language processing have largely built upon the power of unsupervised pre-training, which trains general purpose language representation models using a large amount of text, without human annotations or labels. These pre-trained models, such as BERT and RoBERTa, have been shown to memorize a surprising amount of world knowledge, such as “the birthplace of Francesco Bartolomeo Conti”, “the developer of JDK” and “the owner of Border TV”. While the ability to encode knowledge is especially important for certain natural language processing tasks such as question answering, information retrieval and text generation, these models memorize knowledge implicitly — i.e., world knowledge is captured in an abstract way in the model weights — making it difficult to determine what knowledge has been stored and where it is kept in the model. Furthermore, the storage space, and hence the accuracy of the model, is limited by the size of the network. To capture more world knowledge, the standard practice is to train ever-larger networks, which can be prohibitively slow or expensive.
Instead, what if there was a method for pre-training that could access knowledge explicitly, e.g., by referencing an additional large external text corpus, in order to achieve accurate results without increasing the model size or complexity? For example, a sentence found in an external document collection, "Francesco Bartolomeo Conti was born in Florence," could be referenced by the model to determine the birthplace of the musician, rather than relying on the model's opaque ability to access the knowledge stored in its own parameters. The ability to retrieve text containing explicit knowledge such as this would improve the efficiency of pre-training while enabling the model to perform well on knowledge-intensive tasks without using billions of parameters.
In “REALM: Retrieval-Augmented Language Model Pre-Training”, accepted at the 2020 International Conference on Machine Learning, we share a novel paradigm for language model pre-training, which augments a language representation model with a knowledge retriever, allowing REALM models to retrieve textual world knowledge explicitly from raw text documents, instead of memorizing all the knowledge in the model parameters. We have also open sourced the REALM codebase to demonstrate how one can train the retriever and the language representation jointly.
Background: Pre-training Language Representation Models
To understand how standard language representation models memorize world knowledge, one should first review how these models are pre-trained. Since the invention of BERT, the fill-in-the-blank task, called masked language modeling, has been widely used for pre-training language representation models. Given any text with certain words masked out, the task is to fill back the missing words. An example of this task looks like:
I am so thirsty. I need to __ water.
During pre-training, a model will go over a large number of examples and adjust the parameters in order to predict the missing words (
answer: drink, in the above example). Interestingly, the fill-in-the-blank task makes the model memorize certain facts about the world. For example, the knowledge of Einstein's birthplace is required to fill the missing word in the following example:
Einstein was a __-born scientist. (
However, because the world knowledge captured by the model is stored in the model weights, it is abstract, making it difficult to understand what information is stored.
Our Proposal: Retrieval-Augmented Language Representation Model Pre-training
In contrast to standard language representation models, REALM augments the language representation model with a knowledge retriever that first retrieves another piece of text from an external document collection as the supporting knowledge — in our experiments, we use the Wikipedia text corpus — and then feeds this supporting text as well as the original text into a language representation model.
The key intuition of REALM is that a retrieval system should improve the model's ability to fill in missing words. Therefore, a retrieval that provides more context for filling the missing words should be rewarded. If the retrieved information does not help the model make its predictions, it should be discouraged, making room for better retrievals.
How does one train a knowledge retriever, given that only unlabeled text is available during pre-training? It turns out that one can use the task of filling words to train the knowledge retriever indirectly, without any human annotations. Assume the input of the query is:
We paid twenty __ at the Buckingham Palace gift shop.
Filling the missing word (
answer:pounds) in this sentence without retrieval can be tricky, as the model would need to have implicitly stored knowledge of the country in which the Buckingham Palace is located and the associated currency, as well as make the connection between the two. It would be easier for the model to fill in the missing word if it was presented with a passage that explicitly connects some of the necessary knowledge, retrieved from an external corpus.
In this example, the retriever would be rewarded for retrieving the following sentence.
Buckingham Palace is the London residence of the British monarchy.
Since the retrieval step needs to add more context, there may be multiple retrieval targets that could be helpful in filling the missing word, for example, “
The official currency of the United Kingdom is the Pound.” The whole process is demonstrated in the next figure:
Computational Challenges for REALM
Scaling REALM pre-training such that models can retrieve knowledge from millions of documents is challenging. In REALM, the selection of the best document is formulated as maximum inner product search (MIPS). To perform retrieval, MIPS models need to first encode all of the documents in the collection, such that each document has a corresponding document vector. When an input arrives, it is encoded as a query vector. In MIPS, given a query, the document in the collection that has the maximum inner product value between its document vector and the query vector is retrieved, as shown in the following figure:
ScaNN package to conduct MIPS efficiently, which makes finding the maximum inner product value relatively cheap, given that the document vectors are pre-computed. However, if the model parameters were updated during training, it is typically necessary to re-encode the document vectors for the entire collection of documents. To address the computational challenges, we structure the retriever so that the computation performed for each document can be cached and asynchronously updated. We also found that updating document vectors every 500 training steps, instead of every step, is able to achieve good performance and make training tractable.
Applying REALM to Open-domain Question Answering
We evaluate the effectiveness of REALM by applying it to open-domain question answering (Open-QA), one of the most knowledge-intensive tasks in natural language processing. The goal of the task is to answer questions, such as “What is the angle of the equilateral triangle?”
In standard question answering tasks (e.g., SQuAD or Natural Questions), the supporting document is provided as part of input, so a model only needs to look up the answer in the given document. In Open-QA, there are no given documents, so that Open-QA models need to look up the knowledge by themselves — this makes Open-QA an excellent task to examine the effectiveness of REALM.
The following figure shows the results on the OpenQA version of Natural Question. We mainly compared our results with T5, another approach that trains models without annotated supporting documents. From the figure, one can clearly see that REALM pre-training generates very powerful Open-QA models, and even outperforms the much larger T5 (11B) model by almost 4 points, using only a fraction of the parameters (300M).
The release of REALM has helped drive interest in developing end-to-end retrieval-augmented models, including a recent retrieval-augmented generative model. We look forward to the possibility of extending this line of work in several ways, including 1) applying REALM-like methods to new applications that require knowledge-intensive reasoning and interpretable provenance (beyond Open-QA), and 2) exploring the benefits of retrieving other forms of knowledge, such as images, knowledge graph structures, or even text in other languages. We are also excited to see what the research community does with the open source REALM codebase!
This work has been a collaborative effort involving Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat and Ming-Wei Chang.