Every Google search you do is one of billions we receive that day. In less than half a second, our systems sort through hundreds of billions of web pages to try and find the most relevant and helpful results available.
Because the web and people’s information needs keep changing, we make a lot of improvements to our search algorithms to keep up. Thousands per year, in fact. And we’re always working on new ways to make our results more helpful whether it’s a new feature, or bringing new language understanding capabilities to Search.
The improvements we make go through an evaluation process designed so that people around the world continue to find Google useful for whatever they’re looking for. Here are some ways that insights and feedback from people around the world help make Search better.
Our research team at work
Changes that we make to Search are aimed at making it easier for people to find useful information, but depending on their interests, what language they speak, and where they are in the world, different people have different information needs. It’s our mission to make information universally accessible and useful, and we are committed to serving all of our users in pursuit of that goal.
This is why we have a research team whose job it is to talk to people all around the world to understand how Search can be more useful. We invite people to give us feedback on different iterations of our projects and we do field research to understand how people in different communities access information online.
For example, we’ve learned over the years about the unique needs and technical limitations that people in emerging markets have when accessing information online. So we developed Google Go, a lightweight search app that works well with less powerful phones and less reliable connections. On Google Go, we’ve also introduced uniquely helpful features, including one that lets you listen to web pages out loud, which is particularly useful for people learning a new language or who may be less comfortable with reading long text. Features like these would not be possible without insights from the people who will ultimately use them.
Search quality raters
A key part of our evaluation process is getting feedback from everyday users about whether our ranking systems and proposed improvements are working well. But what do we mean by “working well”? We publish publicly available rater guidelines that describe in great detail how our systems intend to surface great content. These guidelines are more than 160 pages long, but if we have to boil it down to just a phrase, we like to say that Search is designed to return relevant results from the most reliable sources available.
Our systems use signals from the web itself—like where words in your search appear on web pages, or how pages link to one another on the web—to understand what information is related to your query and whether it’s information that people tend to trust. But notions of relevance and trustworthiness are ultimately human judgments, so to measure whether our systems are in fact understanding these correctly, we need to gather insights from people.
To do this, we have a group of more than 10,000 people all over the world we call “search quality raters.” Raters help us measure how people are likely to experience our results. They provide ratings based on our guidelines and represent real users and their likely information needs, using their best judgment to represent their locale. These people study and are tested on our rater guidelines before they can begin to provide ratings.
How rating works
Here’s how a rater task works: we generate a sample of queries (say, a few hundred). A group of raters will be assigned this set of queries, and they’re shown two versions of results pages for those searches. One set of results is from the current version of Google, and the other set is from an improvement we’re considering.
Raters review every page listed in the results set and evaluate that page against the query, based on our rater guidelines. They evaluate whether those pages meet the information needs based on their understanding of what that query was seeking, and they consider things like how authoritative and trustworthy that source seems to be on the topic in the query. To evaluate things like expertise, authoritativeness, and trustworthiness—sometimes referred to as “E-A-T”—raters are asked to do reputational research on the sources.
Here’s what that looks like in practice: imagine the sample query is “carrot cake recipe.” The results set may include articles from recipe sites, food magazines, food brands and perhaps blogs. To determine if a webpage meets their information needs, a rater might consider how easy the cooking instructions are to understand, how helpful the recipe is in terms of visual instructions and imagery, and whether there are other useful features on the site, like a shopping list creator or calculator for recipe doubling.
To understand if the author has subject matter expertise, a rater would do some online research to see if the author has cooking credentials, has been profiled or referenced on other food websites, or has produced other great content that has garnered positive reviews or ratings on recipe sites. Basically, they do some digging to answer questions like: is this page trustworthy, and does it come from a site or author with a good reputation?
Ratings are not used directly for search ranking
Once raters have done this research, they then provide a quality rating for each page. It’s important to note that this rating does not directly impact how this page or site ranks in Search. Nobody is deciding that any given source is “authoritative” or “trustworthy.” In particular, pages are not assigned ratings as a way to determine how well to rank them. Indeed, that would be an impossible task and a poor signal for us to use. With hundreds of billions of pages that are constantly changing, there’s no way humans could evaluate every page on a recurring basis.
Instead, ratings are a data point that, when taken in aggregate, helps us measure how well our systems are working to deliver great content that’s aligned with how people—across the country and around the world—evaluate information.
Last year alone, we did more than 383,605 search quality tests and 62,937 side-by-side experiments with our search quality raters to measure the quality of our results and help us make more than 3,600 improvements to our search algorithms.
Our research and rater feedback isn’t the only feedback we use when making improvements. We also need to understand how a new feature will work when it’s actually available in Search and people are using it as they would in real life. To make sure we’re able to get these insights, we test how people interact with new features through live experiments.
They’re called “live” experiments because they’re actually available to a small proportion of randomly selected people using the current version of Search. To test a change, we will launch a feature to a small percentage of all queries we get, and we look at a number of different metrics to measure the impact.
Did people click or tap on the new feature? Did most people just scroll past it? Did it make the page load slower? These insights can help us understand quite a bit about whether a new feature or change is helpful and if people will actually use it.
In 2019, we ran more than 17,000 live traffic experiments to test out new features and improvements to Search. If you compare that with how many launches actually happened (around 3600, remember?), you can see that only the best and most useful improvements make it into Search.
While our search results will never be perfect, these research and evaluation processes have proven to be very effective over the past two decades. They allow us to make frequent improvements and ensure that the changes we make represent the needs of people around the world coming to Search for information.