Tag Archives: search results

When indexing goes wrong: how Google Search recovered from indexing issues & lessons learned since.

Most of the time, our search engine runs properly. Our teams work hard to prevent technical issues that could affect our users who are searching the web, or webmasters whose sites we index and serve to users. Similarly, the underlying systems that we use to power the search engine also run as intended most of the time. When small disruptions happen, they are largely not visible to anyone except our teams who ensure that our products are up and running. However, like all complex systems, sometimes larger outages can occur, which may lead to disruptions for both users and website creators.

In the last few months, such a situation occurred with our indexing systems, which had a ripple effect on some other parts of our infrastructure. While we worked as quickly as possible to remedy the situation, we apologize for the disruption, as our goal is to continuously provide high-quality products to our users and to the web ecosystem.

Since then, we took a closer, careful look into the situation. In the process, we learned a few lessons that we'd like to share with you today. In this blog post, we will go into more details about what happened, clarify how we plan to communicate better if such things happen in the future, and remind website owners of the channels they can use to communicate with us.

So, what happened a few months ago?

In April, we had several issues related to our index. The Search index is the database that holds the hundreds of billions of web pages that we crawled on the web and that we think could answer some of our users’ queries. When a user enters a query in the Google search engine, our ranking algorithms sort through those pages in our Search index to find the most relevant, useful results in a fraction of a second. Here is more information on what happened.

1. The indexing issue

To start it off, we temporarily lost part of the Search index.
Wait... What? What do you mean “lost part of the index?” Is that even possible?

Basically, when serving search results to users, to accelerate the speed of the service, the query of the user only “travels” as far as the closest of our data centers supporting the Google Search product, from which the Search Engine Results Page (SERP) is generated. So when there are modifications to the composition of the index (some pages added and removed, documents are merged, or other types of data modification), those modifications need to be reflected in all of those data centers. The consequence is that users all over the world are consistently served pages from the most recent version of the index.


Google owns and operates data centers (like the one pictured above) around the world, to keep our products running 24 hours a day, 7 days a week - source

Keeping the index unified across all those data centers is a non trivial task. For large user-facing services, we may deploy updates by starting in one data center and expand until all relevant data centers are updated. For sensitive pieces of infrastructure, we may extend a rollout over several days, interleaving them across instances in different geographic regions. source

So, as we pushed some planned changes to the Search index, on April 5th parts of the deployment system broke, on a Friday no-less! More specifically: as we were updating the index over some of our data centers, a small number of documents ended up being dropped from the index accidentally. Hence: “we lost part of the index.”

Luckily, our on-call engineers caught the issue pretty quickly, at the same time as we started picking up chatter on social media (thanks to everyone who notified us over that weekend!). As a result, we were able to start reverting the Search index to its previous stable state in all data centers only a few hours after the issue was uncovered (we keep back-ups of our indexes just in case such events happen).

We communicated on Sunday, April 7th that we were aware of the issue, and that things were starting to get back to normal. As data centers were progressively reverting back to a stable index, we continued updating on Twitter (on April 8th, on April 9th), until we were confident that all data centers were fully back to a complete version of the index on April 11th.

2. The Search Console issue

Search Console is the set of tools and reports any webmaster can use to access data about their website’s performance in Search. For example, it shows how many impressions and clicks a website gets in the organic search results every day, or information on what pages of a website are included and excluded from the Search index.

As a consequence of the Search index having the issues we described above, Search Console started to also show inconsistencies. This is because some of the data that surfaces in Search Console originates from the Search index itself:

  • the Index Coverage report depends on the Search index being consistent across data centers.
  • when we store a page in the Search index, we can annotate the entry with key signals about the page, like the fact that the page contains rich results markup for example. Therefore, an issue with the Search index can have an impact on the Rich Results reports in Search Console.

Basically, many Search Console individual reports read data from a dedicated database. That database is partially built by using information that comes from the Search index. As we had to revert back to a previous version of the Search index, we also had to pause the updating of the Search Console database. This resulted in plateau-ing data for some reports (and flakiness in others, like the URL inspection tool).


Index coverage report for indexed pages, which shows an example of the data freshness issues in Search Console in April 2019, with a longer time between 2 updates than what is usually observed.

Because the whole Search index issue took several days to roll back (see explanation above), we were delayed focusing on fixing the Search Console database until a few days later, only after the indexing issues were fixed. We communicated on April 15th - tweet - that the Search Console was having troubles and that we were working on fixing it, and we completed our fixes on April 28th (day on which the reports started gathering fresh data again, see graph above). We communicated on Twitter on April 30th that the issue was resolved- tweet.

3. Other issues unrelated to the main indexing bug

Google Search relies on a number of systems that work together. While some of those systems can be tightly linked to one another, in some cases different parts of the system experience unrelated problems around the same time.

In the present case for example, around the same time as the main indexing bug explained above, we also had brief problems gathering fresh Google News content. Additionally, while rendering pages, certain URLs started to redirect Googlebot to other unrelated pages. These issues were entirely unrelated to the indexing bug, and were quickly resolved (tweet 1 & tweet 2).

Our communication and how we intend on doing better

In addition to communicating on social media (as highlighted above) during those few weeks, we also gave webmasters more details in 2 other channels: Search Console, as well as the Search Console Help Center.

In the Search Console Help Center

We updated our “Data anomalies in Search Console” help page after the issue was fully identified. This page is used to communicate information about data disruptions to our Search Console service when the impact affects a large number of website owners.

In Search Console

Because we know that not all our users read social media or the external Help Center page, we also added annotations on Search Console reports, to notify users that the data might not be accurate (see image below). We added this information after the resolution of the bugs. Clicking on “see here for more details” sends users to the “Data Anomalies” page in the Help Center.


Index coverage report for indexed pages, which shows an example of the data annotations that we can include to notify users of specific issues.

Communications going forward

When things break at Google, we have a strong “postmortem” culture: creating a document to debrief on the breakage, and try to avoid it happening next time. The whole process is described in more detail at the Google Site Reliability Engineering website.

In the wake of the April indexing issues, we included in the postmortem how to better communicate with webmasters in case of large system failures. Our key decisions were:

  1. Explore ways to more quickly share information within Search Console itself about widespread bugs, and have that information serve as the main point of reference for webmasters to check, in case they are suspecting outages.
  2. More promptly post to the Search Console data anomalies page, when relevant (if the disturbance is going to be seen over the long term in Search Console data).
  3. Continue tweeting as quickly as we can about such issues to quickly reassure webmasters we’re aware and that the issue is on our end.

Those commitments should make potential future similar situations more transparent for webmasters as a whole.

Putting our resolutions into action: the “new URLs not indexed” case study

On May 22nd, we tested our new communications strategy, as we experienced another issue. Here’s what happened: while processing certain URLs, our duplicate management system ran out of memory after a planned infrastructure upgrade, which caused all incoming URLs to stop processing.

Here is a timeline of how we thought about communications, following the 3 points highlighted just above:

  1. We noticed the issue (around 5.30am California time, May 22nd)
    We tweeted about the ongoing issue (around 6.40am California time, May 22nd)
    We tweeted about the resolution (around 10pm California time, May 22nd)
  2. We evaluated updating the “Data Anomalies” page in the Help Center, but decided against it since we did not expect any long-term impact for the majority of webmasters' Search Console data in the long run.
  3. The confusion that this issue created for many confirmed our earlier conclusions that we need a way to signal more clearly in the Search Console itself that there might be a disruption to one of our systems which could impact webmasters. Such a solution might take longer to implement. We will communicate on this topic in the future, as we have more news.

Last week, we also had another indexing issue. As with May 22, we tweeted to let people know there was an issue, that we were working to fix it and when the issue was resolved.

How to debug and communicate with us

We hope that this post will bring more clarity to how our systems are complex and can sometimes break, and will also help you understand how we communicate about these matters. But while this post focuses on a widespread breakage of our systems, it’s important to keep in mind that most website indexing issues are caused by an individual website’s configuration, which can create difficulties for Google Search to index that website properly. For those cases, all webmasters can debug issues using Search Console and our Help center. After doing so, if you still think that an issue is not coming from your site or don’t know how to resolve it, come talk to us and our community, we always want to take feedback from our users. Here is how to signal an issue to us:

  • Check our Webmaster Community, sometimes other webmasters have highlighted an issue that also impacts your site.
  • In person! We love contact, come and talk to us at events. Calendar.
  • Within our products! The Search Console feedback tool is very useful to our teams.
  • Twitter and YouTube!

Helping publishers and users get more out of visual searches on Google Images with AMP

Google Images has made a series of changes to help people explore, learn and do more through visual search. An important element of visual search is the ability for users to scan many ideas before coming to a decision, whether it’s purchasing a product, learning more about a stylish room, or finding instructions for a DIY project. Often this involves loading many web pages, which can slow down a search considerably and prevent users from completing a task. 

As previewed at Google I/O, we’re launching a new AMP-powered feature in Google Images on the mobile web, Swipe to Visit, which makes it faster and easier for users to browse and visit web pages. After a Google Images user selects an image to view on a mobile device, they will get a preview of the website header, which can be easily swiped up to load the web page instantly. 

Swipe to Visit uses AMP's prerender capability to show a preview of the page displayed at the bottom of the screen. When a user swipes up on the preview, the web page is displayed instantly and the publisher receives a pageview. The speed and ease of this experience makes it more likely for users to visit a publisher's site, while still allowing users to continue their browsing session.

Publishers who support AMP don’t need to take any additional action for their sites to appear in Swipe to Visit on Google Images. Publishers who don’t support AMP can learn more about getting started with AMP here. In the coming weeks, publishers can also view their traffic data from AMP in Google Images in a Search Console’s performance report for Google Images in a new search area named “AMP on Image result”.

We look forward to continuing to support the Google Images ecosystem with features that help users and publishers alike.


Helping publishers and users get more out of visual searches on Google Images with AMP

Google Images has made a series of changes to help people explore, learn and do more through visual search. An important element of visual search is the ability for users to scan many ideas before coming to a decision, whether it’s purchasing a product, learning more about a stylish room, or finding instructions for a DIY project. Often this involves loading many web pages, which can slow down a search considerably and prevent users from completing a task. 

As previewed at Google I/O, we’re launching a new AMP-powered feature in Google Images on the mobile web, Swipe to Visit, which makes it faster and easier for users to browse and visit web pages. After a Google Images user selects an image to view on a mobile device, they will get a preview of the website header, which can be easily swiped up to load the web page instantly. 

Swipe to Visit uses AMP's prerender capability to show a preview of the page displayed at the bottom of the screen. When a user swipes up on the preview, the web page is displayed instantly and the publisher receives a pageview. The speed and ease of this experience makes it more likely for users to visit a publisher's site, while still allowing users to continue their browsing session.

Publishers who support AMP don’t need to take any additional action for their sites to appear in Swipe to Visit on Google Images. Publishers who don’t support AMP can learn more about getting started with AMP here. In the coming weeks, publishers can also view their traffic data from AMP in Google Images in a Search Console’s performance report for Google Images in a new search area named “AMP on Image result”.

We look forward to continuing to support the Google Images ecosystem with features that help users and publishers alike.


Instant-loading AMP pages from your own domain

Today we are rolling out support in Google Search’s AMP web results (also known as “blue links”) to link to signed exchanges, an emerging new feature of the web enabled by the IETF web packaging specification. Signed exchanges enable displaying the publisher’s domain when content is instantly loaded via Google Search. This is available in browsers that support the necessary web platform feature—as of the time of writing, Google Chrome—and availability will expand to include other browsers as they gain support (e.g. the upcoming version of Microsoft Edge).

Background on AMP’s instant loading

One of AMP's biggest user benefits has been the unique ability to instantly load AMP web pages that users click on in Google Search. Near-instant loading works by requesting content ahead of time, balancing the likelihood of a user clicking on a result with device and network constraints–and doing it in a privacy-sensitive way.

We believe that privacy-preserving instant loading web content is a transformative user experience, but in order to accomplish this, we had to make trade-offs; namely, the URLs displayed in browser address bars begin with google.com/amp, as a consequence of being shown in the Google AMP Viewer, rather than display the domain of the publisher. We heard both user and publisher feedback over this, and last year we identified a web platform innovation that provides a solution that shows the content’s original URL while still retaining AMP's instant loading.

Introducing signed exchanges

A signed exchange is a file format, defined in the web packaging specification, that allows the browser to trust a document as if it belongs to your origin. This allows you to use first-party cookies and storage to customize content and simplify analytics integration. Your page appears under your URL instead of the google.com/amp URL.

Google Search links to signed exchanges when the publisher, browser, and the Search experience context all support it. As a publisher, you will need to publish both the signed exchange version of the content in addition to the non-signed exchange version. Learn more about how Google Search supports signed exchange.

Getting started with signed exchanges

Many publishers have already begun to publish signed exchanges since the developer preview opened up last fall. To implement signed exchanges in your own serving infrastructure, follow the guide “Serve AMP using Signed Exchanges” available at amp.dev.

If you use a CDN provider, ask them if they can provide AMP signed exchanges. Cloudflare has recently announced that it is offering signed exchanges to all of its customers free of charge.

Check out our resources like the webmaster community or get in touch with members of the AMP Project with any questions. You can also provide feedback on the signed exchange specification.


Search Console reporting for your site’s Discover performance data

Discover is a popular way for users to stay up-to-date on all their favorite topics, even when they’re not searching. To provide publishers and sites visibility into their Discover traffic, we're adding a new report in Google Search Console to share relevant statistics and help answer questions such as:

  • How often is my site shown in users' Discover? How large is my traffic?
  • Which pieces of content perform well in Discover?
  • How does my content perform differently in Discover compared to traditional search results?

A quick reminder: What is Discover?

Discover is a feature within Google Search that helps users stay up-to-date on all their favorite topics, without needing a query. Users get to their Discover experience in the Google app, on the Google.com mobile homepage, and by swiping right from the homescreen on Pixel phones. It has grown significantly since launching in 2017 and now helps more than 800M monthly active users get inspired and explore new information by surfacing articles, videos, and other content on topics they care most about. Users have the ability to follow topics directly or let Google know if they’d like to see more or less of a specific topic. In addition, Discover isn’t limited to what’s new. It surfaces the best of the web regardless of publication date, from recipes and human interest stories, to fashion videos and more. Here is our guide on how you can optimize your site for Discover.

Discover in Search Console

The new Discover report is shown to websites that have accumulated meaningful visibility in Discover, with the data shown back to March 2019. We hope this report is helpful in thinking about how you might optimize your content strategy to help users discover engaging information-- both new and evergreen.

For questions or comments on the report, feel free to drop by our webmaster help forums, or contact us through our other channels.

How to discover & suggest Google-selected canonical URLs for your pages

Sometimes a web page can be reached by using more than one URL. In such cases, Google tries to determine the best URL to display in search and to use in other ways. We call this the “canonical URL.” There are ways site owners can help us better determine what should be the canonical URLs for their content.

If you suspect we’ve not selected the best canonical URL for your content, you can check by entering your page’s address into the URL Inspection tool within Search Console. It will show you the Google-selected canonical. If you believe there’s a better canonical that should be used, follow the steps on our duplicate URLs help page on how to suggest a preferred choice for consideration.

Please be aware that if you search using the site: or inurl: commands, you will be shown the domain you specified in those, even if these aren’t the Google-selected canonical. This happens because we’re fulfilling the exact request entered. Behind-the-scenes, we still use the Google-selected canonical, including for when people see pages without using the site: or inurl: commands.

We’ve also changed URL Inspection tool so that it will display any Google-selected canonical for a URL, not just those for properties you manage in Search Console. With this change, we’re also retiring the info: command. This was an alternative way of discovering canonicals. It was relatively underused, and URL Inspection tool provides a more comprehensive solution to help publishers with URLs.


Ways to succeed in Google News

With the New Year now underway, we'd like to offer some best practices and advice we hope will lead publishers to more success within Google News in 2019.

General advice

There is a lot of helpful information to consider within the Google News Publisher Help Center. Be sure to have read the material in this area, in particular the content and technical guidelines.

Headlines and dates


  • Present clear headlines: Google News looks at a variety of signals to determine the headline of an article, including within your HTML title tag and for the most prominent text on the page. Review our headline tips.
  • Provide accurate times and dates: Google News tries to determine the time and date to display for an article in a variety of ways. You can help ensure we get it right by using the following methods:
    • Show one clear date and time: As per our date guidelines, show a clear, visible date and time between the headline and the article text. Prevent other dates from appearing on the page whenever possible, such as for related stories.
    • Use structured data: Use the datePublished and dateModified schema and use the correct time zone designator for AMP or non-AMP pages
  • Avoid artificially freshening stories: If an article has been substantially changed, it can make sense to give it a fresh date and time. However, don't artificially freshen a story without adding significant information or some other compelling reason for the freshening. Also, do not create a very slightly updated story from one previously published, then delete the old story and redirect to the new one. That's against our article URLs guidelines.

Duplicate content

Google News seeks to reward independent, original journalistic content by giving credit to the originating publisher, as both users and publishers would prefer. This means we try not to allow duplicate content—which includes scraped, rewritten, or republished material—to perform better than the original content. In line with this, these are guidelines publishers should follow:

  • Block scraped content: Scraping commonly refers to taking material from another site, often on an automated basis. Sites that scrape content must block scraped content from Google News.
  • Block rewritten content: Rewriting refers to taking material from another site, then rewriting that material so that it is not identical. Sites that rewrite content in a way that provides no substantial or clear added value must block that rewritten content from Google News. This includes, but is not limited to, rewrites that make only very slight changes or those that make many word replacements but still keep the original article's overall meaning.
  • Block or consider canonical for republished content: Republishing refers to when a publisher has permission from another publisher or author to republish an original work, such as material from wire services or in partnership with other publications.
    Publishers that allow others to republish content can help ensure that their original versions perform better in Google News by asking those republishing to block or make use of canonical.
    Google News also encourages those that republish material to consider proactively blocking such content or making use of the canonical, so that we can better identify the original content and credit it appropriately.
  • Avoid duplicate content: If you operate a network of news sites that share content, the advice above about republishing is applicable to your network. Select what you consider to be the original article and consider blocking duplicates or making use of the canonical to point to the original.

Transparency


  • Be transparent: Visitors to your site want to trust and understand who publishes it and information about those who have written articles. That's why our content guidelines stress that content should have posts with clear bylines, information about authors, and contact information for the publication.
  • Don't be deceptive: Our content policies do not allow sites or accounts that impersonate any person or organization, or that misrepresent or conceal their ownership or primary purpose. We do not allow sites or accounts that engage in coordinated activity to mislead users. This includes, but isn't limited to, sites or accounts that misrepresent or conceal their country of origin or that direct content at users in another country under false premises.

More tips


  • Avoid taking part in link schemes: Don't participate in link schemes, which can include large-scale article marketing programs or selling links that pass PageRank. Review our page on link schemes for more information.
  • Use structured for rich presentation: Both those using AMP and non-AMP pages can make use of structured data to optimize your content for rich results or carousel-like presentations.
  • Protect your users and their data: Consider securing every page of your website with HTTPS to protect the integrity and confidentiality of the data users exchange on your site. You can find more useful tips in our best practices on how to implement HTTPS.

Here's to a great 2019!

We hope these tips help publishers succeed in Google News over the coming year. For those who have more questions about Google News, we are unable to do one-to-one support. However, we do monitor our Google News Publisher Forum—which has been newly-revamped—and try to provide guidance on questions that might help a number of publishers all at once. The forum is also a great resource where publishers share tips and advice with each other.

Introducing the Indexing API and structured data for livestreams

Over the past few years, it's become easier than ever to stream live videos online, from celebrity updates to special events. But it's not always easy for people to determine which videos are live and know when to tune in.
Today, we're introducing new tools to help more people discover your livestreams in Search and Assistant. With livestream structured data and the Indexing API, you can let Google know when your video is live, so it will be eligible to appear with a red "live" badge:

Add livestream structured data to your page

If your website streams live videos, use the livestream developer documentation to flag your video as a live broadcast and mark the start and end times. In addition, VideoObject structured data is required to tell Google that there's a video on your page.

Update Google quickly with the Indexing API

The Indexing API now supports pages with livestream structured data. We encourage you to call the Indexing API to request that your site is crawled in time for the livestream. We recommend calling the Indexing API when your livestream begins and ends, and if the structured data changes.
For more information, visit our developer documentation. If you have any questions, ask us in the Webmaster Help Forum. We look forward to seeing your live videos on Google!

Rich Results expands for Question & Answer pages

People come to Google seeking information about all kinds of questions.
Frequently, the information they're looking for is on sites where users ask and answer each other's questions. Popular social news sites, expert forums, and help and support message boards are all examples of this pattern.

A screenshot of an example search result for a page titled “Why do touchscreens sometimes register a touch when ...” with a preview of the top answers from the page.
In order to help users better identify which search results may give the best information about their question, we have developed a new rich result type for question and answer sites. Search results for eligible Q&A pages display a preview of the top answers. This new presentation helps site owners reach the right users for their content and helps users get the relevant information about their questions faster.
A screenshot of an example search result for a page titled “Why do touchscreens sometimes register a touch when ...” with a preview of the top answers from the page.

To be eligible for this feature, add Q&A structured data to your pages with Q&A content. Be sure to use the Structured Data Testing Tool to see if your page is eligible and to preview the appearance in search results. You can also check out Search Console to see aggregate stats and markup error examples. The Performance report also tells you which queries show your Q&A Rich Result in Search results, and how these change over time.
If you have any questions, ask us in the Webmaster Help Forum or reach out on Twitter!

Introducing the Indexing API for job posting URLs

Last June we launched a job search experience that has since connected tens of millions of job seekers around the world with relevant job opportunities from third party providers across the web. Timely indexing of new job content is critical because many jobs are filled relatively quickly. Removal of expired postings is important because nothing's worse than finding a great job only to discover it's no longer accepting applications.

Today we're releasing the Indexing API to address this problem. This API allows any site owner to directly notify Google when job posting pages are added or removed. This allows Google to schedule job postings for a fresh crawl, which can lead to higher quality user traffic and job applicant satisfaction. Currently, the Indexing API can only be used for job posting pages that include job posting structured data.

For websites with many short-lived pages like job postings, the Indexing API keeps job postings fresh in Search results because it allows updates to be pushed individually. This API can be integrated into your job posting flow, allowing high quality job postings to be searchable quickly after publication. In addition, you can check the last time Google received each kind of notification for a given URL.

Follow the Quickstart guide to see how the Indexing API works. If you have any questions, ask us in the Webmaster Help Forum. We look forward to hearing from you!