Improving Malicious Document Detection in Gmail with Deep Learning




Gmail protects your incoming mail against spam, phishing attempts, and malware. Our existing machine learning models are highly effective at doing this, and in conjunction with our other protections, they help block more than 99.9% of threats from reaching Gmail inboxes.

One of our key protections is our malware scanner that processes more than 300 billion attachments each week to block harmful content. 63% percent of the malicious documents we block differ from day to day. To stay ahead of this constantly evolving threat, we recently added a new generation of document scanners that rely on deep learning to improve our detection capabilities. We’re sharing the details of this technology and its early success this week at RSA 2020.

Since the new scanner launched at the end of 2019, we have increased our daily detection coverage of Office documents that contain malicious scripts by 10%. Our technology is especially helpful at detecting adversarial, bursty attacks. In these cases, our new scanner has improved our detection rate by 150%. Under the hood, our new scanner uses a distinct TensorFlow deep-learning model trained with TFX (TensorFlow Extended) and a custom document analyzer for each file type. The document analyzers are responsible for parsing the document, identifying common attack patterns, extracting macros, deobfuscating content, and performing feature extraction.


Strengthening our document detection capabilities is one of our key focus areas, as malicious documents represent 58% of the malware targeting Gmail users. We are still actively developing this technology, and right now, we only use it to scan Office documents.

Our new scanner runs in parallel with existing detection capabilities, all of which contribute to the final verdict of our decision engine to block a malicious document. Combining different scanners is one of the cornerstones of our defense-in-depth approach to help protect users and ensure our detection system is resilient to adversarial attacks.

We will continue to actively expand the use of artificial intelligence to protect our users’ inboxes, and to stay ahead of attacks.