Tag Archives: malware

Combating Potentially Harmful Applications with Machine Learning at Google: Datasets and Models

Posted by Mo Yu, Android Security & Privacy Team

In a previous blog post, we talked about using machine learning to combat Potentially Harmful Applications (PHAs). This blog post covers how Google uses machine learning techniques to detect and classify PHAs. We'll discuss the challenges in the PHA detection space, including the scale of data, the correct identification of PHA behaviors, and the evolution of PHA families. Next, we will introduce two of the datasets that make the training and implementation of machine learning models possible, such as app analysis data and Google Play data. Finally, we will present some of the approaches we use, including logistic regression and deep neural networks.

Using machine learning to scale

Detecting PHAs is challenging and requires a lot of resources. Our security experts need to understand how apps interact with the system and the user, analyze complex signals to find PHA behavior, and evolve their tactics to stay ahead of PHA authors. Every day, Google Play Protect (GPP) analyzes over half a million apps, which makes a lot of new data for our security experts to process.

Leveraging machine learning helps us detect PHAs faster and at a larger scale. We can detect more PHAs just by adding additional computing resources. In many cases, machine learning can find PHA signals in the training data without human intervention. Sometimes, those signals are different than signals found by security experts. Machine learning can take better advantage of this data, and discover hidden relationships between signals more effectively.

There are two major parts of Google Play Protect's machine learning protections: the data and the machine learning models.

Data sources

The quality and quantity of the data used to create a model are crucial to the success of the system. For the purpose of PHA detection and classification, our system mainly uses two anonymous data sources: data from analyzing apps and data from how users experience apps.

App data

Google Play Protect analyzes every app that it can find on the internet. We created a dataset by decomposing each app's APK and extracting PHA signals with deep analysis. We execute various processes on each app to find particular features and behaviors that are relevant to the PHA categories in scope (for example, SMS fraud, phishing, privilege escalation). Static analysis examines the different resources inside an APK file while dynamic analysis checks the behavior of the app when it's actually running. These two approaches complement each other. For example, dynamic analysis requires the execution of the app regardless of how obfuscated its code is (obfuscation hinders static analysis), and static analysis can help detect cloaking attempts in the code that may in practice bypass dynamic analysis-based detection. In the end, this analysis produces information about the app's characteristics, which serve as a fundamental data source for machine learning algorithms.

Google Play data

In addition to analyzing each app, we also try to understand how users perceive that app. User feedback (such as the number of installs, uninstalls, user ratings, and comments) collected from Google Play can help us identify problematic apps. Similarly, information about the developer (such as the certificates they use and their history of published apps) contribute valuable knowledge that can be used to identify PHAs. All these metrics are generated when developers submit a new app (or new version of an app) and by millions of Google Play users every day. This information helps us to understand the quality, behavior, and purpose of an app so that we can identify new PHA behaviors or identify similar apps.

In general, our data sources yield raw signals, which then need to be transformed into machine learning features for use by our algorithms. Some signals, such as the permissions that an app requests, have a clear semantic meaning and can be directly used. In other cases, we need to engineer our data to make new, more powerful features. For example, we can aggregate the ratings of all apps that a particular developer owns, so we can calculate a rating per developer and use it to validate future apps. We also employ several techniques to focus in on interesting data.To create compact representations for sparse data, we use embedding. To help streamline the data to make it more useful to models, we use feature selection. Depending on the target, feature selection helps us keep the most relevant signals and remove irrelevant ones.

By combining our different datasets and investing in feature engineering and feature selection, we improve the quality of the data that can be fed to various types of machine learning models.

Models

Building a good machine learning model is like building a skyscraper: quality materials are important, but a great design is also essential. Like the materials in a skyscraper, good datasets and features are important to machine learning, but a great algorithm is essential to identify PHA behaviors effectively and efficiently.

We train models to identify PHAs that belong to a specific category, such as SMS-fraud or phishing. Such categories are quite broad and contain a large number of samples given the number of PHA families that fit the definition. Alternatively, we also have models focusing on a much smaller scale, such as a family, which is composed of a group of apps that are part of the same PHA campaign and that share similar source code and behaviors. On the one hand, having a single model to tackle an entire PHA category may be attractive in terms of simplicity but precision may be an issue as the model will have to generalize the behaviors of a large number of PHAs believed to have something in common. On the other hand, developing multiple PHA models may require additional engineering efforts, but may result in better precision at the cost of reduced scope.

We use a variety of modeling techniques to modify our machine learning approach, including supervised and unsupervised ones.

One supervised technique we use is logistic regression, which has been widely adopted in the industry. These models have a simple structure and can be trained quickly. Logistic regression models can be analyzed to understand the importance of the different PHA and app features they are built with, allowing us to improve our feature engineering process. After a few cycles of training, evaluation, and improvement, we can launch the best models in production and monitor their performance.

For more complex cases, we employ deep learning. Compared to logistic regression, deep learning is good at capturing complicated interactions between different features and extracting hidden patterns. The millions of apps in Google Play provide a rich dataset, which is advantageous to deep learning.

In addition to our targeted feature engineering efforts, we experiment with many aspects of deep neural networks. For example, a deep neural network can have multiple layers and each layer has several neurons to process signals. We can experiment with the number of layers and neurons per layer to change model behaviors.

We also adopt unsupervised machine learning methods. Many PHAs use similar abuse techniques and tricks, so they look almost identical to each other. An unsupervised approach helps define clusters of apps that look or behave similarly, which allows us to mitigate and identify PHAs more effectively. We can automate the process of categorizing that type of app if we are confident in the model or can request help from a human expert to validate what the model found.

PHAs are constantly evolving, so our models need constant updating and monitoring. In production, models are fed with data from recent apps, which help them stay relevant. However, new abuse techniques and behaviors need to be continuously detected and fed into our machine learning models to be able to catch new PHAs and stay on top of recent trends. This is a continuous cycle of model creation and updating that also requires tuning to ensure that the precision and coverage of the system as a whole matches our detection goals.

Looking forward

As part of Google's AI-first strategy, our work leverages many machine learning resources across the company, such as tools and infrastructures developed by Google Brain and Google Research. In 2017, our machine learning models successfully detected 60.3% of PHAs identified by Google Play Protect, covering over 2 billion Android devices. We continue to research and invest in machine learning to scale and simplify the detection of PHAs in the Android ecosystem.

Acknowledgements

This work was developed in joint collaboration with Google Play Protect, Safe Browsing and Play Abuse teams with contributions from Andrew Ahn, Hrishikesh Aradhye, Daniel Bali, Hongji Bao, Yajie Hu, Arthur Kaiser, Elena Kovakina, Salvador Mandujano, Melinda Miller, Rahul Mishra, Damien Octeau, Sebastian Porst, Chuangang Ren, Monirul Sharif, Sri Somanchi, Sai Deep Tetali, Zhikun Wang, and Mo Yu.

Updates to the Google Safe Browsing’s Site Status Tool

(Cross-posted from the Google Security Blog)
Google Safe Browsing gives users tools to help protect themselves from web-based threats like malware, unwanted software, and social engineering. We are best known for our warnings, which users see when they attempt to navigate to dangerous sites or download dangerous files. We also provide other tools, like the Site Status Tool, where people can check the current safety status of a web page (without having to visit it).

We host this tool within Google’s Safe Browsing Transparency Report. As with other sections in Google’s Transparency Report, we make this data available to give the public more visibility into the security and health of the online ecosystem. Users of the Site Status Tool input a webpage (as a URL, website, or domain) into the tool, and the most recent results of the Safe Browsing analysis for that webpage are returned...plus references to troubleshooting help and educational materials.



We’ve just launched a new version of the Site Status Tool that provides simpler, clearer results and is better designed for the primary users of the page: people who are visiting the tool from a Safe Browsing warning they’ve received, or doing casual research on Google’s malware and phishing detection. The tool now features a cleaner UI, easier-to-interpret language, and more precise results. We’ve also moved some of the more technical data on associated ASes (autonomous systems) over to the malware dashboard section of the report.

 While the interface has been streamlined, additional diagnostic information is not gone: researchers who wish to find more details can drill-down elsewhere in Safe Browsing’s Transparency Report, while site-owners can find additional diagnostic information in Search Console. One of the goals of the Transparency Report is to shed light on complex policy and security issues, so, we hope the design adjustments will indeed provide our users with additional clarity.

#NoHacked: A year in review

We hope your year started out safe and secure!
We wanted to share with you a summary of our 2016 work as we continue our #NoHacked campaign. Let’s start with some trends on hacked sites from the past year.

State of Website Security in 2016

First off, some unfortunate news. We’ve seen an increase in the number of hacked sites by approximately 32% in 2016 compared to 2015. We don’t expect this trend to slow down. As hackers get more aggressive and more sites become outdated, hackers will continue to capitalize by infecting more sites.
On the bright side, 84% webmasters who do apply for reconsideration are successful in cleaning their sites. However, 61% of webmasters who were hacked never received a notification from Google that their site was infected because their sites weren't verified in Search Console. Remember to register for Search Console if you own or manage a site. It’s the primary channel that Google uses to communicate site health alerts.

More Help for Hacked Webmasters


We’ve been listening to your feedback to better understand how we can help webmasters with security issues. One of the top requests was easier to understand documentation about hacked sites. As a result we’ve been hard at work to make our documentation more useful.
First, we created new documentation to give webmasters more context when their site has been compromised. Here is a list of the new help documentation:
Next, we created clean up guides for sites affected by known hacks. We’ve noticed that sites often get affected in similar ways when hacked. By investigating the similarities, we were able to create clean up guides for specific known type of hack. Below is a short description of each of the guides we created:
Gibberish Hack: The gibberish hack automatically creates many pages with non-sensical sentences filled with keywords on the target site. Hackers do this so the hacked pages show up in Google Search. Then, when people try to visit these pages, they’ll be redirected to an unrelated page, like a porn site. Learn more on how to fix this type of hack.
Japanese Keywords Hack: The Japanese keywords hack typically creates new pages with Japanese text on the target site in randomly generated directory names. These pages are monetized using affiliate links to stores selling fake brand merchandise and then shown in Google search. Sometimes the accounts of the hackers get added in Search Console as site owners. Learn more on how to fix this type of hack.
Cloaked Keywords Hack: The cloaked keywords and link hack automatically creates many pages with non-sensical sentence, links, and images. These pages sometimes contain basic template elements from the original site, so at first glance, the pages might look like normal parts of the target site until you read the content. In this type of attack, hackers usually use cloaking techniques to hide the malicious content and make the injected page appear as part of the original site or a 404 error page. Learn more on how to fix this type of hack.

Prevention is Key


As always it’s best to take a preventative approach and secure your site rather than dealing with the aftermath. Remember a chain is only as strong as its weakest link. You can read more about how to identify vulnerabilities on your site in our hacked help guide. We also recommend staying up-to-date on releases and announcements from your Content Management System (CMS) providers and software/hardware vendors.

Looking Forward

Hacking behavior is constantly evolving, and research allows us to stay up to date on and combat the latest trends. You can learn about our latest research publications in the information security research site. Highlighted below are a few specific studies specific to website compromises:
If you have feedback or specific questions about compromised sites, the Webmaster Help Forums has an active group of Googlers and technical contributors that can address your questions and provide additional technical support.

Silence speaks louder than words when finding malware

Originally posted on Android Developer Blog

Posted by Megan Ruthven, Software Engineer

In Android Security, we're constantly working to better understand how to make Android devices operate more smoothly and securely. One security solution included on all devices with Google Play is Verify apps. Verify apps checks if there are Potentially Harmful Apps (PHAs) on your device. If a PHA is found, Verify apps warns the user and enables them to uninstall the app.

But, sometimes devices stop checking up with Verify apps. This may happen for a non-security related reason, like buying a new phone, or, it could mean something more concerning is going on. When a device stops checking up with Verify apps, it is considered Dead or Insecure (DOI). An app with a high enough percentage of DOI devices downloading it, is considered a DOI app. We use the DOI metric, along with the other security systems to help determine if an app is a PHA to protect Android users. Additionally, when we discover vulnerabilities, we patch Android devices with our security update system. This blog post explores the Android Security team's research to identify the security-related reasons that devices stop working and prevent it from happening in the future.
Flagging DOI Apps
To understand this problem more deeply, the Android Security team correlates app install attempts and DOI devices to find apps that harm the device in order to protect our users.
With these factors in mind, we then focus on 'retention'. A device is considered retained if it continues to perform periodic Verify apps security check ups after an app download. If it doesn't, it's considered potentially dead or insecure (DOI). An app's retention rate is the percentage of all retained devices that downloaded the app in one day. Because retention is a strong indicator of device health, we work to maximize the ecosystem's retention rate. Therefore, we use an app DOI scorer, which assumes that all apps should have a similar device retention rate. If an app's retention rate is a couple of standard deviations lower than average, the DOI scorer flags it. A common way to calculate the number of standard deviations from the average is called a Z-score. The equation for the Z-score is below.
  • N = Number of devices that downloaded the app.
  • x = Number of retained devices that downloaded the app.
  • p = Probability of a device downloading any app will be retained.

In this context, we call the Z-score of an app's retention rate a DOI score. The DOI score indicates an app has a statistically significant lower retention rate if the Z-score is much less than -3.7. This means that if the null hypothesis is true, there is much less than a 0.01% chance the magnitude of the Z-score being as high. In this case, the null hypothesis means the app accidentally correlated with lower retention rate independent of what the app does.
This allows for percolation of extreme apps (with low retention rate and high number of downloads) to the top of the DOI list. From there, we combine the DOI score with other information to determine whether to classify the app as a PHA. We then use Verify apps to remove existing installs of the app and prevent future installs of the app.
Difference between a regular and DOI app download on the same device.
Results in the wild
Among others, the DOI score flagged many apps in three well known malware families— Hummingbad, Ghost Push, and Gooligan. Although they behave differently, the DOI scorer flagged over 25,000 apps in these three families of malware because they can degrade the Android experience to such an extent that a non-negligible amount of users factory reset or abandon their devices. This approach provides us with another perspective to discover PHAs and block them before they gain popularity. Without the DOI scorer, many of these apps would have escaped the extra scrutiny of a manual review.
The DOI scorer and all of Android's anti-malware work is one of multiple layers protecting users and developers on Android. For an overview of Android's security and transparency efforts, check out our page.

Silence speaks louder than words when finding malware

Posted by Megan Ruthven, Software Engineer

In Android Security, we're constantly working to better understand how to make Android devices operate more smoothly and securely. One security solution included on all devices with Google Play is Verify apps. Verify apps checks if there are Potentially Harmful Apps (PHAs) on your device. If a PHA is found, Verify apps warns the user and enables them to uninstall the app.

But, sometimes devices stop checking up with Verify apps. This may happen for a non-security related reason, like buying a new phone, or, it could mean something more concerning is going on. When a device stops checking up with Verify apps, it is considered Dead or Insecure (DOI). An app with a high enough percentage of DOI devices downloading it, is considered a DOI app. We use the DOI metric, along with the other security systems to help determine if an app is a PHA to protect Android users. Additionally, when we discover vulnerabilities, we patch Android devices with our security update system.

This blog post explores the Android Security team's research to identify the security-related reasons that devices stop working and prevent it from happening in the future.
Flagging DOI Apps

To understand this problem more deeply, the Android Security team correlates app install attempts and DOI devices to find apps that harm the device in order to protect our users.
With these factors in mind, we then focus on 'retention'. A device is considered retained if it continues to perform periodic Verify apps security check ups after an app download. If it doesn't, it's considered potentially dead or insecure (DOI). An app's retention rate is the percentage of all retained devices that downloaded the app in one day. Because retention is a strong indicator of device health, we work to maximize the ecosystem's retention rate.

Therefore, we use an app DOI scorer, which assumes that all apps should have a similar device retention rate. If an app's retention rate is a couple of standard deviations lower than average, the DOI scorer flags it. A common way to calculate the number of standard deviations from the average is called a Z-score. The equation for the Z-score is below.

  • N = Number of devices that downloaded the app.
  • x = Number of retained devices that downloaded the app.
  • p = Probability of a device downloading any app will be retained.

In this context, we call the Z-score of an app's retention rate a DOI score. The DOI score indicates an app has a statistically significant lower retention rate if the Z-score is much less than -3.7. This means that if the null hypothesis is true, there is much less than a 0.01% chance the magnitude of the Z-score being as high. In this case, the null hypothesis means the app accidentally correlated with lower retention rate independent of what the app does.
This allows for percolation of extreme apps (with low retention rate and high number of downloads) to the top of the DOI list. From there, we combine the DOI score with other information to determine whether to classify the app as a PHA. We then use Verify apps to remove existing installs of the app and prevent future installs of the app.

Difference between a regular and DOI app download on the same device.


Results in the wild
Among others, the DOI score flagged many apps in three well known malware families— Hummingbad, Ghost Push, and Gooligan. Although they behave differently, the DOI scorer flagged over 25,000 apps in these three families of malware because they can degrade the Android experience to such an extent that a non-negligible amount of users factory reset or abandon their devices. This approach provides us with another perspective to discover PHAs and block them before they gain popularity. Without the DOI scorer, many of these apps would have escaped the extra scrutiny of a manual review.
The DOI scorer and all of Android's anti-malware work is one of multiple layers protecting users and developers on Android. For an overview of Android's security and transparency efforts, check out our page.


More Safe Browsing Help for Webmasters

(Crossposted from the Google Security Blog.)
For more than nine years, Safe Browsing has helped webmasters via Search Console with information about how to fix security issues with their sites. This includes relevant Help Center articles, example URLs to assist in diagnosing the presence of harmful content, and a process for webmasters to request reviews of their site after security issues are addressed. Over time, Safe Browsing has expanded its protection to cover additional threats to user safety such as Deceptive Sites and Unwanted Software.

To help webmasters be even more successful in resolving issues, we’re happy to announce that we’ve updated the information available in Search Console in the Security Issues report.


The updated information provides more specific explanations of six different security issues detected by Safe Browsing, including malware, deceptive pages, harmful downloads, and uncommon downloads. These explanations give webmasters more context and detail about what Safe Browsing found. We also offer tailored recommendations for each type of issue, including sample URLs that webmasters can check to identify the source of the issue, as well as specific remediation actions webmasters can take to resolve the issue.

We on the Safe Browsing team definitely recommend registering your site in Search Console even if it is not currently experiencing a security issue. We send notifications through Search Console so webmasters can address any issues that appear as quickly as possible.

Our goal is to help webmasters provide a safe and secure browsing experience for their users. We welcome any questions or feedback about the new features on the Google Webmaster Help Forum, where Top Contributors and Google employees are available to help.

For more information about Safe Browsing’s ongoing work to shine light on the state of web security and encourage safer web security practices, check out our summary of trends and findings on the Safe Browsing Transparency Report. If you’re interested in the tools Google provides for webmasters and developers dealing with hacked sites, this video provides a great overview.