Tag Archives: bigquery

Unified experience for Gmail logs in BigQuery, configure your existing Gmail logs to route to Workspace logs

What’s changing 

In the coming months, we will move the location of the existing Gmail logs in BigQuery to Google Workspace logs and reports in BigQuery. This change will only impact existing Google Workspace customers, who have BigQuery Export enabled. New customers will be guided through setting up a Workspace BigQuery project for Gmail logs. 


While this change is happening, for a brief period of time, Gmail data will be exported to both the Gmail logs and the Workspace logs — existing Google Workspace customers who are impacted will receive an email notification with more information in the coming weeks. 

Gmail logs in BigQuery 



                
 Migrating to Unified Workspace logs in BigQuery







Who’s impacted 

Admins


Why it’s important 

This change will create a single space to access all of your Google Workspace audit events and eliminate the need to switch between various sources of truth. 


Once this merger is completed, you will not be able to use Gmail logs in BigQuery in the Admin console. Visit the Help Center for more information, including detailed step-by-step instructions on handling this change. 


Additional details

Temporary dual export period and BigQuery storage requirements 
Once Gmail logs are set to be exported to Workspace Logs, this data will temporarily exported to two destinations: 
  • Your previous, Gmail-only BigQuery project 
  • Your new or existing Workspace BigQuery project, which stores data for all Google Workspace applications 

This will also impact your BigQuery quotas limits and how often they have to refresh, as well as the cost associated with exporting logs to BigQuery. To avoid impact on your quotas and duplicative data, we strongly recommend turning off Gmail log exports to your Gmail-only BigQuery projects


Getting started 


Rollout pace 


Availability 

  • Available to Google Workspace Enterprise Essentials, Enterprise Standard, Enterprise Plus, Education Plus, Education Standard 
  • Not available to Google Workspace Essentials, Business Starter, Business Standard, Business Plus, Frontline, and Nonprofits, as well as legacy G Suite Basic and Business customers 

Resources 

Logica: organizing your data queries, making them universally reusable and fun

We present Logica, a novel open source Logic Programming language. A successor to Yedalog (a language developed at Google earlier) it is a Datalog-like logic programming language. Logica code compiles to SQL and runs on Google BigQuery (with experimental support for PostgreSQL and SQLite), but it is much more concise and supports the clean and reusable abstraction mechanisms that SQL lacks. It supports modules and imports, it can be used from an interactive Python notebook and it even makes testing your queries natural and easy.

“Data is the new oil”, they say, and SQL is so far the lingua franca for working with data. When SQL (or “Structured English Query Language”, as it was first named) was invented in the 1970s, its authors might not have imagined the popularity that it would reach half a century later. Today, systems ranging from tiny smart watch applications to enterprise IT solutions, read and write their data using SQL. Even the browser that you are using to read this post now might have a working built-in SQL database in it.

Despite the widespread adoption, SQL is not flawless. Constructing statements from long chains of English words (which are often capitalized to keep the old-fashioned COBOL spirit of the 70s alive!) can be very verbose—a single query spanning hundreds of lines is a routine occurrence. The main flaw of SQL, however, lies in its very limited support for abstraction.

Good programming is about creating small, understandable, reusable pieces of logic that can be tested, given names, and organized into packages which can later be used to construct more useful pieces of logic. SQL resists this workflow. Although you can encapsulate certain repeated computations into views and functions, the syntax and support for these can vary among implementations, the notions of packages and imports are generally nonexistent, and higher-level constructions (e.g. passing a function to a function) are impossible.

This inherent resistance to decomposition of logic into bite-sized pieces is what leads into the contrived, lengthy queries, the copy-pasted chunks of code and, eventually, unmaintainable, unstructured (note the irony) SQL codebases. To make things worse, SQL code is rarely tested, because “testing SQL queries” sounds rather esoteric to most engineers, at best. Because of that, a number of alternative query languages and libraries have been developed. Of those, systems based on logic programming perhaps come the closest to addressing SQL’s limitations.

Logic programming languages solve problems of SQL by using syntax of mathematical propositional logic rather than natural English language. The language of formal logic was designed by mathematicians specifically to make expression of complex statements easier and suits this purpose much better than natural language. Logica extends classical Logic programming syntax further, most notably with aggregation, hence the name, which stands for

Logica = Logic + Aggregation.

Let us see how it all works. SQL operates with relations, which are sets of rows. In logic programming the analog of a relation is a predicate. While a predicate is a set of rows, we think of it as a logical condition, which describes the rows of a relation. Here is, for example, the definition of a simple predicate:

MagicNumber(x: 2);

MagicNumber(x: 3);

MagicNumber(x: 5);

The definition claims that the condition MagicNumber(x) must hold when X is precisely either 2, 3, or 5. That means, if we were to query this predicate (i.e. request all values of X that satisfy it), the output should be a “relation” with a single column X and rows 2, 3, and 5. The SQL equivalent would be:

SELECT 2 AS x

UNION ALL

SELECT 3 AS x

UNION ALL

SELECT 5 AS x;

Rather than listing the individual values, we could have defined the predicate by encoding a logical condition upon X as follows:

MagicNumber(x:) :-

  x in [2, 3, 5];

Now, here is where the magic starts. Firstly, any table in your database is itself already a predicate, so the following definition:

MagicComment(comment_text:) :-

 `comments`(user_id:, comment_text:),

 user_id == 5;

Defines a predicate MagicComment, which includes precisely those comment_text values, which are present in the comments table where user_id == 5. In SQL this would read:

SELECT comment_text FROM comments WHERE user_id = 5;

Observe what happens if we replace the condition “user_id == 5” in our predicate with MagicNumber(x: user_id):

MagicComment(comment_text:) :-

 `comments`(user_id:, comment_text:),

 MagicNumber(x: user_id);

Here, we are querying for comments of users whose ID is one of the “magic numbers” we just defined above. Note how easily we could reuse a previously defined piece of code without having to copy anything around. We could now even extract the MagicNumber to a common module and import it in wherever it is needed:

import my_project.magic.MagicNumber;

As a final example, let us mock the comments table, in a unittest of a query.

import my_project.magic.MagicComment;


MockComments(user_id: 1, comment_text: "Hello");

MockComments(user_id: 2, comment_text: "Logic");

MockComments(user_id: 3, comment_text: "Programming");


MagicCommentTest := MagicComment(`comments`: MockComments);

If we query the MagicComment predicate here, it will not try to read the comments table in the database. Instead, it will use the predicate we just defined, thus letting us verify its correctness by testing the output (it must include two rows “Logic” and “Programming”). Observe how natural and frictionless many of the good programming practices become with Logica, and compare that to what you would have to do to achieve the same using bare SQL.

There is much more to Logica, so make sure you give it a try—chances are, you will love it! Start with this tutorial to learn Logica. Even if you do not end up using it in your next project, learning a new powerful language may open your mind to new ideas and perspectives on data processing and computing in general.

The simple examples above are only a small sample of how concise Logica code can be over SQL for complex queries. In particular, we did not even touch the topic of aggregations in this article. For all of this see examples section of the Logica open source repository.

We also hope that some of the readers consider contributing to Logica development. That’s what open source is all about!

By Konstantin Tretyakov and Evgeny Skvortsov – Logica Open Source Project

Code that final mile: from big data analysis to slide presentation

Posted by Wesley Chun (@wescpy), Developer Advocate, Google Cloud

Google Cloud Platform (GCP) provides infrastructure, serverless products, and APIs that help you build, innovate, and scale. G Suite provides a collection of productivity tools, developer APIs, extensibility frameworks and low-code platforms that let you integrate with G Suite applications, data, and users. While each solution is compelling on its own, users can get more power and flexibility by leveraging both together.

In the latest episode of the G Suite Dev Show, I'll show you one example of how you can take advantage of powerful GCP tools right from G Suite applications. BigQuery, for example, can help you surface valuable insight from massive amounts of data. However, regardless of "the tech" you use, you still have to justify and present your findings to management, right? You've already completed the big data analysis part, so why not go that final mile and tap into G Suite for its strengths? In the sample app covered in the video, we show you how to go from big data analysis all the way to an "exec-ready" presentation.

The sample application is meant to give you an idea of what's possible. While the video walks through the code a bit more, let's give all of you a high-level overview here. Google Apps Script is a G Suite serverless development platform that provides straightforward access to G Suite APIs as well as some GCP tools such as BigQuery. The first part of our app, the runQuery() function, issues a query to BigQuery from Apps Script then connects to Google Sheets to store the results into a new Sheet (note we left out CONSTANT variable definitions for brevity):

function runQuery() {
// make BigQuery request
var request = {query: BQ_QUERY};
var queryResults = BigQuery.Jobs.query(request, PROJECT_ID);
var jobId = queryResults.jobReference.jobId;
queryResults = BigQuery.Jobs.getQueryResults(PROJECT_ID, jobId);
var rows = queryResults.rows;

// put results into a 2D array
var data = new Array(rows.length);
for (var i = 0; i < rows.length; i++) {
var cols = rows[i].f;
data[i] = new Array(cols.length);
for (var j = 0; j < cols.length; j++) {
data[i][j] = cols[j].v;
}
}

// put array data into new Sheet
var spreadsheet = SpreadsheetApp.create(QUERY_NAME);
var sheet = spreadsheet.getActiveSheet();
var headers = queryResults.schema.fields;
sheet.appendRow(headers); // header row
sheet.getRange(START_ROW, START_COL,
rows.length, headers.length).setValues(data);

// return Sheet object for later use
return spreadsheet;
}

It returns a handle to the new Google Sheet which we can then pass on to the next component: using Google Sheets to generate a Chart from the BigQuery data. Again leaving out the CONSTANTs, we have the 2nd part of our app, the createColumnChart() function:

function createColumnChart(spreadsheet) {
// create & put chart on 1st Sheet
var sheet = spreadsheet.getSheets()[0];
var chart = sheet.newChart()
.setChartType(Charts.ChartType.COLUMN)
.addRange(sheet.getRange(START_CELL + ':' + END_CELL))
.setPosition(START_ROW, START_COL, OFFSET, OFFSET)
.build();
sheet.insertChart(chart);

// return Chart object for later use
return chart;
}

The chart is returned by createColumnChart() so we can use that plus the Sheets object to build the desired slide presentation from Apps Script with Google Slides in the 3rd part of our app, the createSlidePresentation() function:

function createSlidePresentation(spreadsheet, chart) {
// create new deck & add title+subtitle
var deck = SlidesApp.create(QUERY_NAME);
var [title, subtitle] = deck.getSlides()[0].getPageElements();
title.asShape().getText().setText(QUERY_NAME);
subtitle.asShape().getText().setText('via GCP and G Suite APIs:\n' +
'Google Apps Script, BigQuery, Sheets, Slides');

// add new slide and insert empty table
var tableSlide = deck.appendSlide(SlidesApp.PredefinedLayout.BLANK);
var sheetValues = spreadsheet.getSheets()[0].getRange(
START_CELL + ':' + END_CELL).getValues();
var table = tableSlide.insertTable(sheetValues.length, sheetValues[0].length);

// populate table with data in Sheets
for (var i = 0; i < sheetValues.length; i++) {
for (var j = 0; j < sheetValues[0].length; j++) {
table.getCell(i, j).getText().setText(String(sheetValues[i][j]));
}
}

// add new slide and add Sheets chart to it
var chartSlide = deck.appendSlide(SlidesApp.PredefinedLayout.BLANK);
chartSlide.insertSheetsChart(chart);

// return Presentation object for later use
return deck;
}

Finally, we need a driver application that calls all three one after another, the createColumnChart() function:

function createBigQueryPresentation() {
var spreadsheet = runQuery();
var chart = createColumnChart(spreadsheet);
var deck = createSlidePresentation(spreadsheet, chart);
}

We left out some detail in the code above but hope this pseudocode helps kickstart your own project. Seeking a guided tutorial to building this app one step-at-a-time? Do our codelab at g.co/codelabs/bigquery-sheets-slides. Alternatively, go see all the code by hitting our GitHub repo at github.com/googlecodelabs/bigquery-sheets-slides. After executing the app successfully, you'll see the fruits of your big data analysis captured in a presentable way in a Google Slides deck:

This isn't the end of the story as this is just one example of how you can leverage both platforms from Google Cloud. In fact, this was one of two sample apps featured in our Cloud NEXT '18 session this summer exploring interoperability between GCP & G Suite which you can watch here:

Stay tuned as more examples are coming. We hope these videos plus the codelab inspire you to build on your own ideas.

Machine Learning in Google BigQuery



Google BiqQuery allows interactive analysis of large datasets, making it easy for businesses to share meaningful insights and develop solutions based on customer analytics. However, many of the businesses that are using BigQuery aren’t using machine learning to help better understand the data they are generating. This is because data analysts, proficient in SQL, may not have the traditional data science background needed to apply machine learning techniques.

Today we’re announcing BigQuery ML, a capability inside BigQuery that allows data scientists and analysts to build and deploy machine learning models on massive structured or semi-structured datasets. BigQuery ML is a set of simple SQL language extensions which enables users to utilize popular ML capabilities, performing predictive analytics like forecasting sales and creating customer segmentations right at the source, where they already store their data. BigQuery ML additionally sets smart defaults automatically and takes care of data transformation, leading to a seamless and easy to use experience with great results.
When designing the BigQuery ML backend, the team was faced with a dilemma. Transferring large amounts of data from BigQuery servers to special-purpose servers running machine learning algorithms would be time-consuming and would incur an overhead in terms of security and privacy considerations. However, because the core components of gradient descent — an optimization method that is the workhorse of machine learning algorithms — can be implemented using common SQL operations*, we were able to repurpose the existing BigQuery SQL processing engine for BigQuery ML.

Since the BigQuery engine is designed to efficiently scan large datasets rather than randomly draw small samples from them, BigQuery ML is based on the standard (batch) variant of gradient descent rather than the stochastic version. And while stochastic gradient descent is far more common in today’s large-scale machine learning systems, the batch variant has numerous practical advantages.

For example, in-database machine learning systems based on stochastic gradient descent process examples one by one, and can perform poorly when the data is suboptimally ordered. But BigQuery data is often distributed on disk so as to optimize the performance of regular SQL queries, and continually redistributing the data to support stochastic machine learning algorithms would be computationally expensive. In contrast, batch gradient descent is insensitive to the ordering and partitioning of data on disk, thereby completely circumventing this problem. Also, batch methods can be combined with line search techniques from the classical optimization literature, leading to a learning algorithm that is more stable and requires less fine tuning. Using line search with stochastic methods is far trickier. Our implementation also includes support for regularization and preconditioning. For more details, please see our paper.

We hope that you’ll find BigQuery ML useful for many predictive analytics tasks. To try it, visit the BigQuery console and follow the user guide. Creating a model is as simple as:
CREATE MODEL dataset.model_name
OPTIONS(model_type=’linear_reg’, input_label_cols=[‘input_label’])
AS SELECT * FROM input_table;
In the future, we plan to further integrate our gradient descent implementation with BigQuery infrastructure to realize more performance gains. We’re also going to explore other machine learning algorithms that can be easily and efficiently implemented for large-scale problems by leveraging the power of BigQuery.

Acknowledgements
BigQuery ML is the result of a large collaboration across many teams at Google. Key contributors and sponsors include Hossein Ahmadi, Corinna Cortes, Grzegorz Czajkowski, Mingge Deng, Amir Hormati, Abhishek Kashyap, Jing Jing Long, Dan McClary, Chris Meyers, Girishkumar Sabhnani, Vivek Sharma, Jordan Tigani, Chad Verbowski, Jiaxun Wu and Lisa Yin.


* For example, a gradient vector can be computed using the SUM and GROUP BY operators, and the weights of a model can be updated using an INNER JOIN.

Source: Google AI Blog


Hangouts Chat alerts & notifications… with asynchronous messages

Posted by Wesley Chun (@wescpy), Developer Advocate, G Suite

While most chatbots respond to user requests in a synchronous way, there are scenarios when bots don't perform actions based on an explicit user request, such as for alerts or notifications. In today's DevByte video, I'm going to show you how to send messages asynchronously to rooms or direct messages (DMs) in Hangouts Chat, the team collaboration and communication tool in G Suite.

What comes to mind when you think of a bot in a chat room? Perhaps a user wants the last quarter's European sales numbers, or maybe, they want to look up local weather or the next movie showtime. Assuming there's a bot for whatever the request is, a user will either send a direct message (DM) to that bot or @mention the bot from within a chat room. The bot then fields the request (sent to it by the Hangouts Chat service), performs any necessary magic, and responds back to the user in that "space," the generic nomenclature for a room or DM.

Our previous DevByte video for the Hangouts Chat bot framework shows developers what bots and the framework are all about as well as how to build one of these types of bots, in both Python and JavaScript. However, recognize that these bots are responding synchronously to a user request. This doesn't suffice when users want to be notified when a long-running background job has completed, when a late bus or train will be arriving soon, or when one of their servers has just gone down. Recognize that such alerts can come from a bot but also perhaps a monitoring application. In the latest episode of the G Suite Dev Show, learn how to integrate this functionality in either type of application.

From the video, you can see that alerts and notifications are "out-of-band" messages, meaning they can come in at any time. The Hangouts Chat bot framework provides several ways to send asynchronous messages to a room or DM, generically referred to as a "space." The first is the HTTP-based REST API. The other way is using what are known as "incoming webhooks."

The REST API is used by bots to send messages into a space. Since a bot will never be a human user, a Google service account is required. Once you create a service account for your Hangouts Chat bot in the developers console, you can download its credentials needed to communicate with the API. Below is a short Python sample snippet that uses the API to send a message asynchronously to a space.

from apiclient import discovery
from httplib2 import Http
from oauth2client.service_account import ServiceAccountCredentials

SCOPES = 'https://www.googleapis.com/auth/chat.bot'
creds = ServiceAccountCredentials.from_json_keyfile_name(
'svc_acct.json', SCOPES)
CHAT = discovery.build('chat', 'v1', http=creds.authorize(Http()))

room = 'spaces/<ROOM-or-DM>'
message = {'text': 'Hello world!'}
CHAT.spaces().messages().create(parent=room, body=message).execute()

The alternative to using the API with services accounts is the concept of incoming webhooks. Webhooks are a quick and easy way to send messages into any room or DM without configuring a full bot, i.e., monitoring apps. Webhooks also allow you to integrate your custom workflows, such as when a new customer is added to the corporate CRM (customer relationship management system), as well as others mentioned above. Below is a Python snippet that uses an incoming webhook to communicate into a space asynchronously.

import requests
import json

URL = 'https://chat.googleapis.com/...&thread_key=T12345'
message = {'text': 'Hello world!'}
requests.post(URL, data = json.dumps(message))

Since incoming webhooks are merely endpoints you HTTP POST to, you can even use curl to send a message to a Hangouts Chat space from the command-line:

curl \
-X POST \
-H 'Content-Type: application/json' \
'https://chat.googleapis.com/...&thread_key=T12345' \
-d '{"text": "Hello!"}'

To get started, take a look at the Hangouts Chat developer documentation, especially the specific pages linked to above. We hope this video helps you take your bot development skills to the next level by showing you how to send messages to the Hangouts Chat service asynchronously.

Operation Rosehub

Twelve months ago, a team of 50 Google employees used GitHub to patch the “Apache Commons Collections Deserialization Vulnerability” (or the “Mad Gadget vulnerability” as we call it) in thousands of open source projects. We recently learned why our efforts were so important.

The San Francisco Municipal Transportation Agency had their software systems encrypted and shut down by an avaricious hacker. The hacker used that very same vulnerability, according to reports of the incident. He demanded a Bitcoin ransom from the government. He threatened to leak the private data he stole from San Francisco’s citizens if his ransom wasn’t paid. This was an attack on our most critical public infrastructure; infrastructure which underpins the economy of a major US city.

Mad Gadget is one of the most pernicious vulnerabilities we’ve seen. By merely existing on the Java classpath, seven “gadget” classes in Apache Commons Collections (versions 3.0, 3.1, 3.2, 3.2.1, and 4.0) make object deserialization for the entire JVM process turing complete with an exec function. Since many business applications use object deserialization to send messages across the network, it would be like hiring a bank teller who was trained to hand over all the money in the vault if asked to do so politely, and then entrusting that teller with the key. The only thing that would keep a bank safe in such a circumstance is that most people wouldn’t consider asking such a question.

The announcement of Mad Gadget triggered the cambrian explosion of enterprise security disclosures. Oracle, Cisco, Red Hat, Jenkins, VMWare, IBM, Intel, Adobe, HP and SolarWinds all formally disclosed that they had been impacted by this issue.

But unlike big businesses, open source projects don’t have people on staff to read security advisories all day and instead rely on volunteers to keep them informed. It wasn’t until five months later that a Google employee noticed several prominent open source libraries had not yet heard the bad news. Those projects were still depending on vulnerable versions of Collections. So back in March 2016, she started sending pull requests to those projects updating their code. This was easy to do and usually only required a single line change. With the help of GitHub’s GUI, any individual can make such changes to anyone’s codebase in under a minute. Given how relatively easy the changes seemed, she recruited more colleagues at Google to help the cause. As more work was completed, it was apparent that the problem was bigger than we had initially realized.

For instance, when patching projects like the Spring Framework, it was clear we weren’t just patching Spring but also patching every project that depended on Spring. We were furthermore patching all the projects that depended on those projects and so forth. But even once those users upgraded, they could still be impacted by other dependencies introducing the vulnerable version of Collections. To make matters worse, build systems like Maven can not be relied upon to evict old versions.

This was when we realized the particularly viral nature of Mad Gadget. We came to the conclusion that, in order to improve the health of the global software ecosystem, the old version of Collections should be removed from as many codebases as possible.

We used BigQuery to assess the damage. It allowed us to write a SQL query with regular expressions that searched all the public code on GitHub in a couple minutes.

SELECT pop, repo_name, path
FROM (SELECT F.id as id, repo_name, path
FROM (SELECT id, repo_name, path
FROM [bigquery-public-data:github_repos.files]
WHERE path LIKE '%pom.xml') AS F
JOIN (SELECT id
FROM (SELECT id,content
FROM (SELECT id,content
FROM [bigquery-public-data:github_repos.contents]
WHERE NOT binary)
WHERE content CONTAINS 'commons-collections<')
WHERE content CONTAINS '>3.2.1<') AS C
ON F.id = C.id) AS V
JOIN (SELECT difference.new_sha1 AS id,
COUNT(repo_name) WITHIN RECORD AS pop
FROM FLATTEN([bigquery-public-data:github_repos.commits], difference.new_sha1)) AS P
ON V.id = P.id
ORDER BY pop DESC;


We were alarmed when we discovered 2,600 unique open source projects that still directly referenced insecure versions of Collections. Internally at Google, we have a tool called Rosie that allows developers to make large scale changes to codebases owned by hundreds of different teams. But no such tool existed for GitHub. So we recruited even more engineers from around Google to patch the world’s code the hard way.

Ultimately, security rests within the hands of each developer. However we felt that the severity of the vulnerability and its presence in thousands of open source projects were extenuating circumstances. We recognized that the industry best practices had failed. Action was needed to keep the open source community safe. So rather than simply posting a security advisory asking everyone to address the vulnerability, we formed a task force to update their code for them. That initiative was called Operation Rosehub.

Operation Rosehub was organized from the bottom-up on company-wide mailing lists. Employees volunteered and patches were sent out in a matter of weeks. There was no mandate from management to do this—yet management was supportive. They were happy to see employees spontaneously self-organizing to put their 20% time to good use. Some of those managers even participated themselves.

Patches were sent to many projects, avoiding threats to public security for years to come. However, we were only able to patch open source projects on GitHub that directly referenced vulnerable versions of Collections. Perhaps if the SF Muni software systems had been open source, we would have been able to bring Mad Gadget to their attention too.

Going forward, we believe the best thing to do is to build awareness. We want to draw attention to the fact that the tools now exist for fixing software on a massive scale, and that it works best when that software is open.

In this case, the open source dataset on BigQuery allowed us to identify projects that still needed to be patched. When a vulnerability is discovered, any motivated team or individual who wants to help improve the security of our infrastructure can use these tools to do just that.

By Justine Tunney, Software Engineer on TensorFlow

We’d like to recognize the following people for their contributions to Operation Rosehub: Laetitia Baudoin, Chris Blume, Sven Blumenstein, James Bogosian, Phil Bordelon, Andrew Brampton, Joshua Bruning, Sergio Campamá, Kasey Carrothers, Martin Cochran, Ian Flanigan, Frank Fort, Joshua French, Christian Gils, Christian Gruber, Erik Haugen, Andrew Heiderscheit, David Kernan, Glenn Lewis, Roberto Lublinerman, Stefano Maggiolo, Remigiusz Modrzejewski, Kristian Monsen, Will Morrison, Bharadwaj Parthasarathy, Shawn Pearce, Sebastian Porst, Rodrigo Queiro, Parth Shukla, Max Sills, Josh Simmons, Stephan Somogyi, Benjamin Specht, Ben Stewart, Pascal Terjan, Justine Tunney, Daniel Van Derveer, Shannon VanWagner, and Jennifer Winer.

Live from the Firebase Dev Summit in Berlin: Firebase, six months after I/O

Posted by Francis Ma, Firebase Product Manager

Originally posted to the Firebase blog

Our goal with Firebase is to help developers build better apps and grow them into successful businesses. Six months ago at Google I/O, we took our well-loved backend-as-a-service (BaaS) and expanded it to 15 features to make it Google’s unified app development platform, available across iOS, Android, and the web.

We launched many new features at Google I/O, but our work didn’t stop there. Since then, we’ve learned a lot from you (750,000+ projects created on Firebase to date!) about how you’re using our platform and how we can improve it. Thanks to your feedback, today we’re launching a number of enhancements to Crash Reporting, Analytics, support for game developers and more. For more information on our announcements, tune in to the livestream video from Firebase Dev Summit in Berlin. They’re also listed here:

Improve App Quality to Deliver Better User Experiences

Firebase Crash Reporting comes out of Beta and adds a new feature that helps you diagnose and reproduce app crashes.

Often the hardest part about fixing an issue is reproducing it, so we’ve added rich context to each crash to make the process simple. Firebase Crash Reporting now shows Firebase Analytics event data in the logs for each crash. This gives you clarity into the state of your app leading up to an error. Things like which screens of your app were visited are automatically logged with no instrumentation code required. Crash logs will also display any custom events and parameters you explicitly log using Firebase Analytics. Firebase Crash Reporting works for both iOS and Android apps.

Glide, a popular live video messaging app, relies on Firebase Crash Reporting to ensure user quality and release agility. “No matter how much effort you put into testing, it will never be as thorough as millions of active users in different locations, experiencing a variety of network conditions and real life situations. Firebase allows us to rapidly gain trust in our new version during phased release, as well as accelerate the process of identifying core issues and providing quick solutions.” - Roi Ginat, Founder, Glide.

Firebase Test Lab for Android supports more devices and introduces a free tier.

We want to help you deliver high-quality experiences, so testing your app before it goes into the wild is incredibly important. Firebase Test Lab allows you to easily test your app on many physical and virtual devices in the cloud, without writing a single line of test code. Beginning today, developers on the Spark service tier (which is free!) can run five tests per day on physical devices and ten tests per day on virtual devices—with no credit card setup required. We’ve also heard that you want more device options, so we’ve added 11 new popular Android device models to Test Lab, available today.

Illustration of Firebase Crash Reporting

Make Faster Data Driven Decisions with Firebase Analytics

Firebase Analytics now offers live reporting, a new integration with Google “Data Studio”, and real-time exporting to BigQuery.

We know that your data is most actionable when you can see and process it as quickly as possible. Therefore, we’re announcing a number of features to help you maximize the potential of your analytics events:

  1. Real-time uploading of conversion events
  2. Real-time exporting to BigQuery
  3. DebugView for validation of your analytics instrumentation
  4. StreamView, which will offer a live, dynamic view of your analytics data as we receive it

To further enhance your targeting options, we’ve improved the connection between Firebase Analytics and other Firebase features, such as Dynamic Links and Remote Config. For example, you can now use Dynamic Links on your Facebook business page, and we can identify Facebook as a source in Firebase Analytics reporting. As well, you can now target Remote Config changes by User Properties, in addition to Audiences.

Build Better Games using Firebase

Firebase now has a Unity plugin!

Game developers are building great apps, and we want Firebase to work for you, too. We’ve built an entirely new plugin for Unity that supports Analytics, the Realtime Database, Authentication, Dynamic Links, Remote Config, Notifications and more. We've also expanded our C++ SDK with Realtime Database support.

Integrate Firebase Even Easier with Open-Sourced UI Library

FirebaseUI is updated to v1.0.

FirebaseUI is a library that provides common UI elements when building apps, and it’s a quick way to integrate with Firebase. FirebaseUI 1.0 includes a drop-in UI flow for Firebase Authentication, with common identity providers such as Google, Facebook, and Twitter. FirebaseUI 1.0 also added features such as client-side joins and intersections for the Realtime Database, plus integrations with Glide and SDWebImage that make downloading and displaying images from Firebase Storage a cinch. Follow our progress or contribute to our Android, iOS, and Web components on Github.

Learn More via Udacity and Join the Firebase Community

We want to provide the best tool for developers, but it’s also important that we give resources and training to help you get more out of the platform. As such, we’ve created a new Udacity course: Firebase in a Weekend! It’s an instructor-led video course to help all developers get up and running with Firebase on iOS and Android, in two days.

Finally, to help wrap your head around all our announcements, we’ve created a new demo app. This is an easy way to see how Analytics, Crash Reporting, Test Lab, Notifications, and Remote Config work in a live environment, without having to write a line of code.

Helping developers build better apps and successful businesses is at the core of Firebase. We work hard on it every day. We love hearing your feedback and ideas for new features and improvements—and we hope you can see from the length of this post that we take them to heart! Follow us on Twitter, join our Slack channel, participate in our Google Group, and let us know what you think. We’re excited to see what you’ll build next!

Using BigQuery and Firebase Analytics to understand your mobile app

Originally posted on Google Cloud Platform Blog

At Google I/O this May, Firebase announced a new suite of products to help developers build mobile apps. Firebase Analytics, a part of the new Firebase platform, is a tool that automatically captures data on how people are using your iOS and Android app, and lets you define your own custom app events. When the data's captured, it’s available through a dashboard in the Firebase console. One of my favorite cloud integrations with the new Firebase platform is the ability to export raw data from Firebase Analytics to Google BigQuery for custom analysis. This custom analysis is particularly useful for aggregating data from the iOS and Android versions of your app, and accessing custom parameters passed in your Firebase Analytics events. Let’s take a look at what you can do with this powerful combination.

How does the BigQuery export work?


After linking your Firebase project to BigQuery, Firebase automatically exports a new table to an associated BigQuery dataset every day. If you have both iOS and Android versions of your app, Firebase exports the data for each platform into a separate dataset. Each table contains the user activity and demographic data automatically captured by Firebase Analytics, along with any custom events you’re capturing in your app. Thus, after exporting one week’s worth of data for a cross-platform app, your BigQuery project would contain two datasets, each with seven tables:


Diving into the data


The schema for every Firebase Analytics export table is the same, and we’ve created two datasets (one for iOS and one for Android) with sample user data for you to run the example queries below. The datasets are for a sample cross-platform iOS and Android gaming app. Each dataset contains seven tables  one week’s worth of analytics data.

The following query will return some basic user demographic and device data for one day of usage on the iOS version of our app:

SELECT
user_dim.app_info.app_instance_id,
user_dim.device_info.device_category,
user_dim.device_info.user_default_language,
user_dim.device_info.platform_version,
user_dim.device_info.device_model,
user_dim.geo_info.country,
user_dim.geo_info.city,
user_dim.app_info.app_version,
user_dim.app_info.app_store,
user_dim.app_info.app_platform
FROM
[firebase-analytics-sample-data:ios_dataset.app_events_20160601]

Since the schema for every BigQuery table exported from Firebase Analytics is the same, you can run any of the queries in this post on your own Firebase Analytics data by replacing the dataset and table names with the ones for your project.

The schema has user data and event data. All user data is automatically captured by Firebase Analytics, and the event data is populated by any custom events you add to your app. Let’s take a look at the specific records for both user and event data.

User data


The user records contain a unique app instance ID for each user (user_dim.app_info.app_instance_id in the schema), along with data on their location, device and app version. In the Firebase console, there are separate dashboards for the app’s Android and iOS analytics. With BigQuery, we can run a query to find out where our users are accessing our app around the world across both platforms. The query below makes use of BigQuery’s union feature, which lets you use a comma as a UNION ALL operator. Since a row is created in our table for each bundle of events a user triggers, we use EXACT_COUNT_DISTINCT to make sure each user is only counted once:
SELECT
user_dim.geo_info.country as country,
EXACT_COUNT_DISTINCT( user_dim.app_info.app_instance_id ) as users
FROM
[firebase-analytics-sample-data:android_dataset.app_events_20160601],
[firebase-analytics-sample-data:ios_dataset.app_events_20160601]
GROUP BY
country
ORDER BY
users DESC

User data also includes a user_properties record, which includes attributes you define to describe different segments of your user base, like language preference or geographic location. Firebase Analytics captures some user properties by default, and you can create up to 25 of your own.

A user’s language preference is one of the default user properties. To see which languages our users speak across platforms, we can run the following query:

SELECT
user_dim.user_properties.value.value.string_value as language_code,
EXACT_COUNT_DISTINCT(user_dim.app_info.app_instance_id) as users,
FROM
[firebase-analytics-sample-data:android_dataset.app_events_20160601],
[firebase-analytics-sample-data:ios_dataset.app_events_20160601]
WHERE
user_dim.user_properties.key = "language"
GROUP BY
language_code
ORDER BY
users DESC

Event data


Firebase Analytics makes it easy to log custom events such as tracking item purchases or button clicks in your app. When you log an event, you pass an event name and up to 25 parameters to Firebase Analytics and it automatically tracks the number of times the event has occurred. The following query shows the number of times each event in our app has occurred on Android for a particular day:

SELECT 
event_dim.name,
COUNT(event_dim.name) as event_count
FROM
[firebase-analytics-sample-data:android_dataset.app_events_20160601]
GROUP BY
event_dim.name
ORDER BY
event_count DESC

If you have another type of value associated with an event (like item prices), you can pass it through as an optional value parameter and filter by this value in BigQuery. In our sample tables, there is a spend_virtual_currency event. We can write the following query to see how much virtual currency players spend at one time:

SELECT 
event_dim.params.value.int_value as virtual_currency_amt,
COUNT(*) as num_times_spent
FROM
[firebase-analytics-sample-data:android_dataset.app_events_20160601]
WHERE
event_dim.name = "spend_virtual_currency"
AND
event_dim.params.key = "value"
GROUP BY
1
ORDER BY
num_times_spent DESC

Building complex queries


What if we want to run a query across both platforms of our app over a specific date range? Since Firebase Analytics data is split into tables for each day, we can do this using BigQuery’s TABLE_DATE_RANGE function. This query returns a count of the cities users are coming from over a one week period:

SELECT
user_dim.geo_info.city,
COUNT(user_dim.geo_info.city) as city_count
FROM
TABLE_DATE_RANGE([firebase-analytics-sample-data:android_dataset.app_events_], DATE_ADD('2016-06-07', -7, 'DAY'), CURRENT_TIMESTAMP()),
TABLE_DATE_RANGE([firebase-analytics-sample-data:ios_dataset.app_events_], DATE_ADD('2016-06-07', -7, 'DAY'), CURRENT_TIMESTAMP())
GROUP BY
user_dim.geo_info.city
ORDER BY
city_count DESC

We can also write a query to compare mobile vs. tablet usage across platforms over a one week period:

SELECT
user_dim.app_info.app_platform as appPlatform,
user_dim.device_info.device_category as deviceType,
COUNT(user_dim.device_info.device_category) AS device_type_count FROM
TABLE_DATE_RANGE([firebase-analytics-sample-data:android_dataset.app_events_], DATE_ADD('2016-06-07', -7, 'DAY'), CURRENT_TIMESTAMP()),
TABLE_DATE_RANGE([firebase-analytics-sample-data:ios_dataset.app_events_], DATE_ADD('2016-06-07', -7, 'DAY'), CURRENT_TIMESTAMP())
GROUP BY
1,2
ORDER BY
device_type_count DESC

Getting a bit more complex, we can write a query to generate a report of unique user events across platforms over the past two weeks. Here we use PARTITION BY and EXACT_COUNT_DISTINCT to de-dupe our event report by users, making use of user properties and the user_dim.user_id field:

SELECT 
STRFTIME_UTC_USEC(eventTime,"%Y%m%d") as date,
appPlatform,
eventName,
COUNT(*) totalEvents,
EXACT_COUNT_DISTINCT(IF(userId IS NOT NULL, userId, fullVisitorid)) as users
FROM (
SELECT
fullVisitorid,
openTimestamp,
FORMAT_UTC_USEC(openTimestamp) firstOpenedTime,
userIdSet,
MAX(userIdSet) OVER(PARTITION BY fullVisitorid) userId,
appPlatform,
eventTimestamp,
FORMAT_UTC_USEC(eventTimestamp) as eventTime,
eventName
FROM FLATTEN(
(
SELECT
user_dim.app_info.app_instance_id as fullVisitorid,
user_dim.first_open_timestamp_micros as openTimestamp,
user_dim.user_properties.value.value.string_value,
IF(user_dim.user_properties.key = 'user_id',user_dim.user_properties.value.value.string_value, null) as userIdSet,
user_dim.app_info.app_platform as appPlatform,
event_dim.timestamp_micros as eventTimestamp,
event_dim.name AS eventName,
event_dim.params.key,
event_dim.params.value.string_value
FROM
TABLE_DATE_RANGE([firebase-analytics-sample-data:android_dataset.app_events_], DATE_ADD('2016-06-07', -7, 'DAY'), CURRENT_TIMESTAMP()),
TABLE_DATE_RANGE([firebase-analytics-sample-data:ios_dataset.app_events_], DATE_ADD('2016-06-07', -7, 'DAY'), CURRENT_TIMESTAMP())
), user_dim.user_properties)
)
GROUP BY
date, appPlatform, eventName

If you have data in Google Analytics for the same app, it’s also possible to export your Google Analytics data to BigQuery and do a JOIN with your Firebase Analytics BigQuery tables.


Visualizing analytics data


Now that we’ve gathered new insights from our mobile app data using the raw BigQuery export, let’s visualize it using Google Data Studio. Data Studio can read directly from BigQuery tables, and we can even pass it a custom query like the ones above. Data Studio can generate many different types of charts depending on the structure of your data, including time series, bar charts, pie charts and geo maps.

For our first visualization, let’s create a bar chart to compare the device types from which users are accessing our app on each platform. We can paste the mobile vs. tablet query above directly into Data Studio to generate the following chart:
From this chart, it’s easy to see that iOS users are much more likely to access our game from a tablet. Getting a bit more complex, we can use the above event report query to create a bar chart comparing the number of events across platforms:
Check out this post for detailed instructions on connecting your BigQuery project to Data Studio.

What’s next?

If you’re new to Firebase, get started here. If you’re already building a mobile app on Firebase, check out this detailed guide on linking your Firebase project to BigQuery. For questions, take a look at the BigQuery reference docs and use the firebase-analytics and google-bigquery tags on Stack Overflow. And let me know if there are any particular topics you’d like me to cover in an upcoming post.

GitHub on BigQuery: Analyze all the code



Google, in collaboration with GitHub, is releasing an incredible new open dataset on Google BigQuery. So far you've been able to monitor and analyze GitHub's pulse since 2011 (thanks GitHub Archive project!) and today we're adding the perfect complement to this. What could you do if you had access to analyze all the open source software in the world, with just one SQL command?

The Google BigQuery Public Datasets program now offers a full snapshot of the content of more than 2.8 million open source GitHub repositories in BigQuery. Thanks to our new collaboration with GitHub, you'll have access to analyze the source code of almost 2 billion files with a simple (or complex) SQL query. This will open the doors to all kinds of new insights and advances that we're just beginning to envision.

For example, let's say you're the author of a popular open source library. Now you'll be able to find every open source project on GitHub that's using it. Even more, you'll be able to guide the future of your project by analyzing how it's being used, and improve your APIs based on what your users are actually doing with it.

On the security side, we've seen how the most popular open source projects benefit from having multiple eyes and hands working on them. This visibility helps projects get hardened and buggy code cleaned up. What if you could search for errors with similar patterns in every other open source project? Would you notify their authors and send them pull requests? Well, now you can. Some concepts to keep in mind while working with BigQuery and the GitHub contents dataset:
To learn more, read GitHub's announcement and try some sample queries. Share your queries and findings in our reddit.com/r/bigquery and Hacker News posts. The ideas are endless, and I'll start collecting tips and links to other articles on this post on Medium.

Stay curious!

Build a mobile gaming analytics platform

Popular mobile games can attract millions of players and generate terabytes of game-related data in a short burst of time. This places extraordinary pressure on the infrastructure powering these games and requires scalable data analytics services to provide timely, actionable insights in a cost-effective way.

To address these needs, a growing number of successful gaming companies use Google’s web-scale analytics services to create personalized experiences for their players. They use telemetry and smart instrumentation to gain insight into how players engage with the game and to answer questions like: At what game level are players stuck? What virtual goods did they buy? And what's the best way to tailor the game to appeal to both casual and hardcore players?


A new reference architecture describes how you can collect, archive and analyze vast amounts of gaming telemetry data using Google Cloud Platform’s data analytics products. The architecture demonstrates two patterns for analyzing mobile game events:

  • Batch processing: This pattern helps you process game logs and other large files in a fast, parallelized manner. For example, leading mobile gaming company DeNA moved to BigQuery from Hadoop to get faster query responses for their log file analytics pipeline. In this GDC Lightning Talk video they explain the speed benefits of Google’s analytics tools and how the team was able to process large gaming datasets without the need to manage any infrastructure.
  • Real-time processing: Use this pattern when you want to understand what's happening in the game right now. Cloud Pub/Sub and Cloud Dataflow provide a fully managed way to perform a number of data-processing tasks like data cleansing and fraud detection in real-time. For example, you can highlight a player with maximum hit-points outside the valid range. Real-time processing is also a great way to continuously update dashboards of key game metrics, like how many active users are currently logged in or which in-game items are most popular.

Some Cloud Dataflow features are especially useful in a mobile context since messages may be delayed from the source due to mobile Internet connection issues or batteries running out. Cloud Dataflow's built-in session windowing functionality and triggers aggregate events based on the actual time they occurred (event time) as opposed to the time they're processed so that you can still group events together by user session even if there's a delay from the source.

But why choose between one or the other pattern? A key benefit of this architecture is that you can write your data pipeline processing once and execute it in either batch or streaming mode without modifying your codebase. So if you start processing your logs in batch mode, you can easily move to real-time processing in the future. This is an advantage of the high-level Cloud Dataflow model that was released as open source by Google.



Cloud Dataflow loads the processed data into one or more BigQuery tables. BigQuery is built for very large scale, and allows you to run aggregation queries against petabyte-scale datasets with fast response times. This is great for interactive analysis and data exploration, like the example screenshot above, where a simple BigQuery SQL query dynamically creates a Daily Active Users (DAU) graph using Google Cloud Datalab.


And what about player engagement and in-game dynamics? The BigQuery example above shows a bar chart of the ten toughest game bosses. It looks like boss10 killed players more than 75% of the time, much more than the next toughest. Perhaps it would make sense to lower the strength of this boss? Or maybe give the player some more powerful weapons? The choice is yours, but with this reference architecture you'll see the results of your changes straight away. Review the new reference architecture to jumpstart your data-driven quest to engage your players and make your games more successful, contact us, or sign up for a free trial of Google Cloud Platform to get started.

Further Reading and Additional Resources


- Posted by Oyvind Roti, Solutions Architect