Tag Archives: Data

Introducing TensorFlow Recorder

When training computer vision machine learning models, data loading can often be a performance bottleneck, causing your GPU or TPU resources to be underutilized while waiting for data to be loaded into the model. Storing your dataset in the efficient TensorFlow Record (TFRecord) format is a great way to solve these problems, but creating TFRecords can unfortunately often require a great deal of complex code.

Last week we open sourced the TensorFlow Recorder project (also known as TFRecorder), which makes it possible for data scientists, data engineers, or AI/ML engineers to create image based TFRecords with just a few lines of code. Using TFRecords is incredibly important for creating efficient TensorFlow ML pipelines, but until now they haven’t been so easy to create. Before TFRecorder, in order to create TFRecords at scale you would have had to write a data pipeline that parsed your structured data, loaded images from storage, and serialized the results into the TFRecord format. TFRecorder allows you to write TFRecords directly from a Pandas dataframe or CSV without writing any complicated code.

You can see an example of TFRecoder below, but first let’s talk about some of the specific advantages of TFRecords.

How TFRecords Can Help

Using the TFRecord file format allows you to store your data in sets of files, each containing a sequence of protocol buffers serialized as a binary record that can be read very efficiently, which will help reduce the data loading bottleneck mentioned above.

Data loading performance can be further improved by implementing prefetching and parallel interleave along with using the TFRecord format. Prefetching reduces the time of each model training step(s) by fetching the data for the next training step while your model is executing training on the current step. Parallel interleave allows you to read from multiple TFRecords shards (pieces of a TFRecord file) and apply preprocessing of those interleaved data streams. This reduces the latency required to read a training batch and is especially helpful when reading data from the network.

Using TensorFlow Recorder

Creating a TFRecord using TFRecorder requires only a few lines of code. Here’s how it works.
import pandas as pd
import tfrecorder
df = pd.read_csv(...)

TFRecorder currently expects data to be in the same format as Google AutoML Vision.

This format looks like a pandas dataframe or CSV formatted as:

  • split can take on the values TRAIN, VALIDATION, and TEST
  • image_uri specifies a local or google cloud storage location for the image file.
  • label can be either a text-based label that will be integerized or an integer
In the future, we hope to extend TensorFlow Recorder to work with data in any format.

While this example would work well to convert a few thousand images into TFRecords, it probably wouldn’t scale well if you have millions of images. To scale up to huge datasets, TensorFlow Recorder provides connectivity with Google Cloud Dataflow, which is a serverless Apache Beam pipeline runner. Scaling up to DataFlow requires only a little bit more configuration.

What’s next?

We’d love for you to try out TensorFlow Recorder. You can get it from GitHub or simply pip install tfrecorder. Tensorflow Recorder is very new and we’d greatly appreciate your feedback, suggestions, and pull requests.

By Mike Bernico and Carlos Ezequiel, Google Cloud AI Engineers

Kpt: Packaging up your Kubernetes configuration with git and YAML since 2014

Kubernetes configuration manifests have become an industry standard for deploying both custom and off-the-shelf applications (as well as for infrastructure). Manifests are combined into bundles to create higher-level deployable systems as well as reusable blueprints (such as a product offering, off the shelf software, or customizable starting point for a new application).

However, most teams lack the expertise or desire to create bespoke bundles of configuration from scratch and instead: 1) either fork them from another bundle, or 2) use some packaging solution which generates manifests from code.

Teams quickly discover they need to customize, validate, audit and re-publish their forked/ generated bundles for their environment. Most packaging solutions to date are tightly coupled to some format written as code (e.g. templates, DSLs, etc). This introduces a number of challenges when trying to extend, build on top of, or integrate them with other systems. For example, how does one update a forked template from upstream, or how does one apply custom validation?

Packaging is the foundation of building reusable components, but it also incurs a productivity tax on the users of those components.

Today we’d like to introduce kpt, an OSS tool for Kubernetes packaging, which uses a standard format to bundle, publish, customize, update, and apply configuration manifests.

Kpt is built around an “as data” architecture bundling Kubernetes resource configuration, a format for both humans and machines. The ability for tools to read and write the package contents using standardized data structures enables powerful new capabilities:
  • Any existing directory in a Git repo with configuration files can be used as a kpt package.
  • Packages can be arbitrarily customized and later pull in updates from upstream by merging them.
  • Tools and automation can perform high-level operations by transforming and validating package data on behalf of users or systems.
  • Organizations can develop their own tools and automation which operate against the package data.
  • Existing tools and automation that work with resource configuration “just work” with kpt.
  • Existing solutions that generate configuration (e.g. from templates or DSLs) can emit kpt packages which enable the above capabilities for them.

Example workflow with kpt

Now that we’ve established the benefits of using kpt for managing your packages of Kubernetes config, lets walk through how an enterprise might leverage kpt to package, share and use their best practices for Kubernetes across the organization.

First, a team within the organization may build and contribute to a repository of best practices (pictured in blue) for managing a certain type of application, for example a microservice (called “app”). As the best practices are developed within an organization, downstream teams will want to consume and modify configuration blueprints based on them. These blueprints provide a blessed starting point which adheres to organization policies and conventions.

The downstream team will get their own copy of a package by downloading it to their local filesystem (pictured in red) using kpt pkg get. This clones the git subdirectory, recording upstream metadata so that it can be updated later.

They may decide to update the number of replicas to fit their scaling requirements or may need to alter part of the image field to be the image name for their app. They can directly modify the configuration using a text editor (as would be done before). Alternatively, the package may define setters, allowing fields to be set programmatically using kpt cfg set. Setters streamline workflows by providing user and automation friendly commands to perform common operations.

Once the modifications have been made to the local filesystem, the team will commit and push their package to an app repository owned by them. From there, a CI/CD pipeline will kick off and the deployment process will begin. As a final customization before the package is deployed to the cluster, the CI/CD pipeline will inject the digest of the image it just built into the image field (using kpt cfg set). When the image digest has been set, the CI/CD pipeline can send the manifests to the cluster using kpt live apply. Kpt live operates like kubectl apply, providing additional functionality to prune resources deleted from the configuration and block on rollout completion (reporting status of the rollout back to the user).

Now that we’ve walked through how you might use kpt in your organization, we’d love it if you’d try it out, read the docs, or contribute.

One more thing

There’s still a lot to the story we didn’t cover here. Expect to hear more from us about:
  • Using kpt with GitOps
  • Building custom logic with functions
  • Writing effective blueprints with kpt and kustomize
By Phillip Wittrock, Software Engineer and Vic Iglesias, Cloud Solutions Architect

Importing SA360 WebQuery reports to BigQuery


Search Ads 360 (SA36) is an enterprise-class search campaign management platform used by marketers to manage global ad campaigns across multiple engines. It offers powerful reporting capability through WebQuery reports, API, BiqQuery and Datastudio connectors.

Effective Ad campaign management requires multi-dimensional analysis of campaign data along with customers’ first-party data by building custom reports with dimensions combined from paid-search reports and business data.

Customers’ business data resides in a data-warehouse, which is designed for analysis, insights and reporting. To integrate ads data into the data-warehouse, the usual approach is to bring/ load the campaign data into the warehouse; to achieve this, SA360 offers various options to retrieve paid-search data, each of these methods provide a unique capabilities.

Comparison AreaWebQueryBQ ConnectorDatastudio ConnectorAPI
Technical complexityLow
Ease of report customizationHigh
Reporting DetailsCompleteLimited
Reports not supported on API are not available
Location targets
Remarketing targets
Audience reports
Possible Data WarehouseAny
The report is generic and needs to be loaded into the data-warehouse using DWs custom loading methods.
BigQuery ONLYNoneAny
Comparing these approaches, in terms of technical knowledge required, as well as, supporters data warehousing solution, the easiest one is WebQuery report for which a marketer can build a report by choosing the dimensions/metrics they want on the SA360 User Interface.

BigQuery data-transfer service is limited to importing data in BigQuery and Datastudio connector does not allow retrieving data.

WebQuery offers a simpler and customizable method than other alternatives and also offers more options for the kind of data (vs. BQ transfer service which does not bring Business Data from SA360 to BigQuery). It was originally designed for Microsoft Excel to provide an updatable view of a report. In the era of cloud computing, a need was felt for a tool which would help consume the report and make it available on an analytical platform or a cloud data warehouse like BigQuery.

Solution Approach

This tool showcases how to bridge this gap of bringing SA360 data to a data warehouse, in generic fashion, where the report from SA360 is fetched in XML format and converted it into a CSV file using SAX parsers. This CSV file is then transferred to staging storage to be finally ETLed into the Data Warehouse.

As a concrete example, we chose to showcase a solution with BigQuery as the destination (cloud) data warehouse, though the solution architecture is flexible for any other system.


The tool helps marketers bring advertising data closer to their analytical systems helping them derive better insights. In case you use BigQuery as your Data Warehouse, you can use this tool as-is. You can also adopt by adding components for analytical/data-warehousing systems you use and improve it for the larger community.

To get started, follow our step-by-step guide.
Notable Features of the tool are as following:
  • Modular Authorization module
  • Handle arbitrarily large web-query reports
  • Batch mode to process multiple reports in a single call
  • Can be used as part of ETL workflow (Airflow compatible)
By Anant Damle, Solutions Architect and Meera Youn, Technical Partnership Lead

Audience Insights Series: A framework for success

This is our final post in a series exploring the value of audience insights in search marketing. Over the past few weeks, we heard from experts and leaders in the industry on the opportunity, predictions, and insights on the topic. With our final post today, we would like to explore the path to success when applying audience insights in your own campaigns.

Additional insights about your audience, such as location, time of day, and how they’ve engaged with you in the past, can help you better understand the intent of your audience so you can serve the most relevant message.

But more information can also mean more complexity. So to help you effectively navigate and leverage audience insights in your campaigns, we’ve developed a 3-step framework for success: Gather, Target, Engage. The infographic below captures the steps in more detail, along with case studies of advertisers who have applied them to their campaigns.

Click here to download the infographic
1. Gather insights that matter: This step is about identifying relevant signals to leverage in your campaigns, which is essential for developing insights on who the audience is, what context they are in, as well as what their interests may be. Here are some examples of the types of signals you can identify:

The user’s relationship with you, including whether they have previously visited your site or made a purchase
What: Time, location and device used
Interests: Interests in specific categories based on consumed content

2. Target based on discovered insights: The next step is to combine these signals, and based on them, create segments you can target. Below are examples of segments you can create if you were selling laptops:

“Close to store”, based on device and location signals.
“Android users” may be more inclined to purchase a Chromebook.
“Interested in bags”: If a user has bought a laptop through your website, he might now need a laptop case rather than a laptop.

3. Engage your audience with a tailored ad: This final step is about delivering your audience a tailored message. Messaging can be optimized for each segment with A/B testing. By measuring results post-engagement, you can reassess if there are new signals to gather, ultimately coming back to the first step in the cycle.

Advertiser success with Audience-driven planning
Specsavers is a good example of an advertiser who applied this framework, matching their ad copy with location-specific segments. The strategy helped drive a 189% increase in their key metric - conversions. To find out more, you can see other case studies in the infographic or explore previous posts in our series. You can also hear about upcoming developments in your inbox, by signing up for our newsletter.

    Audience Insights Series: Getting started

    Over the past couple of weeks, we heard from experts and leaders in the industry on the opportunity of applying audience insights to their search marketing efforts, and predictions on how this trend will impact the industry moving forward.

    This week, we’re diving deeper and sharing best practices and first steps advertisers can take to make the most of the audience opportunity.

    When we sat down with Ben Wood from iProspect, Khurram Hamid of GlaxoSmithKline and Steve Chester from the IAB UK, we heard a resounding message: Don’t hesitate, start today. The sooner advertisers start testing and experimenting with applying audience targeting in their search campaigns, the sooner they’ll tap into valuable insights  to tailor their campaigns for their audience.

    As one expert said, “Today, because it’s nascent, it’s those brands right at the cutting edge that are really leaning into this... but in 6 month’s time, in 12 month’s time.. this isn’t something you can’t be doing.” 

    And as a first step, we heard how investing in a data management strategy is key. Watch the video for more:

    This will be our last post featuring perspectives from industry leaders, but the journey doesn’t stop here. With our next post in the series, we’ll explore specifics around how advertisers can approach planning their search marketing strategy with a focus on leveraging audience insights. Stay tuned!

    Audience Insights Series: What the future holds

    This is the second post in our series to explore the convergence of audience data and search marketing. In our last post, we heard from industry leaders on the opportunity and how audience data helps them  deliver even more relevant and resonant messages.

    This week, we explore what the future holds.  iProspect’s Ben Wood, Havas Media’s Paul Frampton and the IAB’s Steve Chester share perspectives on the continued convergence of audience data and search marketing, implications for digital marketing teams and how they work together, as well as how audience data in search will help bridge the gap between branding and direct response.  

    Look for our next post in the series, where we will explore best practices for advertisers who are looking to embrace audience data as part of their search marketing efforts.

    Audience Insights Series: What is the opportunity?

    As part of our series exploring the value of audience signals in search marketing, we went behind the scenes at leading agencies and marketers and asked industry experts to share their views on what the opportunity is.

    Here’s what we heard: According to these industry leaders, audience insights enable advertisers to go beyond simple keywords and use other signals to inform their search marketing. They can make smarter bidding decisions, but more than that, advertisers can improve the message  they’re presenting   to their audience, making their  search ads even more relevant and compelling. And of course, as search strategies become sharper, ads perform better.

    To see the latest from the front lines, watch our video featuring Martin McNulty of Forward3D, Ben Wood from iProspect, Paul Frampton of Havas Media, Steve Chester from the Internet Advertising Bureau and Khurram Hamid from GlaxoSmithKline.

    Hope you enjoy the video above; we will continue the series next week with our next post focusing on getting our partner’s views on how audience signals may impact search marketing in the future.

    Plan your digital afterlife with Inactive Account Manager

    Not many of us like thinking about death — especially our own. But making plans for what happens after you’re gone is really important for the people you leave behind. So today, we’re launching a new feature that makes it easy to tell Google what you want done with your digital assets when you die or can no longer use your account.

    The feature is called Inactive Account Manager — not a great name, we know — and you’ll find it on your Google Account settings page. You can tell us what to do with your Gmail messages and data from several other Google services if your account becomes inactive for any reason.

    For example, you can choose to have your data deleted — after three, six, nine or 12 months of inactivity. Or you can select trusted contacts to receive data from some or all of the following services: +1s; Blogger; Contacts and Circles; Drive; Gmail; Google+ Profiles, Pages and Streams; Picasa Web Albums; Google Voice and YouTube. Before our systems take any action, we’ll first warn you by sending a text message to your cellphone and email to the secondary address you’ve provided.

    We hope that this new feature will enable you to plan your digital afterlife — in a way that protects your privacy and security — and make life easier for your loved ones after you’re gone.


    Posted by Andreas Tuerk, Product Manager

    Source: Data Liberation

    A perfect match: Blogger and Google+ Pages for Takeout

    You: A Blogger or Google+ Page owner who dreams of controlling their data.
    Us: A band of engineers who will stop at nothing to make your dreams come true.

    Meet us at https://www.google.com/takeout, and together we will export each of your blogs as an Atom Xml file. Or, if you’ve enjoyed exporting data from your Google+ Stream and Google+ Circles through Takeout in the past, but are looking for something more, join us now and download html files with your posts and json files containing the circles for each Google+ Page you own. If you don’t want to rush into things, we can also just export a single blog or page of your choice. Either way, give us a try. Life will never be the same.

    Posted by Kári Ragnarsson, The Data Liberation Front

    Source: Data Liberation

    Be picky with your Takeout

    Starting today, you'll have a couple of new features to make it even easier to download your data.

    First, your original folder hierarchy is now maintained if you export files from Google Drive. Gone are the days of looking at the contents of your zip file and wondering which "secret_plans" file is which.

    Your folder hierarchy is preserved.
    Second, you can now pick a single resource within a service to download - for instance, a single Picasa album or top-level folder from Drive - instead of exporting every single file. To try it out, go to the "Choose services" tab and click on "Configure..." once you've added a service that supports this.
    Want to download only your nefarious plans and all of your pictures of cats? We've got you covered!
    These are just a few things that we've been working on lately. Stay tuned for lots of excitement in 2013!

    Posted by Nick Piepmeier, The Data Liberation Front

    Source: Data Liberation