Tag Archives: Google Cloud Platform

Kotlin Momentum for Android and Beyond

Posted by James Lau (@jmslau), Product Manager

Today marks the beginning of KotlinConf 2018 - the largest in-person gathering of the Kotlin community annually. 2018 has been a big year for Kotlin, as the language continues to gain adoption and earn the love of developers. In fact, 27% of the top 1000 Android apps on Google Play already use Kotlin. More importantly, Android developers are loving the language with over 97% satisfaction in our most recent survey. It's no surprise that Kotlin was voted as the #2 most-loved language in the 2018 StackOverflow survey.

Google supports Kotlin as a first-class programming language for Android development. In the past 12 months, we have delivered a number of important improvements to the Kotlin developer experience. This includes the Kotlin-friendly SDK, Android KTX, new Lint checks and various Kotlin support improvements in Android Studio. We have also launched Kotlin support in our official documentation, new flagship samples in Kotlin, a new Kotlin Bootcamp Udacity course, #31DaysOfKotlin and other deep dive content. We are committed to continuing to improve the Kotlin developer experience.

As the language continues to advance, more developers are discovering the benefits of Kotlin across the globe. Recently, we traveled to India and worked with local developers like Zomato to better understand how adopting Kotlin has benefited their Android development. Zomato is a leading restaurant search & discovery service that operates in 24 countries, with over 150 million monthly users. Kotlin helped Zomato reduce the number of lines of code in their app significantly, and it has also helped them find important defects in their app at compile time. You can watch their Kotlin adoption story in the video below.

Android Developer Story: Zomato uses Kotlin to write safer, more concise code.

Going beyond Android, we are happy to announce that the Google Cloud Platform team is launching a dedicated Kotlin portal today. This will help developers more easily find resources related to Kotlin on Google Cloud. We want to make it as easy as possible for you to use Kotlin, whether it's on mobile or in the Cloud.

Google Cloud Platform's Kotlin Homepage

Adopting a new language is a major decision for most companies, and you need to be confident that the language you choose will have a bright future. That's why Google has joined forces with JetBrains and established the Kotlin Foundation. The Foundation will ensure that Kotlin continues to advance rapidly, remain free and stay open. You can learn more about the Kotlin Foundation here.

It's an exciting time to be a Kotlin developer. If you haven't tried Kotlin yet, we encourage you to join this growing global community. You can get started by visiting kotlinlang.org or the Android Developer Kotlin page.

Kotlin Momentum for Android and Beyond

Posted by James Lau (@jmslau), Product Manager

Today marks the beginning of KotlinConf 2018 - the largest in-person gathering of the Kotlin community annually. 2018 has been a big year for Kotlin, as the language continues to gain adoption and earn the love of developers. In fact, 27% of the top 1000 Android apps on Google Play already use Kotlin. More importantly, Android developers are loving the language with over 97% satisfaction in our most recent survey. It's no surprise that Kotlin was voted as the #2 most-loved language in the 2018 StackOverflow survey.

Google supports Kotlin as a first-class programming language for Android development. In the past 12 months, we have delivered a number of important improvements to the Kotlin developer experience. This includes the Kotlin-friendly SDK, Android KTX, new Lint checks and various Kotlin support improvements in Android Studio. We have also launched Kotlin support in our official documentation, new flagship samples in Kotlin, a new Kotlin Bootcamp Udacity course, #31DaysOfKotlin and other deep dive content. We are committed to continuing to improve the Kotlin developer experience.

As the language continues to advance, more developers are discovering the benefits of Kotlin across the globe. Recently, we traveled to India and worked with local developers like Zomato to better understand how adopting Kotlin has benefited their Android development. Zomato is a leading restaurant search & discovery service that operates in 24 countries, with over 150 million monthly users. Kotlin helped Zomato reduce the number of lines of code in their app significantly, and it has also helped them find important defects in their app at compile time. You can watch their Kotlin adoption story in the video below.

Android Developer Story: Zomato uses Kotlin to write safer, more concise code.

Going beyond Android, we are happy to announce that the Google Cloud Platform team is launching a dedicated Kotlin portal today. This will help developers more easily find resources related to Kotlin on Google Cloud. We want to make it as easy as possible for you to use Kotlin, whether it's on mobile or in the Cloud.

Google Cloud Platform's Kotlin Homepage

Adopting a new language is a major decision for most companies, and you need to be confident that the language you choose will have a bright future. That's why Google has joined forces with JetBrains and established the Kotlin Foundation. The Foundation will ensure that Kotlin continues to advance rapidly, remain free and stay open. You can learn more about the Kotlin Foundation here.

It's an exciting time to be a Kotlin developer. If you haven't tried Kotlin yet, we encourage you to join this growing global community. You can get started by visiting kotlinlang.org or the Android Developer Kotlin page.

Code that final mile: from big data analysis to slide presentation

Posted by Wesley Chun (@wescpy), Developer Advocate, Google Cloud

Google Cloud Platform (GCP) provides infrastructure, serverless products, and APIs that help you build, innovate, and scale. G Suite provides a collection of productivity tools, developer APIs, extensibility frameworks and low-code platforms that let you integrate with G Suite applications, data, and users. While each solution is compelling on its own, users can get more power and flexibility by leveraging both together.

In the latest episode of the G Suite Dev Show, I'll show you one example of how you can take advantage of powerful GCP tools right from G Suite applications. BigQuery, for example, can help you surface valuable insight from massive amounts of data. However, regardless of "the tech" you use, you still have to justify and present your findings to management, right? You've already completed the big data analysis part, so why not go that final mile and tap into G Suite for its strengths? In the sample app covered in the video, we show you how to go from big data analysis all the way to an "exec-ready" presentation.

The sample application is meant to give you an idea of what's possible. While the video walks through the code a bit more, let's give all of you a high-level overview here. Google Apps Script is a G Suite serverless development platform that provides straightforward access to G Suite APIs as well as some GCP tools such as BigQuery. The first part of our app, the runQuery() function, issues a query to BigQuery from Apps Script then connects to Google Sheets to store the results into a new Sheet (note we left out CONSTANT variable definitions for brevity):

function runQuery() {
// make BigQuery request
var request = {query: BQ_QUERY};
var queryResults = BigQuery.Jobs.query(request, PROJECT_ID);
var jobId = queryResults.jobReference.jobId;
queryResults = BigQuery.Jobs.getQueryResults(PROJECT_ID, jobId);
var rows = queryResults.rows;

// put results into a 2D array
var data = new Array(rows.length);
for (var i = 0; i < rows.length; i++) {
var cols = rows[i].f;
data[i] = new Array(cols.length);
for (var j = 0; j < cols.length; j++) {
data[i][j] = cols[j].v;
}
}

// put array data into new Sheet
var spreadsheet = SpreadsheetApp.create(QUERY_NAME);
var sheet = spreadsheet.getActiveSheet();
var headers = queryResults.schema.fields;
sheet.appendRow(headers); // header row
sheet.getRange(START_ROW, START_COL,
rows.length, headers.length).setValues(data);

// return Sheet object for later use
return spreadsheet;
}

It returns a handle to the new Google Sheet which we can then pass on to the next component: using Google Sheets to generate a Chart from the BigQuery data. Again leaving out the CONSTANTs, we have the 2nd part of our app, the createColumnChart() function:

function createColumnChart(spreadsheet) {
// create & put chart on 1st Sheet
var sheet = spreadsheet.getSheets()[0];
var chart = sheet.newChart()
.setChartType(Charts.ChartType.COLUMN)
.addRange(sheet.getRange(START_CELL + ':' + END_CELL))
.setPosition(START_ROW, START_COL, OFFSET, OFFSET)
.build();
sheet.insertChart(chart);

// return Chart object for later use
return chart;
}

The chart is returned by createColumnChart() so we can use that plus the Sheets object to build the desired slide presentation from Apps Script with Google Slides in the 3rd part of our app, the createSlidePresentation() function:

function createSlidePresentation(spreadsheet, chart) {
// create new deck & add title+subtitle
var deck = SlidesApp.create(QUERY_NAME);
var [title, subtitle] = deck.getSlides()[0].getPageElements();
title.asShape().getText().setText(QUERY_NAME);
subtitle.asShape().getText().setText('via GCP and G Suite APIs:\n' +
'Google Apps Script, BigQuery, Sheets, Slides');

// add new slide and insert empty table
var tableSlide = deck.appendSlide(SlidesApp.PredefinedLayout.BLANK);
var sheetValues = spreadsheet.getSheets()[0].getRange(
START_CELL + ':' + END_CELL).getValues();
var table = tableSlide.insertTable(sheetValues.length, sheetValues[0].length);

// populate table with data in Sheets
for (var i = 0; i < sheetValues.length; i++) {
for (var j = 0; j < sheetValues[0].length; j++) {
table.getCell(i, j).getText().setText(String(sheetValues[i][j]));
}
}

// add new slide and add Sheets chart to it
var chartSlide = deck.appendSlide(SlidesApp.PredefinedLayout.BLANK);
chartSlide.insertSheetsChart(chart);

// return Presentation object for later use
return deck;
}

Finally, we need a driver application that calls all three one after another, the createColumnChart() function:

function createBigQueryPresentation() {
var spreadsheet = runQuery();
var chart = createColumnChart(spreadsheet);
var deck = createSlidePresentation(spreadsheet, chart);
}

We left out some detail in the code above but hope this pseudocode helps kickstart your own project. Seeking a guided tutorial to building this app one step-at-a-time? Do our codelab at g.co/codelabs/bigquery-sheets-slides. Alternatively, go see all the code by hitting our GitHub repo at github.com/googlecodelabs/bigquery-sheets-slides. After executing the app successfully, you'll see the fruits of your big data analysis captured in a presentable way in a Google Slides deck:

This isn't the end of the story as this is just one example of how you can leverage both platforms from Google Cloud. In fact, this was one of two sample apps featured in our Cloud NEXT '18 session this summer exploring interoperability between GCP & G Suite which you can watch here:

Stay tuned as more examples are coming. We hope these videos plus the codelab inspire you to build on your own ideas.

How we brought the latest version of Python to App Engine and Cloud Functions

At Cloud Next 2018, we added Python 3.7 support to Cloud Functions and now we’ve announced Python 3.7 support for the App Engine standard environment. These new runtimes allow you to write Python functions and apps using the latest version of Python and the rich ecosystem of packages available on Python Packaging Index (PyPI).

This new runtime marks a significant update to App Engine and was enabled by new open source software that we recently released: gVisor and FTL.

Python, straight from the source

Running Python 3.7 on App Engine and Cloud Functions required us to fundamentally rethink our infrastructure. Traditionally, meeting Google Cloud’s security requirements meant that we had to run a modified version of the Python interpreter. However, using a modified interpreter constrained some language features and only allowed us to support a limited set of whitelisted Python libraries.

Thanks to gVisor, a container sandbox that provides improved security and process isolation, we can now run the unmodified Python 3.7.0 interpreter. We’ve done extensive testing to make sure Python 3.7 is compatible with gVisor. As part of our compatibility testing, we run Python’s full suite of language tests, and tests for Python packages that are popular on PyPI. We’re committed to ensuring that everything you’ve come to know and love about Python is supported on our platform.

Seamless deployments

Most importantly, this change in our infrastructure makes it easier to take advantage of Python’s vast ecosystem. As a developer, you just add project dependencies to a requirements.txt file and deploy.

During deployment, FTL, a tool for building containers, fetches dependencies listed in your requirements.txt file and installs them alongside your app or function. FTL also includes a short-lived dependency cache, which speeds up repeated deployments if no changes are detected in your requirements.txt file. This is particularly useful if you find just need to re-deploy because you found a typo.

Keeping up with the Pythonistas

In making these changes, we also decided to expand the list of system packages that are included with each runtime’s Ubuntu 18.04 distribution. We think that will make life just a little bit easier for developers working with the latest release of Python.

Looking forward, we’re excited about how these changes will allow us to keep up with the Python community’s progress as they release new versions and libraries. Please let us know what you think and if you run into any challenges.

You can learn more about how to get started with it on App Engine and Cloud Functions in our documentation. We can’t wait to see what you build with Python 3.7.

By Stewart Reichling, Product Manager

We’ve moved! Come see our new home!


Ten years, three months and 30 days ago, we wrote our first post on this blog, and now, we’re writing our last at this particular web address. Today, it’s with great excitement that we present to you the Google Cloud blog, your home for all the latest GCP product news, how-to’s, perspectives and customer stories that you’re used to, all living happily on a shiny, new mobile-friendly platform.

We’re really excited about this change. Not only does the new blog look really nice, but it includes all the content from across the entire Google Cloud family—GCP, G Suite, Google Maps Platform and Chrome Enterprise—so you can see how they all fit together. And because data analysis and artificial intelligence are so central to everything people are building today, we’ve also folded our Big Data and Machine Learning blog into this new platform.

Besides collecting all Google Cloud blog content in one place, we think you’ll really benefit from the blog’s rich tagging capabilities. Now, you can view blog posts by platform, and also drill down to specific technology areas like Application Development, Networking or Open Source, so you can quickly find related content. There are also dedicated pages for partners, customers, trainings and certifications, and solutions and how-to’s, to name a few. And because we can also tag posts to multiple products and topics, you’ll be sure to find what you’re looking for.

Those are just the high-level changes. There are a whole lot of new features to use and explore, and we encourage you to browse the site and get familiar with it. What’s not new is our mission: to provide you with honest, technical content to show you how to build your business on GCP.

To date, we’ve migrated over two year’s worth of GCP blog posts to this new home, with more to come. Let us know if you find any broken links, typos, or just flat-out missing content. And of course, we’d love your feedback on our content, the design, or any features you’d like to see. Thanks for reading!

Last month today: July on GCP

The month of July saw our Google Cloud Next ‘18 conference come and go, and there was plenty of exciting news, updates and demos to share from the show. Here’s a look at some of the most-read blog posts from July.

What caught your attention this month: Creating the open cloud
  • One of the most-read posts this month covered the launch of our Cloud Services Platform, which allows you to build a true hybrid cloud infrastructure. Some of the key components of Cloud Services Platform include the managed Istio service mesh, Google Kubernetes Engine (GKE) On-Prem and GKE Policy Management, Cloud Build for fully managed CI/CD, and several serverless offerings (more on that below). Combined, these technologies can help you gain consistency, security, speed and flexibility of the cloud in your local data center, along with the freedom of workload portability to the environment of your choice.
  • Another popular read was a rundown of Google Cloud’s new serverless offerings. These include core serverless compute announcements such as new App Engine runtimes, Cloud Functions general availability and more. It also included serverless containers, so you can run serverless workloads in a fully managed container environment; GKE Serverless add-on to easily run serverless workloads on Kubernetes Engine; and Knative, the open-source project on which that add-on is built. There are even more features included in this post, too, like Cloud Build, Stackdriver monitoring and Cloud Firestore integration with GCP. 
Bringing detailed metrics and Kubernetes apps to the forefront
  • Another must-read post this month for many of you was Transparent SLIs: See Google Cloud the way your application experiences it, announcing the availability of detailed data insights on GCP services that your workloads use—helping you see like a Google site reliability engineer (SRE). These new service-level indicators (SLIs) go way beyond basic uptime and downtime to delve into response codes, latency and more. You can then separate out metrics by GCP service to see things like API version, location and protocol. The result is that you can filter and sort to get extremely fine-grained information on your software and the GCP services you use, which helps cut resolution times and improve the support experience. Transparent SLIs are available now through the Stackdriver monitoring console. Learn more here about the basics of using SLIs and other SRE tools to measure and manage availability.
  • It’s also now faster and easier to find production-ready commercial Kubernetes apps in the GCP Marketplace. These apps are prepackaged and configured to get up and running easily, whether on Kubernetes Engine or other Kubernetes clusters, and run the gamut from security, data analytics and developer tools to storage, machine learning and monitoring.
There was obviously a lot to talk about at the show, and you can get even more detail on what happened at Next ‘18 here.

Building the cloud back-end
  • For all of you developing cloud apps with Java, the availability of Jib was an exciting announcement last month. This open-source container image builder, available as Gradle and Maven plugins, cuts out several steps from the Docker build flow. Jib does all the work required to package your app into a container image—you don’t need to write a Dockerfile or even have Docker installed. You end up with faster builds and reproducible container images.
  • And on that topic, this best practices for building containers post was a hit, too, giving you tips that will set you up to run your environment more smoothly. The tips in this blog post cover graceful application shutdowns, how to simplify containers and how to choose and tag the container images you’ll use. 
It’s been a busy month at GCP, and we’re glad to share lots of new tools with you. Till next time, build away!

Repairing network hardware at scale with SRE principles



To support our Google Cloud Platform (GCP) customers, we run a complex global network that depends on multiple providers and a lot of hardware. Google network engineering uses a diverse set of vendor equipment to route user traffic from an internet service provider to one of our serving front ends inside a GCP data center. This equipment is proprietary and made by external networking vendors such as Arista, Cisco and Juniper. Each vendor has distinct operational methods, configurations and operational consoles.

With hundreds of distinct components utilized across our global network, we routinely deal with hardware failures—for example, a failed power supply, line card or control plane card. The complexity of today’s cloud networks means that there are a huge number of places where failure can occur. When we first began building and operating our own data centers, Google had a team of engineers, network engineers and site reliability engineers (SREs) who performed fault detection, mitigation and repair work on these devices, using manual processes guided by a ticket system. Google’s SRE principles are prescriptive, and aim to guide developers and operations teams toward better systems reliability. As with DevOps, avoiding toil—the manual tasks that can eat up too much time—is an essential goal.

We realized after becoming familiar with common hardware problems that any ticket type that we encountered repeatedly and that follows a predetermined sequence of steps can easily be automated. Our team created a list of playbooks over time that detailed steps of how to deal with each hardware failure scenario, taking into account relevant software and hardware bugs and typical steps to resolution. Each playbook is used when an alert is received. Given that we already knew in advance how to deal with each issue as it arose, it made sense to automate the work. Here’s how we did it.

Building the automation interface

“In the old way of doing things, we treat our servers like pets, for example, Bob the mail server. If Bob goes down, it’s all hands on deck. The CEO can’t get his email and it’s the end of the world. In the new way, servers are numbered, like cattle in a herd. For example, www001 to www100. When one server goes down, it’s taken out back, shot, and replaced on the line.”
- Randy Bias

The above quote describes a classic engineering scenario often applied within SRE: "Pets vs. cattle," which describes a way of looking at data center hardware as either individual components or a herd of them. The two categories of equipment can be described as follows:

Pet:
  • An individual device you work on. You're familiar with all of its particular failure modes. 
  • When it gets sick, you come to the rescue.

Cattle:
  • A fleet of devices with a common interface.
  • You manage the "herd" of devices as a group.
  • The common interface lets you perform the same basic operations on any device, regardless of its manufacturer.
Before we moved to automating network hardware failure resolution, we were stuck handling our networking equipment like pets, with an eye toward what made it unique, rather than as cattle, with an eye toward what made it a commodity. We needed to make it easier not to custom-manage all these networking devices. Our initial automation design aimed to turn our fleet into cattle by providing a common interface for interacting with networking equipment. Specifically, we used the underlying primitives to implement a higher-level interface for performing common operations—in this case, the basic operations of a line card in a network device, regardless of vendor: "Bring it online," "Take it offline" and "Check the status." We defined the following interface for a line card, using the Go programming language.


type Linecard interface {
  Online() error
  Offline() error 
  Status() error
}
The error qualifier in Go simply means that the function returns an error object if it fails. The underlying code implementing this interface for a Juniper line card varies significantly from implementation on the Cisco line card, but the caller of the function is insulated from the implementation. The upper level code imports the library, and when it operates on a line card, it can only perform one of those three actions we specified above.

We then realized that we could apply the same interface to many hardware components—for example, a fan. For certain vendors, the Online() and Offline() functions did nothing, because those vendors didn't support turning a fan off, so we just used the interface to check the status.
type Fan interface {
  Online() error
  Offline() error 
  Status() error
}
Building upon this line of thought, we realized that we could generalize this interface to define a common interface for all hardware components within a device.
type Component interface {
  Online() error
  Offline() error 
  Status() error
}
By structuring the code this way, anyone can add a device from a new vendor. Moreover, anyone can add any type of new component as a library. Once the library implements this common interface, it can be registered as a handler for that specific vendor and component.

Deciding what to automate

The system needed to interact with humans at various stages of the automation. To decide what to automate, we drew a flow chart of the normal human-based repair sequence and drew boxes around stages we believed we could replace with automation. We used the task of replacing a vendor control plane board as an example. Many of the steps have self-explanatory names, but these are definitions of some of the more complex ones:
  • Determine control plane: Find faulty control plane unit.
  • Determine state: Is it the master or the backup? 
  • Copy image to control plane: Copy the appropriate software image to the master control plane.
  • Offline control plane: Send the backup control plane offline.
  • Toggle mastership: Make the replaced control plane the new master.
Figure 1: Manual workflow for replacing a vendor control plane board
When we needed to carry out this workflow, a Google network engineer performed each step in Figure 1, with the exception of pulling out and replacing the failed control plane, which was performed by someone on-site at a data center location.

Once we had defined this task, we created an automated workflow. The goal of the new system was to provide a UI for our hardware engineers in a data center that allowed them to perform one of those operations at a specific time under specific conditions and with various automated safety checks, followed by an entire device audit at the end of the operation. Previously, a human had performed all of these steps, but now a human only needed to perform the step “hardware gets replaced” in Figure 2—the hardware replacement.
Figure 2: Automated workflow for replacing a vendor control plane board
Automation, before and after
Figure 3: High-level system view.
You can see in Figure 3 what the system looked like after automation. Before automating this workflow, there would have been a lot of manual work. When an alert initially came in, an engineer would have stopped traffic to the device, and offlined by hand the bad component. Our network operations center (NOC) team would then work with the vendor—for example, Juniper or Cisco— to get a replacement part on-site. Next, we would file a change request in our change management system, noting the date of the operation.

On the day of the operation:
  • The data center technician would click “start” on the change management system to begin the repair.
  • Our system picks up this change and is ready to begin the repair.
  • The technician clicks “start” on our UI.
  • An “offline” state machine starts proceeding through the various steps to take the component offline safely.
  • The UI notifies the user each step of the way.
  • Once the state machine has completed, it notifies the technician, who can safely replace the component.
  • Once the component is replaced and re-cabled, the technician returns to the UI and begins the “online” state machine, which safely returns the component into production.
When we reviewed our original automation design, we noticed there would be a lot of work involved in building the various systems needed to implement the automated workflow. To facilitate collaboration, we created ticket items for each component of the system, so multiple engineers could work on the project in parallel.

Automation lessons learned

We used an iterative approach in our planning and execution. We first focused on replacing the line card for one vendor, then moved on to multiple vendors and multiple components. Due to the modular design of the code base and the interacting systems, adding more modules and scaling the code horizontally was easy. 

For example, adding a new library that handled fan replacements meant simply creating the code to handle this and ensuring it implemented the above interface. Then it registered itself in the main function.

We had the option to extend or repurpose existing automation systems owned by our software management teams to meet our needs. We had to carefully consider whether to use those systems or build our own, potentially duplicating work if we chose the latter. Ultimately, we built our own automation because the other systems were understaffed. Trying to extend their tools would have disrupted other teams' project work and delayed our own project.

What worked well
Leveraging multiple engineers to automate our internal part of the workflow allowed us to take the project from design to implementation within a short period—about one year.

What didn’t
We haven't yet fully automated our hardware replacement workflow. Doing so involves troubleshooting hardware issues with vendors and persuading them that each individual failure merits a device or component replacement. We work around this gap in our automation by keeping spares on site for use with our repair automation, and handling the vendor workflow portion of the process separately and mostly manually through our NOC. We are currently working toward a fully automated vendor interaction with our vendor partners.

Measuring automation success
We can measure the hours our automation saves engineers using Google's production change logging service, which all internal tools use to record changes made to the production environment. The service logs changes made by tools manually invoked by engineers as well as tools that provide end-to-end automation without manual input. Thus we can compare how long each network repair action used to take when performed manually vs. the number of repair actions that are undertaken by today's fully automated system. These two data sets allow us to calculate the total time savings from automation. As shown in Figure 4, network hardware repair automation saves us hundreds of hours every month.

Tips for reducing toil through automation

While strategies for eliminating toil must be tailored to your individual environment and use cases, some approaches are universal. Based upon our own experience eliminating toil by automating network repair tasks, we recommend the following: 
  • Measure your toil.  
  • Tackle the biggest sources of toil first, and don't try to solve all problems at once.  
  • Carefully consider whether to enhance existing tools or build new ones. Even if you can partially repurpose another team's work, would creating a tool from scratch actually make more sense cost- or resource-wise? 
  • Take a design-driven approach. Iterate on the design, starting small and iterating quickly. Don't try to design the perfect approach from the start.  
  • Measure your time savings to determine your return on investment.
Automation has proved useful for our team of network site reliability engineers at GCP. Learn more about the practice of SRE and how you might apply its principles to your own network projects.

Istio reaches 1.0: ready for prod



Today, Google Cloud is proud to announce, together with our collaborators, that the Istio open-source project has reached the 1.0 milestone. This is a key step toward delivering the Cloud Services Platform that we discussed last week, helping you manage your services in a hybrid world where some of your infrastructure runs on VMs and some in Kubernetes, some services run in the cloud and some on-premises.

Istio: a service mesh

Istio is at its heart a service mesh—software that layers transparently onto an existing distributed application. It collects logs, traces and telemetry, and adds security and policy without embedding client libraries. Moreover, Istio is also a platform, complete with APIs that let you integrate with systems for logging, telemetry and policy.

Istio delivers a service-based view of the service interactions across the mesh. Whereas traditional monitoring gives you low-level metrics such as nodes’ CPU consumption, Istio measures the actual traffic between services: requests per second, error rates and latency. It also generates a dependency graph so you can see how services affect one another.

With Istio, your DevOps team gets the tools it needs to run distributed apps smoothly. Istio does canary rollouts, letting you smoke-test a new build to make sure it’s performing well before ramping up. It also offers fault-injection, retry logic and circuit breaking so DevOps teams can do more testing and change network behavior at runtime to keep applications up and running.

And finally, Istio adds security. It can be used to layer mTLS on every call, adding encryption-in-flight and giving you the ability to authorize every single call on your cluster and in your mesh.

Istio in action

Istio provides foundational capabilities for your infrastructure, freeing developers to work on code that is critical to your business. But there’s only one way to prove that Istio is ready for the enterprise: by running real workloads on it in production. Already, there are at least a dozen companies running Istio in production, including several on GCP. We worked with them through early hurdles, incorporated their feedback, and they’re reaping the benefits of Istio already. A great example is Auto Trader UK, which used Istio to help accelerate their move to containers and the public cloud.

Auto Trader UK is not only migrating from private cloud to public cloud, but also moving from virtual machines to Kubernetes. The level of control and visibility that Istio provides has enabled us to significantly de-risk this ambitious work, and in several cases has actually helped surface issues we were previously unaware of. We've been able to accelerate the delivery of capabilities such as mutual TLS, that previously would have taken significant engineering effort, allowing us to focus on our market differentiators.
- Karl Stoney, Delivery Infrastructure Lead, Auto Trader UK

A true joint effort

We first released Istio as open source last year, and what a year it’s been. Since that first 0.1 release, Istio has improved and matured significantly, with eight versions, 200+ contributors, and 4,000+ check-ins adding an ever growing set of functionality.

Getting to version 1.0 was truly a community-driven effort. IBM was a key collaborator and co-founder, and Lyft’s Envoy proxy is a key component of the project. Since then, the number of companies involved in Istio has skyrocketed, including Cisco, Red Hat, and VMware consolidating industry support with the goal of accelerating adoption and meeting the service mesh needs of their customers.

“The growth of Istio since its launch last year has been tremendous, and it’s quickly taking its place as the standard way to manage microservices in the cloud,” said Jason McGee, IBM Fellow and VP, IBM Cloud. “Our mission since Istio’s launch has been to enable everyone to succeed with microservices, especially in the enterprise. This is why we’ve focused the community around improving security and scale, and heavily leaned our contributions on what we’ve learned from building agile cloud architectures for companies of all sizes.”
- Jason McGee, IBM Fellow and VP, IBM Cloud 
"We see Istio's potential to be able to solve some of the most complex aspects of application development and deployment. It brings a control plane for service mesh, cluster orchestration, and network control that will support and enable developers to focus on the more important aspects of their application development. We are looking forward to leveraging Istio in Red Hat OpenShift to enable developers to deploy their applications in a more secure and efficient manner." 
- Brian 'Redbeard' Harrington, product manager, Istio, Red Hat
“VMware has been an integral part of the community developing Istio service mesh. We see great potential in Istio’s service-based approach to connectivity, security, and observability. We believe it will become an infrastructure cornerstone, spanning across vSphere and Kubernetes platforms and multiple private and public clouds, and helping our enterprise customers improve development efficiencies and deliver on their SLAs / SLOs in a secure manner. Istio’s application layer complements the network virtualization layer, and together allow enterprises to achieve defense in depth, improve performance and scalability, and speed time to application value.” 
- Pere Monclus, CTO Network and Security, VMware

We’re also thrilled with the number of companies writing adapters for Istio—from observability software from SolarWinds and Datadog, to deployment tools from Weaveworks and CodeFresh, to policy and security offerings from Aspenmesh and Octarine. While Istio is transparent to application developers, it provides a standard integration interface for anyone writing observability tools or policy engines.

Working and integrating with other open source projects in the community drives our success, as well. Integrations with SPIFFE, the Open Policy Agent and OpenTracing all improve the state of open source and the lives of developers.

Istio on GCP

While the open-source Istio project is a major undertaking, we’re also intent on making it especially easy to use on Google Cloud Platform. Last week at Google Cloud Next we announced the alpha release of Managed Istio: open-source Istio that’s automatically installed and upgraded on your Kubernetes Engine clusters as a part of the Cloud Services Platform. Managed Istio will help provide the visibility, security and control you need over services running in hybrid environments, and it integrates with other Google products like Stackdriver and Apigee.

Achieving 1.0 is just a first step, both for the project and for us at Google Cloud. We have ambitious plans for adding features and improving Istio’s usability with  the ultimate goal of delivering a complete set of tools to manage all of your services, so that you can focus on writing software and running a business.

To find out more about Istio and how to get started using it on GCP, please visit cloud.google.com/istio.

Access Google Cloud services, right from IntelliJ IDEA



Great news for IntelliJ users: You can now use Google Cloud services and APIs right from JetBrains’ integrated development environment (IDE). With the Cloud Tools for IntelliJ plugin, you can now discover APIs, consume them, and test against them locally, all without leaving your IDE.

The Cloud Tools plugin for IntelliJ streamlines the development process by integrating tasks into the IDE, such as enabling Google Cloud APIs, creating service accounts for local development, and adding the corresponding Java client libraries to your build.
Example: Using the Cloud Translation API with the Cloud Tools for IntelliJ plugin

Say you are interested in using the Cloud Translation API in our Java Maven-based project. If the Cloud Tools for IntelliJ plugin isn’t already configured, then first install it as described in this quickstart.

Clone the example Cloud Translation project, which allows you to translate some input text from English to French.

git clone https://github.com/GoogleCloudPlatform/java-docs-samples.git
Open the project, located under “java-docs-samples/translate”:

At this point, you might simply try to run the application by navigating to the main method and clicking the play button:

… and configuring the input arguments to translate some text from English to French by editing the newly created run configuration:

Run the program again, and this time you get the following error:

As you may have already guessed, you’re missing authentication rights to access the Cloud Translation API from your local machine. To overcome this, you’d normally have to go through the following steps:
  1. Enable the service on your Google Cloud Platform (GCP) project
  2. Create a new service account with the appropriate roles for accessing the service
  3. Update your local run configuration with the necessary environment variables to access the service
Thankfully, the Cloud Tools for IntelliJ plugin can help. In IntelliJ, navigate to the Cloud Tools menu item under “Tools > Google Cloud Tools > Add Cloud libraries …”:

Select the Cloud Translation API and your GCP project, and click “Add Cloud Libraries”:

In the confirmation window that appears, you can see that Cloud Tools for IntelliJ takes care of enabling the API and creating the service account for you:

Lastly, select the run configuration that you created earlier so that the plugin can inject the necessary environment variables for accessing the Cloud Translation service from your local machine:

Run the program again and your input text is successfully translated from English to French using the Cloud Translation service:

The Cloud Tools for IntelliJ plugin also assists with the following:
  • Adding Java client libraries to your Maven pom.xml if they are not already present
  • Writing a Bill of Materials (BOM) to your pom.xml to help avoid dependency version conflicts
  • Detecting and acting on potential misconfigurations, including a missing BOM, through pom.xml file inspections with quick-fixes
The Cloud Tools for IntelliJ plugin provides many more features to help optimize your development workflow including support for Google App Engine, Stackdriver Debugger, Cloud Repositories, and Cloud Storage. For more information and to leave feedback please visit the official documentation and GitHub pages:

Cloud Tools for IntelliJ:

Drilling down into Stackdriver Service Monitoring



If you’re responsible for application performance and availability, you know how hard it can be to see it through the eyes of your customers and end users. We think that’s really going to change with last week’s introduction of Stackdriver Service Monitoring, a new tool for monitoring how your customers perceive your applications, and that then lets you drill down to the underlying infrastructure when there’s a problem.

Most IT operations tools take a bottoms-up understanding of IT systems: they look at compute, storage, and networking metrics to infer the customer experience. Application performance management (APM) tools like tracing systems, debuggers, and profilers consider the application from the code level—but lose sight of the underlying infrastructure. Sometimes, a logs analytics solution can provide the glue between those two layers, but often with great effort and expense.

IT operators have been missing a cost-effective, easy-to-use, general-purpose tool to monitor the customer-facing behavior of their applications. It’s hard to know how end users experience your software and it’s difficult to measure services and applications in a standardized way. Ops staff risk burning out from all the spurious alerts. The result of all this is that mean-time-to-resolution (MTTR) is longer than necessary, and customer satisfaction is lower than desired. The situation is exacerbated with microservice architectures where the app itself is broken into many small pieces, which makes it hard to understand how all the pieces fit together and where to start investigating when there is a problem.

That all changes with the release of Stackdriver Service Monitoring. Service Monitoring takes advantage of service-aware, “opinionated” infrastructure so you can monitor how end users perceive your systems, letting you drill down to the infrastructure level when necessary. Initially, we are supporting this functionality for Google App Engine and for Istio service meshes running on Google Kubernetes Engine. We will expand to more platforms over time.

With Stackdriver Service Monitoring, you get the answers to the following questions:
  • What are your services? What functionality do those services expose to internal and external customers?
  • What are your promises and commitments regarding the availability and performance of those services, and are your services meeting them?
  • For microservices-based apps, what are the inter-service dependencies? How can you use that knowledge to double check new code rollouts and triage problems in the event of service degradation?
  • Can you look at all the monitoring signals for a service holistically to reduce MTTR?

Anatomy of Stackdriver Service Monitoring

Service Monitoring has three pieces: the service graph, Service Level Objectives (SLOs), and multi-signal service dashboards. Together, these give you an inventory of your services, visually display the dependencies between them, let you set and measure availability and performance promises, help you triage application problems to quickly find the root cause, and finally, help you debug broken services more quickly than ever before. Let’s look at each piece in turn.

The service graph: This is a service-specific view of your infrastructure. It starts out with a real-time top level display of all services in the Istio service mesh and the communication links between them. Selecting one service displays charts with error rates and latency metrics. Double-clicking on a service allows you to drill down into its underlying Kubernetes infrastructure, providing the long elusive connection between app behavior and infrastructure. There is also a time slider which allows you to see the graph at previous points in time. Using the service graph you can see your application architecture for reference purposes or to triage problems. You can explore metrics about service behavior, and determine whether an upstream service is causing problems to a downstream service. Finally, you can compare the service graph at different points in time to determine whether there was a significant architectural change right before a problem was reported. There is no quicker way to get started exploring and understanding complex multi-service applications.

SLOs: Internally at Google, our Site Reliability Engineering team (SRE) only alert themselves on customer-facing symptoms of problems, and not all potential causes. This better aligns them to customer interests, lowers their toil, frees them to do value-added reliability engineering, and increases job satisfaction. Stackdriver Service Monitoring lets you to set, monitor, and alert on SLOs. Because Istio and App Engine are instrumented in an opinionated way, we know exactly what the transaction counts, error counts, and latency distributions are between services. All you need to do is set your targets for availability and performance and we automatically generate the graphs for service level indicators (SLIs), compliance to your targets over time, and your remaining error budget. You can configure the maximum allowed drop rate for your error budget; if that rate is exceeded, we notify you and create an incident so that you can take action. To learn more about SLO concepts including error budget, we encourage you to read the SLO chapter of the SRE book.

Service Dashboard: At some point, you will need to dig deeper into a service’s signals. Maybe you received an SLO alert and there’s no obvious upstream cause. Maybe the service is implicated by the service graph as a possible cause for another service’s SLO alert. Maybe you have a customer complaint outside of an SLO alert that you need to investigate. Or, maybe you want to see how the rollout of a new version of code is going.

The service dashboard provides a single coherent display of all signals for a specific service, all of them scoped to the same timeframe with a single control, providing you the fastest possible way to get to the bottom of a problem with your service. Service monitoring lets you dig deep into the service’s behavior across all signals without having to bounce between different products, tools, or web pages for metrics, logs, and traces. The dashboard gives you a view of the SLOs in one tab, the service metrics (transaction rates, error rates, and latencies) in a second tab, and diagnostics (traces, error reports, and logs) in the third tab.

Once you’ve validated an error budget drop in the first tab and isolated anomalous traffic in the second tab, you can drill down further in the diagnostics tab. For performance issues, you can drill down into long tail traces, and from there easily get into Stackdriver Profiler if your app is instrumented for it. For availability issues you can drill down into logs and error reports, examine stack traces, and open the Stackdriver Debugger, if the app is instrumented for it.

Stackdriver Service Monitoring gives you a whole new way to view your application architecture, reason about its customer-facing behaviors, and get to the root of any problems that arise. It takes advantage of infrastructure software enhancements that Google has championed in the open source-world, and leverages the hard-won knowledge of our SRE teams. We think this will fundamentally transform the ops experience of cloud native and microservice development and operations teams. To learn more see the presentation and demo with Descartes Labs at GCP Next last week. We hope you will sign up to try it out and share your feedback.