Tag Archives: Compute

Introducing Skaffold: Easy and repeatable Kubernetes development



As companies on-board to Kubernetes, one of their goals is to provide developers with an iteration and deployment experience that closely mirrors production. To help companies achieve this goal, we recently announced Skaffold, a command line tool that facilitates continuous development for Kubernetes applications. With Skaffold, developers can iterate on application source code locally while having it continually updated and ready for validation or testing in their local or remote Kubernetes clusters. Having the development workflow automated saves time in development and increases the quality of the application through its journey to production.

Kubernetes provides operators with APIs and methodologies that increase their agility and facilitates reliable deployment of their software. Kubernetes takes bespoke deployment methodologies and provides programmatic ways to achieve similar if not more robust procedures. Kubernetes’ functionality helps operations teams apply common best practices like infrastructure as code, unified logging, immutable infrastructure and safer API-driven deployment strategies like canary and blue/green. Operators can now focus on the parts of infrastructure management that are most critical to their organizations, supporting high release velocity with a minimum of risk to their services.

But in some cases, developers are the last people in an organization to be introduced to Kubernetes, even as operations teams are well versed in the benefits of its deployment methodologies. Developers may have already taken steps to create reproducible packaging for their applications with Linux containers, like Docker. Docker allows them to produce repeatable runtime environments where they can define the dependencies and configuration of their applications in a simple and repeatable way. This allows developers to stay in sync with their development runtimes across the team, however, it doesn’t introduce a common deployment and validation methodology. For that, developers will want to use the Kubernetes APIs and methodologies that are used in production to create a similar integration and manual testing environment.

Once developers have figured out how Kubernetes works, they need to actuate Kubernetes APIs to accomplish their tasks. In this process they'll need to:
  1. Find or deploy a Kubernetes cluster 
  2. Build and upload their Docker images to a registry that's enabled in their cluster 
  3. Use the reference documentation and examples to create their first Kubernetes manifest definitions 
  4. Use the kubectl CLI or Kubernetes Dashboard to deploy their application definitions 
  5. Repeat steps 2-4 until their feature, bug fix or changeset is complete 
  6. Check in their changes and run them through a CI process that includes:
    • Unit testing
    • Integration testing
    • Deployment to a test or staging environment

Steps 2 through 5 require developers to use many tools via multiple interfaces to update their applications. Most of these steps are undifferentiated for developers and can be automated, or at the very least guided by a set of tools that are tailored to a developer’s experience.

Enter Skaffold, which automates the workflow for building, pushing and deploying applications. Developers can start Skaffold in the background while they're developing their code, and have it continually update their application without any input or additional commands. It can also be used in an automated context such as a CI/CD pipeline to leverage the same workflow and tooling when moving applications to production.

Skaffold features


Skaffold is an early phase open-source project that includes the following design considerations and capabilities:
  • No server-side components mean no overhead to your cluster. 
  • Allows you to detect changes in your source code and automatically build/push/deploy. 
  • Image tag management. Stop worrying about updating the image tags in Kubernetes manifests to push out changes during development. 
  • Supports existing tooling and workflows. Build and deploy APIs make each implementation composable to support many different workflows. 
  • Support for multiple application components. Build and deploy only the pieces of your stack that have changed. 
  • Deploy regularly when saving files or run one off deployments using the same configuration.

Pluggability


Skaffold has a pluggable architecture that allows you to choose the tools in the developer workflow that work best for you.
Get started with Skaffold on Kubernetes Engine by following the Getting Started guide or use Minikube by following the instructions in the README. For discussion and feedback join the mailing list or open an issue on GitHub.

If you haven’t tried GCP and Kubernetes Engine before, you can quickly get started with our $300 free credits.

Demo


8 DevOps tools that smoothed our migration from AWS to GCP: Tamr



Editor’s note: If you recently migrated from one cloud provider to another—or are thinking about making the move—you understand the value of avoiding vendor lock-in by using third-party tools. Tamr, a data unification provider, recently made the switch from AWS to Google Cloud Platform, bringing with them a variety of DevOps tools to help with the migration and day-to-day operations. Check out their recommendations for everything from configuration management to storage to user management.

Here at Tamr, we recently migrated from AWS to Google Cloud Platform (GCP), for a wide variety of reasons, including more consistent compute performance, cheaper machines, preemptible machines and better committed usage stories, to name a few. The larger story of our migration itself is worth its own blog post, which will be coming in the future, but today, we’d like to walk through the tools that we used internally that allowed us to make the switch in the first place. Because of these tools, we migrated with no downtime and were able to re-use almost all of the automation/management code we’d developed internally over the past couple of years.

We attribute a big part of our success to having been a DevOps shop for the past few years. When we first built out our DevOps department, we knew that we needed to be as flexible as possible. From day one, we had a set of goals that would drive our decisions as a team, and which technologies we would use. Those goals have proved themselves as they have held up over time, and more recently allowed us to seamlessly migrate our platform from AWS to GCP and Google Compute Engine.

Here were our goals. Some you’ll recognize as common DevOps mantras, others were more specific to our organization:

  • Automate everything, and its corollary, "everything is code"
  • Treat servers as cattle, not pets
  • Scale our devops team sublinearly in relation to the number of servers and services we support 
  • Don’t be tied into one vendor/cloud ecosystem. Flexibility matters, as we also ship our entire stack and install it on-prem at our customers sites
Our first goal was well defined and simple. We wanted all operation tasks to be fully automated. Full stop. Though we would have to build our own tooling in some cases, for the most part there's a very rich set of open source tools out there that can solve 95% of our automation problems with very little effort. And by defining everything as code, we could easily review each change and version everything in git.

Treating servers as cattle, not pets is core to the DevOps philosophy. Server "pets" have names like postgres-master, and require you to maintain them by hand. That is, you run commands on it via a shell and upgrade settings and packages yourself. Instead, we wanted to focus on primitives like the amount of cores and RAM that our services need to run. We also wanted to kill any server in the cluster at any time without having to notify anyone. This makes doing maintenance much easier and streamlined, as we would be able to do rolling restarts of every server in our fleet. It also ties into our first goal of automating everything.

We also wanted to keep our DevOps team in check. We knew from the get-go that to be successful, we would be running our platform across large fleets of servers. Doing things by hand requires us to hire and train a large number of operators just to run through set runbooks. By automating everything and investing in tooling we can scale the number of systems we maintain without having to hire as many people.

Finally, we didn’t want to get tied into one single vendor cloud ecosystem, for both business reasons—we deploy our stack at customer sites—and because we didn’t want to be held hostage by any one cloud provider. To avoid getting locked into a cloud’s proprietary services, we would have to run most things ourselves on our own set of servers. While you may choose to use equivalent services from their cloud provider, we like the independence of this go-it-alone approach.

Our DevOps toolbox


1. Server/Configuration management: Ansible 

Picking a configuration management system should be the very first thing you do when building out your DevOps toolbox, because you’ll be using it on every server that you have. For configuration management, we chose to use Ansible; it’s one of the simpler tools to get started with, and you can use it on just about any Linux machine.

You can use Ansible in many different ways: as a scripting language, as a parallel ssh client, and as a traditional configuration management tool. We opted to use it as a configuration management tool and set up our code base following Ansible best practices. In addition to the best practices layed out in the documentation, we went one step further and made all of our Ansible code fully idempotent—that is, we expect to be able to run Ansible at any time, and as long as everything is already up-to-date, for it to not have to make any changes. We also try and make sure that any package upgrades in Ansible have the correct handlers to ensure a zero downtime deployment.

We were able to use our entire Ansible code base in both the AWS and GCP environments without having to change any of our actual code. The only things that we needed to change were our dynamic inventory scripts, which are just Python scripts that Ansible executes to find the machines in your environment. Ansible playbooks allow you to use multiple of these dynamic inventory scripts simultaneously, allowing us to run Ansible across both clouds at once.

That said, Ansible might not be the right fit for everyone. It can be rather slow for some things and isn’t always ideal in an autoscaling environment, as it's a push-based system, not pull-based (like Puppet and Chef). Some alternatives to Ansible are the afore-mentioned Puppet and Chef, as well as Salt. They all solve the same general problem (automatic configuration of servers) but are optimized for specific use cases.

2. Infrastructure configuration: Terraform

When it comes to setting up infrastructure such as VPCs, DNS and load balancers, administrators sometimes set up cloud services by hand, then forget they are there, or how they configured them. (I’m guilty of this myself.) The story goes like this: we need a couple of machines to test an integration with a vendor. The vendor wants shell access to the machines to walk us through problems and requests an isolated environment. A month or two goes by and everything is running smoothly, and it’s time to set up a production environment based on the development environment. Do you remember what you did to set it up? What settings you customized? That is where infrastructure-as-code configuration tools can be a lifesaver.

Terraform allows you to codify the settings and infrastructure in your cloud environments using its domain specific language (DSL). It handles everything for you (cloud integrations, and ordering of operations for creating resources) and allows you to provision resources across multiple cloud platforms. For example, in Terraform, you can create DNS records in Google DNS that reference a resource in AWS. This allows you to easily link resources across multiple environments and provision complex networking environments as code. Most cloud providers have a tool for managing resources as code: AWS has CloudFormation, Google has Cloud Deployment Manager, and Openstack has Heat Orchestration Templates. Terraform effectively acts as a superset of all these tools and provides a universal format across all platforms.

3. Server imaging: Packer 

One of the basic building blocks of a cloud environment is a Virtual Machine (VM) image. In AWS, there’s a marketplace with AMI images for just about anything, but we often needed to install tools onto our servers beyond the basic services included in the AMI. For example, think Threatstack agents that monitor the activity on the server and scan packages on the server for CVEs. As a result, it was often easier to just build our own images. We also build custom images for our customers and need to share them into their various cloud accounts. These images need to be available to different regions, as do our own base images that we use internally as the basis for our VMs. Having a consistent way to build images independent of a specific cloud provider and a region is a huge benefit.

We use Packer, in conjunction with our Ansible code base, to build all of our images. Packer provides the framework to spin up machines, runs our Ansible code, then saves a copy of the snapshot of the machine into our account. Because Packer is integrated with configuration management tools, it allowed us to define everything in the AMIs as source code. This allows us to easily version images and have confidence that we know exactly what’s in our images. It made reproducing problems that customers had with our images trivial, and allowed us to easily generate changelogs for images.

The bigger benefit that we experienced was that when we switched to Compute Engine, we were able to reuse everything we had in AWS. All we needed to change was a couple of lines in Packer to tell it to use Compute Engine instead of AWS. We didn’t have to change anything to our base images that developers use day-to-day or the base images that we use in our compute clusters.

4. Containers: Docker

When we first started building out our infrastructure at Tamr, we knew that we wanted to use containers as I had used them at my previous company and seen how powerful and useful they can be at scale. Internally we have standardized on Docker as our primary container format. It allows us to build a single shippable artifact for a service that we can run on any Linux system. This gives us portability between Linux operating systems without significant effort. In fact, we’ve been able to Dockerize most of our system dependencies throughout the stack, to simplify bootstrapping from a vanilla Linux system.

5 and 6. Container and service orchestration: Mesos + Marathon

Containers in and of themselves don’t inherently provide scale or high availability on their own. Docker itself is just a piece of the puzzle. To fully leverage containers you need something to manage them and provide management hooks. This is where a container orchestration comes in. It allows you to link together your containers and use them to build up services in a consistent, fault-tolerant way.

For our stack we use Apache Mesos as the basis of our compute clusters. Mesos is basically a distributed kernel for scheduling tasks on servers. It acts as a broker for requests from frameworks to resources (cpu, memory, disk, gpus) available on machines in the Mesos cluster. One of the most common frameworks for Mesos is Marathon, which ships as part Mesosphere’s commercial DC/OS (Data Center Operating System), the main interface for launching tasks onto a Mesos cluster. Internally we deploy all of our services and dependencies on top of a custom Mesos cluster. We spent a fair amount of time building our own deployment/packaging tool on top of Marathon for shipping releases and handling deployments. (Down the road we hope to open source this tool, in addition to writing a few blog posts about it).

The Mesos + Marathon approach for hosting services is so flexible that during our migration from AWS to GCP, we were able to span our primary cluster across both clouds. As a result, we were able to slowly switch services running on the cluster from one cloud to another using Marathon constraints. As we were switching over, we simply spun up more Compute Engine machines and then deprecated machines on the AWS side. After a couple of days, all of our services were running on Compute Engine machines, and off of AWS.

However, if we were building our infrastructure from scratch today, we would heavily consider building on top of Kubernetes rather than Mesos. Kubernetes has come a long way since we started building out our infrastructure, but it just wasn’t ready at the time. I highly recommend Google Kubernetes Engine as a starting point for organizations starting to dip their toes into the container orchestration waters. Even though it's a managed service, the fact that it's based on open-source Kubernetes ensures minimized the risk of cloud lock-in.

7. User management: JumpCloud 

One of the first problems we dealt with in our AWS environment was how to provide ssh access to our servers to our development team. Before we automated server provisioning, developers often created a new root key every time they spun up an instance. We soon consolidated to one shared key. Then we upgraded to running an internal LDAP instance. As the organization grew, managing that LDAP server became a pain—we were definitely treating it as a pet. So we went looking for a hosted LDAP/Active Directory offering, which led us JumpCloud. After working with them, we ended up using their agent on our servers instead of an LDAP connector, even though they have a hosted LDAP endpoint that we do use for other things. The JumpCloud agent syncs with JumpCloud and provisions users and groups and ssh keys onto the server automatically for us. JumpCloud also provides a self-service portal for developers updating their ssh keys. This means that we now spend almost no time actually managing access to our servers; it’s all fully automated.

It’s worth noting that access to machines on Compute Engine is completely different than AWS. With GCP, users can use the gcloud command line interface (CLI) to gain access to a machine. The CLI generates a ssh key, and provisions it onto the server and creates a user account on the machine (for example, here's a sample command is `gcloud compute --project "gce-project" ssh --zone "us-east1-b" "my-machine-name"`). In addition, users can upload their ssh-keys/users pairs in the console and new machines will have those users accounts set up on launch of a machine. In other words, the problem of how to provide ssh access to developers that we ran into in on AWS doesn’t exist on Compute Engine.

JumpCloud solved a specific problem with AWS, but provides a portable solution across both GCP and AWS. Using it with GCP works great, however if you're 100% on GCP, you don’t need to rely on an additional external service such as JumpCloud to manage your users.

8. Storage: RexRay 

Given that we run a large amount of services on top of a Mesos cluster we needed a way to provide persistent storage to Docker containers running there. Since we treat servers as cattle not pets (we expect to be able to kill any one server at any time), using Mesos local persistent storage wasn’t an option for us. We ended up using RexRay as an interface for provisioning/mounting disks into containers. RexRay acts as the bridge on a server between disks and a remote storage provider. Its main interface is a Docker storage driver plugin that can make API calls to a wide variety of sources (AWS, GCP, EMC, Digital Ocean and many more) and mount the provisioned storage into a Docker container. In our case, we were using EBS volumes on AWS and persistent disks on Compute Engine. Because RexRay is implemented as a Docker plugin, the only thing we had to change between the environments was the config file with the Compute Engine vs. AWS settings. We didn’t have to change any of our upstream invocations for disk resources.


DevOps = Freedom 


From the viewpoint of our DevOps team, these tools enabled a smooth migration, without much manual effort. Most things only required updating a couple of config files to be able to talk to Compute Engine APIs. At the top layers in our stack that our developers use, we were able to switch to Compute Engine with no development workflow changes, and zero downtime. Going forward, we see being able to span across and between clouds at will as a competitive advantage, and this would not be possible without the investment we made into our tooling.

Love our list of essential DevOps tools? Hate it? Leave us a note in the comments—we’d love to hear from you. To learn more about Tamr and our data unification service, visit our website.

Automatic serverless deployments with Cloud Source Repositories and Container Builder



There are many reasons to automate your deployments: consistency, safety, and timeliness. These increase in value as your software becomes more critical to your business. In this post, I'll demonstrate how easy it is to start automating deployments with Google Cloud Platform (GCP) tools, and refer you to additional resources to help make your deployment process more robust.

Suppose you have a Google Cloud Functions, Firebase or Google App Engine application. Today, you probably deploy your function or app via gcloud commands from your local workstation. Let's look at a lightweight workflow that takes advantage of two Google Cloud products: Cloud Source Repositories and Cloud Container Builder.
This simple pipeline uses build triggers in Cloud Container Builder to deploy a function to Cloud Functions when source code is pushed to a "prod" branch.

The first step is to get your code under revision control. If you're already using a provider like GitHub or Bitbucket, it's trivial to mirror your code to a Cloud Source Repository. Cloud Source Repositories is offered at no charge for up to five project-users, so it's perfect for small teams.

Commands for the command-line are captured below, but you can find more detailed guides in the documentation.

Create and clone your repository:

$ gcloud source repos create my-function
Created [my-function].

$ gcloud source repos clone my-function
Cloning into 'my-function'...

Now, create a simple function (include a package.json if you have third-party dependencies):

index.js
exports.f = function(req, res) {
  res.send("hello, gcf!");
};

Then, create a Container Builder build definition:

deploy.yaml
steps:
- name: gcr.io/cloud-builders/gcloud
  args:
  - beta
  - functions
  - deploy
  - --trigger-http
  - --source=.
  - --entry-point=f
  - hello-gcf # Function name

This is equivalent to running the command:

gcloud beta functions deploy --trigger-http --source=. --entry-point=f hello-gcf

Before you start your first build, set up your project for Container Builder. First, enable two APIs: Container Builder API and Cloud Functions API. To allow Container Builder to deploy, you need to give it access to your project. The build process uses the credentials of a service account associated with those builds. The address for that service account is {numerical-project-id}@cloudbuild.gserviceaccount.com. You'll need to add an IAM role to that service account: Project Editor. If you use this process to deploy other resources, you might need to add other IAM roles.

Now, test your deployment configuration and permissions by running:

gcloud container builds submit --config deploy.yaml .

Your function is now being deployed via Cloud Container Builder.

Creating a build trigger is easy: choose your repository, the trigger condition (in this case, pushing to the "prod" branch), and the build to run (in this case, the build specified in "deploy.yaml").
Now, update the "prod" branch, bring it up-to-date with "master", push it to Cloud Source Repositories, and your function will be deployed!

$ git checkout prod
$ git pull origin prod
$ git merge master
$ git push origin prod

If the deployment failed, it will show up as a failed build in the build history screen. Check the logs to investigate what went wrong. You can also configure e-mail or other notifications using Pub/Sub and Cloud Functions.

This is a simplified deployment pipeline—just enough to demonstrate the power of deployment automation. At some point, you'll probably find that this process doesn't meet your needs. For example, you might want to get a manual approval before you update production. If that happens, check out Spinnaker, an open-source deployment automation system that can handle more complex workflows.

And that’s just the beginning! As you get further down the road toward automating your deployments, here are some other tools and techniques for you to try:
We hope this gets you excited about automating your software deployments. Let us know what you think of this guide—we’d love to hear from you.

Introducing Agones: Open-source, multiplayer, dedicated game-server hosting built on Kubernetes



In the world of distributed systems, hosting and scaling dedicated game servers for online, multiplayer games presents some unique challenges. And while the game development industry has created a myriad of proprietary solutions, Kubernetes has emerged as the de facto open-source, common standard for building complex workloads and distributed systems across multiple clouds and bare metal servers. So today, we’re excited to announce Agones (Greek for "contest" or "gathering"), a new open-source project that uses Kubernetes to host and scale dedicated game servers.

Currently under development in collaboration with interactive gaming giant Ubisoft, Agones is designed as a batteries-included, open-source, dedicated game server hosting and scaling project built on top of Kubernetes, with the flexibility you need to tailor it to the needs of your multiplayer game.

The nature of dedicated game servers


It’s no surprise that game server scaling is usually done by proprietary software—most orchestration and scaling systems simply aren’t built for this kind of workload.

Many of the popular fast-paced online multiplayer games such as competitive FPSs, MMOs and MOBAs require a dedicated game server—a full simulation of the game world—for players to connect to as they play within it. This dedicated game server is usually hosted somewhere on the internet to facilitate synchronizing the state of the game between players, but also to be the arbiter of truth for each client playing the game, which also has the benefit of safeguarding against players cheating.

Dedicated game servers are stateful applications that retain the full game simulation in memory. But unlike other stateful applications, such as databases, they have a short lifetime. Rather than running for months or years, a dedicated game server runs for a few minutes or hours.

Dedicated game servers also need a direct connection to a running game server process’ hosting IP and port, rather than relying on load balancers. These fast-paced games are extremely sensitive to latency, which a load balancer only adds more of. Also, because all the players connected to a single game server share the in-memory game simulation state at the same time, it’s just easier to connect them to the same machine.

Here’s an example of a typical dedicated game server setup:


  1. Players connect to some kind of matchmaker service, which groups them (often by skill level) to play a match. 
  2. Once players are matched for a game session, the matchmaker tells a game server manager to provide a dedicated game server process on a cluster of machines.
  3. The game server manager creates a new instance of a dedicated game server process that runs on one of the machines in the cluster. 
  4. The game server manager determines the IP address and the port that the dedicated game server process is running on, and passes that back to the matchmaker service.
  5. The matchmaker service passes the IP and port back to the players’ clients.
  6. The players connect directly to the dedicated game server process and play the multiplayer game against one another. 

Building Agones on Kubernetes and open-source 

Agones replaces the bespoke cluster management and game server scaling solution we discussed above, with a Kubernetes cluster that includes a custom Kubernetes Controller and matching GameServer Custom Resource Definitions.
With Agones, Kubernetes gets native abilities to create, run, manage and scale dedicated game server processes within Kubernetes clusters using standard Kubernetes tooling and APIs. This model also allows any matchmaker to interact directly with Agones via the Kubernetes API to provision a dedicated a game server.

Building Agones on top of Kubernetes has lots of other advantages too: it allows you to run your game workloads wherever it makes the most sense, for example, on game developers’ machines via platforms like minikube, in-studio clusters for group development, on-premises machines and on hybrid-cloud or full-cloud environments, including Google Kubernetes Engine.

Kubernetes also simplifies operations. Multiplayer games are never just dedicated game servers—there are always supporting services, account management, inventory, marketplaces etc. Having Kubernetes as a single platform that can run both your supporting services as well as your dedicated game servers drastically reduces the required operational knowledge and complexity for the supporting development team.

Finally, the people behind Agones aren’t just one group of people building a game server platform in isolation. Agones, and the developers that use it, leverages the work of hundreds of Kubernetes contributors and the diverse ecosystem of tools that have been built around the Kubernetes platform.

Founding contributor to the Agones project, Ubisoft brought their deep knowledge and expertise in running top-tier, AAA multiplayer games for a global audience.
“Our goal is to continually find new ways to provide the highest-quality, most seamless services to our players so that they can focus on their games. Agones helps by providing us with the flexibility to run dedicated game servers in optimal datacenters, and by giving our teams more control over the resources they need. This collaboration makes it possible to combine Google Cloud’s expertise in deploying Kubernetes at scale with our deep knowledge of game development pipelines and technologies.”  
Carl Dionne, Development Director, Online Technology Group, Ubisoft. 


Getting started with Agones 


Since Agones is built with Kubernetes’ native extensions, you can use all the standard Kubernetes tooling to interact with it, including kubectl and the Kubernetes API.

Creating a GameServer 

Authoring a dedicated game server to be deployed on Kubernetes is similar to developing a more traditional Kubernetes workload. For example, the dedicated game server is simply built into a container image like so:

Dockerfile
FROM debian:stretch
RUN useradd -m server

COPY ./bin/game-server /home/server/game-server
RUN chown -R server /home/server && \
    chmod o+x /home/server/game-server

USER server
ENTRYPOINT ["/home/server/game-server"]

By installing Agones into Kubernetes, you can add a GameServer resource to Kubernetes, with all the configuration options that also exist for a Kubernetes Pod.

gameserver.yaml
apiVersion: "stable.agon.io/v1alpha1"
kind: GameServer
metadata:
  name: my-game-server
spec:
  containerPort: 7654
  # Pod template
  template:
    spec:
      containers:
      - name: my-game-server-container
        image: gcr.io/agon-images/my-game-server:0.1

You can then apply it through the kubectl command or through the Kubernetes API:

$ kubectl apply -f gamesever.yaml
gameserver "my-game-server" created

Agones manages starting the game server process defined in the yaml, assigning it a public port, and retrieving the IP and port so that players can connect to it. It also tracks the lifecycle and health of the configured GameServer through an SDK that's integrated into the game server process code.

You can query Kubernetes to get details about the GameServer, including its State, and the IP and port that player game clients can connect to, either through kubectl or the Kubernetes API:

$ kubectl describe gameserver my-game-server
Name:         my-game-server
Namespace:    default
Labels:       
Annotations:  
API Version:  stable.agones.dev/v1alpha1
Kind:         GameServer
Metadata:
  Cluster Name:
  Creation Timestamp:  2018-02-09T05:02:18Z
  Finalizers:
    stable.agones.dev
  Generation:        0
  Initializers:      
  Resource Version:  13422
  Self Link:         /apis/stable.agones.dev/v1alpha1/namespaces/default/gameservers/my-game-server
  UID:               6760e87c-0d56-11e8-8f17-0800273d63f2
Spec:
  Port Policy:     dynamic
  Container:       my-game-server-container
  Container Port:  7654
  Health:
    Failure Threshold:      3
    Initial Delay Seconds:  5
    Period Seconds:         5
  Host Port:                7884
  Protocol:                 UDP
  Template:
    Metadata:
      Creation Timestamp:  
    Spec:
      Containers:
        Image:  gcr.io/agones-images/my-game-server:0.1
        Name:   my-game-server-container
        Resources:
Status:
  Address:    192.168.99.100
  Node Name:  agones
  Port:       7884
  State:      Ready
Events:
  Type    Reason    Age   From                   Message
  ----    ------    ----  ----                   -------
  Normal  PortAllocation  3s    gameserver-controller  Port allocated
  Normal  Creating        3s    gameserver-controller  Pod my-game-server-q98sz created
  Normal  Starting        3s    gameserver-controller  Synced
  Normal  Ready           1s    gameserver-controller  Address and Port populated

What’s next for Agones


Agones is still in very early stages, but we’re very excited about its future! We’re already working on new features like game server Fleets, planning a v0.2 release and working on a roadmap that includes support for Windows, game server statistic collection and display, node autoscaling and more.

If you would like to try out a v0.1 alpha release of Agones, you can install it directly on a Kubernetes cluster such as GKE or minikube and take it for a spin. We have a great installation guide that will take you through getting setup!

And we would love your help! There are multiple ways to get involved:

Thanks to everyone has been involved in the project so far across Google Cloud Platform and Ubisoft, we're very excited for the future of Agones!

From open source to sustainable success: the Kubernetes graduation story



Today is a special day: Kubernetes graduates from CNCF incubation, an important milestone in its maturity, and an even bigger milestone for all the organizations that have come to rely on it as a cornerstone of their IT operations.

Graduation is also an opportunity for us at Google to look back at the work accomplished over the years and say thank you and congratulations to a class that has accomplished so much. In this case, we would like to congratulate the Kubernetes project community that has worked with us sometimes as students, frequently as peers, and often as teachers. Kubernetes would not be where it is today without the attention and devotion so many have given it across the industry.

Congratulations are also due to the Cloud Native Computing Foundation (CNCF), which incubated our project and gave the community an independent home in which to grow and learn.

Forgive us as we walk down memory lane for a moment—Google released Kubernetes as open source code in June 2014 and it was an immediate success.

Actually, that’s not how it worked—not at all. We started with a Google technical team that had deep experience running globally-scalable applications on distributed systems. And a year and a half in, we were running into challenges of every sort. We were at risk of becoming a very successful disaster, wrestling with growing interest and contributions to an increasingly complex system. There were so many changes in every release that changes in one area would break another area, making it harder to get a release out the door on time.

Today, there are few open source projects the size of Kubernetes and almost none with Kubernetes’ level of sustained activity. There was only so much prior experience to rely on. To reach this day, it took a cross-industry group of technologists, program managers, product managers, documentation writers and advocates that were open to experimenting and maturing as a community. We’d like to share some of the best practices that helped us get through and overcome those growing pains, and get where we are today.

#1 Focus on the user


A successful project is a combination of great technology, product market fit and innovation that solves real problems. But ensuring adoption in the early stages of an unknown open source project is hard. To support adoption we set up a user rotation early on, and assigned Google engineers to answer user project questions on IRC, StackOverflow and Slack channels. The rotation grew user adoption directly and provided valuable feedback to engineers.

#2 Build a community


Collaborating with engineers from companies like RedHat and CoreOS was critical for shaping Kubernetes to meet the needs of a wide variety of users. Additionally, it was important for the community to welcome independent, individual contributors as much as experienced contributors from big companies. Diverse voices sharing different use cases, unusual perspectives and offering a counterpoint in discussions was the only path to a project that would scale up to the needs of large enterprises and scale down to be digestible by students and hobbyists.

#3 Invest in sustainability


To scale the project in a healthy, sustainable way, Google Cloud funded large investments in test infrastructure, change-review automation, issue-management automation, documentation, project management, contributor experience, mentorship programs and project governance. We also worked closely with the CNCF to develop devstats, a tool for visualizing GitHub organization and repository statistics so that we can regularly review the project’s health.

#4 Enable an ecosystem


It was clear we had the ingredients for a technological revolution on our hands, but turning that into a ubiquitous platform required additional elements. It needed a clear vision of the future that understood user needs and recognized that not every use case was the same. It needed hooks for an ecosystem of contributors to build from, broader capabilities to handle new types of applications and standards to ensure consistent user experience and streamline adoption. It needed advocates and champions from across the industry.

Kubernetes is a successful platform in part due to early decisions by maintainers to ensure a modular architecture with well-defined interfaces. As a result, Kubernetes runs everywhere, from all major cloud platforms to card-sized ARM-based devices, and supports meaningful choices from the ecosystem, including container runtimes, network plugins, ingress controllers and monitoring systems, to name a few. In order to give users an efficient platform for more diverse workloads, we invested in support for stateful workloads, storage plugins, and hardware accelerators. Additionally, Kubernetes extension mechanisms such as API aggregation and Custom Resource Definitions unlock innovation in the ecosystem by enabling developers to take Kubernetes in new directions.

Last but not least, to ensure Kubernetes avoids the risk of fragmentation, Google worked with the CNCF and the Kubernetes community to initiate the Certified Kubernetes Conformance Program that aims to cement the portability and ubiquity of this platform.

Even with years of experience developing Borg and the collective effort of hundreds of Googlers, we couldn’t have done this alone. For all the help making Kubernetes what it is today, we must thank our many contributors, collaborators, leaders, users, advocates, dissenters and challengers—those who helped us turn open-source code into an open source project and an industry ecosystem.

Like a school graduation, this isn’t an end unto itself, but just the beginning. We look forward to the future where Kubernetes is even more critical thanks to all of you who have helped get it this far, and all of you who will help continue to mature it in the future.

For more information on the Kubernetes graduation, take a look at the CNCF announcement.

Learn to run Apache Spark natively on Google Kubernetes Engine with this tutorial



Apache Spark, the open-source cluster computing framework, is a popular choice for large-scale data processing and machine learning, particularly in industries like finance, media, healthcare and retail. Over the past year, Google Cloud has led efforts to natively integrate Apache Spark and Kubernetes. Starting as a small open-source initiative in December 2016, the project has grown and fostered an active community that maintains and supports this integration.

As of version 2.3, Apache Spark includes native Kubernetes support, allowing you to make direct use of multi-tenancy and sharing through Kubernetes Namespaces and Quotas, as well as administrative features such as Pluggable Authorization and Logging for your Spark workloads. This also opens up a range of hybrid cloud possibilities: you can now easily port and run your on-premises Spark jobs on Kubernetes to Kubernetes Engine. In addition, we recently released Hadoop/Spark GCP connectors for Apache Spark 2.3, allowing you to run Spark natively on Kubernetes Engine while leveraging Google data products such as Cloud Storage and BigQuery.

To help you get started, we put together a tutorial to learn how to run Spark on Kubernetes Engine. Here, Spark runs as a custom controller that creates Kubernetes resources in response to requests made by the Spark scheduler. This allows fine-grained management of Spark applications, improved elasticity and seamless integration with logging and monitoring tutorials on Kubernetes Engine.

This tutorial brings together some of the best data storage and processing services of Google. In addition to Cloud Storage and BigQuery, it shows you how to use Google Cloud Pub/Sub with Spark for streaming workloads. The tutorial details the Spark setup, including credentials and IAM to connect to Google’s services and provides runnable code to perform data transformations and aggregations on a public dataset derived from Github. This is a good approach to take if you're looking to write your own Spark applications and use Cloud Storage and BigQuery as data sources and sinks. For instance, you can store logs on Cloud Storage, and then use Spark on Kubernetes Engine to pre-process data, and use BigQuery to perform data analytics.

Designed for flexibility, use this tutorial as a jumping-off point, and customize Apache Spark for your use-case. Alternately, if you want a fully managed and supported Apache Spark service, we offer Cloud Dataproc running on GCP.
We have lots of other plans for Apache Spark and Kubernetes: For one, we’re building support for interactive Spark on Kubernetes Engine. In addition, we’re also working on Spark Dynamic Resource Allocation for future releases of Apache Spark that you’ll be able to use in conjunction with Kubernetes Engine cluster autoscaling, helping you achieve greater efficiencies and elasticity for bursty periodic batch jobs in multi-workload Kubernetes Engine clusters. Until then, be sure to try out the new Spark tutorial on your Kubernetes Engine clusters!

Managing your Compute Engine instances just got easier



If you use Compute Engine, you probably spend a lot of time creating, cloning and managing VM instances. We recently added new management features that will make performing those tasks much easier.

More ways to create instances and use instance templates


With the recent updates to Compute Engine instance templates, now you can create instances from existing instance templates, and create instance templates based on existing VM instances. These features are available independently of Managed Instance Groups, giving you more power (and flexibility) in creating (and managing) your VM instances.

Imagine you're running a VM instance as part of your web-based application, and are moving from development to production. You can now configure your instance exactly the way you want it and then save your golden config as an instance template. You can then use the template to launch as many instances as you need, configured exactly the way you want. In addition, you can tweak VMs launched from an instance template using the override capability.

You can create instance templates using the Cloud Console, CLI or the API. Let’s look at how to create an instance template and instance from the console. Select a VM instance, click on the “Create instance” drop down button, and choose “From template.” Then select the template you would like to use to create the instance.

Create multiple disks when you launch a VM instance


Creating a multiple disk configuration for a VM instance also just got easier. Now you can create multiple persistent disks as part of the virtual machine instance creation workflow. Of course, you can still attach disks later to existing VM instances—that hasn’t changed.

This feature is designed to help you when you want to create data disks and/or application disks that are separate from your operating system disk. You can also use the ability to create multiple disks on launch for instances within a managed instance group by defining multiple disks in the instance template, which makes the MIG a scalable way to create a group of VMs that all have multiple disks.

To create additional disks in the Google Cloud SDK (gcloud CLI), use the --create-disk flag.

Create an image from a running VM instance


When creating an image of a VM instance for cloning, sharing or backup purposes, you may not want to disrupt the services running on that instance. Now you can create images from a disk that's attached to a running VM instance. From the Cloud Console, check the “Keep instance running” checkbox, or from the API, set the force-create flag to true.


Protect your virtual machines from accidental deletion


Accidents happen from time to time, and sometimes that means you delete a VM instance and interrupt key services. You can now protect your VMs from accidental deletion by setting a simple flag. This is especially important for VM instances running critical workloads and applications such as SQL Server instances, shared file system nodes, license managers, etc.

You can enable (and disable) the flag using the Cloud Console, SDK or the API. The screenshot below shows how to enable it through the UI; and how to view the deletion protection status of your VM instances from the list view.

Conclusion


If you already use Compute Engine, you can start using these new features right away from the console, Google Cloud SDK or through APIs. If you aren’t yet using Compute Engine, be sure to sign up for a free trial to get $300 in free cloud credits. To learn more, please visit the instance template, instance creation, custom images and deletion protection product documentation pages.

96 vCPU Compute Engine instances are now generally available


Today we're happy to announce the general availability of Compute Engine machine types with 96 vCPUs and up to 624 GB of memory. Now you can take advantage of the performance improvements and increased core count provided by the new Intel Xeon Scalable Processors (Skylake). For applications that can scale vertically, you can leverage all 96 vCPUs to decrease the number of VMs needed to run your applications, while reducing your total cost of ownership (TCO).

You can launch these high-performance virtual machines (VMs) as three predefined machine types, and as custom machine types. You can also adjust your extended memory settings to create a machine with the exact amount of memory and vCPUs you need for your applications.

These new machine types are available in GCP regions globally. You can currently launch 96 vCPU VMs in us-central1, northamerica-northeast1, us-east1, us-west1, europe-west1, europe-west4, and asia-east1, asia-south1 and asia-southeast1. Stay up-to-date on additional regions by visiting our available regions and zones page.

Customers are doing exciting things with the new 96 vCPU machine types including running in-memory databases such as SAP HANA, media rendering and production, and satellite image analysis.
"When preparing petabytes of global satellite imagery to be calibrated, cleaned up, and "science-ready" for our machine learning models, we do a tremendous amount of image compression. By leveraging the additional compute resources available with 96 vCPU machine types, as well as Advanced Vector Extensions such as AVX-512 with Skylake, we have seen a 38% performance improvement in our compression and a 23% improvement in our imagery expansions. This really adds up when working with petabytes of satellite and aerial imagery." 
- Tim Kelton, Co-Founder, Descartes Labs
The 96 vCPU machine types enable you to take full advantage of the performance improvements available through the Intel Xeon Scalable Processor (Skylake), and the supported AVX-512 instruction set. Our partner Altair demonstrated how you can achieve up to 1.8X performance improvement using the new machine types for HPC workloads. We also worked with Intel to support your performance and scaling efforts by providing the Intel Performance libraries freely on Compute Engine. You can take advantage of these components across all machine types, but they're of particular interest for applications that can exploit the scale of 96 vCPU instances on Skylake-based servers.

The following chart shows an example of the performance improvements delivered by using the Intel Distribution for Python: scikit-learn on Compute Engine with 96 vCPUs.

Visit the GCP Console to create a new instance. To learn more, you can read the documentation for instructions on creating new virtual machines with the gcloud command line tool. 


At Google Cloud, we’re committed to helping customers access state-of-the-art compute infrastructure on GCP. To get started, sign up for a free trial today and get $300 in free cloud credits to get started! 

Get the most out of Google Kubernetes Engine with Priority and Preemption



Wouldn’t it be nice if you could ensure that your most important workloads always get the resources they need to run in a Kubernetes cluster? Now you can. Kubernetes 1.9 introduces an alpha feature called “priority and preemption” that allows you to assign priorities to your workloads, so that more important pods evict less important pods when the cluster is full.

Before priority and preemption, Kubernetes pods were scheduled purely on a first-come-first-served basis, and ran to completion (or forever, in the case of pods created by something like a Deployment or StatefulSet). This meant less important workloads could block more important, later-arriving, workloads from running—not the desired effect. Priority and preemption solves this problem.

Priority and preemption is valuable in a number of scenarios. For example, imagine you want to cap autoscaling to a maximum cluster size to control costs, or you have clusters that you can’t grow in real-time (e.g., because they are on-premises and you need to buy and install additional hardware). Or you have high-priority cloud workloads that need to scale up faster than the cluster autoscaler can add nodes. In short, priority and preemption lead to better resource utilization, lower costs and better service levels for critical applications.


Predictable cluster costs without sacrificing safety


In the past year, the Kubernetes community has made tremendous strides in system scalability and support for multi-tenancy. As a result, we see an increasing number of Kubernetes clusters that run both critical user-facing services (e.g., web servers, application servers, back-ends and other microservices in the direct serving path) and non-time-critical workloads (e.g., daily or weekly data analysis pipelines, one-off analytics jobs, developer experiments, etc.). Sharing a cluster in this way is very cost-effective because it allows the latter type of workload to partially or completely run in the “resource holes” that are unused by the former, but that you're still paying for. In fact, a study of Google’s internal workloads found that not sharing clusters between critical and non-critical workloads would increase costs by as much as almost 60 percent. In the cloud, where node sizes are flexible and there's less resource fragmentation, we don’t expect such dramatic results from Kubernetes priority and preemption, but the general premise still holds.

The traditional approach to filling unused resources is to run less important workloads as BestEffort. But because the system does not explicitly reserve resources for BestEffort pods, they can be starved of CPU or killed if the node runs out of memory—even if they're only consuming modest amounts of resources.

A better alternative is to run all workloads as Burstable or Guaranteed, so that they receive a resource guarantee. That, however, leads to a tradeoff between predictable costs and safety against load spikes. For example, consider a user-facing service that experiences a traffic spike while the cluster is busy with non-time-critical analytics workloads. Without the priority and preemption capabilities, you might prioritize safety, by configuring the cluster autoscaler without an upper bound or with a very high upper bound. That way, it can handle the spike in load even while it’s busy with non-time-critical workloads. Alternately, you might pick predictability by configuring the cluster autoscaler with a tight bound, but that may prevent the service from scaling up sufficiently to handle unexpected load.

With the addition of priority and preemption, on the other hand, Kubernetes evicts pods from the non-time-critical workload when the cluster runs out of resources, allowing you to set an upper bound on cluster size without having to worry that the serving pipeline might not scale sufficiently to handle the traffic spike. Note that evicted pods receive a termination grace period before being killed, which is 30 seconds by default.

Even if you don’t care about the predictability vs. safety tradeoff, priority and preemption are still useful, because preemption evicts a pod faster than a cloud provider can usually provision a Kubernetes node. For example, imagine there's a load spike to a high-priority user-facing service, so the Horizontal Pod Autoscaler creates new pods to absorb the load. If there are low-priority workloads running in the cluster, the new, higher-priority pods can start running as soon as pod(s) from low-priority workloads are evicted; they don’t have to wait for the cluster autoscaler to create new nodes. The evicted low-priority pods start running again once the cluster autoscaler has added node(s) for them. (If you want to use priority and preemption this way, a good practice is to set a low termination grace period for your low-priority workloads, so the high-priority pods can start running quickly.)

Enabling priority and preemption on Kubernetes Engine


We recently made Kubernetes 1.9 available in Google Kubernetes Engine, and made priority and preemption available in alpha clusters. Here’s how to get started with this new feature:

  1. Create an alpha cluster—please note the cited limitations. 
  2. Follow the instructions to create at least two PriorityClasses in your Kubernetes cluster. 
  3. Create workloads (using Deployment, ReplicaSet, StatefulSet, Job, or whatever you like) with the priorityClassName field filled in, matching one of the PriorityClasses you created.

If you wish, you can also enable the cluster autoscaler and set a maximum cluster size. In that case your cluster will not grow above the configured maximum number of nodes, and higher-priority pods will evict lower-priority pods when the cluster reaches its maximum size and there are pending pods from the higher priority classes. If you don’t enable the cluster autoscaler, the priority and preemption behavior is the same, except that the cluster size is fixed.

Advanced technique: enforcing “filling the holes”


As we mentioned earlier, one of the motivations for priority and preemption is to allow non-time-critical workloads to “fill the resource holes” between important workloads on a node. To enforce this strictly, you can associate a workload with a PriorityClass whose priority is less than zero. Then the cluster autoscaler does not add the nodes necessary for that workload to run, even if the cluster is below the maximum size configured for the autoscaler.

Thus you can create three tiers of workloads of decreasing importance:

  • Workloads that can access the entire cluster up to the cluster autoscaler maximum size 
  • Workloads that can trigger autoscaling but that will be evicted if the cluster has reached the configured maximum size and higher-priority work needs to run
  • Workloads that will only “fill the cracks” in the resource usage of the higher-priority workloads, i.e., that will wait to run if they can’t fit into existing free resources.

And because PriorityClass maps to an integer, you can of course create many sub-tiers within these three categories.

Let us know what you think!


Priority and preemption are welcome additions in Kubernetes 1.9, making it easier for you to control your resource utilization, establish workload tiers and control costs. Priority and preemption is still an alpha feature. We’d love to know how you are using it, and any suggestions you might have for making it better. Please contact us at kubernetes-sig-scheduling@googlegroups.com.

To explore this new capability and other features of Kubernetes Engine, you can quickly get started using our 12-month free trial.

GPUs in Kubernetes Engine now available in beta



Last year we introduced our first GPU offering for Google Kubernetes Engine with the alpha launch of NVIDIA Tesla GPUs and received an amazing customer response. Today, GPUs in Kubernetes Engine are in beta and ready to be used widely from the latest Kubernetes Engine release.

Using GPUs in Kubernetes Engine can turbocharge compute-intensive applications like machine learning (ML), image processing and financial modeling. By packaging your CUDA workloads into containers, you can benefit from the massive processing power of Kubernetes Engine’s GPUs whenever you need it, without having to manage hardware or even VMs.

With its best-in-class CPUs, GPUs, and now TPUs, Google Cloud provides the best choice, flexibility and performance for running ML workloads in the cloud. The ride-sharing pioneer Lyft, for instance, uses GPUs in Kubernetes Engine to accelerate training of its deep learning models.
"GKE clusters are ideal for deep learning workloads, with out-of-the box GPU integration, autoscaling clusters for our spiky training workloads, and integrated container logging and monitoring." 
— Luc Vincent, VP of Engineering at Lyft

Both the NVIDIA Tesla P100 and K80 GPUs are available as part of the beta—and V100s are on the way. Recently, we also introduced Preemptible GPUs as well as new lower prices to unlock new opportunities for you. Check out the latest prices for GPUs here.

Getting started with GPUs in Kubernetes Engine


Creating a cluster with GPUs in Kubernetes Engine is easy. From the Cloud Console, you can expand the machine type on the "Creating Kubernetes Cluster" page to select the types and the number of GPUs.
And if you want to add nodes with GPUs to your existing cluster, you can use the Node Pools and Cluster Autoscaler features. By using node pools with GPUs, your cluster can use GPUs whenever you need them. Autoscaler, meanwhile, can automatically create nodes with GPUs whenever pods requesting GPUs are scheduled, and scale down to zero when GPUs are no longer consumed by any active pods.

The following command creates a node pool with GPUs that can scale up to five nodes and down to zero nodes.

gcloud beta container node-pools create my-gpu-node-pool 
--accelerator=type=nvidia-tesla-p100,count=1 
--cluster=my-existing-cluster --num-nodes 2 
--min-nodes 0 --max-nodes 5 --enable-autoscaling

Behind the scenes, Kubernetes Engine applies taint and toleration techniques to ensure only pods requesting GPUs will be scheduled on the nodes with GPUs, and prevent pods that don't require GPUs from running on them.

While Kubernetes Engine does a lot of things behind the scenes for you, we also want you to understand how your GPU jobs are performing. Kubernetes Engine exposes metrics for containers using GPUs, such as how busy the GPUs are, how much memory is available, and how much memory is allocated. You can also visualize these metrics by using Stackdriver.

Figure 1: GPU duty cycle for three different jobs

For a more detailed explanation of Kubernetes Engine with GPUs, for example installing NVIDIA drivers and how to configure a pod to consume GPUs, check out the documentation.

Tackling new workloads with Kubernetes


In 2017, Kubernetes Engine core-hours grew 9X year over year, and the platform is gaining momentum as a premier deployment platform for ML workloads. We’re very excited about open source projects like Kubeflow that make it easy, fast and extensible to run ML stacks in Kubernetes. We hope that the combination of these open-source ML projects and GPUs in Kubernetes Engine will help you innovate in business, engineering and science.

Try it today


To get started using GPUs with Kubernetes Engine using our free-trial of $300 credits, you’ll need to upgrade your account and apply for a GPU quota for the credits to take effect.

Thanks for the support and feedback in shaping our roadmap to better serve your needs. Keep the conversation going, and connect with us on the Kubernetes Engine Slack channel.