Tag Archives: Docker

An easier way to move your App Engine apps to Cloud Run

Posted by Wesley Chun (@wescpy), Developer Advocate, Google Cloud

Blue header

An easier yet still optional migration

In the previous episode of the Serverless Migration Station video series, developers learned how to containerize their App Engine code for Cloud Run using Docker. While Docker has gained popularity over the past decade, not everyone has containers integrated into their daily development workflow, and some prefer "containerless" solutions but know that containers can be beneficial. Well today's video is just for you, showing how you can still get your apps onto Cloud Run, even If you don't have much experience with Docker, containers, nor Dockerfiles.

App Engine isn't going away as Google has expressed long-term support for legacy runtimes on the platform, so those who prefer source-based deployments can stay where they are so this is an optional migration. Moving to Cloud Run is for those who want to explicitly move to containerization.

Migrating to Cloud Run with Cloud Buildpacks video

So how can apps be containerized without Docker? The answer is buildpacks, an open-source technology that makes it fast and easy for you to create secure, production-ready container images from source code, without a Dockerfile. Google Cloud Buildpacks adheres to the buildpacks open specification and allows users to create images that run on all GCP container platforms: Cloud Run (fully-managed), Anthos, and Google Kubernetes Engine (GKE). If you want to containerize your apps while staying focused on building your solutions and not how to create or maintain Dockerfiles, Cloud Buildpacks is for you.

In the last video, we showed developers how to containerize a Python 2 Cloud NDB app as well as a Python 3 Cloud Datastore app. We targeted those specific implementations because Python 2 users are more likely to be using App Engine's ndb or Cloud NDB to connect with their app's Datastore while Python 3 developers are most likely using Cloud Datastore. Cloud Buildpacks do not support Python 2, so today we're targeting a slightly different audience: Python 2 developers who have migrated from App Engine ndb to Cloud NDB and who have ported their apps to modern Python 3 but now want to containerize them for Cloud Run.

Developers familiar with App Engine know that a default HTTP server is provided by default and started automatically, however if special launch instructions are needed, users can add an entrypoint directive in their app.yaml files, as illustrated below. When those App Engine apps are containerized for Cloud Run, developers must bundle their own server and provide startup instructions, the purpose of the ENTRYPOINT directive in the Dockerfile, also shown below.

Starting your web server with App Engine (app.yaml) and Cloud Run with Docker (Dockerfile) or Buildpacks (Procfile)

Starting your web server with App Engine (app.yaml) and Cloud Run with Docker (Dockerfile) or Buildpacks (Procfile)

In this migration, there is no Dockerfile. While Cloud Buildpacks does the heavy-lifting, determining how to package your app into a container, it still needs to be told how to start your service. This is exactly what a Procfile is for, represented by the last file in the image above. As specified, your web server will be launched in the same way as in app.yaml and the Dockerfile above; these config files are deliberately juxtaposed to expose their similarities.

Other than this swapping of configuration files and the expected lack of a .dockerignore file, the Python 3 Cloud NDB app containerized for Cloud Run is nearly identical to the Python 3 Cloud NDB App Engine app we started with. Cloud Run's build-and-deploy command (gcloud run deploy) will use a Dockerfile if present but otherwise selects Cloud Buildpacks to build and deploy the container image. The user experience is the same, only without the time and challenges required to maintain and debug a Dockerfile.

Get started now

If you're considering containerizing your App Engine apps without having to know much about containers or Docker, we recommend you try this migration on a sample app like ours before considering it for yours. A corresponding codelab leading you step-by-step through this exercise is provided in addition to the video which you can use for guidance.

All migration modules, their videos (when available), codelab tutorials, and source code, can be found in the migration repo. While our content initially focuses on Python users, we hope to one day also cover other legacy runtimes so stay tuned. Containerization may seem foreboding, but the goal is for Cloud Buildpacks and migration resources like this to aid you in your quest to modernize your serverless apps!

Containerizing Google App Engine apps for Cloud Run

Posted by Wesley Chun (@wescpy), Developer Advocate, Google Cloud

Google App Engine header

An optional migration

Serverless Migration Station is a video mini-series from Serverless Expeditions focused on helping developers modernize their applications running on a serverless compute platform from Google Cloud. Previous episodes demonstrated how to migrate away from the older, legacy App Engine (standard environment) services to newer Google Cloud standalone equivalents like Cloud Datastore. Today's product crossover episode differs slightly from that by migrating away from App Engine altogether, containerizing those apps for Cloud Run.

There's little question the industry has been moving towards containerization as an application deployment mechanism over the past decade. However, Docker and use of containers weren't available to early App Engine developers until its flexible environment became available years later. Fast forward to today where developers have many more options to choose from, from an increasingly open Google Cloud. Google has expressed long-term support for App Engine, and users do not need to containerize their apps, so this is an optional migration. It is primarily for those who have decided to add containerization to their application deployment strategy and want to explicitly migrate to Cloud Run.

If you're thinking about app containerization, the video covers some of the key reasons why you would consider it: you're not subject to traditional serverless restrictions like development language or use of binaries (flexibility); if your code, dependencies, and container build & deploy steps haven't changed, you can recreate the same image with confidence (reproducibility); your application can be deployed elsewhere or be rolled back to a previous working image if necessary (reusable); and you have plenty more options on where to host your app (portability).

Migration and containerization

Legacy App Engine services are available through a set of proprietary, bundled APIs. As you can surmise, those services are not available on Cloud Run. So if you want to containerize your app for Cloud Run, it must be "ready to go," meaning it has migrated to either Google Cloud standalone equivalents or other third-party alternatives. For example, in a recent episode, we demonstrated how to migrate from App Engine ndb to Cloud NDB for Datastore access.

While we've recently begun to produce videos for such migrations, developers can already access code samples and codelab tutorials leading them through a variety of migrations. In today's video, we have both Python 2 and 3 sample apps that have divested from legacy services, thus ready to containerize for Cloud Run. Python 2 App Engine apps accessing Datastore are most likely to be using Cloud NDB whereas it would be Cloud Datastore for Python 3 users, so this is the starting point for this migration.

Because we're "only" switching execution platforms, there are no changes at all to the application code itself. This entire migration is completely based on changing the apps' configurations from App Engine to Cloud Run. In particular, App Engine artifacts such as app.yaml, appengine_config.py, and the lib folder are not used in Cloud Run and will be removed. A Dockerfile will be implemented to build your container. Apps with more complex configurations in their app.yaml files will likely need an equivalent service.yaml file for Cloud Run — if so, you'll find this app.yaml to service.yaml conversion tool handy. Following best practices means there'll also be a .dockerignore file.

App Engine and Cloud Functions are sourced-based where Google Cloud automatically provides a default HTTP server like gunicorn. Cloud Run is a bit more "DIY" because users have to provide a container image, meaning bundling our own server. In this case, we'll pick gunicorn explicitly, adding it to the top of the existing requirements.txt required packages file(s), as you can see in the screenshot below. Also illustrated is the Dockerfile where gunicorn is started to serve your app as the final step. The only differences for the Python 2 equivalent Dockerfile are: a) require the Cloud NDB package (google-cloud-ndb) instead of Cloud Datastore, and b) start with a Python 2 base image.

Image of The Python 3 requirements.txt and Dockerfile

The Python 3 requirements.txt and Dockerfile

Next steps

To walk developers through migrations, we always "START" with a working app then make the necessary updates that culminate in a working "FINISH" app. For this migration, the Python 2 sample app STARTs with the Module 2a code and FINISHes with the Module 4a code. Similarly, the Python 3 app STARTs with the Module 3b code and FINISHes with the Module 4b code. This way, if something goes wrong during your migration, you can always rollback to START, or compare your solution with our FINISH. If you are considering this migration for your own applications, we recommend you try it on a sample app like ours before considering it for yours. A corresponding codelab leading you step-by-step through this exercise is provided in addition to the video which you can use for guidance.

All migration modules, their videos (when published), codelab tutorials, START and FINISH code, etc., can be found in the migration repo. We hope to also one day cover other legacy runtimes like Java 8 so stay tuned. We'll continue with our journey from App Engine to Cloud Run ahead in Module 5 but will do so without explicit knowledge of containers, Docker, or Dockerfiles. Modernizing your development workflow to using containers and best practices like crafting a CI/CD pipeline isn't always straightforward; we hope content like this helps you progress in that direction!

Container Structure Tests: Unit Tests for Docker Images

Usage of containers in software applications is on the rise, and with their increasing usage in production comes a need for robust testing and validation. Containers provide great testing environments, but actually validating the structure of the containers themselves can be tricky. The Docker toolchain provides us with easy ways to interact with the container images themselves, but no real way of verifying their contents. What if we want to ensure a set of commands runs successfully inside of our container, or check that certain files are in the correct place with the correct contents, before shipping?

The Container Tools team at Google is happy to announce the release of the Container Structure Test framework. This framework provides a convenient and powerful way to verify the contents and structure of your containers. We’ve been using this framework at Google to test all of our team’s released containers for over a year now, and we’re excited to finally share it with the public.

The framework supports four types of tests:
  • Command Tests - to run a command inside your container image and verify the output or error it produces
  • File Existence Tests - to check the existence of a file in a specific location in the image’s filesystem
  • File Content Tests - to check the contents and metadata of a file in the filesystem
  • A unique Metadata Test - to verify configuration and metadata of the container itself
Users can specify test configurations through YAML or JSON. This provides a way to abstract away the test configuration from the implementation of the tests, eliminating the need for hacky bash scripting or other solutions to test container images.

Command Tests

The Command Tests give the user a way to specify a set of commands to run inside of a container, and verify that the output, error, and exit code were as expected. An example configuration looks like this:
globalEnvVars:
- key: "VIRTUAL_ENV"
value: "/env"
- key: "PATH"
value: "/env/bin:$PATH"

commandTests:

# check that the python binary is in the correct location
- name: "python installation"
command: "which"
args: ["python"]
expectedOutput: ["/usr/bin/python\n"]

# setup a virtualenv, and verify the correct python binary is run
- name: "python in virtualenv"
setup: [["virtualenv", "/env"]]
command: "which"
args: ["python"]
expectedOutput: ["/env/bin/python\n"]

# setup a virtualenv, install gunicorn, and verify the installation
- name: "gunicorn flask"
setup: [["virtualenv", "/env"],
["pip", "install", "gunicorn", "flask"]]
command: "which"
args: ["gunicorn"]
expectedOutput: ["/env/bin/gunicorn"]
Regexes are used to match the expected output, and error, of each command (or excluded output/error if you want to make sure something didn’t happen). Additionally, setup and teardown commands can be run with each individual test, and environment variables can be specified to be set for each individual test, or globally for the entire test run (shown in the example).

File Tests

File Tests allow users to verify the contents of an image’s filesystem. We can check for existence of files, as well as examine the contents of individual files or directories. This can be particularly useful for ensuring that scripts, config files, or other runtime artifacts are in the correct places before shipping and running a container.
fileExistenceTests:

# check that the apt-packages text file exists and has the correct permissions
- name: 'apt-packages'
path: '/resources/apt-packages.txt'
shouldExist: true
permissions: '-rw-rw-r--'
Expected permissions and file mode can be specified for each file path in the form of a standard Unix permission string. As with the Command Tests’ “Excluded Output/Error,” a boolean can be provided to these tests to tell the framework to be sure a file is not present in a filesystem.

Additionally, the File Content Tests verify the contents of files and directories in the filesystem. This can be useful for checking package or repository versions, or config file contents among other things. Following the pattern of the previous tests, regexes are used to specify the expected or excluded contents.
fileContentTests:

# check that the default apt repository is set correctly
- name: 'apt sources'
path: '/etc/apt/sources.list'
expectedContents: ['.*httpredir\.debian\.org/debian jessie main.*']

# check that the retry policy is correctly specified
- name: 'retry policy'
path: '/etc/apt/apt.conf.d/apt-retry'
expectedContents: ['Acquire::Retries 3;']

Metadata Test

Unlike the previous tests which all allow any number to be specified, the Metadata test is a singleton test which verifies a container’s configuration. This is useful for making sure things specified in the Dockerfile (e.g. entrypoint, exposed ports, mounted volumes, etc.) are manifested correctly in a built container.
metadataTest:
env:
- key: "VIRTUAL_ENV"
value: "/env"
exposedPorts: ["8080", "2345"]
volumes: ["/test"]
entrypoint: []
cmd: ["/bin/bash"]
workdir: ["/app"]

Tiny Images

One interesting case that we’ve put focus on supporting is “tiny images.” We think keeping container sizes small is important, and sometimes the bare minimum in a container image might even exclude a shell. Users might be used to running something like:
`docker run -d "cat /etc/apt/sources.list && grep -rn 'httpredir.debian.org' image"`
… but this breaks without a working shell in a container. With the structure test framework, we convert images to in-memory filesystem representations, so no shell is needed to examine the contents of an image!

Dockerless Test Runs

At their core, Docker images are just bundles of tarballs. One of the major use cases for these tests is running in CI systems, and often we can't guarantee that we'll have access to a working Docker daemon in these environments. To address this, we created a tar-based test driver, which can handle the execution of all file-related tests through simple tar manipulation. Command tests are currently not supported in this mode, since running commands in a container requires a container runtime.

This means that using the tar driver, we can retrieve images from a remote registry, convert them into filesystems on disk, and verify file contents and metadata all without a working Docker daemon on the host! Our container-diff library is leveraged here to do all the image processing; see our previous blog post for more information.
structure-test -test.v -driver tar -image gcr.io/google-appengine/python:latest structure-test-examples/python/python_file_tests.yaml

Running in Bazel

Structure tests can also be run natively through Bazel, using the “container_test” rule. Bazel provides convenient build rules for building Docker images, so the structure tests can be run as part of a build to ensure any new built images are up to snuff before being released. Check out this example repo for a quick demonstration of how to incorporate these tests into a Bazel build.

We think this framework can be useful for anyone building and deploying their own containers in the wild, and hope that it can promote their usage everywhere through increasing the robustness of containers. For more detailed information on the test specifications, check out the documentation in our GitHub repository.

By Nick Kubala, Container Tools team

Introducing container-diff, a tool for quickly comparing container images

Originally posted by Nick Kubala, Colette Torres, and Abby Tisdale from the Container Tools team, on the Google Open Source Blog

The Google Container Tools team originally built container-diff, a new project to help uncover differences between container images, to aid our own development with containers. We think it can be useful for anyone building containerized software, so we're excited to release it as open source to the development community.

Containers and the Dockerfile format help make customization of an application's runtime environment more approachable and easier to understand. While this is a great advantage of using containers in software development, a major drawback is that it can be hard to visualize what changes in a container image will result from a change in the respective Dockerfile. This can lead to bloated images and make tracking down issues difficult.

Imagine a scenario where a developer is working on an application, built on a runtime image maintained by a third-party. During development someone releases a new version of that base image with updated system packages. The developer rebuilds their application and picks up the latest version of the base image, and suddenly their application stops working; it depended on a previous version of one of the installed system packages, but which one? What version was it on before? With no currently existing tool to easily determine what changed between the two base image versions, this totally stalls development until the developer can track down the package version incompatibility.

Introducing container-diff

container-diff helps users investigate image changes by computing semantic diffs between images. What this means is that container-diff figures out on a low-level what data changed, and then combines this with an understanding of package manager information to output this information in a format that's actually readable to users. The tool can find differences in system packages, language-level packages, and files in a container image.

Users can specify images in several formats - from local Docker daemon (using the prefix `daemon://` on the image path), a remote registry (using the prefix `remote://`), or a file in the .tar in the format exported by "docker save" command. You can also combine these formats to compute the diff between a local version of an image and a remote version. This can be useful when experimenting with new builds of an image that you might not be quite ready to push yet. container-diff supports image tarballs and the registry protocol natively, enabling it to run in environments without a Docker daemon.

Examples and Use Cases

Here is a basic Dockerfile that installs Python inside our Debian base image. Running container-diff on the base image and the new one with Python, users can see all the apt packages that were installed as dependencies of Python.

➜  debian_with_python cat Dockerfile
FROM gcr.io/google-appengine/debian8
RUN apt-get update && apt-get install -qq --force-yes python
➜ debian_with_python docker build -q -t debian_with_python .
sha256:be2cd1ae6695635c7041be252589b73d1539a858c33b2814a66fe8fa4b048655
➜ debian_with_python container-diff diff gcr.io/google-appengine/debian8:latest daemon://debian_with_python:latest

-----Apt-----
Packages found only in gcr.io/google-appengine/debian8:latest: None

Packages found only in debian_with_python:latest:
NAME VERSION SIZE
-file 1:5.22 15-2+deb8u3 76K
-libexpat1 2.1.0-6 deb8u4 386K
-libffi6 3.1-2 deb8u1 43K
-libmagic1 1:5.22 15-2+deb8u3 3.1M
-libpython-stdlib 2.7.9-1 54K
-libpython2.7-minimal 2.7.9-2 deb8u1 2.6M
-libpython2.7-stdlib 2.7.9-2 deb8u1 8.2M
-libsqlite3-0 3.8.7.1-1 deb8u2 877K
-mime-support 3.58 146K
-python 2.7.9-1 680K
-python-minimal 2.7.9-1 163K
-python2.7 2.7.9-2 deb8u1 360K
-python2.7-minimal 2.7.9-2 deb8u1 3.7M

Version differences: None

And below is a Dockerfile that inherits from our Python base runtime image, and then installs the mock and six packages inside of it. Running container-diff with the pip differ, users can see all the Python packages that have either been installed or changed as a result of this:

➜  python_upgrade cat Dockerfile
FROM gcr.io/google-appengine/python
RUN pip install -U six
➜ python_upgrade docker build -q -t python_upgrade .
sha256:7631573c1bf43727d7505709493151d3df8f98c843542ed7b299f159aec6f91f
➜ python_upgrade container-diff diff gcr.io/google-appengine/python:latest daemon://python_upgrade:latest --types=pip

-----Pip-----

Packages found only in gcr.io/google-appengine/python:latest: None

Packages found only in python_upgrade:latest:
NAME VERSION SIZE
-funcsigs 1.0.2 51.4K
-mock 2.0.0 531.2K
-pbr 3.1.1 471.1K

Version differences:
PACKAGE IMAGE1 (gcr.io/google-appengine/python:latest) IMAGE2 (python_upgrade:latest)
-six 1.8.0, 26.7K

This can be especially useful when it's unclear which packages might have been installed or changed incidentally as a result of dependency management of Python modules.

These are just a few examples. The tool currently has support for Python and Node.js packages installed via pip and npm, respectively, as well as comparison of image filesystems and Docker history. In the future, we'd like to see support added for additional runtime and language differs, including Java, Go, and Ruby. External contributions are welcome! For more information on contributing to container-diff, see this how-to guide.

Now that we've seen container-diff compare two images in action, it's easy to imagine how the tool may be integrated into larger workflows to aid in development:

  • Changelog generation: Given container-diff's capacity to facilitate investigation of filesystem and package modifications, it can do most of the heavy lifting in discerning changes for automatic changelog generation for new releases of an image.
  • Continuous integration: As part of a CI system, users can leverage container-diff to catch potentially breaking filesystem changes resulting from a Dockerfile change in their builds.

container-diff's default output mode is "human-readable," but also supports output to JSON, allowing for easy automated parsing and processing by users.

Single Image Analysis

In addition to comparing two images, container-diff has the ability to analyze a single image on its own. This can enable users to get a quick glance at information about an image, such as its system and language-level package installations and filesystem contents.

Let's take a look at our Debian base image again. We can use the tool to easily view a list of all packages installed in the image, along with each one's installed version and size:

➜  Development container-diff analyze gcr.io/google-appengine/debian8:latest

-----Apt-----
Packages found in gcr.io/google-appengine/debian8:latest:
NAME VERSION SIZE
-acl 2.2.52-2 258K
-adduser 3.113 nmu3 1M
-apt 1.0.9.8.4 3.1M
-base-files 8 deb8u9 413K
-base-passwd 3.5.37 185K
-bash 4.3-11 deb8u1 4.9M
-bsdutils 1:2.25.2-6 181K
-ca-certificates 20141019 deb8u3 367K
-coreutils 8.23-4 13.9M
-dash 0.5.7-4 b1 191K
-debconf 1.5.56 deb8u1 614K
-debconf-i18n 1.5.56 deb8u1 1.1M
-debian-archive-keyring 2017.5~deb8u1 137K

We could use this to verify compatibility with an application we're building, or maybe sort the packages by size in another one of our images and see which ones are taking up the most space.

For more information about this tool as well as a breakdown with examples, uses, and inner workings of the tool, please take a look at documentation on our GitHub page. Happy diffing!

Special thanks to Colette Torres and Abby Tisdale, our software engineering interns who helped build the tool from the ground up.

Introducing container-diff, a tool for quickly comparing container images

The Google Container Tools team originally built container-diff, a new project to help uncover differences between container images, to aid our own development with containers. We think it can be useful for anyone building containerized software, so we’re excited to release it as open source to the development community.

Containers and the Dockerfile format help make customization of an application’s runtime environment more approachable and easier to understand. While this is a great advantage of using containers in software development, a major drawback is that it can be hard to visualize what changes in a container image will result from a change in the respective Dockerfile. This can lead to bloated images and make tracking down issues difficult.

Imagine a scenario where a developer is working on an application, built on a runtime image maintained by a third-party. During development someone releases a new version of that base image with updated system packages. The developer rebuilds their application and picks up the latest version of the base image, and suddenly their application stops working; it depended on a previous version of one of the installed system packages, but which one? What version was it on before? With no currently existing tool to easily determine what changed between the two base image versions, this totally stalls development until the developer can track down the package version incompatibility.

Introducing container-diff

container-diff helps users investigate image changes by computing semantic diffs between images. What this means is that container-diff figures out on a low-level what data changed, and then combines this with an understanding of package manager information to output this information in a format that’s actually readable to users. The tool can find differences in system packages, language-level packages, and files in a container image.

Users can specify images in several formats - from local Docker daemon (using the prefix `daemon://` on the image path), a remote registry (using the prefix `remote://`), or a file in the .tar in the format exported by "docker save" command. You can also combine these formats to compute the diff between a local version of an image and a remote version. This can be useful when experimenting with new builds of an image that you might not be quite ready to push yet. container-diff supports image tarballs and the registry protocol natively, enabling it to run in environments without a Docker daemon.

Examples and Use Cases

Here is a basic Dockerfile that installs Python inside our Debian base image. Running container-diff on the base image and the new one with Python, users can see all the apt packages that were installed as dependencies of Python.


And below is a Dockerfile that inherits from our Python base runtime image, and then installs the mock and six packages inside of it. Running container-diff with the pip differ, users can see all the Python packages that have either been installed or changed as a result of this:


This can be especially useful when it’s unclear which packages might have been installed or changed incidentally as a result of dependency management of Python modules.

These are just a few examples. The tool currently has support for Python and Node.js packages installed via pip and npm, respectively, as well as comparison of image filesystems and Docker history. In the future, we’d like to see support added for additional runtime and language differs, including Java, Go, and Ruby. External contributions are welcome! For more information on contributing to container-diff, see this how-to guide.

Now that we’ve seen container-diff compare two images in action, it’s easy to imagine how the tool may be integrated into larger workflows to aid in development:
  • Changelog generation: Given container-diff’s capacity to facilitate investigation of filesystem and package modifications, it can do most of the heavy lifting in discerning changes for automatic changelog generation for new releases of an image.
  • Continuous integration: As part of a CI system, users can leverage container-diff to catch potentially breaking filesystem changes resulting from a Dockerfile change in their builds.
container-diff’s default output mode is “human-readable,” but also supports output to JSON, allowing for easy automated parsing and processing by users.

Single Image Analysis

In addition to comparing two images, container-diff has the ability to analyze a single image on its own. This can enable users to get a quick glance at information about an image, such as its system and language-level package installations and filesystem contents.

Let’s take a look at our Debian base image again. We can use the tool to easily view a list of all packages installed in the image, along with each one’s installed version and size:


We could use this to verify compatibility with an application we’re building, or maybe sort the packages by size in another one of our images and see which ones are taking up the most space.

For more information about this tool as well as a breakdown with examples, uses, and inner workings of the tool, please take a look at documentation on our GitHub page. Happy diffing!

Special thanks to Colette Torres and Abby Tisdale, our software engineering interns who helped build the tool from the ground up.

By Nick Kubala, Container Tools team


Docker + Dataflow = happier workflows

When I first saw the Google Cloud Dataflow monitoring UI -- with its visual flow execution graph that updates as your job runs, and convenient links to the log messages -- the idea came to me. What if I could take that UI, and use it for something it was never built for? Could it be connected with open source projects aimed at promoting reproducible scientific analysis, like Common Workflow Language (CWL) or Workflow Definition Language (WDL)?
Screenshot of a Dockerflow workflow for DNA sequence analysis.

In scientific computing, it’s really common to submit jobs to a local high-performance computing (HPC) cluster. There are tools to do that in the cloud, like Elasticluster and Starcluster. They replicate the local way of doing things, which means they require a bunch of infrastructure setup and management that the university IT department would otherwise do. Even after you’re set up, you still have to ssh into the cluster to do anything. And then there are a million different choices for workflow managers, each unsatisfactory in its own special way.

By day, I’m a product manager. I hadn’t done any serious coding in a few years. But I figured it shouldn’t be that hard to create a proof-of-concept, just to show that the Apache Beam API that Dataflow implements can be used for running scientific workflows. Now, Dataflow was created for a different purpose, namely, to support scalable data-parallel processing, like transforming giant data sets, or computing summary statistics, or indexing web pages. To use Dataflow for scientific workflows would require wrapping up shell steps that launch VMs, run some code, and shuttle data back and forth from an object store. It should be easy, right?

It wasn’t so bad. Over the weekend, I downloaded the Dataflow SDK, ran the wordcount examples, and started modifying. I had a “Hello, world” proof-of-concept in a day.

To really run scientific workflows would require more, of course. Varying VM shapes, a way to pass parameters from one step to the next, graph definition, scattering and gathering, retries. So I shifted into prototyping mode.

I created a new GitHub project called Dockerflow. With Dockerflow, workflows can be defined in YAML files. They can also be written in pretty compact Java code. You can run a batch of workflows at once by providing a CSV file with one row per workflow to define the parameters.

Dataflow and Docker complement each other nicely:

  • Dataflow provides a fully managed service with a nice monitoring interface, retries,  graph optimization and other niceties.
  • Docker provides portability of the tools themselves, and there's a large library of packaged tools already available as Docker images.

While Dockerflow supports a simple YAML workflow definition, a similar approach could be taken to implement a runner for one of the open standards like CWL or WDL.

To get a sense of working with Dockerflow, here’s “Hello, World” written in YAML:

defn:
  name: HelloWorkflow
steps:
- defn:
    name: Hello
    inputParameters:
      name: message
      defaultValue: Hello, World!
    docker:
      imageName: ubuntu
      cmd: echo $message

And here’s the same example written in Java:

public class HelloWorkflow implements WorkflowDefn {
  @Override
  public Workflow createWorkflow(String[] args) throws IOException {
    Task hello =
        TaskBuilder.named("Hello").input("message", “Hello, World!”).docker(“ubuntu”).script("echo $message").build();
    return TaskBuilder.named("HelloWorkflow").steps(hello).args(args).build();
  }
}

Dockerflow is just a prototype at this stage, though it can run real workflows and includes many nice features, like dry runs, resuming failed runs from mid-workflow, and, of course, the nice UI. It uses Cloud Dataflow in a way that was never intended -- to run scientific batch workflows rather than large-scale data-parallel workloads. I wish I’d written it in Python rather than Java. The Dataflow Python SDK wasn’t quite as mature when I started.

Which is all to say, it’s been a great 20% project, and the future really depends on whether it solves a problem people have, and if others are interested in improving on it. We welcome your contributions and comments! How do you run and monitor scientific workflows today?

By Jonathan Bingham, Google Genomics and Verily Life Sciences