11 months ago, Google and a number of partners donated a giant pile of code to the Apache Software Foundation, thus forming the incubating Beam project. The bulk of this code composed the Google Cloud Dataflow SDK: the libraries that developers used to write streaming and batch pipelines that ran on any supported execution engine. At the time, the main supported engine was Google’s Cloud Dataflow service with support for Apache Spark and Apache Flink in development); as of today there are five officially supported runners. Though there were many motivations behind the creation of Apache Beam, the one at the heart of everything was a desire to build an open and thriving community and ecosystem around this powerful model for data processing that so many of us at Google spent years refining. But taking a project with over a decade of engineering momentum behind it from within a single company and opening it to the world is no small feat. That’s why I feel today’s announcement is so meaningful.
With that context in mind, let’s look at some statistics squirreled away in the graduation maturity model assessment:
- Out of the ~22 large modules in the codebase, at least 10 modules have been developed from scratch by the community, with little to no contribution from Google.
- Since September, no single organization has had more than ~50% of the unique contributors per month.
- The majority of new committers added during incubation came from outside Google.
And for good measure, here’s a quote from the Vice President of the Apache Incubator, lifted from the public Apache incubator general discussions list where Beam’s graduation was first proposed:
“In my day job as well as part of my work at Apache, I have been very impressed at the way that Google really understands how to work with open source communities like Apache. The Apache Beam project is a great example of this and is a great example of how to build a community." -- Ted Dunning, Vice President of Apache Incubator
The point I’m trying to make here is this: while Google’s commitment to Apache Beam remains as strong as it always has been, everyone involved (both within Google and without) has done an excellent job of building an open source project that’s truly open in the best sense of the word.
This is what makes open source software amazing: people coming together to build great, practical systems for everyone to use because the work is exciting, useful and relevant. This is the core reason I was so excited about us creating Apache Beam in the first place, the reason I’m proud to have played some small part in that journey, and the reason I’m so grateful for all the work the community has invested in making the project a reality.
Naturally, graduation is only one milestone in the lifetime of the project, and we have many more ahead of us, but becoming top-level project is an indication that Apache Beam now has a development community that is ready for prime time.
That means we’re ready to continue pushing forward the state of the art in stream and batch processing. We’re ready to bring the promise of portability to programmatic data processing, much in the way SQL has done so for declarative data analysis. We’re ready to build the things that never would have gotten built had this project stayed confined within the walls of Google. And last but perhaps not least, we’re ready to recoup the vast quantities of text space previously consumed by the mandatory “(incubating)” moniker accompanying all of our initial mentions of Apache Beam!
But seriously, whatever your motivation, please consider joining us along the way. We have an exciting road ahead.
By Tyler Akidau, Apache Beam PMC and Staff Software Engineer at Google