Tag Archives: Open source

Introducing Data Transfer Project: an open source platform promoting universal data portability

In 2007, a small group of engineers in our Chicago office formed the Data Liberation Front, a team that believed consumers should have better tools to put their data where they want, when they want, and even move it to a different service. This idea, called “data portability,” gives people greater control of their information, and pushes us to develop great products because we know they can pack up and leave at any time.

In 2011, we launched Takeout, a new way for Google users to download or transfer a copy of the data they store or create in a variety of industry-standard formats. Since then, we've continued to invest in Takeout—we now call it Download Your Data—and today, our users can download a machine-readable copy of the data they have stored in 50+ Google products, with more on the way.

Now, we’re taking our commitment to portability a step further. In tandem with Microsoft, Twitter, and Facebook we’re announcing the Data Transfer Project, an open source initiative dedicated to developing tools that will enable consumers to transfer their data directly from one service to another, without needing to download and re-upload it. Download Your Data users can already do this; they can transfer their information directly to their Dropbox, Box, MS OneDrive, and Google Drive accounts today. With this project, the development of which we mentioned in our blog post about preparations for the GDPR, we’re looking forward to working with companies across the industry to bring this type of functionality to individuals across the web.

Our approach

The organizations involved with this project are developing tools that can convert any service's proprietary APIs to and from a small set of standardized data formats that can be used by anyone. This makes it possible to transfer data between any two providers using existing industry-standard infrastructure and authorization mechanisms, such as OAuth. So far, we have developed adapters for seven different service providers across five different types of consumer data; we think this demonstrates the viability of this approach to scale to a large number of use cases.

Consumers will benefit from improved flexibility and control over their data. They will be able to import their information into any participating service that offers compelling features—even brand new ones that could rely on powerful, cloud-based infrastructure rather than the consumers’ potentially limited bandwidth and capability to transfer files. Services will benefit as well, as they will be able to compete for users that can move their data more easily.

Protecting users’ data and keeping them in control

Data security and privacy are foundational to the design of the Data Transfer Project. Services must first agree to allow data transfer between them, and then they will require that individuals authenticate each account independently. All credentials and user data will be encrypted both in transit and at rest. The protocol uses a form of perfect forward secrecy where a new unique key is generated for each transfer. Additionally, the framework allows partners to support any authorization mechanism they choose. This enables partners to leverage their existing security infrastructure when authorizing accounts.

As it is an open source product, anyone can inspect the code to verify that data isn't being collected or used for profiling purposes. Tech savvy consumers are also free to download and run an instance of the framework themselves. Interested parties can learn more at the Data Transfer Project website, which explains the technical foundations behind the project and goes into greater detail on how it works.

How to get involved

It is very early days for the Data Transfer Project and we encourage the developer community to join us and help extend the platform to support many more data types, service providers, and hosting solutions.

The Data Transfer Project’s open source code can be found at datatransferproject.dev and you can learn more about Google’s approach to portability in our paper, where we describe our history with this topic and the values and principles that motivated us to invest in the Data Transfer Project. Our prototype already supports data transfer for several product verticals including: photos, mail, contacts, calendar, and tasks. These are enabled by existing, publicly available APIs from Google, Microsoft, Twitter, Flickr, Instagram, Remember the Milk, and Smugmug.

Data portability makes it easy for consumers to try new services and use the ones that they like best. We’re thrilled to help drive an initiative that incentivizes companies large and small to continue innovating across the internet. We’re just getting started and we’re looking forward to what comes next.

By Brian Willard, Software Engineer and Greg Fair, Product Manager

Accelerated Training and Inference with the Tensorflow Object Detection API



Last year we announced the TensorFlow Object Detection API, and since then we’ve released a number of new features, such as models learned via Neural Architecture Search, instance segmentation support and models trained on new datasets such as Open Images. We have been amazed at how it is being used – from finding scofflaws on the streets of NYC to diagnosing diseases on cassava plants in Tanzania.
Today, as part of Google’s commitment to democratizing computer vision, and using feedback from the research community on how to make this codebase even more useful, we’re excited to announce a number of additions to our API. Highlights of this release include:
  • Support for accelerated training of object detection models via Cloud TPUs
  • Improving the mobile deployment process by accelerating inference and making it easy to export a model to mobile with the TensorFlow Lite format
  • Several new model architecture definitions including:
Additionally, we are releasing pre-trained weights for each of the above models based on the COCO dataset.

Accelerated Training via Cloud TPUs
Users spend a great deal of time on optimizing hyperparameters and retraining object detection models, therefore having fast turnaround times on experiments is critical. The models released today belong to the single shot detector (SSD) class of architectures that are optimized for training on Cloud TPUs. For example, we can now train a ResNet-50 based RetinaNet model to achieve 35% mean Average Precision (mAP) on the COCO dataset in < 3.5 hrs.
Accelerated Inference via Quantization and TensorFlow Lite 
To better support low-latency requirements on mobile and embedded devices, the models we are providing are now natively compatible with TensorFlow Lite, which enables on-device machine learning inference with low latency and a small binary size. As part of this, we have implemented: (1) model quantization and (2) detection-specific operations natively in TensorFlow Lite. Our model quantization follows the strategy outlined in Jacob et al. (2018) and the whitepaper by Krishnamoorthi (2018) which applies quantization to both model weights and activations at training and inference time, yielding smaller models that run faster.
Quantized detection models are faster and smaller (e.g., a quantized 75% depth-reduced SSD Mobilenet model runs at >15 fps on a Pixel 2 CPU with a 4.2 Mb footprint) with minimal loss in detection accuracy compared to the full floating point model.
Try it Yourself with a New Tutorial!
To get started training your own model on Cloud TPUs, check out our new tutorial! This walkthrough will take you through the process of training a quantized pet face detector on Cloud TPU then exporting it to an Android phone for inference via TensorFlow Lite conversion.

We hope that these new additions will help make high-quality computer vision models accessible to anyone wishing to solve an object detection problem, and provide a more seamless user experience, from training a model with quantization to exporting to a TensorFlow Lite model ready for on-device deployment. We would like to thank everyone in the community who have contributed features and bug fixes. As always, contributions to the codebase are welcome, and please stay tuned for more updates!

Acknowledgements
This post reflects the work of the following group of core contributors: Derek Chow, Aakanksha Chowdhery, Jonathan Huang, Pengchong Jin, Zhichao Lu, Vivek Rathod, Ronny Votel and Xiangxin Zhu. We would also like to thank the following colleagues: Vasu Agrawal, Sourabh Bajaj, Chiachen Chou, Tom Jablin, Wenzhe Li, Tsung-Yi Lin, Hernan Moraldo, Kevin Murphy, Sara Robinson, Andrew Selle, Shashi Shekhar, Yash Sonthalia, Zak Stone, Pete Warden and Menglong Zhu.

Source: Google AI Blog


Googlers on the road: CLS and OSCON 2018

Next week a veritable who’s who of free and open source software luminaries, maintainers and developers will gather to celebrate the 20th annual OSCON and the 20th anniversary of the Open Source Definition. Naturally, the Google Open Source and Google Cloud teams will be there too!

Program chairs at OSCON 2017, left to right:
Rachel Roumeliotis, Kelsey Hightower, Scott Hanselman.
Photo used with permission from O'Reilly Media.
This year OSCON returns to Portland, Oregon and runs from July 16-19. As usual, it is preceded by the free-to-attend Community Leadership Summit on July 14-15.

If you’re curious about our outreach programs, our approach to open source, or any of the open source projects we’ve released, please find us! We’re eager to chat. You’ll find us and many other Googlers throughout the week on stage, in the expo hall, and at several special events that we’re running, including:
Here’s a rundown of the sessions we’re hosting this year:

Sunday, July 15th (Community Leadership Summit)

11:45am   Asking for time and/or money by Cat Allman

Monday, July 16th (Tutorials)

9:00am    Getting started with TensorFlow by Josh Gordon
1:30pm    Introduction to natural language processing with Python by Barbara Fusinska

Tuesday, July 17th (Tutorials)

9:00am    Istio Day opening remarks by Kelsey Hightower
9:00am    TensorFlow Day opening remarks by Edd Wilder-James
9:05am    Sailing to 1.0: Istio community update by April Nassi
9:05am    The state of TensorFlow by Sandeep Gupta
9:30am    Introduction to fairness in machine learning by Hallie Benjamin
9:55am    Farm to table: A TensorFlow story by Gunhan Gulsoy
11:00am  Hassle-free, scalable machine learning with Kubeflow by Barbara Fusinska
11:05am  Istio: Zero-trust communication security for production services by Samrat Ray, Tao Li, and Mak Ahmad
12:00pm  Project Magenta: Machine learning for music and art by Sherol Chen
1:35pm    Istio à la carte by Daniel Ciruli

Wednesday, July 18th (Sessions)

9:00am    Wednesday opening welcome by Kelsey Hightower
11:50am  Machine learning for continuous integration by Joseph Gregorio
1:45pm    Live-coding a beautiful, performant mobile app from scratch by Emily Fortuna and Matt Sullivan
2:35pm    Powering TensorFlow with big data using Apache Beam, Flink, and Spark by Holden Karau
5:25pm    Teaching the Next Generation to FLOSS by Josh Simmons

Thursday, July 19th (Sessions)

9:00am    Thursday opening welcome by Kelsey Hightower
9:40am    20 years later, open source is as important as ever by Sarah Novotny
11:50am  Google’s approach to distributed systems observability by Jaana B. Dogan
2:35pm    gRPC versus REST: Let the battle begin with Alex Borysov
5:05pm    Shenzhen Go: A visual Go environment for everybody, even professionals by Josh Deprez

We look forward to seeing you and the rest of the community there!

By Josh Simmons, Google Open Source

Why we believe in an open cloud



Open clouds matter more now than ever. While most companies today use a single public cloud provider in addition to their on-premises environment, research shows that most companies will likely adopt multiple public and private clouds in the coming years. In fact, according to a 2018 Rightscale study, 81-percent of enterprises with 1,000 or more employees have a multi-cloud strategy, and if you consider SaaS, most organizations are doing multi-cloud already.

Open clouds let customers freely choose which combination of services and providers will best meet their needs over time. Open clouds let customers orchestrate their infrastructure effectively across hybrid-cloud environments.

We believe in three principles for an open cloud:
  1. Open is about the power to pick up an app and move it—to and from on-premises, our cloud, or another cloud—at any time.
  2. Open-source software permits a richness of thought and continuous feedback loop with users.
  3. Open APIs preserve everyone’s ability to build on each other’s work.

1. Open is about the power to pick up an app and move it

An open cloud is grounded in a belief that being tied to a particular cloud shouldn’t get in the way of achieving your goals. An open cloud embraces the idea that the power to deliver your apps to different clouds while using a common development and operations approach will help you meet whatever your priority is at any given time—whether that’s making the most of skills shared widely across your teams or rapidly accelerating innovation. Open source is an enabler of open clouds because open source in the cloud preserves your control over where you deploy your IT investments. For example, customers are using Kubernetes to manage containers and TensorFlow to build machine learning models on-premises and on multiple clouds.

2. Open-source software permits a richness of thought and continuous feedback loop with users

Through the continuous feedback loop with users, open source software (OSS) results in better software, faster, and requires substantial time and investment on the part of the people and companies leading open source projects. Here are examples of Google’s commitment to OSS and the varying levels of work required:
  • OSS such as Android, has an open code base and development is the sole responsibility of one organization.
  • OSS with community-driven changes such as TensorFlow, involves coordination between many companies and individuals.
  • OSS with community-driven strategy, for example collaboration with the Linux Foundation and Kubernetes community, involves collaborative, decision-making and accepting consensus over control.
Open source is so important to Google that we call it out twice in our corporate philosophies, and we encourage employees, and in fact all developers, to engage with open source.

Using BigQuery to analyze GHarchive.org data, we found that in 2017, over 5,500 Googlers submitted code to nearly 26,000 repositories, created over 215,000 pull requests, and engaged with countless communities through almost 450,000 comments. A comparative analysis of Google’s contribution to open source provides a useful relative position of the leading companies in open source based on normalized data.

Googlers are active contributors to popular projects you may have heard of including Linux, LLVM, Samba, and Git.

Google regularly open-sources internal projects

Top Google-initiated projects include:

3. Open APIs preserve everyone’s ability to build on each other’s work

Open APIs preserve everyone’s ability to build on each other’s work, improving software iteratively and collaboratively. Open APIs empower companies and individual developers to change service providers at will. Peer-reviewed research shows that open APIs drive faster innovation across the industry and in any given ecosystem. Open APIs depend on the right to reuse established APIs by creating independent-yet-compatible implementations. Google is committed to supporting open APIs via our membership in the Open API Initiative, involvement in the Open API specification, support of gRPC, via Cloud Bigtable compatibility with the HBase API, Cloud Spanner and BigQuery compatibility with SQL:2011 (with extensions), and Cloud Storage compatibility with shared APIs.

Build an open cloud with us

If you believe in an open cloud like we do, we’d love your participation. You can help by contributing to and using open source libraries, and asking your infrastructure and cloud vendors what they’re doing to keep workloads free from lock-in. We believe open ecosystems grow the fastest and are more resilient and adaptable in the face of change. Like you, we’re in it for the long-term.



It’s worth noting that not all Google’s products will be open in every way at every stage of their life cycle. Openness is less of an absolute and more of a mindset when conducting business in general. You can, however, expect Google Cloud to continue investing in openness across our products over time, to contribute to open source projects, and to open source some of our internal projects.

If you believe open clouds are an important part of making this multi-cloud world a place in which everyone can thrive, we encourage you to check out our new open cloud website where we offer more detailed definitions and examples of the terms, concepts, and ideas we’ve discussed here: cloud.google.com/opencloud.

Contributing to the AMP Project

This is a guest post by Adam Silverstein who was recently recognized through the Google Open Source Peer Bonus Program for contributions to the AMP Project. We invited Adam to share about his work on our blog.

I started my web career building websites for small businesses on WordPress, so when I decided to begin contributing to open source, WordPress was a natural place to start.

Now I work at the digital agency 10up, where I am a part of our open source team. We build popular sites like FiveThirtyEight where having the best possible AMP experience is critical. However, bringing FiveThirtyEight’s AMP version up to parity with the site’s responsive mobile experience was challenging, in part because of advanced features that aren’t directly supported in AMP.

One of those unsupported features was MathML, a standard for displaying mathematical formulas on the web. To avoid a clumsy work around (amp-iframe) and improve our presentation of formulas, I proposed a native `amp-mathml` component which could display formulas inline. Contributing improvements “upstream” to open source projects – especially as we encounter friction in real-world projects – is a core value at 10up and important to the health of the web. I expected that I could leverage the same open source MathJax library we used on the responsive website for an AMP implementation. Contributing this component would strengthen my understanding of AMP’s internals while simultaneously improving a client site and enabling the open MathML standard on any AMP page. Win, win, win!

I started by opening an issue on Google’s amphtml repository, describing MathML and proposing a native `amp-mathml` component. Justin Ridgewell from the AMP team immediately responded to the issue and asked Ali Ghassemi to track it. I offered to help write the code and received an enthusiastic response, encouraging me and assuring me that the team would be available on GitHub and in Slack to answer any questions.

This warm welcome gave me the confidence to dive in, but ramp up was daunting. The build tools and coding standards were quite different from other projects I work on and setup required some editor reconfiguring and reflex retraining. Getting the unit test to run on my system required tracking down and installing some missing dependencies.

Fortunately, AMP’s project documentation is thorough, and Ali guided me through the implementation, pointing me to existing, similar samples in the project. I already knew how to use JavaScript to render formulas with MathJax – my challenge was building an AMP component that ran this code and displayed it inline.

After a few days of concerted effort, I built a proof of concept and opened a pull request. The real fun began as I refined the approach and wrote documentation with help from the team. The team’s active engagement helped the process move along rapidly. Amazingly, the pull request was merged one week later, and today amp-mathml is live in the wild. FiveThirtyEight is already using the new, native implementation.

From opening the issue all the way to the merge of my pull request, I was impressed by the support and encouragement I received. Ali and honeybadgerdontcare provided regular reviews and thorough suggestions on the pull request when I pushed iterations. Their engagement throughout the process made me and my work feel valued, and helped me stay motivated to continue working on the feature.

Adding MathML to AMP reminded me why I find so much joy and professional growth in contributing to open source projects. I have a better understanding of AMP from the inside out, and I was welcomed into the project’s community with wide open arms. I'm proud of my contribution, and ready to tackle new challenges after seeing its success!
 
By Adam Silverstein, AMP Project contributor

Google Summer of Code 2018 statistics part 2

Now that Google Summer of Code (GSoC) 2018 is underway and students are wrapping up their first month of coding, we wanted to bring you some more statistics on the 2018 program. Lots and lots of numbers follow:

Organizations

Students are working with 206 organizations (the most we’ve ever had!), 41 of which are participating in GSoC for the first time.

Student Registrations

25,873 students from 147 countries registered for the program, which is a 25.3% increase over the previous high for the program back in 2017. There are 9 new countries with students registering for the first time: Angola, Bahamas, Burundi, Cape Verde, Chad, Equatorial Guinea, Kosovo, Maldives, and Mali.

Project Proposals

5,199 students from 101 countries submitted a total of 7,209 project proposals. 70.5% of the students submitted 1 proposal, 18.1% submitted 2 proposals, and 11.4% submitted 3 proposals (the max allowed).

Gender Breakdown

11.63% of accepted students are women, a 0.25% increase from last year. We are always working toward making our programs and open source more inclusive, and we collaborate with organizations and communities that help us improve every year.

Universities

The 1,268 students accepted into the GSoC 2018 program hailed from 613 universities, of which 216 have students participating for the first time in GSoC.

Schools with the most accepted students for GSoC 2018:
University Country Students
Indian Institute of Technology, Roorkee India 35
International Institute of Information Technology - Hyderabad India 32
Birla Institute of Technology and Science, Pilani (BITS Pilani) India 23
Indian Institute of Technology, Kharagpur India 22
Birla Institute of Technology and Science Pilani, Goa campus / BITS-Pilani - K.K.Birla Goa Campus India 18
Indian Institute of Technology, Kanpur India 16
University of Moratuwa Sri Lanka 16
Indian Institute of Technology, Patna India 14
Amrita University India 13
Indian Institute of Technology, Mandi India 11
Indraprastha Institute of Information and Technology, New Dehli India 11
University of Buea Cameroon 11
BITS Pilani, Hyderabad Campus India 11
Another post with stats on our awesome GSoC mentors will be coming soon!

By Stephanie Taylor, Google Open Source

31st anniversary of the GIF: give your terminal some personality with Tenor GIF for CLI

Creating ASCII art for your terminal isn’t new. Displaying animating ASCII GIFs in the CLI (command line interface), however, hasn’t been possible, or at least, not easy to do -- until now.

Shortly after Tenor was acquired by Google, I had an idea.

Many developers configure static ASCII art to appear when opening a terminal, but I imagined that ASCII art could animate like a GIF, and could easily be created from any GIF on Tenor.

After some tinkering, GIF for CLI was born.

Just in time for the 31st anniversary of the GIF, GIF for CLI is available today on GitHub. GIF for CLI takes in a GIF, short video, or a query to the Tenor GIF API and converts it to animated ASCII art. This means each time you log on to your programming workstation, your GIF is there to greet you in ASCII form. Animation and color support are performed using ANSI escape sequences.

Rob Delaney as “Peter” from Deadpool 2, in ASCII GIF form. See the original GIF on Tenor here.
For the more technically-minded, here are the details:

When the command line program is run, it takes the chosen .gif file (file, url, or Tenor query) and uses ffmpeg to split the animated gif or short video into static jpg frames. Those jpg frames are then converted to ASCII art (these ASCII art frames are cached in $HOME/.cache/gif-for-cli). The program then prints one frame at a time to the console, clearing the console using ANSI escape sequences between each frame.

You could also use GIF for CLI to run gif-for-cli in your .bashrc or .profile to get an animated ASCII art image as your MOTD, or with Git hooks.

GIF for CLI integrates with the Tenor GIF API to source the GIFs. The Tenor API powers GIF search within many of today’s most popular messaging apps and social platforms on iOS and Android. 

Share screen captures of your ASCII GIFs with us on Twitter with #GIFforCLI. 

By Sean Hayes, Tenor

OpenCensus’s journey ahead: enhanced feature set

This is the second half of a series of blog posts about what’s coming next for OpenCensus. The OpenCensus Roadmap is composed of two pillars: increased language, framework, and platform coverage, and the addition of more powerful features.

In this blog post we’re going to discuss the second pillar: new functionality that makes OpenCensus more powerful. This includes dramatically improved sampling capabilities and new types of telemetry that we’re looking to capture.

More Power

Intelligent Sampling

In addition to expanding the list of languages and frameworks that OpenCensus supports out of the box, we’ll also be increasing the usefulness of existing functionality.

Services instrumented with OpenCensus currently randomly (at a configurable rate) sample new requests (without context, usually received directly from clients). While this does provide an effective view into application latency, developers are mostly interested in traces of particularly slow requests or requests that also capture a useful event, such as an exception.

We’re adding support for OpenCensus to make deferred sampling decisions - that is, to sample requests after they’ve propagated through several systems, while still preserving the full critical path of the trace. Though the feature is just starting development, we’re focusing on making sampling more intelligent - for example, by triggering traces based on accumulated latency, errors, and debugging events. Expect to hear more about this soon.

New Telemetry, Including Logs and Errors

As we mentioned in our last blog post, our ambition is for OpenCensus to become a ubiquitous observability framework, meaning that collecting traces and stats alone won’t be enough. Correlating traces and tags with logs and errors represents an obvious next step, and we’re currently working through what this might look like. Longer term, this list could grow to include profiles and other signals.

The topic of what signals will come next is worth of its own blog post, and you can expect us to start talking about this more in the coming months.

Server-provided Traces and Metrics

Distributed applications can obtain observability into their own performance by instrumenting themselves with OpenCensus, however visibility into the performance of external services or APIs that they call into is still limited. For example, imagine a web service that calls into Google Cloud Platform’s Cloud Bigtable service: the application developer would have visibility into their client side traces but would not be able to tell how much time Cloud BigTable took to respond vs time taken by network. We’re working on adding server side traces and metrics - essentially a way for service providers to summarize server side traces and metrics.

Cluster wide Z pages

Today, OpenCensus provides a stand-alone application called z-pages that includes an embedded web server and displays configuration parameters and trace information in real-time, as captured from any OpenCensus libraries running on the same host. By accessing a z-page, developers can configure sampling rate for the local instance, or view traces, tags, and stats as they’re being processed in real-time.

Longer-term, we wish to extend this functionality to enable cluster wide z-pages, which could provide the same functionality as the current z-pages experience, aggregated over all of the instances of a particular service. We’re still discussing different implementation options, and if we can tie this into other aggregation-related workstreams that we’re already pursuing.

Wrapping up

Does the strategy and roadmap above resonate with what you’d want to get from OpenCensus library? We’d love to hear your ideas and what you’d like to see prioritized.

As we mentioned in our last post, none of this is possible without the support and participation from the community. Check out our repo and start contributing. No contribution or idea is too small. Join other developers and users on the OpenCensus Gitter channel. We’d love to hear from you.

By Pritam Shah and Morgan McLean, Census team

Introducing Git protocol version 2


Today we announce Git protocol version 2, a major update of Git's wire protocol (how clones, fetches and pushes are communicated between clients and servers). This update removes one of the most inefficient parts of the Git protocol and fixes an extensibility bottleneck, unblocking the path to more wire protocol improvements in the future.

The protocol version 2 spec can be found here. The main improvements are:
    The main motivation for the new protocol was to enable server side filtering of references (branches and tags). Prior to protocol v2, servers responded to all fetch commands with an initial reference advertisement, listing all references in the repository. This complete listing is sent even when a client only cares about updating a single branch, e.g.: `git fetch origin master`. For repositories that contain 100s of thousands of references (the Chromium repository has over 500k branches and tags) the server could end up sending 10s of megabytes of data that get ignored. This typically dominates both time and bandwidth during a fetch, especially when you are updating a branch that's only a few commits behind the remote, or even when you are only checking if you are up-to-date, resulting in a no-op fetch.

    We recently rolled out support for protocol version 2 at Google and have seen a performance improvement of 3x for no-op fetches of a single branch on repositories containing 500k references. Protocol v2 has also enabled a reduction of 8x of the overhead bytes (non-packfile) sent from googlesource.com servers. A majority of this improvement is due to filtering references advertised by the server to the refs the client has expressed interest in.

    Getting over the hurdles

    The Git project has tried on a number of occasions over the years to either limit the initial ref advertisement or move to a new protocol altogether but continued to run into two problems: (1) the initial request is rigid and does not include a field that could be used to request that new servers modify their response without breaking compatibility with existing servers and (2) error handling is not well enough defined to allow safely using a new protocol that existing servers do not understand with a quick fallback to the old protocol. To migrate to a new protocol version, we needed to find a side channel which existing servers would ignore but could be used to safely communicate with newer servers.

    There are three main transports that are used to speak Git’s wire-protocol (git://, ssh://, and https://), and the side channel that we use to request v2 needs to communicate in such a way that an older server would ignore any additional data sent and not crash. The http transport was the easiest as we can simply include an additional http header in the request (“Git-Protocol: version=2”). The ssh transport is a bit more difficult as it requires sending an environment variable (“GIT_PROTOCOL=version=2”) to be set on the remote end. This is more challenging because it requires server administrators to configure sshd to accept the new environment variable on their server. The most difficult transport is the anonymous Git transport (git://).

    Initial requests made to a server using the anonymous Git transport are made in the form of a single packet-line which includes the requested service (git-upload-pack for fetches and git-receive-pack for pushes), and the repository followed by a NUL byte. Later virtualization support was added and a hostname parameter could be tacked on and  terminated by a NUL byte: `0033git-upload-pack /project.git\0host=myserver.com\0`. Ideally we’d be able to add a new parameter to be used to request v2 by adding it in the same manner as the hostname was added: `003dgit-upload-pack /project.git\0host=myserver.com\0version=2\0`. Unfortunately due to a bug introduced in 2006 we aren't able to place any extra arguments (separated by NULs) other than the host because otherwise the parsing of those arguments would enter an infinite loop. When this bug was fixed in 2009, a check was put in place to disallow extra arguments so that new clients wouldn't trigger this bug in older servers.

    Fortunately, that check doesn't notice if we send additional request arguments hidden behind a second NUL byte, which was pointed out back in 2009.  This allows requests structured like: `003egit-upload-pack /project.git\0host=myserver.com\0\0version=2\0`. By placing version information behind a second NUL byte we can skirt around both the infinite loop bug and the explicit disallowal of extra arguments besides hostname. Only newer servers will know to look for additional information hidden behind two NUL bytes and older servers won’t croak.

    Now, in every case, a client can issue a request to use v2, using a transport-specific side channel, and v2 servers can respond using the new protocol while older servers will ignore the side channel and just respond with a ref advertisement.

    Try it for yourself

    To try out protocol version 2 for yourself you'll need an up to date version of Git (support for v2 was recently merged to Git's master branch and is expected to be part of Git 2.18) and a v2 enabled server (repositories on googlesource.com and Cloud Source Repositories are v2 enabled). If you enable tracing and run the `ls-remote` command querying for a single branch, you can see the server sends a much smaller set of references when using protocol version 2:

    ```
    # Using the original wire protocol
    GIT_TRACE_PACKET=1 git -c protocol.version=0 ls-remote https://chromium.googlesource.com/chromium/src.git master

    # Using protocol version 2
    GIT_TRACE_PACKET=1 git -c protocol.version=2 ls-remote https://chromium.googlesource.com/chromium/src.git master
    ```

    By Brandon Williams, Git-core Team

    Google Summer of Code 2018 statistics part 1

    Since 2005, Google Summer of Code (GSoC) has been bringing new developers into the open source community every year. This year we accepted 1,264 students from 62 countries into the 2018 GSoC program to work with a record 206 open source organizations this summer.

    Students are currently participating in the Community Bonding phase of the program where they become familiar with the open source projects they will be working with. They also spend time learning the codebase and the community’s best practices so they can start their 12 week coding projects on May 14th.

    Each year we like to share program statistics about the GSoC program and the accepted students and mentors involved in the program. Here are a few stats:
    • 88.2% of the accepted students are participating in their first GSoC
    • 74.4% of the students are first time applicants

    Degrees

    • 76.18% of accepted students are undergraduates, 17.5% are masters students, and 6.3% are getting their PhDs.
    • 73% are Computer Science majors, 4.2% are mathematics majors, 17% are other engineering majors (electrical, mechanical, aerospace, etc.)
    • We have students in a variety of majors including neuroscience, linguistics, typography, and music technologies.

    Countries

    This year there are four students that are the first to be accepted into GSoC from their home countries of Kosovo (three students) and Senegal. A complete list of accepted students and their countries is below:
    CountryStudentsCountryStudentsCountryStudents
    Argentina5Hungary7Russian Federation35
    Australia10India605Senegal1
    Austria14Indonesia3Serbia1
    Bangladesh3Ireland1Singapore8
    Belarus3Israel2Slovak Republic2
    Belgium3Italy24South Africa1
    Brazil19Japan7South Korea2
    Bulgaria2Kosovo3Spain21
    Cameroon14Latvia1Sri Lanka41
    Canada31Lithuania5Sweden6
    China52Malaysia2Switzerland5
    Croatia3Mauritius1Taiwan3
    Czech Republic4Mexico4Trinidad and Tobago1
    Denmark1Morocco2Turkey8
    Ecuador4Nepal1Uganda1
    Egypt12Netherlands6Ukraine6
    Finland3Nigeria6United Kingdom28
    France22Pakistan5United States104
    Germany53Poland3Venezuela1
    Greece16Portugal10Vietnam4
    Hong Kong3Romania10Venezuela1
    There were a record number of students submitting proposals for the program this year -- 5,199 students from 101 countries.

    In our next GSoC statistics post we will delve deeper into the schools, gender breakdown, mentors, and registration numbers for the 2018 program.

    By Stephanie Taylor, Google Open Source