Tag Archives: open data

Introducing the Open Images Dataset

Originally posted on the Google Research Blog

In the last few years, advances in machine learning have enabled Computer Vision to progress rapidly, allowing for systems that can automatically caption images to apps that can create natural language replies in response to shared photos. Much of this progress can be attributed to publicly available image datasets, such as ImageNet and COCO for supervised learning, and YFCC100M for unsupervised learning.

Today, we introduce Open Images, a dataset consisting of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories. We tried to make the dataset as practical as possible: the labels cover more real-life entities than the 1000 ImageNet classes, there are enough images to train a deep neural network from scratch and the images are listed as having a Creative Commons Attribution license*.

The image-level annotations have been populated automatically with a vision model similar to Google Cloud Vision API. For the validation set, we had human raters verify these automated labels to find and remove false positives. On average, each image has about 8 labels assigned. Here are some examples:
Annotated images form the Open Images dataset. Left: Ghost Arches by Kevin Krejci. Right: Some Silverware by J B. Both images used under CC BY 2.0 license
We have trained an Inception v3 model based on Open Images annotations alone, and the model is good enough to be used for fine-tuning applications as well as for other things, like DeepDream or artistic style transfer which require a well developed hierarchy of filters. We hope to improve the quality of the annotations in Open Images the coming months, and therefore the quality of models which can be trained.

The dataset is a product of a collaboration between Google, CMU and Cornell universities, and there are a number of research papers built on top of the Open Images dataset in the works. It is our hope that datasets like Open Images and the recently released YouTube-8M will be useful tools for the machine learning community.

By Ivan Krasin and Tom Duerig, Software Engineers

* While we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.

Which languages convey the most information in the least space? Introducing the Unimorph dataset.

Several years ago a science journalist asked me which languages could pack the most information into a 140-character Tweet. Because Twitter defines a character roughly as a single Unicode code point, this turns out to be an easy question to answer. Chinese almost certainly rates as the most “compact” language from that point of view because a single Chinese character represents a whole morpheme (in linguist terminology, a minimal unit of meaning) whereas an English letter only represents a part of a morpheme. The Chinese equivalent of I don’t eat meat, which in English takes 16 characters including spaces is 我不吃肉, which takes just four.

But this question relates to a broader question that as a linguist I have often been asked: which languages are the most “efficient” at conveying information? Or, which languages can convey the same information in the smallest amount of space? Untethered by the idiosyncrasies of Twitter, this question becomes quite difficult to answer. What do you mean by “space”? Number of characters? Number of bytes? Number of syllables? Each of these has its own problems. And perhaps more crucially, what do you mean by “information”? The Shannon notion of information does not straightforwardly apply here.

A group of us at Google set out to answer this question, or at least to provide the form that an answer would have to take. We had the resources and experience needed to annotate data in multiple languages, and we were able to divert some of those resources to this task. The results were published in a paper presented at the 2014 International Conference on Language Resources and Evaluation in Reykjavík, Iceland.

We are now releasing the data on GitHub. The data consist of 85 sentences typical of the kinds of sentences generated by Google Now, translated into eight typologically diverse languages: English, French, Italian, German, Russian, Arabic, Korean, Chinese, which include some highly inflected and uninflected languages, and various types of morphology including inflectional and agglutinative. The data were annotated by one to three annotators depending on the language, with morphological information, counts of the marked features and other information. The main data file is in HTML, color coded by language, which makes it easy to browse but also easy to extract into other formats.

Since the basic information conveyed by each sentence can be assumed to be the same across languages, the main focus of the research was on the additional information that each language marks, and cannot avoid marking. For example, the English sentence:

Use my location for the search results and other services.

has the French translation:

Utilisez ma position pour les résultats de recherche et d'autres services.

The verb ending -ez, in boldface above marks “addressee respect”, a bit of information that is missing from the English original.  One could have used a different ending on the French verb, but then that would not avoid this bit of information—it would be choosing to mark lack of respect, or familiarity with the addressee.

In the paper we tried various ways of measuring the differing information content of the languages relative to various definitions of “space”. Considering all the factors together, we concluded that the languages that conveyed the most information in a given amount of space were highly inflected languages like Russian, with uninflected languages like Chinese actually being the “least efficient” at conveying information.

We don’t expect this to be the final answer, which is why we are releasing the data as open source in the hopes that others will find it useful and maybe can even extend it to more sentences or a wider variety of languages. Ultimately though, any answer to the question of which languages convey the most information in the smallest amount of space must seriously address what is meant by “information”, and must pay heed to the famous maxim by the Russian linguist Roman Jakobson (1959) that “languages differ essentially in what they must convey and not in what they may convey.”

By Richard Sproat, Research Scientist

GitHub on BigQuery: Analyze all the code



Google, in collaboration with GitHub, is releasing an incredible new open dataset on Google BigQuery. So far you've been able to monitor and analyze GitHub's pulse since 2011 (thanks GitHub Archive project!) and today we're adding the perfect complement to this. What could you do if you had access to analyze all the open source software in the world, with just one SQL command?

The Google BigQuery Public Datasets program now offers a full snapshot of the content of more than 2.8 million open source GitHub repositories in BigQuery. Thanks to our new collaboration with GitHub, you'll have access to analyze the source code of almost 2 billion files with a simple (or complex) SQL query. This will open the doors to all kinds of new insights and advances that we're just beginning to envision.

For example, let's say you're the author of a popular open source library. Now you'll be able to find every open source project on GitHub that's using it. Even more, you'll be able to guide the future of your project by analyzing how it's being used, and improve your APIs based on what your users are actually doing with it.

On the security side, we've seen how the most popular open source projects benefit from having multiple eyes and hands working on them. This visibility helps projects get hardened and buggy code cleaned up. What if you could search for errors with similar patterns in every other open source project? Would you notify their authors and send them pull requests? Well, now you can. Some concepts to keep in mind while working with BigQuery and the GitHub contents dataset:
To learn more, read GitHub's announcement and try some sample queries. Share your queries and findings in our reddit.com/r/bigquery and Hacker News posts. The ideas are endless, and I'll start collecting tips and links to other articles on this post on Medium.

Stay curious!

Student applications now open for Google Summer of Code!

Posted by Mary Radomile, Google Open Source team

Are you a university student looking to learn more about open source software development? Look no further than Google Summer of Code (GSoC) and spend your summer break working on an exciting open source project, learning how to write code.

For twelve years running, GSoC gives participants a chance to work on an open source software project entirely online. Students, who receive a stipend for their successful contributions, are paired with mentors who can help address technical questions and concerns throughout the program. Former GSoC participants have told us that the real-world experience they’ve gained during the program has not only sharpened their technical skills, but has also boosted their confidence, broadened their professional network and enhanced their resumes.

Students who are interested can submit proposals on the program site now through Friday, March 25 at 19:00 UTC. The first step is to review the 180 open source projects and find project ideas that appeal to you. Since spots are limited, we recommend a strong project proposal to help increase your chances of selection. Our Student Manual provides lots of helpful advice to get you started on choosing an organization and crafting a great application.

For ongoing information throughout the application period and beyond, see the Google Open Source Blog, join our Google Summer of Code discussion lists or join us on internet relay chat (IRC) at #gsoc on Freenode.

Good luck to all the open source coders out there, and remember to submit your proposals early — you only have until Friday, March 25 at 19:00 UTC to apply!