Tag Archives: statistics

Google Summer of Code 2016 statistics: celebrating our mentors

Our final statistics post of the year is dedicated to to the incredible Google Summer of Code (GSoC) 2016 mentors. There were a total of 2,524 mentors, but today we'll look at the 1,500+ mentors who were assigned to an active project. Mentors are the lifeblood of our program. Without their hard work and dedication to the success of our students, there would be no GSoC. A merry band of volunteers, mentors work with students for more than 12 weeks — remotely, across multiple time zones, giving their time, expertise and guidance in addition to a regular full-time job for an average of 7.45 hours a week. Today we’ll take a closer look at our 2016 team.

GSoC 2016 mentors reside all over the world and represent 66 countries.




Want to see the data? Here’s the breakdown of the countries our mentors come from.



We have many mentors who participate in GSoC year after year. In 2016, we have six mentors who have participated since the program’s inception in 2005! GSoC “lifer” Bart Massey, who participated as a mentor for Portland State University and X.Org had this to say about his time with GSoC:

“I'm not sure which is more astonishing, that I am 12 years older with GSoC or that GSoC is 12 years old with me. Some of the most fantastic, interesting, brilliant and hardworking folks on the planet have gotten together every year for 12 years to change the world: Google folks and open source leadership and skilled, special students. It's been great to get to be part of it all, both as Portland State University and during my time with X.Org...I hope I get to keep working with and hanging out with these people I love every year forever.” 

Awww, we love you too Bart!

There are also plenty of newbies to the program each year and 2016 is no exception. We’d like to welcome 528 (33%) new mentors to the GSoC family.

Some fun facts:
  • Average age: 32
  • Youngest: 14
  • Oldest: 78
  • Most common mentor first name: David
At the end of each program year, we invite two mentors from each participating organization to join us at the Mentor Summit, a three day unconference at Google HQ in Northern California. There they enjoy a weekend with their peers to talk about all things open source-y (a technical term) and have some fun.

A huge thanks to each and every Google Summer of Code mentor. We salute you.

By Mary Radomile, Open Source Programs

More statistics from Google Summer of Code 2016

Google Summer of Code
Google Summer of Code (GSoC) 2016 is officially at its halfway point. Mentors and students have just completed their midterm evaluations and it’s time for our second stats post. This time we take a closer look at our participating students.

First, we’d like to highlight the universities with the most student participants. Congratulations are due to the International Institute of Information Technology - Hyderabad for claiming the top spot for the third consecutive year!

Country School 2016 Accepted Students 2015 Accepted Students 12 Year Total
India International Institute of Information Technology - Hyderabad 50 62 252
Sri Lanka University of Moratuwa 29 44 320
Romania University POLITEHNICA of Bucharest 24 14 155
India Birla Institute of Technology and Science Pilani, Goa Campus 22 15 110
India Birla Institute of Technology and Science, Pilani Campus 22 18 116
India Indian Institute of Technology, Bombay 18 13 75
India Indian Institute of Technology, Kharagpur 15 8 92
India Indian Institute of Technology, Roorkee 15 8 57
India Indraprastha Institute of Information Technology 15 7 27
India Amrita Institute Of Technology & Science, Amritapuri 13 5 33
India Indian Institute of Technology, Guwahati 13 5 38
Cameroon University of Buea 12 10 26
India Delhi Technological University 12 9 60
India Indian Institute of Technology BHU Varanasi 12 12 37
Germany TU Munich 11 7 45


Next, we are proud to announce that 2016 marks the largest number of female GSoC participants to date — 12% of accepted students are female, up 2.2% from 2015. This is good progress, but we are certain we can do better in the future to diversify our program. The Google Open Source team will continue our outreach to many organizations, for example, Grace Hopper and Black Girls Code, to increase this number even more 2017. If you have any suggestions of organizations we should work with, please let us know in the comments.

Finally, each year we like to look at the majors of students. As expected, the most common area of study for our participants is Computer Science (approximately 78%), but this year we have a wide variety of studies including Linguistics, Law, Music Technology and Psychology.  The majority of our students this year are undergraduates (67%), followed by Masters (23%) and then PhD students (9%).



Although reviewing GSoC statistics each year is great fun, we want to stress that being “first place” is not the point of the program. Our goal is to get more and more students involved in creating free and open source software. We hope Google Summer of Code encourages contributions to projects that have the potential to make a difference worldwide. Congratulations to the students from all over the globe and keep up the good work!

By Mary Radomile, Open Source Programs Office

Google Summer of Code 2016 statistics: Part one

Google Summer of Code
We share statistics from Google Summer of Code (GSoC) every year — now that 2016 is chugging along we’ve got some exciting numbers to share! 1,202 students from all over the globe are currently in the community bonding period, a time where participants learn more about the organization they will be contributing to before coding officially begins on May 23. This includes becoming familiar with the community practices and processes, setting up a development environment, or contributing small (or large) patches and bug fixes.

We’ll start our statistics reporting this year with the total number of students participating from each country:

Country Accepted Students Country Accepted Students Country Accepted Students
Albania
1
Greece
10
Romania
31
Algeria
1
Guatemala
1
Russian Federation
52
Argentina
3
Hong Kong
2
Serbia
2
Armenia
3
Hungary
7
Singapore
7
Australia
6
India
454
Slovak Republic
3
Austria
19
Ireland
3
Slovenia
4
Belarus
5
Israel
2
South Africa
2
Belgium
5
Italy
23
South Korea
6
Bosnia-Herzegovina
1
Japan
12
Spain
33
Brazil
21
Kazakhstan
2
Sri Lanka
54
Bulgaria
2
Kenya
3
Sweden
5
Cambodia
1
Latvia
3
Switzerland
2
Cameroon
1
Lithuania
1
Taiwan
7
Canada
23
Luxembourg
1
Thailand
1
China
34
Macedonia
1
Turkey
12
Croatia
2
Mexico
2
Ukraine
13
Czech Republic
6
Netherlands
9
United Kingdom
18
Denmark
2
New Zealand
2
United States
118
Egypt
10
Pakistan
4
Uruguay
1
Estonia
1
Paraguay
1
Venezuela
1
Finland
3
Philippines
2
Vietnam
4
France
19
Poland
28
 
 
Germany
66
Portugal
7
 
 


We’d like to welcome a new country to the GSoC family. 2016 brings us one student from Albania!

In our upcoming statistics posts, we will delve deeper into the numbers by looking at  universities with the most accepted students, gender numbers, mentor countries and more. If you have additional statistics that you would like us to share, please leave a comment below and we will consider including them in an upcoming post.

By Mary Radomile, Open Source Programs

Google Summer of Code marches on!

Google Summer of Code 2016 (GSoC) is well underway and we’ve already seen some impressive numbers — all record highs!
sun.png
  • 18,981 total registered students (up 36% from 2015)
  • 17.34% female registrants
  • 142 countries
  • 5107 students submitting  7,543 project proposals

Student proposals are currently being reviewed by over 2300 mentors and organization administrators from the 180 participating mentor organizations. We will announce accepted students on April 22, 2016 on the Open Source blog and on the program site.

Last week, members of the Google Open Source Programs team attended FOSSASIA in Singapore, Asia’s premier open technology event, to talk about GSoC and Google Code-in. There, we met dozens of former GSoC and GCI students and mentors who were excited to embark on another great year. To learn more about Google Summer of Code, please visit our program site.


By Stephanie Taylor, Open Source Programs

Google Code-in 2015: diving into the numbers


GCI vertical. 1142x994dp.png

Google Code-in (GCI), our contest introducing 13-17 year olds to open source software development, wrapped up a few weeks ago with our largest contest to date: 980 students from 65 countries completed a record-breaking 4,776 tasks! Working with 14 open source organizations, students wrote code, created and edited documentation, designed UI elements and logos, conducted research, developed screencasts and videos teaching others about open source software, and helped find (and fix!) hundreds of bugs.

General statistics

  • 57% of students completed three or more tasks (earning themselves a sweet Google Code-in 2015 t-shirt)
  • 21% of students were female, up from 18% in 2014
  • This was the first Google Code-in for 810 students (83%)


Student age

Participating schools

Students from 550 schools competed in this year’s contest. Below are the top five participating schools.

School Name
Number of student participants
Country
Website
Dunman High School
147
Singapore
GSS PU College
44
India
Colegiul National Aurel Vlaicu
31
Romania
Sacred Heart Convent Senior Secondary School
28
India
Freehold High School
10
United States

Countries

The charts below display the top ten countries with the most students completing at least 1 task.

Country
Number of student participants
India
246
United States
224
Singapore
164
Romania
65
Canada
24
Taiwan
22
Poland
19
United Kingdom
18
Australia
17
Germany
13


We are pleased to have 11 new countries participating in GCI this year: Albania, Armenia, Cameroon, Costa Rica (home to one of this year’s grand prize winners!), Cyprus, Georgia, Guatemala, Laos, Luxembourg, Qatar and Uganda.

In June we will welcome all 28 grand prize winners (along with a mentor from each participating organization) for a fun-filled trip to the Bay Area. The trip will include meeting with Google engineers to hear about new and exciting projects, a tour of the Google campus and a day of sightseeing around San Francisco.  

Stay tuned to our blog for more stats on Google Code-in, including wrap up posts from the mentor organizations. We are thrilled that Google Code-in was so popular this year. We hope to grow and expand this contest in the future to introduce even more passionate teens to the world of open source software development.

By Stephanie Taylor, Google Code-in Program Manager

The reusable holdout: Preserving validity in adaptive data analysis



Machine learning and statistical analysis play an important role at the forefront of scientific and technological progress. But with all data analysis, there is a danger that findings observed in a particular sample do not generalize to the underlying population from which the data were drawn. A popular XKCD cartoon illustrates that if you test sufficiently many different colors of jelly beans for correlation with acne, you will eventually find one color that correlates with acne at a p-value below the infamous 0.05 significance level.
Image credit: XKCD
Unfortunately, the problem of false discovery is even more delicate than the cartoon suggests. Correcting reported p-values for a fixed number of multiple tests is a fairly well understood topic in statistics. A simple approach is to multiply each p-value by the number of tests, but there are more sophisticated tools. However, almost all existing approaches to ensuring the validity of statistical inferences assume that the analyst performs a fixed procedure chosen before the data are examined. For example, “test all 20 flavors of jelly beans”. In practice, however, the analyst is informed by data exploration, as well as the results of previous analyses. How did the scientist choose to study acne and jelly beans in the first place? Often such choices are influenced by previous interactions with the same data. This adaptive behavior of the analyst leads to an increased risk of spurious discoveries that are neither prevented nor detected by standard approaches. Each adaptive choice the analyst makes multiplies the number of possible analyses that could possibly follow; it is often difficult or impossible to describe and analyze the exact experimental setup ahead of time.

In The Reusable Holdout: Preserving Validity in Adaptive Data Analysis, a joint work with Cynthia Dwork (Microsoft Research), Vitaly Feldman (IBM Almaden Research Center), Toniann Pitassi (University of Toronto), Omer Reingold (Samsung Research America) and Aaron Roth (University of Pennsylvania), to appear in Science tomorrow, we present a new methodology for navigating the challenges of adaptivity. A central application of our general approach is the reusable holdout mechanism that allows the analyst to safely validate the results of many adaptively chosen analyses without the need to collect costly fresh data each time.

The curse of adaptivity

A beautiful example of how false discovery arises as a result of adaptivity is Freedman’s paradox. Suppose that we want to build a model that explains “systolic blood pressure” in terms of hundreds of variables quantifying the intake of various kinds of food. In order to reduce the number of variables and simplify our task, we first select some promising looking variables, for example, those that have a positive correlation with the response variable (systolic blood pressure). We then fit a linear regression model on the selected variables. To measure the goodness of our model fit, we crank out a standard F-test from our favorite statistics textbook and report the resulting p-value.
Inference after selection: We first select a subset of the variables based on a data-dependent criterion and then fit a linear model on the selected variables.
Freedman showed that the reported p-value is highly misleading - even if the data were completely random with no correlation whatsoever between the response variable and the data points, we’d likely observe a significant p-value! The bias stems from the fact that we selected a subset of the variables adaptively based on the data, but we never account for this fact. There is a huge number of possible subsets of variables that we selected from. The mere fact that we chose one test over the other by peeking at the data creates a selection bias that invalidates the assumptions underlying the F-test.

Freedman’s paradox bears an important lesson. Significance levels of standard procedures do not capture the vast number of analyses one can choose to carry out or to omit. For this reason, adaptivity is one of the primary explanations of why research findings are frequently false as was argued by Gelman and Loken who aptly refer to adaptivity as “garden of the forking paths”.

Machine learning competitions and holdout sets

Adaptivity is not just an issue with p-values in the empirical sciences. It affects other domains of data science just as well. Machine learning competitions are a perfect example. Competitions have become an extremely popular format for solving prediction and classification problems of all sorts.

Each team in the competition has full access to a publicly available training set which they use to build a predictive model for a certain task such as image classification. Competitors can repeatedly submit a model and see how the model performs on a fixed holdout data set not available to them. The central component of any competition is the public leaderboard which ranks all teams according to the prediction accuracy of their best model so far on the holdout. Every time a team makes a submission they observe the score of their model on the same holdout data. This methodology is inspired by the classic holdout method for validating the performance of a predictive model.
Ideally, the holdout score gives an accurate estimate of the true performance of the model on the underlying distribution from which the data were drawn. However, this is only the case when the model is independent of the holdout data! In contrast, in a competition the model generally incorporates previously observed feedback from the holdout set. Competitors work adaptively and iteratively with the feedback they receive. An improved score for one submission might convince the team to tweak their current approach, while a lower score might cause them to try out a different strategy. But the moment a team modifies their model based on a previously observed holdout score, they create a dependency between the model and the holdout data that invalidates the assumption of the classic holdout method. As a result, competitors may begin to overfit to the holdout data that supports the leaderboard. This means that their score on the public leaderboard continues to improve, while the true performance of the model does not. In fact, unreliable leaderboards are a widely observed phenomenon in machine learning competitions.

Reusable holdout sets

A standard proposal for coping with adaptivity is simply to discourage it. In the empirical sciences, this proposal is known as pre-registration and requires the researcher to specify the exact experimental setup ahead of time. While possible in some simple cases, it is in general too restrictive as it runs counter to today’s complex data analysis workflows.

Rather than limiting the analyst, our approach provides means of reliably verifying the results of an arbitrary adaptive data analysis. The key tool for doing so is what we call the reusable holdout method. As with the classic holdout method discussed above, the analyst is given unfettered access to the training data. What changes is that there is a new algorithm in charge of evaluating statistics on the holdout set. This algorithm ensures that the holdout set maintains the essential guarantees of fresh data over the course of many estimation steps.
The limit of the method is determined by the size of the holdout set - the number of times that the holdout set may be used grows roughly as the square of the number of collected data points in the holdout, as our theory shows.

Armed with the reusable holdout, the analyst is free to explore the training data and verify tentative conclusions on the holdout set. It is now entirely safe to use any information provided by the holdout algorithm in the choice of new analyses to carry out, or the tweaking of existing models and parameters.

A general methodology

The reusable holdout is only one instance of a broader methodology that is, perhaps surprisingly, based on differential privacy—a notion of privacy preservation in data analysis. At its core, differential privacy is a notion of stability requiring that any single sample should not influence the outcome of the analysis significantly.
Example of a stable learning algorithm: Deletion of any single data point does not affect the accuracy of the classifier much.
A beautiful line of work in machine learning shows that various notions of stability imply generalization. That is any sample estimate computed by a stable algorithm (such as the prediction accuracy of a model on a sample) must be close to what we would observe on fresh data.

What sets differential privacy apart from other stability notions is that it is preserved by adaptive composition. Combining multiple algorithms that each preserve differential privacy yields a new algorithm that also satisfies differential privacy albeit at some quantitative loss in the stability guarantee. This is true even if the output of one algorithm influences the choice of the next. This strong adaptive composition property is what makes differential privacy an excellent stability notion for adaptive data analysis.

In a nutshell, the reusable holdout mechanism is simply this: access the holdout set only through a suitable differentially private algorithm. It is important to note, however, that the user does not need to understand differential privacy to use our method. The user interface of the reusable holdout is the same as that of the widely used classical method.

Reliable benchmarks

A closely related work with Avrim Blum dives deeper into the problem of maintaining a reliable leaderboard in machine learning competitions (see this blog post for more background). While the reusable holdout could directly be used for this purpose, it turns out that a variant of the reusable holdout, we call the Ladder algorithm, provides even better accuracy.

This method is not just useful for machine learning competitions, since there are many problems that are roughly equivalent to that of maintaining an accurate leaderboard in a competition. Consider, for example, a performance benchmark that a company uses to test improvements to a system internally before deploying them in a production system. As the benchmark data set is used repeatedly and adaptively for tasks such as model selection, hyper-parameter search and testing, there is a danger that eventually the benchmark becomes unreliable.

Conclusion

Modern data analysis is inherently an adaptive process. Attempts to limit what data scientists will do in practice are ill-fated. Instead we should create tools that respect the usual workflow of data science while at the same time increasing the reliability of data driven insights. It is our goal to continue exploring techniques that can help to create more reliable validation techniques and benchmarks that track true performance more accurately than existing methods.