Category Archives: Uncategorized

Rethinking Attention with Performers

Transformer models have achieved state-of-the-art results across a diverse range of domains, including natural language, conversation, images, and even music. The core block of every Transformer architecture is the attention module, which computes similarity scores for all pairs of positions in an input sequence. This however, scales poorly with the length of the input sequence, requiring quadratic computation time to produce all similarity scores, as well as quadratic memory size to construct a matrix to store these scores.

For applications where long-range attention is needed, several fast and more space-efficient proxies have been proposed such as memory caching techniques, but a far more common way is to rely on sparse attention. Sparse attention reduces computation time and the memory requirements of the attention mechanism by computing a limited selection of similarity scores from a sequence rather than all possible pairs, resulting in a sparse matrix rather than a full matrix. These sparse entries may be manually proposed, found via optimization methods, learned, or even randomized, as demonstrated by such methods as Sparse Transformers, Longformers, Routing Transformers, Reformers, and Big Bird. Since sparse matrices can also be represented by graphs and edges, sparsification methods are also motivated by the graph neural network literature, with specific relationships to attention outlined in Graph Attention Networks. Such sparsity-based architectures usually require additional layers to implicitly produce a full attention mechanism.

Standard sparsification techniques. Left: Example of a sparsity pattern, where tokens attend only to other nearby tokens. Right: In Graph Attention Networks, tokens attend only to their neighbors in the graph, which should have higher relevance than other nodes. See Efficient Transformers: A Survey for a comprehensive categorization of various methods.

Unfortunately, sparse attention methods can still suffer from a number of limitations. (1) They require efficient sparse-matrix multiplication operations, which are not available on all accelerators; (2) they usually do not provide rigorous theoretical guarantees for their representation power; (3) they are optimized primarily for Transformer models and generative pre-training; and (4) they usually stack more attention layers to compensate for sparse representations, making them difficult to use with other pre-trained models, thus requiring retraining and significant energy consumption. In addition to these shortcomings, sparse attention mechanisms are often still not sufficient to address the full range of problems to which regular attention methods are applied, such as Pointer Networks. There are also some operations that cannot be sparsified, such as the commonly used softmax operation, which normalizes similarity scores in the attention mechanism and is used heavily in industry-scale recommender systems.

To resolve these issues, we introduce the Performer, a Transformer architecture with attention mechanisms that scale linearly, thus enabling faster training while allowing the model to process longer lengths, as required for certain image datasets such as ImageNet64 and text datasets such as PG-19. The Performer uses an efficient (linear) generalized attention framework, which allows a broad class of attention mechanisms based on different similarity measures (kernels). The framework is implemented by our novel Fast Attention Via Positive Orthogonal Random Features (FAVOR+) algorithm, which provides scalable low-variance and unbiased estimation of attention mechanisms that can be expressed by random feature map decompositions (in particular, regular softmax-attention). We obtain strong accuracy guarantees for this method while preserving linear space and time complexity, which can also be applied to standalone softmax operations.

Generalized Attention
In the original attention mechanism, the query and key inputs, corresponding respectively to rows and columns of a matrix, are multiplied together and passed through a softmax operation to form an attention matrix, which stores the similarity scores. Note that in this method, one cannot decompose the query-key product back into its original query and key components after passing it into the nonlinear softmax operation. However, it is possible to decompose the attention matrix back to a product of random nonlinear functions of the original queries and keys, otherwise known as random features, which allows one to encode the similarity information in a more efficient manner.

LHS: The standard attention matrix, which contains all similarity scores for every pair of entries, formed by a softmax operation on the query and keys, denoted by q and k. RHS: The standard attention matrix can be approximated via lower-rank randomized matrices Q′ and K′ with rows encoding potentially randomized nonlinear functions of the original queries/keys. For the regular softmax-attention, the transformation is very compact and involves an exponential function as well as random Gaussian projections.

Regular softmax-attention can be seen as a special case with these nonlinear functions defined by exponential functions and Gaussian projections. Note that we can also reason inversely, by implementing more general nonlinear functions first, implicitly defining other types of similarity measures, or kernels, on the query-key product. We frame this as generalized attention, based on earlier work in kernel methods. Although for most kernels, closed-form formulae do not exist, our mechanism can still be applied since it does not rely on them.

To the best of our knowledge, we are the first to show that any attention matrix can be effectively approximated in downstream Transformer-applications using random features. The novel mechanism enabling this is the use of positive random features, i.e., positive-valued nonlinear functions of the original queries and keys, which prove to be crucial for avoiding instabilities during training and provide more accurate approximation of the regular softmax attention mechanism.

Towards FAVOR+: Fast Attention via Matrix Associativity
The decomposition described above allows one to store the implicit attention matrix with linear, rather than quadratic, memory complexity. One can also obtain a linear time attention mechanism using this decomposition. While the original attention mechanism multiplies the stored attention matrix with the value input to obtain the final result, after decomposing the attention matrix, one can rearrange matrix multiplications to approximate the result of the regular attention mechanism, without explicitly constructing the quadratic-sized attention matrix. This ultimately leads to FAVOR+.

Left: Standard attention module computation, where the final desired result is computed by performing a matrix multiplication with the attention matrix A and value tensor V. Right: By decoupling matrices Q′ and K′ used in lower rank decomposition of A and conducting matrix multiplications in the order indicated by dashed-boxes, we obtain a linear attention mechanism, never explicitly constructing A or its approximation.

The above analysis is relevant for so-called bidirectional attention, i.e., non-causal attention where there is no notion of past and future. For unidirectional (causal) attention, where tokens do not attend to other tokens appearing later in the input sequence, we slightly modify the approach to use prefix-sum computations, which only store running totals of matrix computations rather than storing an explicit lower-triangular regular attention matrix.

Left: Standard unidirectional attention requires masking the attention matrix to obtain its lower-triangular part. Right: Unbiased approximation on the LHS can be obtained via a prefix-sum mechanism, where the prefix-sum of the outer-products of random feature maps for keys and value vectors is built on the fly and left-multiplied by query random feature vector to obtain the new row in the resulting matrix.

We first benchmark the space- and time-complexity of the Performer and show that the attention speedups and memory reductions are empirically nearly optimal, i.e., very close to simply not using an attention mechanism at all in the model.

Bidirectional timing for the regular Transformer model in log-log plot with time (T) and length (L). Lines end at the limit of GPU memory. The black line (X) denotes the maximum possible memory compression and speedups when using a “dummy” attention block, which essentially bypasses attention calculations and demonstrates the maximum possible efficiency of the model. The Performer model is nearly able to reach this optimal performance in the attention component.

We further show that the Performer, using our unbiased softmax approximation, is backwards compatible with pretrained Transformer models after a bit of fine-tuning, which could potentially lower energy costs by improving inference speed, without having to fully retrain pre-existing models.

Using the One Billion Word Benchmark (LM1B) dataset, we transferred the original pre-trained Transformer weights to the Performer model, which produces an initial non-zero 0.07 accuracy (dotted orange line). Once fine-tuned however, the Performer quickly recovers accuracy in a small fraction of the original number of gradient steps.

Example Application: Protein Modeling
Proteins are large molecules with complex 3D structures and specific functions that are essential to life. Like words, proteins are specified as linear sequences where each character is one of 20 amino acid building blocks. Applying Transformers to large unlabeled corpora of protein sequences (e.g. UniRef) yields models that can be used to make accurate predictions about the folded, functional macromolecule. Performer-ReLU (which uses ReLU-based attention, an instance of generalized attention that is different from softmax) performs strongly at modeling protein sequence data, while Performer-Softmax matches the performance of the Transformer, as predicted by our theoretical results.

Performance at modeling protein sequences. Train = Dashed, Validation = Solid, Unidirectional = (U), Bidirectional = (B). We use the 36-layer model parameters from ProGen (2019) for all runs, each using a 16×16 TPU-v2. Batch sizes were maximized for each run, given the corresponding compute constraints.

Below we visualize a protein Performer model, trained using the ReLU-based approximate attention mechanism. Using the Performer to estimate similarity between amino acids recovers similar structure to well-known substitution matrices obtained by analyzing evolutionary substitution patterns across carefully curated sequence alignments. More generally, we find local and global attention patterns consistent with Transformer models trained on protein data. The dense attention approximation of the Performer has the potential to capture global interactions across multiple protein sequences. As a proof of concept, we train models on long concatenated protein sequences, which overloads the memory of a regular Transformer model, but not the Performer due to its space efficiency.

Left: Amino acid similarity matrix estimated from attention weights. The model recognizes highly similar amino acid pairs such as (D,E) and (F,Y), despite only having access to protein sequences without prior information about biochemistry. Center: Attention matrices from 4 layers (rows) and 3 selected heads (columns) for the BPT1_BOVIN protein, showing local and global attention patterns.
Performance on sequences up to length 8192 obtained by concatenating individual protein sequences. To fit into TPU memory, the Transformer’s size (number of layers and embedding dimensions) was reduced.

Our work contributes to the recent efforts on non-sparsity based methods and kernel-based interpretations of Transformers. Our method is interoperable with other techniques like reversible layers and we have even integrated FAVOR with the Reformer’s code. We provide the links for the paper, Performer code, and the Protein Language Modeling code. We believe that our research opens up a brand new way of thinking about attention, Transformer architectures, and even kernel methods.

This work was performed by the core Performer designers Krzysztof Choromanski (Google Brain Team, Tech and Research Lead), Valerii Likhosherstov (University of Cambridge) and Xingyou Song (Google Brain Team), with contributions from David Dohan, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. We give special thanks to the Applied Science Team for jointly leading the research effort on applying efficient Transformer architectures to protein sequence data.

We additionally wish to thank Joshua Meier, John Platt, and Tom Weingarten for many fruitful discussions on biological data and useful comments on this draft, along with Yi Tay and Mostafa Dehghani for discussions on comparing baselines. We further thank Nikita Kitaev and Wojciech Gajewski for multiple discussions on the Reformer, and Aurko Roy and Ashish Vaswani for multiple discussions on the Routing Transformer.

An ultramarathoner running so others can “rise”

Editor’s note: Passion Projects is a new Keyword series highlighting Googlers with unexpected interests outside the office.

If you had asked Zanele Hlatshwayo several years ago if she’d ever go on a run for 10 days straight, she’d probably laugh. But these days, that’s exactly what she’s training for—and she changed her mind about running for a deeply personal reason.

Zanele, an ad sales specialist based in Google’s Johannesburg office, turned to running to cope after a tragedy in her family. In 2010, her father committed suicide, and she needed to find a way to deal with her grief. She already went to the gym to work out, but one day she decided to check out the running track there, and that changed everything. Even though she used to hate running, the sport became a crucial outlet for her. Pushing through the pain of a long run taught her she could overcome anything.

“I got tired of feeling sorry for myself and crying and trying to make sense of the reason why he actually committed suicide,” she says. “Running became my sacred space, so to speak, a space where I could really clear my mind.” She started to run races with a few former colleagues, and she was hooked.

The self-described “adrenaline junkie” wasn’t content with just some 5Ks, though. She tried half marathons, then tried full marathons, and then entered the Comrades ultramarathon, which was a whopping 90 kilometers (55.9 miles). At that point, she was running to test her own limits, but wanted to do more. “There’s no point for me in running all these races and just running for medals,” she says. “I wanted to actually run for a purpose.”

Inspired by her father’s legacy, and also by a friend who was going through depression, Zanele decided to start a campaign called Rise 18 last year. In 2018, she ran 18 races to raise money for a suicide prevention help line, the only one of its kind in South Africa.

Zanele's Rise 18 promotion video

Zanele says there’s a major stigma around depression and suicide, not just in South Africa but around the world. “It’s really a state of emergency at the moment, because there aren’t enough resources to assist people who may be struggling,” she says. “And people are too scared to speak out because they don’t want to be made to feel as if there’s something wrong with them.”

The longest, and final, race of Rise 18 was 100 miles long, and took 26 hours to complete. She showed up to the race injured from her previous long-distance runs, and never stopped to sleep the entire race, but she was still determined to finish. For her, the race was mental, not just physical.

Zanele running the Washie 100 Miler ultramarathon.

Zanele running the Washie 100 Miler ultramarathon.

“The sun rises while you’re still on the road, the sun sets while you’re still on the road, and that takes a lot of mental preparation,” she says. “For me, what really kept me going was the goal I had made to myself, and the commitment I made to myself. I don’t want somebody else to go through what my father did.”

She finished that race, and went above and beyond her campaign’s goal. Her initial aim was to raise R 180,000 ($12,716) to support the help line, but she exceeded that, ultimately raising R 210,000 ($14,575). When she donated the money to the charity, they told her that money would fund 11,000 calls to the hotline, which is entirely run by volunteers. “That’s 11,000 lives,” she says. “It’s truly, truly amazing.”

An experience like that makes you realize how powerful the human mind and the human body is.

Now that Rise 18 has completed, Zanele is setting her sights on even bigger goals. She’s working on building an app to connect South Africans to therapists, and plans to raise funds for the project through her next set of races, which will include an Ironman triathlon. (You can find out more on her campaign page.)

But her biggest challenge is still ahead of her: a 10-day run from Johannesburg to Cape Town this May. She’ll travel with a group of 12, who will take day or night shifts on the road, for the Ocal Global Journey for Change. And through it all, she’ll have her larger mission in mind: The group is raising funds to help provide wheelchairs to children with disabilities.

“An experience like that makes you realize how powerful the human mind and the human body is. We’re able to take so much pain,” she says. “And for me, when I’m running, the pain I go through really signifies the pain people go through when they have challenges in their lives. That small pain I feel does not amount to the challenges those people have to face on a daily basis.”

If you or someone you know needs help, you can contact the National Suicide Prevention Lifeline in the U.S. at 1-800-273-TALK(8255), or, in South Africa, the South African Depression and Anxiety Group’s Suicide Crisis Line at 0800 567 567.

Registration now open for DevFest OnAir!

Posted by Erica Hanson, Program Manager in Developer Relations

We’re excited to announce the first official DevFest OnAir! DevFest OnAir is an online conference taking place on December 11th, 2018 featuring sessions from DevFest events around the globe. These community-led developer sessions are hosted by GDGs (Google Developer Groups) focused on community, networking and learning about Google technologies. With over 500 communities and DevFest events happening all over the world, DevFest OnAir brings this global experience online for the first time!

Exclusive content.

DevFest OnAir includes exclusive content from Google in addition to content from the DevFest community. Watch content from up to three tracks at any time:

  • Cloud
  • Mobile
  • Voice, Web & more

Sessions cover multiple products such as Android, Google Cloud Platform, Firebase, Google Assistant, Flutter, machine learning with TensorFlow, and mobile web.

Tailored to your time zone.

Anyone can join, no matter where you are. We’re hosting three broadcasts around the world, around the clock, so there’s a convenient time for you to join no matter where you are at home or at work.

Ask us a question live.

Our live Q&A forum will be open throughout the online event to spark conversation and get you the answers you need.


Join the fun with interactive trivia during DevFest OnAir where you can receive something special!

Every participant who tunes in live on December 11th will receive one free month of learning on Qwiklabs.

Sign up now

Registration is free. Sign up here.

Learn more about DevFest 2018 here and find a DevFest event near you here.

GDGs are local groups of developers interested in Google products and APIs. Each GDG group can host a variety of technical activities for developers – from just a few people getting together to watch the latest Google Developers videos, to large gatherings with demos, tech talks, or hackathons. Learn more about GDG here.

Follow us on Twitter and YouTube.

How YouTube can help people develop their careers and grow their businesses

As new technologies change the way people do their jobs or run their businesses, YouTube can help them acquire new skills to take advantage of the opportunities ahead.

Video is much more than just a source of entertainment, it’s also a powerful medium for learning.  YouTube has a wealth of resources to help people advance their careers, prepare for new jobs or grow their businesses. More than 500 million learning-related videos are viewed on the platform every day. These videos are made and shared by a highly-motivated group of creators, such as Linda Raynier, whose videos teach job seekers how to nail aninterview or write a resume that gets noticed; or Vanessa Van Edwards, who helps people master soft skills like how to use body language in an interview or communicate a great elevator pitch. Thanks to creators like Linda and Vanessa, people can learn new skills for free and engage with a YouTube community of experts for valuable support.

Finding out the facts

Together with brand consultancy Flamingo, we recently surveyed internet users to discover what they think of YouTube and and how it helps them learn new skills.

In the ten European countries covered in the research, 64 percent of respondents felt that YouTube helps them learn new skills that enable personal or professional advancement, making it the highest-rated channel of those included in the survey. YouTube scores highly on this measure for both men (62 percent) and women (66 percent), and across all age groups, at least 50 percent of respondents agreed with the statement.


As part of the research, we ran interviews with people who note that YouTube is a key resource for learning and building their career. One respondent in Saudi Arabia observed that: “YouTube makes me feel like I have a teacher—a teacher that’s available at any moment.” Likewise, a teenager  from France, said, “I decided I wanted to work in fashion thanks to YouTube. I learned how to apply makeup and spot fashion trends thanks to what I learned from YouTubers.”

No matter if you want to launch your business or find tips to get a new job, YouTube is a resource that’s always there to help you grow. What will you learn next?

A new public energy tool to reduce emissions

Renewable energy, and the transition to a low-carbon future, has long been a priority for Google. However, there is still a long way to go toward the low-carbon future we envision.

Electricity generation from fossil fuels accounts for about 45 percent of global carbon emissions yet useful and accessible information to guide the transition to clean energy is still needed. Now with satellite data, cutting-edge science and powerful cloud computing technology like Google Earth Engine, we can achieve an unprecedented understanding of our changing environment and use that to guide wiser decision-making.

Today, the World Resources Institute (WRI) and Google, in partnership with leading global research institutions including Global Energy Observatory, KTH Royal Institute of Technology in Stockholm, and the University of Groningen, are releasing a global database of power plants. This database standardizes power sector information to encourage providers to adopt a common approach for reporting power plant features—like location, fuel type, and emissions—in the future.


Global database of power plants 

Drawing from over 700 publicly available data sources, this database compiles information to cover 80 percent of globally installed electrical capacity from 168 countries, and includes capacity, generation rates, fuel type, ownership and location. Making this kind of information open and accessible to researchers and scientists can help reduce carbon emissions and increase energy access. Power capacity and generation indicators can be used to develop a more granular understanding of the emissions created from the electricity we use, and to develop pathways to decarbonize electricity supply.

Information about power plants—such as location and size—can help researchers study emissions and air pollution at an international, national, and local scale. And, as a high-quality geospatial data source, it can also be used to augment remote sensing and enable machine learning analysis to discover a wide variety of important environmental insights. The data is now available in Earth Engine and WRI’s Resource Watch, where it can be easily combined with other data to create new insights.

Until recently it wasn’t possible to monitor the health of Earth’s critical resources in both a globally consistent and locally relevant manner. Making global data openly available for researchers is a core mission of the Earth Outreach team. By working closely with on-the-ground partners we can put this data into the hands of those who can take action. With the increased visibility into the power sector that this database provides, we see the potential to make the transition to a low-carbon future happen even faster.

Introducing Nearby: A new way to discover the things around you

The Play Store offers over one million apps – many of which are created to be used in specific locations or situations. The right app at the right moment lets you get more done. For example, at a store, you may want a barcode scanner to check prices and reviews for an item. Or when you’re at a museum, an audio tour would enhance the experience as you make your way around the exhibits.

But, getting the right apps at the right time can be tough if you don’t already know about them. So, we’re introducing a new Android feature called Nearby, which notifies you of of things that can be helpful near you.

For example:

Select Google devices, including Google Cast and Android Wear watches, will also let you set them up simply by tapping a notification when you’re near them.

Chromecast Setup V2.png

Earlier this year, we started experimenting with surfacing websites relevant to a place in Chrome through the Physical Web project. In addition to displaying relevant apps, Nearby will surface these websites directly from Android. To deploy your own beacons that work with Nearby, check out our developer blog post


To use Nearby, just turn on Bluetooth and Location, and we’ll show you a notification if a nearby app or website is available. Once you’ve opted-in, tapping on a notification takes you straight into the intended experience. If you’re not interested, just swipe it away to give us a clear signal.

Nearby has started rolling out to users as part of the upcoming Google Play Services release and will work on Android 4.4 (KitKat) and above.