How it’s Made: Interacting with Gemini through multimodal prompting

Posted by Alexander Chen, Creative Director

Let’s try an experiment. We’ll show this picture to our multimodal model Gemini and ask it to describe what it sees:

A close up of a hand with five fingers spread out, palm facing down on a wooden surface.

Tell me what you see

Gemini: I see a person's right hand. The hand is open with the fingers spread apart.

Ok, let’s try this one:

A closeup of the same hand, now balled up in a fist.

Gemini: A person knocking on a wooden door.

How about this one?

A closeup of the same hand, with two fingers extended like that of a peace sign.

Gemini: I see a hand with two fingers extended, which is a common symbol for the number two.

But wait … what if we asked Gemini to reason about all of these images together?

A sequence of three frames depicting the same hand in different formations. The first frame shows the hand with all five fingers spread out, palm facing down. The second frame shows the hand balled up in a fist. The third frame shows the hand with two fingers extended, like that of a peace sign.

What do you think I'm doing? Hint: it's a game.

Gemini: You're playing rock, paper, scissors.

Nice. What else could we try? Let's see if Gemini can notice this pattern:

A sequence of six frames that alternate between a hand that is balled up in a fist, and a hand that has two fingers extended.

I’m playing rock paper scissors. Do you notice a pattern in my gameplay?

Gemini: You are alternating between rock and scissors.

Neat, it spotted the pattern! But does Gemini really know how this game works? Let’s try asking:

Hmm ... do you think it's a good strategy?

Gemini: No, it is not a good strategy. Your opponent will quickly learn your pattern and will be able to win every time. You should try to randomize your choices more.

Good advice, Gemini! Okay, for this last one, let’s put your multimodal reasoning capabilities to the test. Here’s your prompt:

This is a secret message. What does it say?

A six by eight grid of frames of a hands balled up in a fist or the hand with all five fingers spread out, palm facing down. The series of photos spells 'Gemini' in ASCII binary.

Can you figure it out? We have gotten Gemini to solve it, but we won’t spoil it by putting the answer in this post. 🙂

What is multimodal prompting?

Everything we did just now is an example of “multimodal prompting.” We’re basically giving Gemini combinations of different modalities — image and text in this case — and having Gemini respond by predicting what might come next. Gemini’s ability to seamlessly combine these modes together enables new possibilities for what you can do.

In this post, we’ll explore some of the prompting approaches we used in our Hands on with Gemini demo video. We’ll soon be rolling out Gemini for people to try in Google AI Studio, our free, web-based developer tool where you’ll be able to try your own multimodal prompts with Gemini. We’re hoping this guide of starter prompts and ideas helps inspire you to start exploring your own ideas.

🧮 Spatial reasoning and logic

Logic and puzzles are fun way to test Gemini. Let’s try one that requires both left-to-right spatial reasoning and knowledge about our solar system:

Three sticky notes on a wooden surface, each featuring a hand-drawn sketch of a solar-body. The left-most note depicts a drawing of the sun. The middle note has a drawing of saturn. The right-most note shows a drawing of earth.

Is this the right order? Consider the distance from the sun and explain your reasoning.

Gemini: No, the correct order is Sun, Earth, Saturn. The Sun is the closest to the center of the solar system, followed by the Earth, and then Saturn.

Nice! Here’s another, inspired by pinewood derby challenges:

Two sticky notes depicting two different hand-drawn cars going down-hill. The car on the left is blocky and awkwardly shaped, the car on the right is slanted and smooth.

Which of these cars is more aerodynamic? The one on the left or the right? Explain why, using specific visual details.

Gemini: The car on the right is more aerodynamic. It has a lower profile and a more streamlined shape. The car on the left has a higher profile and a more boxy shape, which makes it less aerodynamic.

📷 Image sequences

Next, what if we showed Gemini a sequence of images? Let’s see we can show a few still frames from a game of charades and have Gemini guess the movie:

Four frames from a video of a person leaning back and moving their arms wildly.

Guess what movie I'm acting out.

Gemini: The Matrix

Nice! But which part specifically? Look at my body movements.

Gemini: The part where Neo dodges bullets.

🪄 Magic tricks

Exploring image sequences really lets you dig into how Gemini reasons over time. Let’s try a classic magic trick.

An image of two hands with palms facing the camera, there is a coin in the right hand.

Explain what I'm doing.

Gemini: You are holding a coin in your right hand.

An image of two hands in motion, palms turning inward and away from the camera.

An image of two hands, palms facing down and away from the camera.

Where do you think the coin is?

Gemini: The coin is in your right hand.

Sure, that’s a logical answer given what it’s seen. But let’s pull a sleight of hand as we reveal this next image to Gemini...

An image of two hands, the left palm is facing down and the right hand is tilted upwards revealing nothing under the right palm.

Look here, what happened?

Gemini: The coin is gone.

Gemini notices it’s gone. But can it summarize what may have happened?

Summarize what may have happened, step by step, starting with the first image.

Gemini:

1. You started with a coin in your right hand.

2. You then made a motion with your hands, and the coin disappeared.
3. You then showed that the coin was gone.

Gemini is leveraging the entire context of this conversation – both text and image – to summarize its answer. Finally, let’s ask:

Can you explain how it disappeared?

Gemini: The coin was palmed.

Gemini makes a really solid guess! But we won’t reveal our secret 🙂.

🔀 Cup shuffling

Let’s try a trick that requires memory and logic: the classic ball and cup shuffling game. For this one, we’ll break it down into steps for Gemini. First, let’s establish the basics with these instructions:

Instructions: There are three cups in position 1 (left), position 2 (middle), position 3 (right).

Next, we’ll need to tell Gemini where we’re placing the ball:

Starting position: The ball is in position 3.

Current state: empty, empty, ball

Notice how we came up with a simple format for remembering where the ball is with “empty, empty, ball.” Next, let’s show Gemini two example turns:

Images:

Three frames depicting a person shuffling three green cups. The person switches positions of the middle and right-most cups.

Swap: 2 and 3

Current state: empty, ball, empty

Images:

Three frames depicting a person shuffling three green cups. The person switches positions of the middle and left-most cups.

Swap: 1 and 2

Current state: ball, empty, empty

In these two example turns, we’re both showing it what a swap looks like, and how to update the ball position. Now, let’s see if Gemini can generalize to a swap it hasn't seen:

Images:

Three frames depicting a person shuffling three green cups. The person switches positions of the left-most and right-most cups.

Swap:

Here's Gemini's response:

1 and 3

Current state: empty, empty, ball

Gemini got it! It looked at these images and correctly inferred that cups 1 and 3 are being swapped. And it reasoned correctly about how to update the ball position. Let’s ask:

Tell me where the ball is and summarize how it got there, in paragraph form.

Gemini: The ball is in cup 3.

Starting with the ball in cup 3, the first swap was cup 2 and 3. This put the ball in cup 2. The second swap was cup 1 and 2. This put the ball in cup 1. The third swap was cup 1 and 3. This put the ball in cup 3.

Not only did Gemini get the answer correctly, it accurately summarized the game history. Of course, it won’t always get this challenge right. Sometimes the fake out move (where you swap two empty cups) seems to trip it up, but sometimes it gets that too. But simple prompts like this make it really fun to rapidly test Gemini. You can change the variables in your prompt, including the order of swaps, and see how it does.

🔨 Tool use

If you want to use Gemini in your own apps, you’ll want it to be able to connect to other tools. Let’s try a simple idea where Gemini needs to combine multimodality with tool use: drawing a picture to search for music.

A musical instruments and palm tree sketch on a notebook. The sketch features a guitar, an amplifier, a drum set, and a palm tree drawn in a loose and expressive style. The notebook sits on a wooden desk.

Step 1. Look at this drawing, describe everything in it, and come up with a fun, specific idea for a good musical soundtrack genre, mood, or soundtrack.

Step 2. Provide a search query in the form: "Search [insert query]"

Gemini:

The drawing is of a guitar, drums, and an amplifier. There is also a palm tree. The mood is fun, upbeat, and tropical. The genre is reggae. The soundtrack could be used for a movie or TV show set in a tropical location.

Search "Reggae music for a tropical vacation"

Nice! Gemini both reasons about what it sees and then generates a search query you can parse to do a search. It’s like Gemini is acting like a translator for you – but instead of translating between languages, it’s translating modalities – from drawing to music in this case. With multimodal prompting, you can use Gemini to invent your own entirely new translations between different inputs and outputs.

🕹️Game creation

What if we tried using Gemini to quickly prototype a multimodal game? Here’s an idea: a geography guessing game where you have to point at a map to make your guess. Let’s start by prompting Gemini with the core idea:

Instructions: Let's play a game. Think of a country and give me a clue. The clue must be specific enough that there is only one correct country. I will try pointing at the country on a map.

Next, let’s give Gemini an example turn of gameplay, showing it how we want it to handle both incorrect and correct answers:

Country: This country is huge island that is mostly covered in ice.
(Answer: Greenland)

Guess:

An image of a world map, a hand icon with the index finger pointing at Canada

I am pointing at the country of Canada *incorrect*

Guess:

An image of a world map, a hand icon with the index finger pointing at Greenland

I am pointing at the country of Greenland *correct*

Let’s give it a go and prompt Gemini to generate a clue:

Clue:

This country is known for its unique wildlife, including kangaroos and koalas. (Answer: Australia)

Ok, that’s a good clue. Let’s test out whether pointing will work. Just for fun, let’s try pointing at the wrong place first:

Guess:

A person pointing at Brazil on a map of the world with their right index finger. The map includes blue oceans, and green continents with no country borders.

I am pointing at the country of Brazil *incorrect*

Great! Gemini looked at my image and figured out I’m pointing at Brazil, and correctly reasoned that’s wrong. Now let’s point at the right place on the map:

I am pointing at the country of Australia *correct*

Nice! We’ve basically taught Gemini our game logic just by giving it an example. You'll also notice that it generalized from the illustrated hand in the examples.

⌨️ Coding

Of course, to bring your game idea to life, you’ll eventually have to write some executable code. Let’s see if Gemini can make a simple countdown timer for a game, but with a few fun twists:

Implement a simple timer in HTML/CSS/Javascript. Use a sans serif font and dark mode. Start it at 10 seconds and start counting down. When it reaches zero, replace the timer with a random emoji that is associated with excitement and motivation! Then go back to the timer at 10 seconds and start counting down again.

With just this single instruction, Gemini gives us a working timer that does what we asked for:

An animated gif of a countdown timer starting from 10. At the end of the countdown, a rocket emoji is shown followed lightning bolt emoji and a confetti emoji.

My favorite part is scrolling through Gemini’s source code to find the array of motivational emojis it picked for me:

 const emojis  = ['🚀', '⚡️', '🎉', '🎊', '🥳', '🤩', '✨'];

👀 A sneak peek

Throughout this post, we’ve been giving Gemini an input, and having Gemini make predictions for what might come next. That's basically what prompting is. And our inputs have been multimodal – image and text, combined.

But so far we've only shown Gemini responding in text. Maybe you’re wondering, can Gemini also respond with a combination of image and text? It can! This is a capability of Gemini called “interleaved text and image generation.” While this feature won’t be ready in the first version of Gemini for people to try, we hope to roll it out soon. Here’s a sneak peek of what’s possible.

Let’s see if we could use Gemini to provide everyday creative inspiration. And let’s try it in a domain that requires a bit of multimodal reasoning ... knitting! 🧶. Similar to our map game above, let's provide one example turn of interaction:

Give me three ideas for crochet creations I could make with these two colors:

Balls of blue and yellow yarn on a wooden table.

Colors: I see blue and yellow yarn.

Idea 1: How about a cute dog with one yellow ear?

Image 1:

A blue and yellow crocheted dog head with ears placed on a wooden table.

Idea 2: Or how about a jellyfish with a few yellow tentacles?

Image 2:

A blue and yellow crocheted octopus with tentacles placed on a wooden table.

Idea 3: Or how about a blue cat with a yellow nose?

Image 3:

A blue and yellow crocheted cat with whiskers placed on a wooden table.

We’re essentially teaching Gemini about how we want each interaction to go: “I’ll take a photo of two balls of yarn, and I expect you (Gemini) to both come up with an idea for something I could make, and generate an image of it.”

Now, let’s show it a new pair of yarn colors it hasn't yet seen, and see if it can generalize:

Give me three ideas for crochet creations I could make with these two colors:

Balls of blue and pink yarn on a wooden table.

Gemini:

Colors: I see blue and pink yarn.

Idea 1: How about a pig with blue ears?

Image 1:

A blue and pink crocheted pig face with ears placed on a wooden table.

Idea 2: Or an octopus with pink and blue tentacles?

Image 2:

Idea 3: Or a bunny with a pink nose?

Image 3:

A blue and pink crocheted bunny placed on a wooden table.

Nice! Gemini correctly reasoned about the new colors (“I see blue and pink yarn”) and generated these ideas and the images in a single, interleaved output of text and image.

What Gemini did here is fundamentally different from today’s text-to-image models. It's not just passing an instruction to a separate text-to-image model. It sees the image of my actual yarn on my wooden table, truly doing multimodal reasoning about my text and image together.

What's Next?

We hope you found this a helpful starter guide to get a sense of what’s possible with Gemini. We’re very excited to roll it out to more people soon so you can explore your own ideas through prompting. Stay tuned!

Source: Google for Developers Blog - News about Web, Mobile, AI and Cloud

Pixel 8 Pro — the first smartphone with AI built in — is now running Gemini Nano, plus more AI updates coming to the Pixel portfolio

New Feature Drop brings updates to your Pixel hardware. Plus, Gemini Nano now powers on-device generative AI features for Pixel 8 Pro.

Source: The Official Google Blog

Introducing Gemini: our largest and most capable AI model

Gemini is our most capable and general model, built to be multimodal and optimized for three different sizes: Ultra, Pro and Nano.

Source: The Official Google Blog

Bard gets its biggest upgrade yet with Gemini

We’re starting to bring Gemini’s advanced capabilities into Bard.

Source: The Official Google Blog

A New Foundation for AI on Android

Posted by Dave Burke, VP of Engineering

Foundation Models learn from a diverse range of data sources to produce AI systems capable of adapting to a wide range of tasks, instead of being trained for a single narrow use case. Today, we announced Gemini, our most capable model yet. Gemini was designed for flexibility, so it can run on everything from data centers to mobile devices. It's been optimized for three different sizes: Ultra, Pro and Nano.

Gemini Nano, optimized for mobile

Gemini Nano, our most efficient model built for on-device tasks, runs directly on mobile silicon, opening support for a range of important use cases. Running on-device enables features where the data should not leave the device, such as suggesting replies to messages in an end-to-end encrypted messaging app. It also enables consistent experiences with deterministic latency, so features are always available even when there’s no network.

Gemini Nano is distilled down from the larger Gemini models and specifically optimized to run on mobile silicon accelerators. Gemini Nano enables powerful capabilities such as high quality text summarization, contextual smart replies, and advanced proofreading and grammar correction. For example, the enhanced language understanding of Gemini Nano enables the Pixel 8 Pro to concisely summarize content in the Recorder app, even when the phone’s network connection is offline.

Moving image of Gemini Nano being used in the Recorder app on a Pixel 8 Pro device

Pixel 8 Pro using Gemini Nano in the Recorder app to summarize meeting audio, even without a network connection.

Gemini Nano is starting to power Smart Reply in Gboard on Pixel 8 Pro, ready to be enabled in settings as a developer preview. Available now to try with WhatsApp and coming to more apps next year, the on-device AI model saves you time by suggesting high-quality responses with conversational awareness¹.

Moving image of WhatsApp’s use of Smart Reply in Gboard using Gemini Nano on Pixel 8 Pro device

Smart Reply in Gboard within WhatsApp using Gemini Nano on Pixel 8 Pro.

Android AICore, a new system service for on-device foundation models

Android AICore is a new system service in Android 14 that provides easy access to Gemini Nano. AICore handles model management, runtimes, safety features and more, simplifying the work for you to incorporate AI into your apps.

AICore is private by design, following the example of Android’s Private Compute Core with isolation from the network via open-source APIs, providing transparency and auditability. As part of our efforts to build and deploy AI responsibly, we also built dedicated safety features to make it safer and more inclusive for everyone.

AICore manages model, runtime and safety features.

AICore enables Low Rank Adaptation (LoRA) fine tuning with Gemini Nano. This powerful concept enables app developers to create small LoRA adapters based on their own training data. The LoRA adapter is loaded by AICore, resulting in a powerful large language model fine tuned for the app’s own use-cases.

AICore takes advantage of new ML hardware like the latest Google Tensor TPU and NPUs in flagship Qualcomm Technologies, Samsung S.LSI and MediaTek silicon. AICore and Gemini Nano are rolling out to Pixel 8 Pro, with more devices and silicon partners to be announced in the coming months.

Build with Gemini

We're excited to bring together state-of-the-art AI research with easy-to-use tools and APIs for Android developers to build with Gemini on-device. If you are interested in building apps using Gemini Nano and AICore, please sign up for our Early Access Program.

¹ Available globally, only using the United States English keyboard language. Read more for details.

Source: Android Developers Blog

YouTube Music et YouTube Premium arrivent au Kenya, au Sénégal et au Ghana

Nous sommes ravis d’annoncer que YouTube Premium et YouTube Music seront disponibles au Kenya, au Sénégal et au Ghana à partir du 5 décembre. Les utilisateurs dans ces pays auront désormais accès à l’application YouTube Music, ainsi qu’à l’expérience YouTube Premium qui leur permettra de profiter de leurs contenus sans interruption sur YouTube.

YouTube Music

Avec l’application YouTube Music, nous avons créé un service de streaming musical dédié qui propose plus de 100 millions de titres officiels, ainsi qu'un vaste catalogue de performances live, de clips musicaux, de remix, de podcasts et de morceaux rares que vous ne trouverez nulle part ailleurs. Avec tous ces contenus musicaux, nous avons conçu une expérience musicale personnalisée qui répond à vos envies et préférences.

Ce catalogue musical mondial est désormais à votre portée grâce à YouTube Music. Que vous souhaitiez écouter les derniers titres d’artistes populaires comme Sauti Sol, Burna Boy et Tems, ou découvrir de nouveaux talents indé comme Karun et Xenia Manasseh, tout est là ! Vous trouverez aussi les albums des géants de la musique, comme X&Y.

Voici quelques-unes des fonctionnalités que vous allez adorer dans YouTube Music :

La recherche intelligente vous permet de trouver des chansons avec seulement quelques paroles
La barre ”Activité” vous permet d’accéder rapidement à des playlists et des mix personnalisés pour toutes les occasions
L’onglet “Explorer” vous permet de découvrir le meilleur des nouveautés et titres populaires
L’onglet “Similaires” vous propose du contenu musical basé sur le morceau que vous êtes en train d’écouter
Les paroles synchronisées vous permettent de suivre les paroles de la chanson que vous écoutez
Des recommandations musicales de grande qualité grâce aux fonctionnalités intelligentes de Google

Nous savons bien que chaque expérience musicale est unique. C'est pourquoi nous avons conçu cette application pour VOUS. Profitant du vaste écosystème de YouTube, les abonnés YouTube Premium ont également accès à YouTube Music Premium, qui permet d’écouter de la musique sans publicité, hors connexion et en arrière-plan, partout et à tout moment.

Cette année, nous avons lancé une multitude de nouvelles fonctionnalités très pratiques sur YouTube Music pour vous aider à trouver votre prochain morceau favori, personnaliser votre expérience d'écoute et développer une communauté unique de fans de musique. Mais surtout, nous voulons que votre expérience musicale soit un plaisir. Découvrez ces fonctionnalités plus en détail sur notre blog YouTube mondial.

YouTube Premium

YouTube Premium est un service d’abonnement payant conçu pour les plus grands fans de YouTube. Avec ce service, ils ont accès à une nouvelle expérience de visionnage plus fluide pour regarder leurs créatrices et créateurs favoris dans le monde entier, sans interruption.

Visionnage sans publicité : Regardez toutes vos vidéos sans publicité.
Lecture en arrière-plan : Avec un abonnement YouTube Premium, vous pouvez continuer à écouter l’audio de votre vidéo même si vous quittez l’application YouTube.
Accès hors connexion : Vous pouvez télécharger vos vidéos préférées pour pouvoir les regarder quand vous voulez.
YouTube Music Premium : Vous recevrez automatiquement une version premium de YouTube Music, qui vous permet d’écouter de la musique hors connexion et sans publicité, dans l’application YouTube Music.

- Poste par Addy Awofisayo, Responsable de la Musique, Afrique Subsaharienne, YouTube

Source: Official Google Africa Blog

Dev Channel Update for ChromeOS / ChromeOS Flex

The Dev channel is being updated to OS version: 15694.0.0, Browser version: 121.0.6154.0 for most ChromeOS devices.

If you find new issues, please let us know one of the following ways

File a bug
Visit our ChromeOS communities
1. General: Chromebook Help Community
2. Beta Specific: ChromeOS Beta Help Community
Report an issue or send feedback on Chrome

Interested in switching channels? Find out how.

Matt Nelson,
Google ChromeOS

Source: Google Chrome Releases

Chrome for Android Update

Hi, everyone! We've just released Chrome 120 (120.0.6099.43) for Android: it'll become available on Google Play over the next few days.

This release includes stability and performance improvements. You can see a full list of the changes in the Git log. If you find a new issue, please let us know by filing a bug.

Android releases contain the same security fixes as their corresponding Desktop release (Windows: 120.0.6099.62/.63; Mac & Linux: 120.0.6099.62) unless otherwise noted.

Harry Souders
Google Chrome

Source: Google Chrome Releases

Google at EMNLP 2023

Posted by Malaya Jules, Program Manager, Google

Google is proud to be a Diamond Sponsor of Empirical Methods in Natural Language Processing (EMNLP 2023), a premier annual conference, which is being held this week in Sentosa, Singapore. Google has a strong presence at this year’s conference with over 65 accepted papers and active involvement in 11 workshops and tutorials. Google is also happy to be a Major Sponsor for the Widening NLP workshop (WiNLP), which aims to highlight global representations of people, perspectives, and cultures in AI and ML. We look forward to sharing some of our extensive NLP research and expanding our partnership with the broader research community.

We hope you’ll visit the Google booth to chat with researchers who are actively pursuing the latest innovations in NLP, and check out some of the scheduled booth activities (e.g., demos and Q&A sessions listed below). Visit the @GoogleAI X (Twitter) and LinkedIn accounts to find out more about the Google booth activities at EMNLP 2023.

Take a look below to learn more about the Google research being presented at EMNLP 2023 (Google affiliations in bold).

Board & Organizing Committee

Sponsorship Chair: Shyam Upadyay
Industry Track Chair: Imed Zitouni
Senior Program Committee: Roee Aharoni, Annie Louis, Vinodkumar Prabhakaran, Shruti Rijhwani, Brian Roark, Partha Talukdar

Google Research booth activities

This schedule is subject to change. Please visit the Google booth for more information.

Developing and Utilizing Evaluation Metrics for Machine Translation & Improving Multilingual NLP
Presenter: Isaac Caswell, Dan Deutch, Jan-Thorsten Peter, David Vilar Torres
Fri, Dec 8 | 10:30AM -11:00AM SST

Differentiable Search Indexes & Generative Retrieval
Presenter: Sanket Vaibhav Mehta, Vinh Tran, Kai Hui, Ronak Pradeep^*
Fri, Dec 8 | 3:30PM -4:00PM SST

Retrieval and Generation in a single pass
Presenter: Palak Jain, Livio Baldini Soares
Sat, Dec 9 | 10:30AM -11:00AM SST

Amplifying Adversarial Attacks
Presenter: Anu Sinha
Sat, Dec 9 | 12:30PM -1:45PM SST

Automate prompt design: Universal Self-Adaptive Prompting (see blog post)
Presenter: Xingchen Qian^*, Ruoxi Sun
Sat, Dec 9 | 3:30PM -4:00PM SST

Papers

SynJax: Structured Probability Distributions for JAX
Miloš Stanojević, Laurent Sartran

Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning
Clifton Poth, Hannah Sterz, Indraneil Paul, Sukannya Purkayastha, Leon Engländer, Timo Imhof, Ivan Vulić, Sebastian Ruder, Iryna Gurevych, Jonas Pfeiffer

DocumentNet: Bridging the Data Gap in Document Pre-training
Lijun Yu, Jin Miao, Xiaoyu Sun, Jiayi Chen, Alexander Hauptmann, Hanjun Dai, Wei Wei

AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-Powered Applications
Bhaktipriya Radharapu, Kevin Robinson, Lora Aroyo, Preethi Lahoti

CRoW: Benchmarking Commonsense Reasoning in Real-World Tasks
Mete Ismayilzada, Debjit Paul, Syrielle Montariol, Mor Geva, Antoine Bosselut

Large Language Models Can Self-Improve
Jiaxin Huang^*, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, Jiawei Han

Dissecting Recall of Factual Associations in Auto-Regressive Language Models
Mor Geva, Jasmijn Bastings, Katja Filippova, Amir Globerson

Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks
Alon Jacovi, Avi Caciularu, Omer Goldman, Yoav Goldberg

Selective Labeling: How to Radically Lower Data-Labeling Costs for Document Extraction Models
Yichao Zhou, James Bradley Wendt, Navneet Potti, Jing Xie, Sandeep Tata

Measuring Attribution in Natural Language Generation Models
Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Lora Aroyo, Michael Collins, Dipanjan Das, Slav Petrov, Gaurav Singh Tomar, Iulia Turc, David Reitter

Inverse Scaling Can Become U-Shaped
Jason Wei^*, Najoung Kim, Yi Tay^*, Quoc Le

INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback
Wenda Xu, Danqing Wang, Liangming Pan, Zhenqiao Song, Markus Freitag, William Yang Wang, Lei Li

On the Robustness of Dialogue History Representation in Conversational Question Answering: A Comprehensive Study and a New Prompt-Based Method
Zorik Gekhman, Nadav Oved, Orgad Keller, Idan Szpektor, Roi Reichart

Investigating Efficiently Extending Transformers for Long-Input Summarization
Jason Phang^*, Yao Zhao, Peter J Liu

DSI++: Updating Transformer Memory with New Documents
Sanket Vaibhav Mehta^*, Jai Gupta, Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Jinfeng Rao, Marc Najork, Emma Strubell, Donald Metzler

MultiTurnCleanup: A Benchmark for Multi-Turn Spoken Conversational Transcript Cleanup
Hua Shen^*, Vicky Zayats, Johann C Rocholl, Daniel David Walker, Dirk Padfield

Findings of EMNLP

Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs
Jiefeng Chen^*, Jinsung Yoon, Sayna Ebrahimi, Sercan O Arik, Tomas Pfister, Somesh Jha

A Comprehensive Evaluation of Tool-Assisted Generation Strategies
Alon Jacovi^*, Avi Caciularu, Jonathan Herzig, Roee Aharoni, Bernd Bohnet, Mor Geva

1-PAGER: One Pass Answer Generation and Evidence Retrieval
Palak Jain, Livio Baldini Soares, Tom Kwiatkowski

MaXM: Towards Multilingual Visual Question Answering
Soravit Changpinyo, Linting Xue, Michal Yarom, Ashish V. Thapliyal, Idan Szpektor, Julien Amelot, Xi Chen, Radu Soricut

SDOH-NLI: A Dataset for Inferring Social Determinants of Health from Clinical Notes
Adam D. Lelkes, Eric Loreaux^*, Tal Schuster, Ming-Jun Chen, Alvin Rajkomar

Machine Reading Comprehension Using Case-based Reasoning
Dung Ngoc Thai, Dhruv Agarwal, Mudit Chaudhary, Wenlong Zhao, Rajarshi Das, Jay-Yoon Lee, Hannaneh Hajishirzi, Manzil Zaheer, Andrew McCallum

Cross-lingual Open-Retrieval Question Answering for African Languages
Odunayo Ogundepo, Tajuddeen Gwadabe, Clara E. Rivera, Jonathan H. Clark, Sebastian Ruder, David Ifeoluwa Adelani, Bonaventure F. P. Dossou, Abdou Aziz DIOP, Claytone Sikasote, Gilles HACHEME, Happy Buzaaba, Ignatius Ezeani, Rooweither Mabuya, Salomey Osei, Chris Chinenye Emezue, Albert Kahira, Shamsuddeen Hassan Muhammad, Akintunde Oladipo, Abraham Toluwase Owodunni, Atnafu Lambebo Tonja, Iyanuoluwa Shode, Akari Asai, Anuoluwapo Aremu, Ayodele Awokoya, Bernard Opoku, Chiamaka Ijeoma Chukwuneke, Christine Mwase, Clemencia Siro, Stephen Arthur, Tunde Oluwaseyi Ajayi, Verrah Akinyi Otiende, Andre Niyongabo Rubungo, Boyd Sinkala, Daniel Ajisafe, Emeka Felix Onwuegbuzia, Falalu Ibrahim Lawan, Ibrahim Said Ahmad, Jesujoba Oluwadara Alabi, CHINEDU EMMANUEL MBONU, Mofetoluwa Adeyemi, Mofya Phiri, Orevaoghene Ahia, Ruqayya Nasir Iro, Sonia Adhiambo

On Uncertainty Calibration and Selective Generation in Probabilistic Neural Summarization: A Benchmark Study
Polina Zablotskaia, Du Phan, Joshua Maynez, Shashi Narayan, Jie Ren, Jeremiah Zhe Liu

Epsilon Sampling Rocks: Investigating Sampling Strategies for Minimum Bayes Risk Decoding for Machine Translation
Markus Freitag, Behrooz Ghorbani^*, Patrick Fernandes^*

Sources of Hallucination by Large Language Models on Inference Tasks
Nick McKenna, Tianyi Li, Liang Cheng, Mohammad Javad Hosseini, Mark Johnson, Mark Steedman

Don’t Add, Don’t Miss: Effective Content Preserving Generation from Pre-selected Text Spans
Aviv Slobodkin, Avi Caciularu, Eran Hirsch, Ido Dagan

What Makes Chain-of-Thought Prompting Effective? A Counterfactual Study
Aman Madaan^*, Katherine Hermann, Amir Yazdanbakhsh

Understanding HTML with Large Language Models
Izzeddin Gur, Ofir Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang, Aakanksha Chowdhery, Sharan Narang, Noah Fiedel, Aleksandra Faust

Improving the Robustness of Summarization Models by Detecting and Removing Input Noise
Kundan Krishna^*, Yao Zhao, Jie Ren, Balaji Lakshminarayanan, Jiaming Luo, Mohammad Saleh, Peter J. Liu

In-Context Learning Creates Task Vectors
Roee Hendel, Mor Geva, Amir Globerson

Pre-training Without Attention
Junxiong Wang, Jing Nathan Yan, Albert Gu, Alexander M Rush

MUX-PLMs: Data Multiplexing for High-Throughput Language Models
Vishvak Murahari, Ameet Deshpande, Carlos E Jimenez, Izhak Shafran, Mingqiu Wang, Yuan Cao, Karthik R Narasimhan

PaRaDe: Passage Ranking Using Demonstrations with LLMs
Andrew Drozdov^*, Honglei Zhuang, Zhuyun Dai, Zhen Qin, Razieh Rahimi, Xuanhui Wang, Dana Alon, Mohit Iyyer, Andrew McCallum, Donald Metzler^*, Kai Hui

Long-Form Speech Translation Through Segmentation with Finite-State Decoding Constraints on Large Language Models
Arya D. McCarthy, Hao Zhang, Shankar Kumar, Felix Stahlberg, Ke Wu

Unsupervised Opinion Summarization Using Approximate Geodesics
Somnath Basu Roy Chowdhury^*, Nicholas Monath, Kumar Avinava Dubey, Amr Ahmed, Snigdha Chaturvedi

SQLPrompt: In-Context Text-to-SQL with Minimal Labeled Data
Ruoxi Sun, Sercan O. Arik, Rajarishi Sinha, Hootan Nakhost, Hanjun Dai, Pengcheng Yin, Tomas Pfister

Retrieval-Augmented Parsing for Complex Graphs by Exploiting Structure and Uncertainty
Zi Lin, Quan Yuan, Panupong Pasupat, Jeremiah Zhe Liu, Jingbo Shang

A Zero-Shot Language Agent for Computer Control with Structured Reflection
Tao Li, Gang Li, Zhiwei Deng, Bryan Wang^*, Yang Li

Pragmatics in Language Grounding: Phenomena, Tasks, and Modeling Approaches
Daniel Fried, Nicholas Tomlin, Jennifer Hu, Roma Patel, Aida Nematzadeh

Improving Classifier Robustness Through Active Generation of Pairwise Counterfactuals
Ananth Balashankar, Xuezhi Wang, Yao Qin, Ben Packer, Nithum Thain, Jilin Chen, Ed H. Chi, Alex Beutel

mmT5: Modular Multilingual Pre-training Solves Source Language Hallucinations
Jonas Pfeiffer, Francesco Piccinno, Massimo Nicosia, Xinyi Wang, Machel Reid, Sebastian Ruder

Scaling Laws vs Model Architectures: How Does Inductive Bias Influence Scaling?
Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, William Fedus, Jinfeng Rao, Sharan Narang, Vinh Q. Tran, Dani Yogatama, Donald Metzler

TaTA: A Multilingual Table-to-Text Dataset for African Languages
Sebastian Gehrmann, Sebastian Ruder, Vitaly Nikolaev, Jan A. Botha, Michael Chavinda, Ankur P Parikh, Clara E. Rivera

XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages
Sebastian Ruder, Jonathan H. Clark, Alexander Gutkin, Mihir Kale, Min Ma, Massimo Nicosia, Shruti Rijhwani, Parker Riley, Jean Michel Amath Sarr, Xinyi Wang, John Frederick Wieting, Nitish Gupta, Anna Katanova, Christo Kirov, Dana L Dickinson, Brian Roark, Bidisha Samanta, Connie Tao, David Ifeoluwa Adelani, Vera Axelrod, Isaac Rayburn Caswell, Colin Cherry, Dan Garrette, Reeve Ingle, Melvin Johnson, Dmitry Panteleev, Partha Talukdar

q2d: Turning Questions into Dialogs to Teach Models How to Search
Yonatan Bitton, Shlomi Cohen-Ganor, Ido Hakimi, Yoad Lewenberg, Roee Aharoni, Enav Weinreb

Emergence of Abstract State Representations in Embodied Sequence Modeling
Tian Yun^*, Zilai Zeng, Kunal Handa, Ashish V Thapliyal, Bo Pang, Ellie Pavlick, Chen Sun

Evaluating and Modeling Attribution for Cross-Lingual Question Answering
Benjamin Muller^*, John Wieting, Jonathan H. Clark, Tom Kwiatkowski, Sebastian Ruder, Livio Baldini Soares, Roee Aharoni, Jonathan Herzig, Xinyi Wang

Weakly-Supervised Learning of Visual Relations in Multimodal Pre-training
Emanuele Bugliarello, Aida Nematzadeh, Lisa Anne Hendricks

How Do Languages Influence Each Other? Studying Cross-Lingual Data Sharing During LM Fine-Tuning
Rochelle Choenni, Dan Garrette, Ekaterina Shutova

CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models
Benjamin Minixhofer, Jonas Pfeiffer, Ivan Vulić

IC3: Image Captioning by Committee Consensus
David Chan, Austin Myers, Sudheendra Vijayanarasimhan, David A Ross, John Canny

The Curious Case of Hallucinatory (Un)answerability: Finding Truths in the Hidden States of Over-Confident Large Language Models
Aviv Slobodkin, Omer Goldman, Avi Caciularu, Ido Dagan, Shauli Ravfogel

Evaluating Large Language Models on Controlled Generation Tasks
Jiao Sun, Yufei Tian, Wangchunshu Zhou, Nan Xu, Qian Hu, Rahul Gupta, John Wieting, Nanyun Peng, Xuezhe Ma

Ties Matter: Meta-Evaluating Modern Metrics with Pairwise Accuracy and Tie Calibration
Daniel Deutsch, George Foster, Markus Freitag

Transcending Scaling Laws with 0.1% Extra Compute
Yi Tay^*, Jason Wei^*, Hyung Won Chung^*, Vinh Q. Tran, David R. So^*, Siamak Shakeri, Xavier Garcia, Huaixiu Steven Zheng, Jinfeng Rao, Aakanksha Chowdhery, Denny Zhou, Donald Metzler, Slav Petrov, Neil Houlsby, Quoc V. Le, Mostafa Dehghani

Data Similarity is Not Enough to Explain Language Model Performance
Gregory Yauney^*, Emily Reif, David Mimno

Self-Influence Guided Data Reweighting for Language Model Pre-training
Megh Thakkar^*, Tolga Bolukbasi, Sriram Ganapathy, Shikhar Vashishth, Sarath Chandar, Partha Talukdar

ReTAG: Reasoning Aware Table to Analytic Text Generation
Deepanway Ghosal, Preksha Nema, Aravindan Raghuveer

GATITOS: Using a New Multilingual Lexicon for Low-Resource Machine Translation
Alex Jones^*, Isaac Caswell, Ishank Saxena

Video-Helpful Multimodal Machine Translation
Yihang Li, Shuichiro Shimizu, Chenhui Chu, Sadao Kurohashi, Wei Li

Symbol Tuning Improves In-Context Learning in Language Models
Jerry Wei^*, Le Hou, Andrew Kyle Lampinen, Xiangning Chen^*, Da Huang, Yi Tay^*, Xinyun Chen, Yifeng Lu, Denny Zhou, Tengyu Ma^*, Quoc V Le

"Don't Take This Out of Context!" On the Need for Contextual Models and Evaluations for Stylistic Rewriting
Akhila Yerukola, Xuhui Zhou, Elizabeth Clark, Maarten Sap

QAmeleon: Multilingual QA with Only 5 Examples
Priyanka Agrawal, Chris Alberti, Fantine Huot, Joshua Maynez, Ji Ma, Sebastian Ruder, Kuzman Ganchev, Dipanjan Das, Mirella Lapata

Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision
Eugene Kharitonov, Damien Vincent, Zalán Borsos, Raphaël Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, Neil Zeghidour

AnyTOD: A Programmable Task-Oriented Dialog System
Jeffrey Zhao, Yuan Cao, Raghav Gupta, Harrison Lee, Abhinav Rastogi, Mingqiu Wang, Hagen Soltau, Izhak Shafran, Yonghui Wu

Selectively Answering Ambiguous Questions
Jeremy R. Cole, Michael JQ Zhang, Daniel Gillick, Julian Martin Eisenschlos, Bhuwan Dhingra, Jacob Eisenstein

PRESTO: A Multilingual Dataset for Parsing Realistic Task-Oriented Dialogs (see blog post)
Rahul Goel, Waleed Ammar, Aditya Gupta, Siddharth Vashishtha, Motoki Sano, Faiz Surani^*, Max Chang, HyunJeong Choe, David Greene, Chuan He, Rattima Nitisaroj, Anna Trukhina, Shachi Paul, Pararth Shah, Rushin Shah, Zhou Yu

LM vs LM: Detecting Factual Errors via Cross Examination
Roi Cohen, May Hamri, Mor Geva, Amir Globerson

A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding
Andrea Burns^*, Krishna Srinivasan, Joshua Ainslie, Geoff Brown, Bryan A. Plummer, Kate Saenko, Jianmo Ni, Mandy Guo

AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages
Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, Nedjma Ousidhoum, David Ifeoluwa Adelani, Seid Muhie Yimam, Ibrahim Said Ahmad, Meriem Beloucif, Saif M. Mohammad, Sebastian Ruder, Oumaima Hourrane, Alipio Jorge, Pavel Brazdil, Felermino D. M. A. Ali, Davis David, Salomey Osei, Bello Shehu-Bello, Falalu Ibrahim Lawan, Tajuddeen Gwadabe, Samuel Rutunda, Tadesse Destaw Belay, Wendimu Baye Messelle, Hailu Beshada Balcha, Sisay Adugna Chala, Hagos Tesfahun Gebremichael, Bernard Opoku, Stephen Arthur

Optimizing Retrieval-Augmented Reader Models via Token Elimination
Moshe Berchansky, Peter Izsak, Avi Caciularu, Ido Dagan, Moshe Wasserblat

SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation
Elizabeth Clark, Shruti Rijhwani, Sebastian Gehrmann, Joshua Maynez, Roee Aharoni, Vitaly Nikolaev, Thibault Sellam, Aditya Siddhant, Dipanjan Das, Ankur P Parikh

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong^*, Yury Zemlyanskiy, Federico Lebron, Sumit Sanghai

CoLT5: Faster Long-Range Transformers with Conditional Computation
Joshua Ainslie, Tao Lei, Michiel de Jong, Santiago Ontanon, Siddhartha Brahma, Yury Zemlyanskiy, David Uthus, Mandy Guo, James Lee-Thorp, Yi Tay, Yun-Hsuan Sung, Sumit Sanghai

Improving Diversity of Demographic Representation in Large Language Models via Collective-Critiques and Self-Voting
Preethi Lahoti, Nicholas Blumm, Xiao Ma, Raghavendra Kotikalapudi, Sahitya Potluri, Qijun Tan, Hansa Srinivasan, Ben Packer, Ahmad Beirami, Alex Beutel, Jilin Chen

Universal Self-Adaptive Prompting (see blog post)
Xingchen Wan^*, Ruoxi Sun, Hootan Nakhost, Hanjun Dai, Julian Martin Eisenschlos, Sercan O. Arik, Tomas Pfister

TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models
Zorik Gekhman, Jonathan Herzig, Roee Aharoni, Chen Elkind, Idan Szpektor

Hierarchical Pre-training on Multimodal Electronic Health Records
Xiaochen Wang, Junyu Luo, Jiaqi Wang, Ziyi Yin, Suhan Cui, Yuan Zhong, Yaqing Wang, Fenglong Ma

NAIL: Lexical Retrieval Indices with Efficient Non-Autoregressive Decoders
Livio Baldini Soares, Daniel Gillick, Jeremy R. Cole, Tom Kwiatkowski

How Does Generative Retrieval Scale to Millions of Passages?
Ronak Pradeep^*, Kai Hui, Jai Gupta, Adam D. Lelkes, Honglei Zhuang, Jimmy Lin, Donald Metzler, Vinh Q. Tran

Make Every Example Count: On the Stability and Utility of Self-Influence for Learning from Noisy NLP Datasets
Irina Bejan^*, Artem Sokolov, Katja Filippova

Workshops

The Seventh Widening NLP Workshop (WiNLP)
Major Sponsor
Organizers: Sunipa Dev
Panelist: Preethi Lahoti

The Sixth Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC)
Invited Speaker: Bernd Bohnet

The 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS)
Organizer: Geeticka Chauhan

Combined Workshop on Spatial Language Understanding and Grounded Communication for Robotics (SpLU-RoboNLP)
Invited Speaker: Andy Zeng

Natural Language Generation, Evaluation, and Metric (GEM)
Organizer: Elizabeth Clark

The First Arabic Natural Language Processing Conference (ArabicNLP)
Organizer: Imed Zitouni

The Big Picture: Crafting a Research Narrative (BigPicture)
Organizer: Nora Kassner, Sebastian Ruder

BlackboxNLP 2023: The 6th Workshop on Analysing and Interpreting Neural Networks for NLP
Organizer: Najoung Kim
Panelist: Neel Nanda

The SIGNLL Conference on Computational Natural Language Learning (CoNLL)
Co-Chair: David Reitter
Areas and ACs: Kyle Gorman (Speech and Phonology), Fei Liu (Natural Language Generation)

The Third Workshop on Multi-lingual Representation Learning (MRL)
Organizer: Omer Goldman, Sebastian Ruder
Invited Speaker: Orhan Firat

Tutorials

Creative Natural Language Generation
Organizer: Tuhin Chakrabarty^*

* Work done while at Google

Source: Google AI Blog

Chrome Stable for iOS Update

Hi everyone! We've just released Chrome Stable 120 (120.0.6099.101) for iOS; it'll become available on App Store in the next few hours.

This release includes stability and performance improvements. You can see a full list of the changes in the Git log. If you find a new issue, please let us know by filing a bug.

Harry Souders
Google Chrome