Tag Archives: Generative AI

Gemma Family Expands with Models Tailored for Developers and Researchers

Posted by Tris Warkentin – Director, Product Management and Jane Fine - Senior Product Manager

In February we announced Gemma, our family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. The community's incredible response – including impressive fine-tuned variants, Kaggle notebooks, integration into tools and services, recipes for RAG using databases like MongoDB, and lots more – has been truly inspiring.

Today, we're excited to announce our first round of additions to the Gemma family, expanding the possibilities for ML developers to innovate responsibly: CodeGemma for code completion and generation tasks as well as instruction following, and RecurrentGemma, an efficiency-optimized architecture for research experimentation. Plus, we're sharing some updates to Gemma and our terms aimed at improvements based on invaluable feedback we've heard from the community and our partners.


Introducing the first two Gemma variants


CodeGemma: Code completion, generation, and chat for developers and businesses

Harnessing the foundation of our Gemma models, CodeGemma brings powerful yet lightweight coding capabilities to the community. CodeGemma models are available as a 7B pretrained variant that specializes in code completion and code generation tasks, a 7B instruction-tuned variant for code chat and instruction-following, and a 2B pretrained variant for fast code completion that fits on your local computer. CodeGemma models have several advantages:

  • Intelligent code completion and generation: Complete lines, functions, and even generate entire blocks of code – whether you're working locally or leveraging cloud resources. 
  • Enhanced accuracy: Trained on 500 billion tokens of primarily English language data from web documents, mathematics, and code, CodeGemma models generate code that's not only more syntactically correct but also semantically meaningful, helping reduce errors and debugging time. 
  • Multi-language proficiency: Your invaluable coding assistant for Python, JavaScript, Java, and other popular languages. 
  • Streamlined workflows: Integrate a CodeGemma model into your development environment to write less boilerplate, and focus on interesting and differentiated code that matters – faster.
image of streamlined workflows within an exisitng AI dev project with CodeGemma integrated
This table compares the performance of CodeGemma with other similar models on both single and multi-line code completion tasks. Learn more in the technical report.

Learn more about CodeGemma in our report or try it in this quickstart guide.


RecurrentGemma: Efficient, faster inference at higher batch sizes for researchers

RecurrentGemma is a technically distinct model that leverages recurrent neural networks and local attention to improve memory efficiency. While achieving similar benchmark score performance to the Gemma 2B model, RecurrentGemma's unique architecture results in several advantages:

  • Reduced memory usage: Lower memory requirements allow for the generation of longer samples on devices with limited memory, such as single GPUs or CPUs. 
  • Higher throughput: Because of its reduced memory usage, RecurrentGemma can perform inference at significantly higher batch sizes, thus generating substantially more tokens per second (especially when generating long sequences). 
  • Research innovation: RecurrentGemma showcases a non-transformer model that achieves high performance, highlighting advancements in deep learning research. 
graph showing maximum thoughput when sampling from a prompt of 2k tokens on TPUv5e
This chart reveals how RecurrentGemma maintains its sampling speed regardless of sequence length, while Transformer-based models like Gemma slow down as sequences get longer.

To understand the underlying technology, check out our paper. For practical exploration, try the notebook, which demonstrates how to finetune the model.


Built upon Gemma foundations, expanding capabilities

Guided by the same principles of the original Gemma models, the new model variants offer:

  • Open availability: Encourages innovation and collaboration with its availability to everyone and flexible terms of use. 
  • High-performance and efficient capabilities: Advances the capabilities of open models with code-specific domain expertise and optimized design for exceptionally fast completion and generation. 
  • Responsible design: Our commitment to responsible AI helps ensure the models deliver safe and reliable results. 
  • Flexibility for diverse software and hardware:  
    • Both CodeGemma and RecurrentGemma: Built with JAX and compatible with JAX, PyTorch, , Hugging Face Transformers, and Gemma.cpp. Enable local experimentation and cost-effective deployment across various hardware, including laptops, desktops, NVIDIA GPUs, and Google Cloud TPUs.  
    • CodeGemma: Additionally compatible with Keras, NVIDIA NeMo, TensorRT-LLM, Optimum-NVIDIA, MediaPipe, and availability on Vertex AI. 
    • RecurrentGemma: Support for all the aforementioned products will be available in the coming weeks.

Gemma 1.1 update

Alongside the new model variants, we're releasing Gemma 1.1, which includes performance improvements. Additionally, we've listened to developer feedback, fixed bugs, and updated our terms to provide more flexibility.


Get started today

These first Gemma model variants are available in various places worldwide, starting today on Kaggle, Hugging Face, and Vertex AI Model Garden. Here's how to get started:

We invite you to try the CodeGemma and RecurrentGemma models and share your feedback on Kaggle. Together, let's shape the future of AI-powered content creation and understanding.

Talk like a graph: Encoding graphs for large language models

Imagine all the things around you — your friends, tools in your kitchen, or even the parts of your bike. They are all connected in different ways. In computer science, the term graph is used to describe connections between objects. Graphs consist of nodes (the objects themselves) and edges (connections between two nodes, indicating a relationship between them). Graphs are everywhere now. The internet itself is a giant graph of websites linked together. Even the knowledge search engines use is organized in a graph-like way.

Furthermore, consider the remarkable advancements in artificial intelligence — such as chatbots that can write stories in seconds, and even software that can interpret medical reports. This exciting progress is largely thanks to large language models (LLMs). New LLM technology is constantly being developed for different uses.

Since graphs are everywhere and LLM technology is on the rise, in “Talk like a Graph: Encoding Graphs for Large Language Models”, presented at ICLR 2024, we present a way to teach powerful LLMs how to better reason with graph information. Graphs are a useful way to organize information, but LLMs are mostly trained on regular text. The objective is to test different techniques to see what works best and gain practical insights. Translating graphs into text that LLMs can understand is a remarkably complex task. The difficulty stems from the inherent complexity of graph structures with multiple nodes and the intricate web of edges that connect them. Our work studies how to take a graph and translate it into a format that an LLM can understand. We also design a benchmark called GraphQA to study different approaches on different graph reasoning problems and show how to phrase a graph-related problem in a way that enables the LLM to solve the graph problem. We show that LLM performance on graph reasoning tasks varies on three fundamental levels: 1) the graph encoding method, 2) the nature of the graph task itself, and 3) interestingly, the very structure of the graph considered. These findings give us clues on how to best represent graphs for LLMs. Picking the right method can make the LLM up to 60% better at graph tasks!

Pictured, the process of encoding a graph as text using two different approaches and feeding the text and a question about the graph to the LLM.

Graphs as text

To be able to systematically find out what is the best way to translate a graph to text, we first design a benchmark called GraphQA. Think of GraphQA as an exam designed to evaluate powerful LLMs on graph-specific problems. We want to see how well LLMs can understand and solve problems that involve graphs in different setups. To create a comprehensive and realistic exam for LLMs, we don’t just use one type of graph, we use a mix of graphs ensuring breadth in the number of connections. This is mainly because different graph types make solving such problems easier or harder. This way, GraphQA can help expose biases in how an LLM thinks about the graphs, and the whole exam gets closer to a realistic setup that LLMs might encounter in the real world.

Overview of our framework for reasoning with graphs using LLMs.

GraphQA focuses on simple tasks related to graphs, like checking if an edge exists, calculating the number of nodes or edges, finding nodes that are connected to a specific node, and checking for cycles in a graph. These tasks might seem basic, but they require understanding the relationships between nodes and edges. By covering different types of challenges, from identifying patterns to creating new connections, GraphQA helps models learn how to analyze graphs effectively. These basic tasks are crucial for more complex reasoning on graphs, like finding the shortest path between nodes, detecting communities, or identifying influential nodes. Additionally, GraphQA includes generating random graphs using various algorithms like Erdős-Rényi, scale-free networks, Barabasi-Albert model, and stochastic block model, as well as simpler graph structures like paths, complete graphs, and star graphs, providing a diverse set of data for training.

When working with graphs, we also need to find ways to ask graph-related questions that LLMs can understand. Prompting heuristics are different strategies for doing this. Let's break down the common ones:

  • Zero-shot: simply describe the task ("Is there a cycle in this graph?") and tell the LLM to go for it. No examples provided.
  • Few-shot: This is like giving the LLM a mini practice test before the real deal. We provide a few example graph questions and their correct answers.
  • Chain-of-Thought: Here, we show the LLM how to break down a problem step-by-step with examples. The goal is to teach it to generate its own "thought process" when faced with new graphs.
  • Zero-CoT: Similar to CoT, but instead of training examples, we give the LLM a simple prompt, like "Let's think step-by-step," to trigger its own problem-solving breakdown.
  • BAG (build a graph): This is specifically for graph tasks. We add the phrase "Let's build a graph..." to the description, helping the LLM focus on the graph structure.

We explored different ways to translate graphs into text that LLMs can work with. Our key questions were:

  • Node encoding: How do we represent individual nodes? Options tested include simple integers, common names (people, characters), and letters.
  • Edge encoding: How do we describe the relationships between nodes? Methods involved parenthesis notation, phrases like "are friends", and symbolic representations like arrows.

Various node and edge encodings were combined systematically. This led to functions like the ones in the following figure:

Examples of graph encoding functions used to encode graphs via text.

Analysis and results

We carried out three key experiments: one to test how LLMs handle graph tasks, and two to understand how the size of the LLM and different graph shapes affected performance. We run all our experiments on GraphQA.


How LLMs handle graph tasks

In this experiment, we tested how well pre-trained LLMs tackle graph problems like identifying connections, cycles, and node degrees. Here is what we learned:

  • LLMs struggle: On most of these basic tasks, LLMs did not do much better than a random guess.
  • Encoding matters significantly: How we represent the graph as text has a great effect on LLM performance. The "incident" encoding excelled for most of the tasks in general.

Our results are summarized in the following chart.

Comparison of various graph encoder functions based on their accuracy on different graph tasks. The main conclusion from this figure is that the graph encoding functions matter significantly.

Bigger is (usually) better

In this experiment, we wanted to see if the size of the LLM (in terms of the number of parameters) affects how well they can handle graph problems. For that, we tested the same graph tasks on the XXS, XS, S, and L sizes of PaLM 2. Here is a summary of our findings:

  • In general, bigger models did better on graph reasoning tasks. It seems like the extra parameters gave them space to learn more complex patterns.
  • Oddly, size didn't matter as much for the “edge existence” task (finding out if two nodes in a graph are connected).
  • Even the biggest LLM couldn't consistently beat a simple baseline solution on the cycle check problem (finding out if a graph contains a cycle or not). This shows LLMs still have room to improve with certain graph tasks.
Effect of model capacity on graph reasoning task for PaLM 2-XXS, XS, S, and L.

Do different graph shapes confuse LLMs

We wondered if the "shape" of a graph (how nodes are connected) influences how well LLMs can solve problems on it. Think of the following figure as different examples of graph shapes.

Samples of graphs generated with different graph generators from GraphQA. ER, BA, SBM, and SFN refers to Erdős–Rényi, Barabási–Albert, Stochastic Block Model, and Scale-Free Network respectively.

We found that graph structure has a big impact on LLM performance. For example, in a task asking if a cycle exists, LLMs did great on tightly interconnected graphs (cycles are common there) but struggled on path graphs (where cycles never happen). Interestingly, providing some mixed examples helped it adapt. For instance, for cycle check, we added some examples containing a cycle and some examples with no cycles as few-shot examples in our prompt. Similar patterns occurred with other tasks.

Comparing different graph generators on different graph tasks. The main observation here is that graph structure has a significant impact on the LLM’s performance. ER, BA, SBM, and SFN refers to Erdős–Rényi, Barabási–Albert, Stochastic Block Model, and Scale-Free Network respectively.

Conclusion

In short, we dug deep into how to best represent graphs as text so LLMs can understand them. We found three major factors that make a difference:

  • How to translate the graph to text: how we represent the graph as text significantly influences LLM performance. The incident encoding excelled for most of the tasks in general..
  • Task type: Certain types of graph questions tend to be harder for LLMs, even with a good translation from graph to text.
  • Graph structure: Surprisingly, the "shape" of the graph that on which we do inference (dense with connections, sparse, etc.) influences how well an LLM does.

This study revealed key insights about how to prepare graphs for LLMs. The right encoding techniques can significantly boost an LLM's accuracy on graph problems (ranging from around 5% to over 60% improvement). Our new benchmark, GraphQA, will help drive further research in this area.


Acknowledgements

We would like to express our gratitude to our co-author, Jonathan Halcrow, for his valuable contributions to this work. We express our sincere gratitude to Anton Tsitsulin, Dustin Zelle, Silvio Lattanzi, Vahab Mirrokni, and the entire graph mining team at Google Research, for their insightful comments, thorough proofreading, and constructive feedback which greatly enhanced the quality of our work. We would also like to extend special thanks to Tom Small for creating the animation used in this post.

Source: Google AI Blog


Building Open Models Responsibly in the Gemini Era

Google has long believed that open technology is not only good for our company, but good for the industry, consumers, and the world. We’ve released open-source projects like Android and Chromium that transformed access to mobile and web technologies, and have done the same in AI with Transformers, TensorFlow, and AlphaFold. The release of our Gemma family of open models is a next step in how we’re deepening our commitment to open technology alongside an industry-leading safe, responsible approach. At the same time, the rapidly evolving nature of AI raises important considerations for how to enable safety-aligned open models: an approach that supports broad innovation while promoting safe uses.

A benefit of open source is that once it is released, its license gives users full creative autonomy. This is a powerful guarantee of technology access for developers and end users. Another benefit is that open-source technology can be modified to fit the unique use case of the end user, without restriction.

In the hands of a malicious actor, however, the lack of restrictions can raise risks. Computing has been through similar cycles before, addressing issues such as protecting users of the open internet, handling cryptography, and addressing open-source software security. We now face this challenge with AI. Below we share the approach we took to openly releasing Gemma models, and the advancements in open model safety we hope to accelerate.


Providing access to Gemma open models

Today, Gemma models are being released as what the industry collectively has begun to refer to as “open models.” Open models feature free access to the model weights, but terms of use, redistribution, and variant ownership vary according to a model’s specific terms of use, which may not be based on an open-source license. The Gemma models’ terms of use make them freely available for individual developers, researchers, and commercial users for access and redistribution. Users are also free to create and publish model variants. In using Gemma models, developers agree to avoid harmful uses, reflecting our commitment to developing AI responsibly while increasing access to this technology.

We’re precise about the language we’re using to describe Gemma models because we’re proud to enable responsible AI access and innovation, and we’re equally proud supporters of open source. The definition of "Open Source" has been invaluable to computing and innovation because of requirements for redistribution and derived works, and against discrimination. These requirements enable cross-industry collaboration, individual innovation and entrepreneurship, and shared research to happen with exponential effects.

However, existing open-source concepts can’t always be directly applied to AI systems, which raises questions on how to use open-source licenses with AI. It’s important that we carry forward open principles that have made the sea-change we’re experiencing with AI possible while clarifying the concept of open-source AI and addressing concepts like derived work and author attribution.


Taking a comprehensive approach to releasing Gemma safely and responsibly

Licensing and terms of use are only one part of the evaluations, technical tools, and considered decision-making that went into aligning this release with our responsible AI Principles. Our approach involved:

  • Systematic internal review in accordance with our AI Principles: Consistent with our AI Principles, we release models only when we have determined the benefits are significant, and the risks of misuse are low or can be mitigated. We take that same approach to open models, incorporating a balance of the benefits of wider access to a particular model as well as the risks of misuse and how we can mitigate them. With Gemma, we considered the increased AI research and innovation by us and many others in the community, the access to AI technology the models could bring, and what access was needed to support these use cases.
  • A high evaluation bar: Gemma models underwent thorough evaluations, and were held to a higher bar for evaluating risk of abuse or harm than our proprietary models, given the more limited mitigations currently available for open models. These evaluations cover a broad range of responsible AI areas, including safety, fairness, privacy, societal risk, as well as capabilities such as chemical, biological, radiological, nuclear (CBRN) risks, cybersecurity, and autonomous replication. As described in our technical report, the Gemma models exhibit state-of-the-art safety performance in human side-by-side evaluations.
  • Responsibility tools for developers: As we release the Gemma models, we are also releasing a Responsible Generative AI Toolkit for developers, providing guidance and tools to help them create safer AI applications.

We continue to evolve our approach. As we build these frameworks further, we will proceed thoughtfully and incorporate what we learn into future model assessments. We will continue to explore the full range of access mechanisms, with benefits and risk mitigation in mind, including API-based access and staged releases.


Advancing open model safety together

Many of today’s AI safety tools are designed for systems where the design approach assumes restricted access and redistribution, as well as auxiliary controls like query filters. Similarly, much of the AI safety research for improving mitigations takes on the design assumptions of those systems. Just as we have created unique threat models and solutions for other open technology, we are developing safety and security tools appropriate for the differences of openly available AI.

As models become more and more capable, we are conducting research and investing in rigorous safety evaluation, testing, and mitigations for open models. We are also actively participating in conversations with policymakers and open-source community leaders on how the industry should approach this technology. This challenge is multifaceted, just like AI systems themselves. Model-sharing platforms like Hugging Face and Kaggle, where developers inspire each other with novel model iterations, play a critical role in efforts to develop open models safely; there is also a role for the cybersecurity community to contribute learnings and best practices.

Building those solutions requires access to open models, sharing innovations and improvements. We believe sharing the Gemma models will not just help increase access to AI technology, but also help the industry develop new approaches to safety and responsibility.

As developers adopt Gemma models and other safety-aligned open models, we look forward to working with the open-source community to develop more solutions for responsible approaches to AI in the open ecosystem. A global diversity of experiences, perspectives, and opportunities will help build safe and responsible AI that works for everyone.

By Anne Bertucio – Sr Program Manager, Open Source Programs Office; Helen King – Sr Director of Responsibility, Google DeepMind

AMIE: A research AI system for diagnostic medical reasoning and conversations

The physician-patient conversation is a cornerstone of medicine, in which skilled and intentional communication drives diagnosis, management, empathy and trust. AI systems capable of such diagnostic dialogues could increase availability, accessibility, quality and consistency of care by being useful conversational partners to clinicians and patients alike. But approximating clinicians’ considerable expertise is a significant challenge.

Recent progress in large language models (LLMs) outside the medical domain has shown that they can plan, reason, and use relevant context to hold rich conversations. However, there are many aspects of good diagnostic dialogue that are unique to the medical domain. An effective clinician takes a complete “clinical history” and asks intelligent questions that help to derive a differential diagnosis. They wield considerable skill to foster an effective relationship, provide information clearly, make joint and informed decisions with the patient, respond empathically to their emotions, and support them in the next steps of care. While LLMs can accurately perform tasks such as medical summarization or answering medical questions, there has been little work specifically aimed towards developing these kinds of conversational diagnostic capabilities.

Inspired by this challenge, we developed Articulate Medical Intelligence Explorer (AMIE), a research AI system based on a LLM and optimized for diagnostic reasoning and conversations. We trained and evaluated AMIE along many dimensions that reflect quality in real-world clinical consultations from the perspective of both clinicians and patients. To scale AMIE across a multitude of disease conditions, specialties and scenarios, we developed a novel self-play based simulated diagnostic dialogue environment with automated feedback mechanisms to enrich and accelerate its learning process. We also introduced an inference time chain-of-reasoning strategy to improve AMIE’s diagnostic accuracy and conversation quality. Finally, we tested AMIE prospectively in real examples of multi-turn dialogue by simulating consultations with trained actors.

AMIE was optimized for diagnostic conversations, asking questions that help to reduce its uncertainty and improve diagnostic accuracy, while also balancing this with other requirements of effective clinical communication, such as empathy, fostering a relationship, and providing information clearly.

Evaluation of conversational diagnostic AI

Besides developing and optimizing AI systems themselves for diagnostic conversations, how to assess such systems is also an open question. Inspired by accepted tools used to measure consultation quality and clinical communication skills in real-world settings, we constructed a pilot evaluation rubric to assess diagnostic conversations along axes pertaining to history-taking, diagnostic accuracy, clinical management, clinical communication skills, relationship fostering and empathy.

We then designed a randomized, double-blind crossover study of text-based consultations with validated patient actors interacting either with board-certified primary care physicians (PCPs) or the AI system optimized for diagnostic dialogue. We set up our consultations in the style of an objective structured clinical examination (OSCE), a practical assessment commonly used in the real world to examine clinicians’ skills and competencies in a standardized and objective way. In a typical OSCE, clinicians might rotate through multiple stations, each simulating a real-life clinical scenario where they perform tasks such as conducting a consultation with a standardized patient actor (trained carefully to emulate a patient with a particular condition). Consultations were performed using a synchronous text-chat tool, mimicking the interface familiar to most consumers using LLMs today.

AMIE is a research AI system based on LLMs for diagnostic reasoning and dialogue.

AMIE: an LLM-based conversational diagnostic research AI system

We trained AMIE on real-world datasets comprising medical reasoning, medical summarization and real-world clinical conversations.

It is feasible to train LLMs using real-world dialogues developed by passively collecting and transcribing in-person clinical visits, however, two substantial challenges limit their effectiveness in training LLMs for medical conversations. First, existing real-world data often fails to capture the vast range of medical conditions and scenarios, hindering the scalability and comprehensiveness. Second, the data derived from real-world dialogue transcripts tends to be noisy, containing ambiguous language (including slang, jargon, humor and sarcasm), interruptions, ungrammatical utterances, and implicit references.

To address these limitations, we designed a self-play based simulated learning environment with automated feedback mechanisms for diagnostic medical dialogue in a virtual care setting, enabling us to scale AMIE’s knowledge and capabilities across many medical conditions and contexts. We used this environment to iteratively fine-tune AMIE with an evolving set of simulated dialogues in addition to the static corpus of real-world data described.

This process consisted of two self-play loops: (1) an “inner” self-play loop, where AMIE leveraged in-context critic feedback to refine its behavior on simulated conversations with an AI patient simulator; and (2) an “outer” self-play loop where the set of refined simulated dialogues were incorporated into subsequent fine-tuning iterations. The resulting new version of AMIE could then participate in the inner loop again, creating a virtuous continuous learning cycle.

Further, we also employed an inference time chain-of-reasoning strategy which enabled AMIE to progressively refine its response conditioned on the current conversation to arrive at an informed and grounded reply.

AMIE uses a novel self-play based simulated dialogue learning environment to improve the quality of diagnostic dialogue across a multitude of disease conditions, specialities and patient contexts.

We tested performance in consultations with simulated patients (played by trained actors), compared to those performed by 20 real PCPs using the randomized approach described above. AMIE and PCPs were assessed from the perspectives of both specialist attending physicians and our simulated patients in a randomized, blinded crossover study that included 149 case scenarios from OSCE providers in Canada, the UK and India in a diverse range of specialties and diseases.

Notably, our study was not designed to emulate either traditional in-person OSCE evaluations or the ways clinicians usually use text, email, chat or telemedicine. Instead, our experiment mirrored the most common way consumers interact with LLMs today, a potentially scalable and familiar mechanism for AI systems to engage in remote diagnostic dialogue.

Overview of the randomized study design to perform a virtual remote OSCE with simulated patients via online multi-turn synchronous text chat.

Performance of AMIE

In this setting, we observed that AMIE performed simulated diagnostic conversations at least as well as PCPs when both were evaluated along multiple clinically-meaningful axes of consultation quality. AMIE had greater diagnostic accuracy and superior performance for 28 of 32 axes from the perspective of specialist physicians, and 24 of 26 axes from the perspective of patient actors.

AMIE outperformed PCPs on multiple evaluation axes for diagnostic dialogue in our evaluations.
Specialist-rated top-k diagnostic accuracy. AMIE and PCPs top-k differential diagnosis (DDx) accuracy are compared across 149 scenarios with respect to the ground truth diagnosis (a) and all diagnoses listed within the accepted differential diagnoses (b). Bootstrapping (n=10,000) confirms all top-k differences between AMIE and PCP DDx accuracy are significant with p <0.05 after false discovery rate (FDR) correction.
Diagnostic conversation and reasoning qualities as assessed by specialist physicians. On 28 out of 32 axes, AMIE outperformed PCPs while being comparable on the rest.

Limitations

Our research has several limitations and should be interpreted with appropriate caution. Firstly, our evaluation technique likely underestimates the real-world value of human conversations, as the clinicians in our study were limited to an unfamiliar text-chat interface, which permits large-scale LLM–patient interactions but is not representative of usual clinical practice. Secondly, any research of this type must be seen as only a first exploratory step on a long journey. Transitioning from a LLM research prototype that we evaluated in this study to a safe and robust tool that could be used by people and those who provide care for them will require significant additional research. There are many important limitations to be addressed, including experimental performance under real-world constraints and dedicated exploration of such important topics as health equity and fairness, privacy, robustness, and many more, to ensure the safety and reliability of the technology.


AMIE as an aid to clinicians

In a recently released preprint, we evaluated the ability of an earlier iteration of the AMIE system to generate a DDx alone or as an aid to clinicians. Twenty (20) generalist clinicians evaluated 303 challenging, real-world medical cases sourced from the New England Journal of Medicine (NEJM) ClinicoPathologic Conferences (CPCs). Each case report was read by two clinicians randomized to one of two assistive conditions: either assistance from search engines and standard medical resources, or AMIE assistance in addition to these tools. All clinicians provided a baseline, unassisted DDx prior to using the respective assistive tools.

Assisted randomized reader study setup to investigate the assistive effect of AMIE to clinicians in solving complex diagnostic case challenges from the New England Journal of Medicine.

AMIE exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% vs. 33.6%, p= 0.04). Comparing the two assisted study arms, the top-10 accuracy was higher for clinicians assisted by AMIE, compared to clinicians without AMIE assistance (24.6%, p<0.01) and clinicians with search (5.45%, p=0.02). Further, clinicians assisted by AMIE arrived at more comprehensive differential lists than those without AMIE assistance.

In addition to strong standalone performance, using the AMIE system led to significant assistive effect and improvements in diagnostic accuracy of the clinicians in solving these complex case challenges.

It's worth noting that NEJM CPCs are not representative of everyday clinical practice. They are unusual case reports in only a few hundred individuals so offer limited scope for probing important issues like equity or fairness.


Bold and responsible research in healthcare — the art of the possible

Access to clinical expertise remains scarce around the world. While AI has shown great promise in specific clinical applications, engagement in the dynamic, conversational diagnostic journeys of clinical practice requires many capabilities not yet demonstrated by AI systems. Doctors wield not only knowledge and skill but a dedication to myriad principles, including safety and quality, communication, partnership and teamwork, trust, and professionalism. Realizing these attributes in AI systems is an inspiring challenge that should be approached responsibly and with care. AMIE is our exploration of the “art of the possible”, a research-only system for safely exploring a vision of the future where AI systems might be better aligned with attributes of the skilled clinicians entrusted with our care. It is early experimental-only work, not a product, and has several limitations that we believe merit rigorous and extensive further scientific studies in order to envision a future in which conversational, empathic and diagnostic AI systems might become safe, helpful and accessible.


Acknowledgements

The research described here is joint work across many teams at Google Research and Google Deepmind. We are grateful to all our co-authors - Tao Tu, Mike Schaekermann, Anil Palepu, Daniel McDuff, Jake Sunshine, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Sara Mahdavi, Karan Sighal, Shekoofeh Azizi, Nenad Tomasev, Yun Liu, Yong Cheng, Le Hou, Albert Webson, Jake Garrison, Yash Sharma, Anupam Pathak, Sushant Prakash, Philip Mansfield, Shwetak Patel, Bradley Green, Ewa Dominowska, Renee Wong, Juraj Gottweis, Dale Webster, Katherine Chou, Christopher Semturs, Joelle Barral, Greg Corrado and Yossi Matias. We also thank Sami Lachgar, Lauren Winer and John Guilyard for their support with narratives and the visuals. Finally, we are grateful to Michael Howell, James Maynika, Jeff Dean, Karen DeSalvo, Zoubin Gharahmani and Demis Hassabis for their support during the course of this project.


Source: Google AI Blog


AMIE: A research AI system for diagnostic medical reasoning and conversations

The physician-patient conversation is a cornerstone of medicine, in which skilled and intentional communication drives diagnosis, management, empathy and trust. AI systems capable of such diagnostic dialogues could increase availability, accessibility, quality and consistency of care by being useful conversational partners to clinicians and patients alike. But approximating clinicians’ considerable expertise is a significant challenge.

Recent progress in large language models (LLMs) outside the medical domain has shown that they can plan, reason, and use relevant context to hold rich conversations. However, there are many aspects of good diagnostic dialogue that are unique to the medical domain. An effective clinician takes a complete “clinical history” and asks intelligent questions that help to derive a differential diagnosis. They wield considerable skill to foster an effective relationship, provide information clearly, make joint and informed decisions with the patient, respond empathically to their emotions, and support them in the next steps of care. While LLMs can accurately perform tasks such as medical summarization or answering medical questions, there has been little work specifically aimed towards developing these kinds of conversational diagnostic capabilities.

Inspired by this challenge, we developed Articulate Medical Intelligence Explorer (AMIE), a research AI system based on a LLM and optimized for diagnostic reasoning and conversations. We trained and evaluated AMIE along many dimensions that reflect quality in real-world clinical consultations from the perspective of both clinicians and patients. To scale AMIE across a multitude of disease conditions, specialties and scenarios, we developed a novel self-play based simulated diagnostic dialogue environment with automated feedback mechanisms to enrich and accelerate its learning process. We also introduced an inference time chain-of-reasoning strategy to improve AMIE’s diagnostic accuracy and conversation quality. Finally, we tested AMIE prospectively in real examples of multi-turn dialogue by simulating consultations with trained actors.

AMIE was optimized for diagnostic conversations, asking questions that help to reduce its uncertainty and improve diagnostic accuracy, while also balancing this with other requirements of effective clinical communication, such as empathy, fostering a relationship, and providing information clearly.

Evaluation of conversational diagnostic AI

Besides developing and optimizing AI systems themselves for diagnostic conversations, how to assess such systems is also an open question. Inspired by accepted tools used to measure consultation quality and clinical communication skills in real-world settings, we constructed a pilot evaluation rubric to assess diagnostic conversations along axes pertaining to history-taking, diagnostic accuracy, clinical management, clinical communication skills, relationship fostering and empathy.

We then designed a randomized, double-blind crossover study of text-based consultations with validated patient actors interacting either with board-certified primary care physicians (PCPs) or the AI system optimized for diagnostic dialogue. We set up our consultations in the style of an objective structured clinical examination (OSCE), a practical assessment commonly used in the real world to examine clinicians’ skills and competencies in a standardized and objective way. In a typical OSCE, clinicians might rotate through multiple stations, each simulating a real-life clinical scenario where they perform tasks such as conducting a consultation with a standardized patient actor (trained carefully to emulate a patient with a particular condition). Consultations were performed using a synchronous text-chat tool, mimicking the interface familiar to most consumers using LLMs today.

AMIE is a research AI system based on LLMs for diagnostic reasoning and dialogue.

AMIE: an LLM-based conversational diagnostic research AI system

We trained AMIE on real-world datasets comprising medical reasoning, medical summarization and real-world clinical conversations.

It is feasible to train LLMs using real-world dialogues developed by passively collecting and transcribing in-person clinical visits, however, two substantial challenges limit their effectiveness in training LLMs for medical conversations. First, existing real-world data often fails to capture the vast range of medical conditions and scenarios, hindering the scalability and comprehensiveness. Second, the data derived from real-world dialogue transcripts tends to be noisy, containing ambiguous language (including slang, jargon, humor and sarcasm), interruptions, ungrammatical utterances, and implicit references.

To address these limitations, we designed a self-play based simulated learning environment with automated feedback mechanisms for diagnostic medical dialogue in a virtual care setting, enabling us to scale AMIE’s knowledge and capabilities across many medical conditions and contexts. We used this environment to iteratively fine-tune AMIE with an evolving set of simulated dialogues in addition to the static corpus of real-world data described.

This process consisted of two self-play loops: (1) an “inner” self-play loop, where AMIE leveraged in-context critic feedback to refine its behavior on simulated conversations with an AI patient simulator; and (2) an “outer” self-play loop where the set of refined simulated dialogues were incorporated into subsequent fine-tuning iterations. The resulting new version of AMIE could then participate in the inner loop again, creating a virtuous continuous learning cycle.

Further, we also employed an inference time chain-of-reasoning strategy which enabled AMIE to progressively refine its response conditioned on the current conversation to arrive at an informed and grounded reply.

AMIE uses a novel self-play based simulated dialogue learning environment to improve the quality of diagnostic dialogue across a multitude of disease conditions, specialities and patient contexts.

We tested performance in consultations with simulated patients (played by trained actors), compared to those performed by 20 real PCPs using the randomized approach described above. AMIE and PCPs were assessed from the perspectives of both specialist attending physicians and our simulated patients in a randomized, blinded crossover study that included 149 case scenarios from OSCE providers in Canada, the UK and India in a diverse range of specialties and diseases.

Notably, our study was not designed to emulate either traditional in-person OSCE evaluations or the ways clinicians usually use text, email, chat or telemedicine. Instead, our experiment mirrored the most common way consumers interact with LLMs today, a potentially scalable and familiar mechanism for AI systems to engage in remote diagnostic dialogue.

Overview of the randomized study design to perform a virtual remote OSCE with simulated patients via online multi-turn synchronous text chat.

Performance of AMIE

In this setting, we observed that AMIE performed simulated diagnostic conversations at least as well as PCPs when both were evaluated along multiple clinically-meaningful axes of consultation quality. AMIE had greater diagnostic accuracy and superior performance for 28 of 32 axes from the perspective of specialist physicians, and 24 of 26 axes from the perspective of patient actors.

AMIE outperformed PCPs on multiple evaluation axes for diagnostic dialogue in our evaluations.
Specialist-rated top-k diagnostic accuracy. AMIE and PCPs top-k differential diagnosis (DDx) accuracy are compared across 149 scenarios with respect to the ground truth diagnosis (a) and all diagnoses listed within the accepted differential diagnoses (b). Bootstrapping (n=10,000) confirms all top-k differences between AMIE and PCP DDx accuracy are significant with p <0.05 after false discovery rate (FDR) correction.
Diagnostic conversation and reasoning qualities as assessed by specialist physicians. On 28 out of 32 axes, AMIE outperformed PCPs while being comparable on the rest.

Limitations

Our research has several limitations and should be interpreted with appropriate caution. Firstly, our evaluation technique likely underestimates the real-world value of human conversations, as the clinicians in our study were limited to an unfamiliar text-chat interface, which permits large-scale LLM–patient interactions but is not representative of usual clinical practice. Secondly, any research of this type must be seen as only a first exploratory step on a long journey. Transitioning from a LLM research prototype that we evaluated in this study to a safe and robust tool that could be used by people and those who provide care for them will require significant additional research. There are many important limitations to be addressed, including experimental performance under real-world constraints and dedicated exploration of such important topics as health equity and fairness, privacy, robustness, and many more, to ensure the safety and reliability of the technology.


AMIE as an aid to clinicians

In a recently released preprint, we evaluated the ability of an earlier iteration of the AMIE system to generate a DDx alone or as an aid to clinicians. Twenty (20) generalist clinicians evaluated 303 challenging, real-world medical cases sourced from the New England Journal of Medicine (NEJM) ClinicoPathologic Conferences (CPCs). Each case report was read by two clinicians randomized to one of two assistive conditions: either assistance from search engines and standard medical resources, or AMIE assistance in addition to these tools. All clinicians provided a baseline, unassisted DDx prior to using the respective assistive tools.

Assisted randomized reader study setup to investigate the assistive effect of AMIE to clinicians in solving complex diagnostic case challenges from the New England Journal of Medicine.

AMIE exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% vs. 33.6%, p= 0.04). Comparing the two assisted study arms, the top-10 accuracy was higher for clinicians assisted by AMIE, compared to clinicians without AMIE assistance (24.6%, p<0.01) and clinicians with search (5.45%, p=0.02). Further, clinicians assisted by AMIE arrived at more comprehensive differential lists than those without AMIE assistance.

In addition to strong standalone performance, using the AMIE system led to significant assistive effect and improvements in diagnostic accuracy of the clinicians in solving these complex case challenges.

It's worth noting that NEJM CPCs are not representative of everyday clinical practice. They are unusual case reports in only a few hundred individuals so offer limited scope for probing important issues like equity or fairness.


Bold and responsible research in healthcare — the art of the possible

Access to clinical expertise remains scarce around the world. While AI has shown great promise in specific clinical applications, engagement in the dynamic, conversational diagnostic journeys of clinical practice requires many capabilities not yet demonstrated by AI systems. Doctors wield not only knowledge and skill but a dedication to myriad principles, including safety and quality, communication, partnership and teamwork, trust, and professionalism. Realizing these attributes in AI systems is an inspiring challenge that should be approached responsibly and with care. AMIE is our exploration of the “art of the possible”, a research-only system for safely exploring a vision of the future where AI systems might be better aligned with attributes of the skilled clinicians entrusted with our care. It is early experimental-only work, not a product, and has several limitations that we believe merit rigorous and extensive further scientific studies in order to envision a future in which conversational, empathic and diagnostic AI systems might become safe, helpful and accessible.


Acknowledgements

The research described here is joint work across many teams at Google Research and Google Deepmind. We are grateful to all our co-authors - Tao Tu, Mike Schaekermann, Anil Palepu, Daniel McDuff, Jake Sunshine, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Sara Mahdavi, Karan Sighal, Shekoofeh Azizi, Nenad Tomasev, Yun Liu, Yong Cheng, Le Hou, Albert Webson, Jake Garrison, Yash Sharma, Anupam Pathak, Sushant Prakash, Philip Mansfield, Shwetak Patel, Bradley Green, Ewa Dominowska, Renee Wong, Juraj Gottweis, Dale Webster, Katherine Chou, Christopher Semturs, Joelle Barral, Greg Corrado and Yossi Matias. We also thank Sami Lachgar, Lauren Winer and John Guilyard for their support with narratives and the visuals. Finally, we are grateful to Michael Howell, James Maynika, Jeff Dean, Karen DeSalvo, Zoubin Gharahmani and Demis Hassabis for their support during the course of this project.

Source: Google AI Blog


Navigating AI Safety & Compliance: A guide for CTOs

Posted by Fergus Hurley – Co-Founder & GM, Checks, and Pedro Rodriguez – Head of Engineering, Checks

The rapid advances in generative artificial intelligence (GenAI) have brought about transformative opportunities across many industries. However, these advances have raised concerns about risks, such as privacy, misuse, bias, and unfairness. Responsible development and deployment is, therefore, a must.

AI applications are becoming more sophisticated, and developers are integrating them into critical systems. Therefore, the onus is on technology leaders, particularly CTOs and Heads of Engineering and AI – those responsible for leading the adoption of AI across their products and stacks – to ensure they use AI safely, ethically, and in compliance with relevant policies, regulations, and laws.

While comprehensive AI safety regulations are nascent, CTOs cannot wait for regulatory mandates before they act. Instead, they must adopt a forward-thinking approach to AI governance, incorporating safety and compliance considerations into the entire product development cycle.

This article is the first in a series to explore these challenges. To start, this article presents four key proposals for integrating AI safety and compliance practices into the product development lifecycle:


1.     Establish a robust AI governance framework

Formulate a comprehensive AI governance framework that clearly defines the organization’s principles, policies, and procedures for developing, deploying, and operating AI systems. This framework should establish clear roles, responsibilities, accountability mechanisms, and risk assessment protocols.

Examples of emerging frameworks include the US National Institute of Standards and Technologies’ AI Risk Management Framework, the OSTP Blueprint for an AI Bill of Rights, the EU AI Act, as well as Google’s Secure AI Framework (SAIF).

As your organization adopts an AI governance framework, it is crucial to consider the implications of relying on third-party foundation models. These considerations include the data from your app that the foundation model uses and your obligations based on the foundation model provider's terms of service.


2.     Embed AI safety principles into the design phase

Incorporate AI safety principles, such as Google’s responsible AI principles, into the design process from the outset.

AI safety principles involve identifying and mitigating potential risks and challenges early in the development cycle. For example, mitigate bias in training or model inferences and ensure explainability of models behavior. Use techniques such as adversarial training – red teaming testing of LLMs using prompts that look for unsafe outputs – to help ensure that AI models operate in a fair, unbiased, and robust manner.


3.     Implement continuous monitoring and auditing

Track the performance and behavior of AI systems in real time with continuous monitoring and auditing. The goal is to identify and address potential safety issues or anomalies before they escalate into larger problems.

Look for key metrics like model accuracy, fairness, and explainability, and establish a baseline for your app and its monitoring. Beyond traditional metrics, look for unexpected changes in user behavior and AI model drift using a tool such as Vertex AI Model Monitoring. Do this using data logging, anomaly detection, and human-in-the-loop mechanisms to ensure ongoing oversight.


4.     Foster a culture of transparency and explainability

Drive AI decision-making through a culture of transparency and explainability. Encourage this culture by defining clear documentation guidelines, metrics, and roles so that all the team members developing AI systems participate in the design, training, deployment, and operations.

Also, provide clear and accessible explanations to cross-functional stakeholders about how AI systems operate, their limitations, and the available rationale behind their decisions. This information fosters trust among users, regulators, and stakeholders.


Final word

As AI's role in core and critical systems grows, proper governance is essential for its success and that of the systems and organizations using AI. The four proposals in this article should be a good start in that direction.

However, this is a broad and complex domain, which is what this series of articles is about. So, look out for deeper dives into the tools, techniques, and processes you need to safely integrate AI into your development and the apps you create.

A New Foundation for AI on Android

Posted by Dave Burke, VP of Engineering

Foundation Models learn from a diverse range of data sources to produce AI systems capable of adapting to a wide range of tasks, instead of being trained for a single narrow use case. Today, we announced Gemini, our most capable model yet. Gemini was designed for flexibility, so it can run on everything from data centers to mobile devices. It's been optimized for three different sizes: Ultra, Pro and Nano.

Gemini Nano, optimized for mobile

Gemini Nano, our most efficient model built for on-device tasks, runs directly on mobile silicon, opening support for a range of important use cases. Running on-device enables features where the data should not leave the device, such as suggesting replies to messages in an end-to-end encrypted messaging app. It also enables consistent experiences with deterministic latency, so features are always available even when there’s no network.

Gemini Nano is distilled down from the larger Gemini models and specifically optimized to run on mobile silicon accelerators. Gemini Nano enables powerful capabilities such as high quality text summarization, contextual smart replies, and advanced proofreading and grammar correction. For example, the enhanced language understanding of Gemini Nano enables the Pixel 8 Pro to concisely summarize content in the Recorder app, even when the phone’s network connection is offline.

Moving image of Gemini Nano being used in the Recorder app on a Pixel 8 Pro device
Pixel 8 Pro using Gemini Nano in the Recorder app to summarize meeting audio, even without a network connection.

Gemini Nano is starting to power Smart Reply in Gboard on Pixel 8 Pro, ready to be enabled in settings as a developer preview. Available now to try with WhatsApp and coming to more apps next year, the on-device AI model saves you time by suggesting high-quality responses with conversational awareness1.

Moving image of WhatsApp’s use of Smart Reply in Gboard using Gemini Nano on Pixel 8 Pro device
Smart Reply in Gboard within WhatsApp using Gemini Nano on Pixel 8 Pro.

Android AICore, a new system service for on-device foundation models

Android AICore is a new system service in Android 14 that provides easy access to Gemini Nano. AICore handles model management, runtimes, safety features and more, simplifying the work for you to incorporate AI into your apps.

AICore is private by design, following the example of Android’s Private Compute Core with isolation from the network via open-source APIs, providing transparency and auditability. As part of our efforts to build and deploy AI responsibly, we also built dedicated safety features to make it safer and more inclusive for everyone.

AICore architechture
AICore manages model, runtime and safety features.

AICore enables Low Rank Adaptation (LoRA) fine tuning with Gemini Nano. This powerful concept enables app developers to create small LoRA adapters based on their own training data. The LoRA adapter is loaded by AICore, resulting in a powerful large language model fine tuned for the app’s own use-cases.

AICore takes advantage of new ML hardware like the latest Google Tensor TPU and NPUs in flagship Qualcomm Technologies, Samsung S.LSI and MediaTek silicon. AICore and Gemini Nano are rolling out to Pixel 8 Pro, with more devices and silicon partners to be announced in the coming months.

Build with Gemini

We're excited to bring together state-of-the-art AI research with easy-to-use tools and APIs for Android developers to build with Gemini on-device. If you are interested in building apps using Gemini Nano and AICore, please sign up for our Early Access Program.


1 Available globally, only using the United States English keyboard language. Read more for details.

How it’s Made: Interacting with Gemini through multimodal prompting

Posted by Alexander Chen, Creative Director


Let’s try an experiment. We’ll show this picture to our multimodal model Gemini and ask it to describe what it sees:

A close up of a hand with five fingers spread out, palm facing down on a wooden surface.
Tell me what you see
Gemini:  I see a person's right hand. The hand is open with the fingers spread apart.

Ok, let’s try this one:

A closeup of the same hand, now balled up in a fist.
Gemini: A person knocking on a wooden door.

How about this one?

A closeup of the same hand, with two fingers extended like that of a peace sign.
Gemini: I see a hand with two fingers extended, which is a common symbol for the number two.

But wait … what if we asked Gemini to reason about all of these images together?

A sequence of three frames depicting the same hand in different formations. The first frame shows the hand with all five fingers spread out, palm facing down. The second frame shows the hand balled up in a fist. The third frame shows the hand with two fingers extended, like that of a peace sign.
What do you think I'm doing? Hint: it's a game.
Gemini: You're playing rock, paper, scissors.

Nice. What else could we try? Let's see if Gemini can notice this pattern:

A sequence of six frames that alternate between a hand that is balled up in a fist, and a hand that has two fingers extended.
I’m playing rock paper scissors. Do you notice a pattern in my gameplay?
Gemini: You are alternating between rock and scissors.

Neat, it spotted the pattern! But does Gemini really know how this game works? Let’s try asking:

Hmm ... do you think it's a good strategy?
Gemini: No, it is not a good strategy. Your opponent will quickly learn your pattern and will be able to win every time. You should try to randomize your choices more.

Good advice, Gemini! Okay, for this last one, let’s put your multimodal reasoning capabilities to the test. Here’s your prompt:

This is a secret message. What does it say?

A six by eight grid of frames of a hands balled up in a fist or the hand with all five fingers spread out, palm facing down. The series of photos spells 'Gemini' in ASCII binary.

Can you figure it out? We have gotten Gemini to solve it, but we won’t spoil it by putting the answer in this post. 🙂



What is multimodal prompting?

Everything we did just now is an example of “multimodal prompting.” We’re basically giving Gemini combinations of different modalities — image and text in this case — and having Gemini respond by predicting what might come next. Gemini’s ability to seamlessly combine these modes together enables new possibilities for what you can do.

In this post, we’ll explore some of the prompting approaches we used in our Hands on with Gemini demo video. We’ll soon be rolling out Gemini for people to try in Google AI Studio, our free, web-based developer tool where you’ll be able to try your own multimodal prompts with Gemini. We’re hoping this guide of starter prompts and ideas helps inspire you to start exploring your own ideas.



🧮 Spatial reasoning and logic

Logic and puzzles are fun way to test Gemini. Let’s try one that requires both left-to-right spatial reasoning and knowledge about our solar system:

Three sticky notes on a wooden surface, each featuring a hand-drawn sketch of a solar-body. The left-most note depicts a drawing of the sun. The middle note has a drawing of saturn. The right-most note shows a drawing of earth.
Is this the right order? Consider the distance from the sun and explain your reasoning.

Gemini: No, the correct order is Sun, Earth, Saturn. The Sun is the closest to the center of the solar system, followed by the Earth, and then Saturn.

Nice! Here’s another, inspired by pinewood derby challenges:

Two sticky notes depicting two different hand-drawn cars going down-hill. The car on the left is blocky and awkwardly shaped, the car on the right is slanted and smooth.
Which of these cars is more aerodynamic? The one on the left or the right? Explain why, using specific visual details.

Gemini: The car on the right is more aerodynamic. It has a lower profile and a more streamlined shape. The car on the left has a higher profile and a more boxy shape, which makes it less aerodynamic.



📷 Image sequences

Next, what if we showed Gemini a sequence of images? Let’s see we can show a few still frames from a game of charades and have Gemini guess the movie:

Four frames from a video of a person leaning back and moving their arms wildly.
Guess what movie I'm acting out.
Gemini: The Matrix
Nice! But which part specifically? Look at my body movements.
Gemini: The part where Neo dodges bullets.



🪄 Magic tricks

Exploring image sequences really lets you dig into how Gemini reasons over time. Let’s try a classic magic trick.

An image of two hands with palms facing the camera, there is a coin in the right hand.
Explain what I'm doing.
Gemini: You are holding a coin in your right hand.
An image of two hands in motion, palms turning inward and away from the camera.
An image of two hands, palms facing down and away from the camera.
Where do you think the coin is?
Gemini: The coin is in your right hand.

Sure, that’s a logical answer given what it’s seen. But let’s pull a sleight of hand as we reveal this next image to Gemini...

An image of two hands, the left palm is facing down and the right hand is tilted upwards revealing nothing under the right palm.
Look here, what happened?
Gemini: The coin is gone.

Gemini notices it’s gone. But can it summarize what may have happened?

Summarize what may have happened, step by step, starting with the first image.

Gemini: 
1. You started with a coin in your right hand.
2. You then made a motion with your hands, and the coin disappeared.
3. You then showed that the coin was gone.

Gemini is leveraging the entire context of this conversation – both text and image – to summarize its answer. Finally, let’s ask:

Can you explain how it disappeared?
Gemini: The coin was palmed.

Gemini makes a really solid guess! But we won’t reveal our secret 🙂.



🔀 Cup shuffling

Let’s try a trick that requires memory and logic: the classic ball and cup shuffling game. For this one, we’ll break it down into steps for Gemini. First, let’s establish the basics with these instructions:

Instructions: There are three cups in position 1 (left), position 2 (middle), position 3 (right).

Next, we’ll need to tell Gemini where we’re placing the ball:

Starting position: The ball is in position 3.
Current state: empty, empty, ball

Notice how we came up with a simple format for remembering where the ball is with “empty, empty, ball.” Next, let’s show Gemini two example turns:

Images:
Three frames depicting a person shuffling three green cups. The person switches positions of the middle and right-most cups. 
Swap: 2 and 3
Current state: empty, ball, empty

Images:
Three frames depicting a person shuffling three green cups. The person switches positions of the middle and left-most cups. 
Swap: 1 and 2
Current state: ball, empty, empty

In these two example turns, we’re both showing it what a swap looks like, and how to update the ball position. Now, let’s see if Gemini can generalize to a swap it hasn't seen:

Images:
Three frames depicting a person shuffling three green cups. The person switches positions of the left-most and right-most cups. 
Swap: 

Here's Gemini's response:

1 and 3
Current state: empty, empty, ball

Gemini got it! It looked at these images and correctly inferred that cups 1 and 3 are being swapped. And it reasoned correctly about how to update the ball position. Let’s ask:

Tell me where the ball is and summarize how it got there, in paragraph form.

Gemini: The ball is in cup 3. 

Starting with the ball in cup 3, the first swap was cup 2 and 3. This put the ball in cup 2. The second swap was cup 1 and 2. This put the ball in cup 1. The third swap was cup 1 and 3. This put the ball in cup 3.

Not only did Gemini get the answer correctly, it accurately summarized the game history. Of course, it won’t always get this challenge right. Sometimes the fake out move (where you swap two empty cups) seems to trip it up, but sometimes it gets that too. But simple prompts like this make it really fun to rapidly test Gemini. You can change the variables in your prompt, including the order of swaps, and see how it does.



🔨 Tool use

If you want to use Gemini in your own apps, you’ll want it to be able to connect to other tools. Let’s try a simple idea where Gemini needs to combine multimodality with tool use: drawing a picture to search for music.

A musical instruments and palm tree sketch on a notebook. The sketch features a guitar, an amplifier, a drum set, and a palm tree drawn in a loose and expressive style. The notebook sits on a wooden desk.
Step 1. Look at this drawing, describe everything in it, and come up with a fun, specific idea for a good musical soundtrack genre, mood, or soundtrack. 
Step 2. Provide a search query in the form: "Search [insert query]"

Gemini: 
The drawing is of a guitar, drums, and an amplifier. There is also a palm tree. The mood is fun, upbeat, and tropical. The genre is reggae. The soundtrack could be used for a movie or TV show set in a tropical location. 

Search "Reggae music for a tropical vacation"

Nice! Gemini both reasons about what it sees and then generates a search query you can parse to do a search. It’s like Gemini is acting like a translator for you – but instead of translating between languages, it’s translating modalities – from drawing to music in this case. With multimodal prompting, you can use Gemini to invent your own entirely new translations between different inputs and outputs.



🕹️Game creation

What if we tried using Gemini to quickly prototype a multimodal game? Here’s an idea: a geography guessing game where you have to point at a map to make your guess. Let’s start by prompting Gemini with the core idea:

Instructions: Let's play a game. Think of a country and give me a clue. The clue must be specific enough that there is only one correct country. I will try pointing at the country on a map.

Next, let’s give Gemini an example turn of gameplay, showing it how we want it to handle both incorrect and correct answers:

Country: This country is huge island that is mostly covered in ice.
(Answer: Greenland) 

Guess:
An image of a world map, a hand icon with the index finger pointing at Canada
I am pointing at the country of Canada *incorrect*

Guess:
An image of a world map, a hand icon with the index finger pointing at Greenland
I am pointing at the country of Greenland *correct*

Let’s give it a go and prompt Gemini to generate a clue:

Clue:
This country is known for its unique wildlife, including kangaroos and koalas. (Answer: Australia)

Ok, that’s a good clue. Let’s test out whether pointing will work. Just for fun, let’s try pointing at the wrong place first:

Guess:
A person pointing at Brazil on a map of the world with their right index finger. The map includes blue oceans, and green continents with no country borders.
I am pointing at the country of Brazil *incorrect*

Great! Gemini looked at my image and figured out I’m pointing at Brazil, and correctly reasoned that’s wrong. Now let’s point at the right place on the map:

A person pointing at Brazil on a map of the world with their right index finger. The map includes blue oceans, and green continents with no country borders.
I am pointing at the country of Australia *correct*

Nice! We’ve basically taught Gemini our game logic just by giving it an example. You'll also notice that it generalized from the illustrated hand in the examples.



⌨️ Coding

Of course, to bring your game idea to life, you’ll eventually have to write some executable code. Let’s see if Gemini can make a simple countdown timer for a game, but with a few fun twists:

Implement a simple timer in HTML/CSS/Javascript. Use a sans serif font and dark mode. Start it at 10 seconds and start counting down. When it reaches zero, replace the timer with a random emoji that is associated with excitement and motivation! Then go back to the timer at 10 seconds and start counting down again.

With just this single instruction, Gemini gives us a working timer that does what we asked for:

An animated gif of a countdown timer starting from 10. At the end of the countdown, a rocket emoji is shown followed lightning bolt emoji and a confetti emoji.

My favorite part is scrolling through Gemini’s source code to find the array of motivational emojis it picked for me:

 const emojis  = ['🚀', '⚡️', '🎉', '🎊', '🥳', '🤩', '✨'];



👀 A sneak peek

Throughout this post, we’ve been giving Gemini an input, and having Gemini make predictions for what might come next. That's basically what prompting is. And our inputs have been multimodal – image and text, combined.

But so far we've only shown Gemini responding in text. Maybe you’re wondering, can Gemini also respond with a combination of image and text? It can! This is a capability of Gemini called “interleaved text and image generation.” While this feature won’t be ready in the first version of Gemini for people to try, we hope to roll it out soon. Here’s a sneak peek of what’s possible.

Let’s see if we could use Gemini to provide everyday creative inspiration. And let’s try it in a domain that requires a bit of multimodal reasoning ... knitting! 🧶. Similar to our map game above, let's provide one example turn of interaction:

Give me three ideas for crochet creations I could make with these two colors:

Balls of blue and yellow yarn on a wooden table.

Colors: I see blue and yellow yarn.
Idea 1: How about a cute dog with one yellow ear? 
Image 1:

A blue and yellow crocheted dog head with ears placed on a wooden table.

Idea 2: Or how about a jellyfish with a few yellow tentacles?
Image 2:

A blue and yellow crocheted octopus with tentacles placed on a wooden table.

Idea 3: Or how about a blue cat with a yellow nose?
Image 3:

A blue and yellow crocheted cat with whiskers placed on a wooden table.

We’re essentially teaching Gemini about how we want each interaction to go: “I’ll take a photo of two balls of yarn, and I expect you (Gemini) to both come up with an idea for something I could make, and generate an image of it.”

Now, let’s show it a new pair of yarn colors it hasn't yet seen, and see if it can generalize:

Give me three ideas for crochet creations I could make with these two colors:

Balls of blue and pink yarn on a wooden table.

Gemini:
Colors: I see blue and pink yarn.
Idea 1: How about a pig with blue ears?
Image 1:

A blue and pink crocheted pig face with ears placed on a wooden table.

Idea 2: Or an octopus with pink and blue tentacles?
Image 2:

A blue and yellow crocheted octopus with tentacles placed on a wooden table.

Idea 3: Or a bunny with a pink nose?
Image 3:

A blue and pink crocheted bunny placed on a wooden table.

Nice! Gemini correctly reasoned about the new colors (“I see blue and pink yarn”) and generated these ideas and the images in a single, interleaved output of text and image.

What Gemini did here is fundamentally different from today’s text-to-image models. It's not just passing an instruction to a separate text-to-image model. It sees the image of my actual yarn on my wooden table, truly doing multimodal reasoning about my text and image together.


What's Next?

We hope you found this a helpful starter guide to get a sense of what’s possible with Gemini. We’re very excited to roll it out to more people soon so you can explore your own ideas through prompting. Stay tuned!

Full-stack development in Project IDX

Posted by Kaushik Sathupadi, Prakhar Srivastav, and Kristin Bi – Software Engineers; Alex Geboff – Technical Writer

We launched Project IDX, our experimental, new browser-based development experience, to simplify the chaos of building full-stack apps and streamline the development process from (back)end to (front)end.

In our experience, most web applications are built with at-least two different layers: a frontend (UI) layer and a backend layer. When you think about the kind of app you’d build in a browser-based developer workspace, you might not immediately jump to full-stack apps with robust, fully functional backends. Developing a backend in a web-based environment can get clunky and costly very quickly. Between different authentication setups for development and production environments, secure communication between backend and frontend, and the complexity of setting up a fully self-contained (hermetic) testing environment, costs and inconveniences can add up.

We know a lot of you are excited to try IDX yourselves, but in the meantime, we wanted to share this post about full-stack development in Project IDX. We’ll untangle some of the complex situations you might hit as a developer building both your frontend and backend layers in a web-based workspace — developer authentication, frontend-backend communication, and hermetic testing — and how we’ve tried to make it all just a little bit easier. And of course we want to hear from you about what else we should build that would make full-stack development easier for you!


Streamlined app previews

First and foremost, we've streamlined the process of enabling your applications frontend communication with its backend services in the VM, making it effortless to preview your full-stack application in the browser.

IDX workspaces are built on Google Cloud Workstations and securely access connected services through Service Accounts. Each workspace’s unique service account supports seamless, authenticated preview environments for your applications frontend. So, when you use Project IDX, application previews are built directly into your workspace, and you don’t actually have to set up a different authentication path to preview your UI. Currently, IDX only supports web previews, but Android and iOS application previews are coming soon to IDX workspaces near you.

Additionally, if your setup necessitates communication with the backend API under development in IDX from outside the browser preview, we've established a few mechanisms to temporarily provide access to the ports hosting these API backends.


Simple front-to-backend communication

If you’re using a framework that serves both the backend and frontend layers from the same port, you can pass the $PORT flag to use a custom PORT environment variable in your workspace configuration file (powered by Nix and stored directly in your workspace). This is part of the basic setup flow in Project IDX, so you don’t have to do anything particularly special (outside of setting the variable in your config file). Here’s an example Nix-based configuration file:


{ pkgs, ... }: {

# NOTE: This is an excerpt of a complete Nix configuration example.

# Enable previews and customize configuration
idx.previews = {
  enable = true;
  previews = [
    {
      command = [
        "npm"
        "run"
        "start"
        "--"
        "--port"
        "$PORT"
        "--host"
        "0.0.0.0"
        "--disable-host-check"
      ];
      manager = "web";
      id = "web";
    }
  ];
};

However, if your backend server is running on a different port from your UI server, you’ll need to implement a different strategy. One method is to have the frontend proxy the backend, as you would with Vite's custom server options.

Another way to establish communication between ports is to set up your code so the javascript running on your UI can communicate with the backend server using AJAX requests.

Let’s start with some sample code that includes both a backend and a frontend. Here’s a backend server written in Express.js:


import express from "express";
import cors from "cors";


const app= express();
app.use(cors());

app.get("/", (req, res) => {
    res.send("Hello World");
});

app.listen(6000, () => {
    console.log("Server is running on port 6000");
})

The bolded line in the sample — app.use(cors()); — sets up the CORS headers. Setup might be different based on the language/framework of your choice, but your backend needs to return these headers whether you’re developing locally or on IDX.

When you run the server in the IDX terminal, the backend ports show up in the IDX panel. And every port that your server runs on is automatically mapped to a URL you can call.

Moving text showing the IDX terminal and panel

Now, let's write some client code to make an AJAX call to this server.


// This URL is copied from the side panel showing the backend ports view
const WORKSPACE_URL = "https://6000-monospace-ksat-web-prod-79679-1677177068249.cluster-lknrrkkitbcdsvoir6wqg4mwt6.cloudworkstations.dev/";

async function get(url) {
  const response = await fetch(url, {
    credentials: 'include',
  });
  console.log(response.text());
}

// Call the backend
get(WORKSPACE_URL);

We’ve also made sure that the fetch() call includes credentials. IDX URLs are authenticated, so we need to include credentials. This way, the AJAX call includes the cookies to authenticate against our servers.

If you’re using XMLHttpRequest instead of fetch, you can set the “withCredentials” property, like this:


const xhr = new XMLHttpRequest();
xhr.open("GET", WORKSPACE_URL, true);
xhr.withCredentials = true;
xhr.send(null);

Your code might differ from our samples based on the client library you use to make the AJAX calls. If it does, check the documentation for your specific client library on how to make a credentialed request. Just be sure to make a credentialed request.


Server-side testing without a login

In some cases you might want to access your application on Project IDX without logging into your Google account — or from an environment where you can’t log into your Google account. For example, if you want to access an API you're developing in IDX using either Postman or cURL from your personal laptops's command line. You can do this by using a temporary access token generated by Project IDX.

Once you have a server running in Project IDX, you can bring up the command menu to generate an access token. This access token is a short-lived token that temporarily allows you to access your workstation.

It’s extremely important to note that this access token provides access to your entire IDX workspace, including but not limited to your application in preview, so you shouldn’t share it with just anyone. We recommend that you only use it for testing.

Generate access token in Project IDX

When you run this command from IDX, your access token shows up in a dialog window. Copy the access token and use it to make a cURL request to a service running on your workstation, like this one:


$ export ACCESS_TOKEN=myaccesstoken
$ curl -H "Authorization: Bearer $ACCESS_TOKEN" https://6000-monospace-ksat-web-prod-79679-1677177068249.cluster-lknrrkkitbcdsvoir6wqg4mwt6.cloudworkstations.dev/
Hello world

And now you can run tests from an authenticated server environment!


Web-based, fully hermetic testing

As we’ve highlighted, you can test your application’s frontend and backend in a fully self-contained, authenticated, secure environment using IDX. You can also run local emulators in your web-based development environment to test your application’s backend services.

For example, you can run the Firebase Local Emulator Suite directly from your IDX workspace. To install the emulator suite, you’d run firebase init emulators from the IDX Terminal tab and follow the steps to configure which emulators you want on what ports.

ALT TEXT

Once you’ve installed them, you can configure and use them the same way you would in a local development environment from the IDX terminal.


Next Steps

As you can see, Project IDX can meet many of your full-stack development needs — from frontend to backend and every emulator in between.

If you're already using Project IDX, tag us on social with #projectidx to let us know how Project IDX has helped you with your full-stack development. Or to sign up for the waitlist, visit idx.dev.

People of AI: Season 2

Posted by Ashley Oldacre

If you are joining us for the first time, you can binge listen to our amazing 8 episodes from Season 1 wherever you get your podcasts.

We are back for another season of People of AI with a new lineup of incredible guests! I am so excited to introduce my new co-host Luiz Gustavo Martins as we meet inspiring people with interesting stories in the field of Artificial Intelligence.

Last season we focused on the incredible journeys that our guests took to get into the field of AI. Through our stories, we highlighted that no matter who you are, what your interests are, or what you work on, there is a place for anyone to get into this field. We also explored how much more accessible the technology has become over the years, as well as the importance of building AI-related products responsibly and ethically. It is easier than ever to use tools, platforms and services powered by machine learning to leverage the benefits of AI, and break down the barrier of entry.

For season 2, we will feature amazing conversations, focusing on Generative AI! Specifically, we will be discussing the explosive growth of Generative AI tools and the major technology shift that has happened in recent months. We will dive into various topics to explore areas where Generative AI can contribute tremendous value, as well as boost both productivity and economic growth. We will also continue to explore the personal paths and career development of this season’s guests as they share how their interest in technology was sparked, how they worked hard to get to where they are today, and explore what it is that they are currently working on.

Starting today, we will release one new episode of season 2 per week. Listen to the first episode on the People of AI site or wherever you get your podcasts. And stay tuned for later in the season when we premiere our first video podcasts as well!

  • Episode 1: meet your hosts, Ashley and Gus and learn about Generative AI, Bard and the big shift that has dramatically changed the industry. 
  • Episode 2: meet Sunita Verma, a long-time Googler, as she shares her personal journey from Engineering to CS, and into Google. As an early pioneer of AI and Google Ads, we will talk about the evolution of AI and how Generative AI will transform the way we work. 
  • Episode 3: meet Sayak Paul, a Google Developer Expert (GDE) as we explore what it means to be a GDE and how to leverage the power of your community through community contributions. 
  • Episode 4: meet Crispin Velez, the lead for Cloud’s Vertex AI as we dig into his experience in Cloud working with customers and partners on how to integrate and deploy AI. We also learn how he grew his AI developer community in LATAM from scratch. 
  • Episode 5: meet Joyce Shen, venture capital/private equity investor. She shares her fascinating career in AI and how she has worked with businesses to spot AI talent, incorporate AI technology into workflows and implement responsible AI into their products. 
  • Episode 6: meet Anne Simonds and Brian Gary, founders of Muse https://www.museml.com. Join us as we talk about their recent journeys into AI and their new company which uses the power of Generative AI to spark creativity. 
  • Episode 7: meet Tulsee Doshi, product lead for Google’s Responsible AI efforts as we discuss the development of Google-wide resources and best practices for developing more inclusive, diverse, and ethical algorithm driven products. 
  • Episode 8: meet Jeanine Banks, Vice President and General Manager of Google Developer X and Head of Developer Relations. Join us as we debunk AI and get down to what Generative AI really is, how it has changed over the past few months and will continue to change the developer landscape. 
  • Episode 9: meet Simon Tokumine, Director of Product Management at Google. We will talk about how AI has brought us into the era of task-orientated products and is fueling a new community of makers.

Listen now to the first episode of Season 2. We can’t wait to share the stories of these exceptional People of AI with you!

This podcast is sponsored by Google. Any remarks made by the speakers are their own and are not endorsed by Google.