Chrome Dev for Android Update

Hi everyone! We've just released Chrome Dev 122 (122.0.6250.2) for Android. It's now available on Google Play.

You can see a partial list of the changes in the Git log. For details on new features, check out the Chromium blog, and for details on web platform updates, check here.

If you find a new issue, please let us know by filing a bug.

Erhu Akpobaro
Google Chrome

Performance Max Guide Enhancements

Today, we are pleased to announce several guide enhancements to improve the experience of creating, managing and reporting on Performance Max campaigns with the Google Ads API.

Improving Performance Max integrations Blog Series

This article is part of a series that discusses new and upcoming features that you have been asking for. Keep an eye out for further updates and improvements on our developer blog, continue providing feedback on Performance Max integrations with the Google Ads API, and as always, contact our team if you need support.

Introducing ASPIRE for selective prediction in LLMs

In the fast-evolving landscape of artificial intelligence, large language models (LLMs) have revolutionized the way we interact with machines, pushing the boundaries of natural language understanding and generation to unprecedented heights. Yet, the leap into high-stakes decision-making applications remains a chasm too wide, primarily due to the inherent uncertainty of model predictions. Traditional LLMs generate responses recursively, yet they lack an intrinsic mechanism to assign a confidence score to these responses. Although one can derive a confidence score by summing up the probabilities of individual tokens in the sequence, traditional approaches typically fall short in reliably distinguishing between correct and incorrect answers. But what if LLMs could gauge their own confidence and only make predictions when they're sure?

Selective prediction aims to do this by enabling LLMs to output an answer along with a selection score, which indicates the probability that the answer is correct. With selective prediction, one can better understand the reliability of LLMs deployed in a variety of applications. Prior research, such as semantic uncertainty and self-evaluation, has attempted to enable selective prediction in LLMs. A typical approach is to use heuristic prompts like “Is the proposed answer True or False?” to trigger self-evaluation in LLMs. However, this approach may not work well on challenging question answering (QA) tasks.

The OPT-2.7B model incorrectly answers a question from the TriviaQA dataset: “Which vitamin helps regulate blood clotting?” with “Vitamin C”. Without selective prediction, LLMs may output the wrong answer which, in this case, could lead users to take the wrong vitamin. With selective prediction, LLMs will output an answer along with a selection score. If the selection score is low (0.1), LLMs will further output “I don’t know!” to warn users not to trust it or verify it using other sources.

In "Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs", presented at Findings of EMNLP 2023, we introduce ASPIRE — a novel framework meticulously designed to enhance the selective prediction capabilities of LLMs. ASPIRE fine-tunes LLMs on QA tasks via parameter-efficient fine-tuning, and trains them to evaluate whether their generated answers are correct. ASPIRE allows LLMs to output an answer along with a confidence score for that answer. Our experimental results demonstrate that ASPIRE significantly outperforms state-of-the-art selective prediction methods on a variety of QA datasets, such as the CoQA benchmark.


The mechanics of ASPIRE

Imagine teaching an LLM to not only answer questions but also evaluate those answers — akin to a student verifying their answers in the back of the textbook. That's the essence of ASPIRE, which involves three stages: (1) task-specific tuning, (2) answer sampling, and (3) self-evaluation learning.

Task-specific tuning: ASPIRE performs task-specific tuning to train adaptable parameters (θp) while freezing the LLM. Given a training dataset for a generative task, it fine-tunes the pre-trained LLM to improve its prediction performance. Towards this end, parameter-efficient tuning techniques (e.g., soft prompt tuning and LoRA) might be employed to adapt the pre-trained LLM on the task, given their effectiveness in obtaining strong generalization with small amounts of target task data. Specifically, the LLM parameters (θ) are frozen and adaptable parameters (θp) are added for fine-tuning. Only θp are updated to minimize the standard LLM training loss (e.g., cross-entropy). Such fine-tuning can improve selective prediction performance because it not only improves the prediction accuracy, but also enhances the likelihood of correct output sequences.

Answer sampling: After task-specific tuning, ASPIRE uses the LLM with the learned θp to generate different answers for each training question and create a dataset for self-evaluation learning. We aim to generate output sequences that have a high likelihood. We use beam search as the decoding algorithm to generate high-likelihood output sequences and the Rouge-L metric to determine if the generated output sequence is correct.

Self-evaluation learning: After sampling high-likelihood outputs for each query, ASPIRE adds adaptable parameters (θs) and only fine-tunes θs for learning self-evaluation. Since the output sequence generation only depends on θ and θp, freezing θ and the learned θp can avoid changing the prediction behaviors of the LLM when learning self-evaluation. We optimize θs such that the adapted LLM can distinguish between correct and incorrect answers on their own.

The three stages of the ASPIRE framework.

In the proposed framework, θp and θs can be trained using any parameter-efficient tuning approach. In this work, we use soft prompt tuning, a simple yet effective mechanism for learning “soft prompts” to condition frozen language models to perform specific downstream tasks more effectively than traditional discrete text prompts. The driving force behind this approach lies in the recognition that if we can develop prompts that effectively stimulate self-evaluation, it should be possible to discover these prompts through soft prompt tuning in conjunction with targeted training objectives.

Implementation of the ASPIRE framework via soft prompt tuning. We first generate the answer to the question with the first soft prompt and then compute the learned self-evaluation score with the second soft prompt.

After training θp and θs, we obtain the prediction for the query via beam search decoding. We then define a selection score that combines the likelihood of the generated answer with the learned self-evaluation score (i.e., the likelihood of the prediction being correct for the query) to make selective predictions.


Results

To demonstrate ASPIRE’s efficacy, we evaluate it across three question-answering datasets — CoQA, TriviaQA, and SQuAD — using various open pre-trained transformer (OPT) models. By training θp with soft prompt tuning, we observed a substantial hike in the LLMs' accuracy. For example, the OPT-2.7B model adapted with ASPIRE demonstrated improved performance over the larger, pre-trained OPT-30B model using the CoQA and SQuAD datasets. These results suggest that with suitable adaptations, smaller LLMs might have the capability to match or potentially surpass the accuracy of larger models in some scenarios.

When delving into the computation of selection scores with fixed model predictions, ASPIRE received a higher AUROC score (the probability that a randomly chosen correct output sequence has a higher selection score than a randomly chosen incorrect output sequence) than baseline methods across all datasets. For example, on the CoQA benchmark, ASPIRE improves the AUROC from 51.3% to 80.3% compared to the baselines.

An intriguing pattern emerged from the TriviaQA dataset evaluations. While the pre-trained OPT-30B model demonstrated higher baseline accuracy, its performance in selective prediction did not improve significantly when traditional self-evaluation methods — Self-eval and P(True) — were applied. In contrast, the smaller OPT-2.7B model, when enhanced with ASPIRE, outperformed in this aspect. This discrepancy underscores a vital insight: larger LLMs utilizing conventional self-evaluation techniques may not be as effective in selective prediction as smaller, ASPIRE-enhanced models.

Our experimental journey with ASPIRE underscores a pivotal shift in the landscape of LLMs: The capacity of a language model is not the be-all and end-all of its performance. Instead, the effectiveness of models can be drastically improved through strategic adaptations, allowing for more precise, confident predictions even in smaller models. As a result, ASPIRE stands as a testament to the potential of LLMs that can judiciously ascertain their own certainty and decisively outperform larger counterparts in selective prediction tasks.


Conclusion

In conclusion, ASPIRE is not just another framework; it's a vision of a future where LLMs can be trusted partners in decision-making. By honing the selective prediction performance, we're inching closer to realizing the full potential of AI in critical applications.

Our research has opened new doors, and we invite the community to build upon this foundation. We're excited to see how ASPIRE will inspire the next generation of LLMs and beyond. To learn more about our findings, we encourage you to read our paper and join us in this thrilling journey towards creating a more reliable and self-aware AI.


Acknowledgments

We gratefully acknowledge the contributions of Sayna Ebrahimi, Sercan O Arik, Tomas Pfister, and Somesh Jha.

Source: Google AI Blog


HealthPulse AI Leverages MediaPipe to Increase Health Equity

A guest post by Rouella Mendonca, AI Product Lead and Matt Brown, Machine Learning Engineer at Audere

Please note that the information, uses, and applications expressed in the below post are solely those of our guest authors from Audere.


About HealthPulse AI and its application in the real world

Preventable and treatable diseases like HIV, COVID-19, and malaria infect ~12 million per year globally with a disproportionate number of cases impacting already underserved and under-resourced communities1. Communicable and non-communicable diseases are impeding human development by their negative impact on education, income, life expectancy, and other health indicators2. Lack of access to timely, accurate, and affordable diagnostics and care is a key contributor to high mortality rates.

Due to their low cost and relative ease of use, ~1 billion rapid diagnostic tests (RDTs) are used globally per year and growing. However, there are challenges with RDT use.

  • Where RDT data is reported, results are hard to trust due to inflated case counts, lack of reported expected seasonal fluctuations, and non-adherence to treatment regimens.
  • They are used in decentralized care settings by those with limited or no training, increasing the risk of misadministration and misinterpretation of test results.

HealthPulse AI, developed by a digital health non-profit Audere, leverages MediaPipe to address these issues by providing digital building blocks to increase trust in the world’s most widely used RDTs.

HealthPulse AI is a set of building blocks that can turn any digital solution into a Rapid Diagnostic Test (RDT) reader. These building blocks solve prominent global health problems by improving rapid diagnostic test accuracy, reducing misadministration of tests, and expanding the availability of testing for conditions including malaria, COVID, and HIV in decentralized care settings. With just a low-end smartphone, HealthPulse AI improves the accuracy of rapid diagnostic test results while automatically digitizing data for surveillance, program reporting, and test validation. It provides AI facilitated digital capture and result interpretation; quality, accessible digital use instructions for provider and self-tests; and standards based real-time reporting of test results.

These capabilities are available to local implementers, global NGOs, governments, and private sector pharmacies via a web service for use with chatbots, apps or server implementations; a mobile SDK for offline use in any mobile application; or directly through native Android and iOS apps.

It enables innovative use cases such as quality-assured virtual care models which enables stigma-free, convenient HIV home testing with linkage to education, prevention, and treatment options.

HealthPulse AI Use Cases

HealthPulse AI can substantially democratize access to timely, quality care in the private sector (e.g. pharmacies), in the public sector (e.g. clinics), in community programs (e.g. community health workers), and self-testing use cases. Using only an RDT image captured on a low-end smartphone, HealthPulse AI can power virtual care models by providing valuable decision support and quality control to clinicians, especially in cases where lines may be faint and hard to detect with the human eye. In the private sector, it can automate and scale incentive programs so auditors only need to review automated alerts based on test anomalies; procedures which presently require human reviews of each incoming image and transaction. In community care programs, HealthPulse AI can be used as a training tool for health workers learning how to correctly administer and interpret tests. In the public sector, it can strengthen surveillance systems with real-time disease tracking and verification of results across all channels where care is delivered - enabling faster response and pandemic preparedness3.


HealthPulse AI algorithms

HealthPulse AI provides a library of AI algorithms for the top RDTs for malaria, HIV, and COVID. Each algorithm is a collection of Computer Vision (CV) models that are trained using machine learning (ML) algorithms. From an image of an RDT, our algorithms can:

  • Flag image quality issues common on low-end phones (blurriness, over/underexposure)
  • Detect the RDT type
  • Interpret the test result

Image Quality Assurance

When capturing an image of an RDT, it is important to ensure that the image captured is human and AI interpretable to power the use cases described above. Image quality issues are common, particularly when images are captured with low-end phones in settings that may have poor lighting or simply captured by users with shaky hands. As such, HealthPulse AI provides image quality assurance (IQA) to identify adversarial image conditions. IQA returns concerns detected and can be used to request users to retake the photo in real time. Without IQA, clients would have to retest due to uninterpretable images and expired RDT read windows in telehealth use cases, for example. With just-in-time quality concern flagging, additional cost and treatment delays can be avoided. Examples of some adversarial images that IQA would flag are shown in Figure 1 below.

Images of malaria, HIV and COVID tests that are dark, blurry, too bright, and too small.
Figure 1: Images of malaria, HIV and COVID tests that are dark, blurry, too bright, and too small.

Classification

With just an image captured on a 5MP camera from low-end smartphones commonly used in Africa, SE Asia, and Latin America where a disproportionate disease burden exists, HealthPulse AI can identify a specific test (brand, disease), individual test lines, and provide an interpretation of the test. Our current library of AI algorithms supports many of the most commonly used RDTs for malaria, HIV, and COVID-19 that are W.H.O. pre-qualified. Our AI is condition agnostic and can be easily extended to support any RDT for a range of communicable and non-communicable diseases (Diabetes, Influenza, Tuberculosis, Pregnancy, STIs and more).

HealthPulse AI is able to detect the type of RDT in the image (for supported RDTs that the model was trained for), detect the presence of lines, and return a classification for the particular test (e.g. positive, negative, invalid, uninterpretable). See Figure 2.

Figure 2: Interpretation of a supported lateral flow rapid test.
Figure 2: Interpretation of a supported lateral flow rapid test.

How and why we use MediaPipe

Deploying HealthPulse AI in decentralized care settings with unstable infrastructure comes with a number of challenges. The first challenge is a lack of reliable internet connectivity, often requiring our CV and ML algorithms to run locally. Secondly, phones available in these settings are often very old, lacking the latest hardware (< 1 GB of ram and comparable CPU specs), and on different platforms and versions ( iOS, Android, Huawei; very old versions - possibly no longer receiving OS updates) mobile platforms. This necessitates having a platform agnostic, highly efficient inference engine. MediaPipe’s out-of-the-box multi-platform support for image-focused machine learning processes makes it efficient to meet these needs.

As a non-profit operating in cost-recovery mode, it was important that solutions:

  • have broad reach globally,
  • are low-lift to maintain, and
  • meet the needs of our target population for offline, low resource, performant use.

Without needing to write a lot of glue code, HealthPulse AI can support Android, iOS, and cloud devices using the same library built on MediaPipe.

Our pipeline

MediaPipe’s graph definitions allow us to build and iterate our inference pipeline on the fly. After a user submits a picture, the pipeline determines the RDT type, and attempts to classify the test result by passing the detected result-window crop of the RDT image to our classifier.

For good human and AI interpretability, it is important to have good quality images. However, input images to the pipeline have a high level of variability we have little to no control over. Variability factors include (but are not limited to) varying image quality due to a range of smartphone camera features/megapixels/physical defects, decentralized testing settings which include differing and non-ideal lighting conditions, random orientations of the RDT cassettes, blurry and unfocused images, partial RDT images, and many other adversarial conditions that add challenges for the AI. As such, an important part of our solution is image quality assurance. Each image passes through a number of calculators geared towards highlighting quality concerns that may prevent the detector or classifier from doing its job accurately. The pipeline elevates these concerns to the host application, so an end-user can be requested in real-time to retake a photo when necessary. Since RDT results have a limited validity time (e.g. a time window specified by the RDT manufacturer for how long after processing a result can be accurately read), IQA is essential to ensure timely care and save costs. A high level flowchart of the pipeline is shown below in Figure 3.

Figure 3: HealthPulse AI pipeline
Figure 3: HealthPulse AI pipeline

Summary

HealthPulse AI is designed to improve the quality and richness of testing programs and data in underserved communities that are disproportionately impacted by preventable communicable and non-communicable diseases.

Towards this mission, MediaPipe plays a critical role by providing a platform that allows Audere to quickly iterate and support new rapid diagnostic tests. This is imperative as new rapid tests come to market regularly, and test availability for community and home use can change frequently. Additionally, the flexibility allows for lower overhead in maintaining the pipeline, which is crucial for cost-effective operations. This, in turn, reduces the cost of use for governments and organizations globally that provide services to people who need them most.

HealthPulse AI offerings allow organizations and governments to benefit from new innovations in the diagnostics space with minimal overhead. This is an essential component of the primary health journey - to ensure that populations in under-resourced communities have access to timely, cost-effective, and efficacious care.


About Audere

Audere is a global digital health nonprofit developing AI based solutions to address important problems in health delivery by providing innovative, scalable, interconnected tools to advance health equity in underserved communities worldwide. We operate at the unique intersection of global health and high tech, creating advanced, accessible software that revolutionizes the detection, prevention, and treatment of diseases — such as malaria, COVID-19, and HIV. Our diverse team of passionate, innovative minds combines human-centered design, smartphone technology, artificial intelligence (AI), open standards, and the best of cloud-based services to empower innovators globally to deliver healthcare in new ways in low-and-middle income settings. Audere operates primarily in Africa with projects in Nigeria, Kenya, Côte d’Ivoire, Benin, Uganda, Zambia, South Africa, and Ethiopia.


1 WHO malaria fact sheets

How This Indie Game Studio Launched Their First Game on Google Play

Posted by Scarlett Asuncion – Product Marketing Manager

Indie game developers Geoffrey Mugford and Samuli Pietikainen first connected online through their shared passion for game design, before joining forces to create their own studio No Devs. Looking for ways to grow as a team, they entered the Quickplay Game Jam hosted by Latinx in Gaming in partnership with Google Play. The 6-week competition, open to anyone globally, challenged participants to generate a game idea around the theme of ‘tradition’. The duo became one of 4 winners to receive a share of $80,000 to bring their game jam concept to life and launch it on Google Play.

Their winning game idea, Pilkki, has just launched in early access. It offers players a captivating claymation ice fishing adventure set in a serene atmosphere that celebrates Finnish culture. Intrigued by the game’s origins and unique gameplay, we chatted with one-half of No Devs, Geoffrey. He shares how his multicultural heritage and Samuli’s Finnish background inspired their game design, the lessons they’ve learned so far and their studio’s future plans.

Headshots of Geoffrey Mugford (left)and Samuli Pietikainen (right), smiling
Geoffrey Mugford (left); Samuli Pietikainen (right)

Tell us about your journey as a team and why you entered the Quickplay Game Jam.

We started making games together in May 2022. We talked about it for a year, but hadn’t taken the plunge, so a game jam was the perfect way of kickstarting our creative partnership. Our first game jam was a success so we decided to take it further and look for more game jam opportunities. As indie developers, balancing personal projects with financial stability is tough. Winning a prize in a game jam offers a chance to prototype an idea and potentially secure early funding for it. This game jam offered that opportunity whilst also promoting cultural diversity. Because of Samuli’s background, we were keen to make a game that embodies the Finnish mindset.

What inspired the creation of Pilkki and how did you shape the game to offer a unique cultural experience like ice fishing?

At first, we struggled with the game jam’s theme of 'tradition.' We were initially keen to make a traditional 'Day of the Dead' inspired game, but realized it didn't resonate enough with us after a couple of attempts, so we shifted gears. Coming from a multicultural background, we thought about blending cultures rather than a focus on one. We considered creating new traditions using deck builder or city-builder formats but found them too ambitious given the timeframe. We eventually turned our focus to Finland and its quirky traditions. Some of them, like eukonkanto (wife-carrying races) and tinanvalanta (tin melting in a sauna ladle) caught our attention, but we ultimately settled on ice fishing - a simple, unique and very Finnish activity that could suit mobile gaming. The challenge was innovating on it - we reimagined it as a physics-driven puzzle game where the player would control the hook as a pendulum, and that's how Pilkki came to be.

side by side photos of Pilkki gameplay
Gameplay of Pilkki

Can you highlight some of the learnings and adjustments you made along the way?

We only had 6 weeks to make the game, and had already spent 2 of them brainstorming. When we settled on our game idea, we had to be very careful with scope and, sometimes, make quick decisions without the opportunity for play-testing. Some of these decisions ended up being super fun for the players - others, not so much. Luckily we had a clear division of responsibilities - I was on game design and programming, Samuli on art, audio and game feel - so we could work smoothly in parallel and meet milestones efficiently.

The win condition was a challenging aspect to figure out during development. We wanted a calm and reflective experience, similar to a real-life analogue, so we avoided score systems and timers. With time running out to complete the game, we were unable to explore alternative options. As a result, our game jam entry ended up being a race against time to catch as many fish as possible. After the game jam ended, we revisited this and turned towards a more tranquil atmosphere, where the progression was driven by puzzles rather than scores.

How did the funding from the Quickplay Game Jam in partnership with Google Play contribute to the development of Pilkki beyond its initial prototype stage?

Pilkki is much larger in scope than anything we've attempted before. Without funding, we would have likely left it in its prototype stage without exploring the concept further. The Quickplay Game Jam allowed us to recognize the potential in the idea, and dedicate ourselves to turning it into the relaxing fishing experience it has become.

With the funding, we were able to dedicate 3 months full-time to the design and development of Pilkki. We were able to take a step back and really put some thought into how we would build a game that would continue growing post-release. On top of that, Samuli experimented with multiple styles and multi-media art - this is how he developed the beautiful claymation visuals that have become our unique selling point.

Are you excited about your future as a new indie game studio?

Yes, for sure! We love creating fun and innovative experiences for people, and we have both been dreaming about working on our own games full time. It's a long road ahead, but we're excited to keep the momentum. For now, we’re actively working on Pilkki and aiming to release a major game update in 2024. We're eager to see the reaction from our players.

Having our game on Google Play gives us access to new markets worldwide. We can't wait to see how the game grows and attracts new players, and how it introduces them to our quirky take on Finnish culture.

Chrome Dev for Desktop Update

The Dev channel has been updated to 122.0.6253.3 for Windows, Mac and Linux.

A partial list of changes is available in the Git log. Interested in switching release channels? Find out how. If you find a new issue, please let us know by filing a bug. The community help forum is also a great place to reach out for help or learn about common issues.

Prudhvi Bommana
Google Chrome