Dev Channel Update for ChromeOS/ChromeOS Flex

 The Dev channel is being updated to 122.0.6226.0 (Platform version: 15739.0.0) for most ChromeOS devices. This build contains a number of bug fixes and security updates.

If you find new issues, please let us know one of the following ways

  1. File a bug
  2. Visit our ChromeOS communities
    1. General: Chromebook Help Community
    2. Beta Specific: ChromeOS Beta Help Community
  3. Report an issue or send feedback on Chrome

Interested in switching channels? Find out how.

Cole Brown,
Google ChromeOS

Can large language models identify and correct their mistakes?

LLMs are increasingly popular for reasoning tasks, such as multi-turn QA, task completion, code generation, or mathematics. Yet much like people, they do not always solve problems correctly on the first try, especially on tasks for which they were not trained. Therefore, for such systems to be most useful, they should be able to 1) identify where their reasoning went wrong and 2) backtrack to find another solution.

This has led to a surge in methods related to self-correction, where an LLM is used to identify problems in its own output, and then produce improved results based on the feedback. Self-correction is generally thought of as a single process, but we decided to break it down into two components, mistake finding and output correction.

In “LLMs cannot find reasoning errors, but can correct them!”, we test state-of-the-art LLMs on mistake finding and output correction separately. We present BIG-Bench Mistake, an evaluation benchmark dataset for mistake identification, which we use to address the following questions:

  1. Can LLMs find logical mistakes in Chain-of-Thought (CoT) style reasoning?
  2. Can mistake-finding be used as a proxy for correctness?
  3. Knowing where the mistake is, can LLMs then be prompted to backtrack and arrive at the correct answer?
  4. Can mistake finding as a skill generalize to tasks the LLMs have never seen?

About our dataset

Mistake finding is an underexplored problem in natural language processing, with a particular lack of evaluation tasks in this domain. To best assess the ability of LLMs to find mistakes, evaluation tasks should exhibit mistakes that are non-ambiguous. To our knowledge, most current mistake-finding datasets do not go beyond the realm of mathematics for this reason.

To assess the ability of LLMs to reason about mistakes outside of the math domain, we produce a new dataset for use by the research community, called BIG-Bench Mistake. This dataset consists of Chain-of-Thought traces generated using PaLM 2 on five tasks in BIG-Bench. Each trace is annotated with the location of the first logical mistake.

To maximize the number of mistakes in our dataset, we sample 255 traces where the answer is incorrect (so we know there is definitely a mistake), and 45 traces where the answer is correct (so there may or may not be a mistake). We then ask human labelers to go through each trace and identify the first mistake step. Each trace has been annotated by at least three labelers, whose answers had inter-rater reliability levels of >0.98 (using Krippendorff’s α). The labeling was done for all tasks except the Dyck Languages task, which involves predicting the sequence of closing parentheses for a given input sequence. This task we labeled algorithmically.

The logical errors made in this dataset are simple and unambiguous, providing a good benchmark for testing an LLM’s ability to find its own mistakes before using them on harder, more ambiguous tasks.


Core questions about mistake identification


1. Can LLMs find logical mistakes in Chain-of-Thought style reasoning?

First, we want to find out if LLMs can identify mistakes independently of their ability to correct them. We attempt multiple prompting methods to test GPT series models for their ability to locate mistakes (prompts here) under the assumption that they are generally representative of modern LLM performance.

Generally, we found these state-of-the-art models perform poorly, with the best model achieving 52.9% accuracy overall. Hence, there is a need to improve LLMs’ ability in this area of reasoning.

In our experiments, we try three different prompting methods: direct (trace), direct (step) and CoT (step). In direct (trace), we provide the LLM with the trace and ask for the location step of the mistake or no mistake. In direct (step), we prompt the LLM to ask itself this question for each step it takes. In CoT (step), we prompt the LLM to give its reasoning for whether each step is a mistake or not a mistake.

A diagram showing the three prompting methods direct (trace), direct (step) and CoT (step).

Our finding is in line and builds upon prior results, but goes further in showing that LLMs struggle with even simple and unambiguous mistakes (for comparison, our human raters without prior expertise solve the problem with a high degree of agreement). We hypothesize that this is a big reason why LLMs are unable to self-correct reasoning errors. See the paper for the full results.


2. Can mistake-finding be used as a proxy for correctness of the answer?

When people are confronted with a problem where we are unsure of the answer, we can work through our solutions step-by-step. If no error is found, we can make the assumption that we did the right thing.

While we hypothesized that this would work similarly for LLMs, we discovered that this is a poor strategy. On our dataset of 85% incorrect traces and 15% correct traces, using this method is not much better than the naïve strategy of always labeling traces as incorrect, which gives a weighted average F1 of 78.

A diagram showing how well mistake-finding with LLMs can be used as a proxy for correctness of the answer on each dataset.

3. Can LLMs backtrack knowing where the error is?

Since we’ve shown that LLMs exhibit poor performance in finding reasoning errors in CoT traces, we want to know whether LLMs can even correct errors at all, even if they know where the error is.

Note that knowing the mistake location is different from knowing the right answer: CoT traces can contain logical mistakes even if the final answer is correct, or vice versa. In most real-world situations, we won’t know what the right answer is, but we might be able to identify logical errors in intermediate steps.

We propose the following backtracking method:

  1. Generate CoT traces as usual, at temperature = 0. (Temperature is a parameter that controls the randomness of generated responses, with higher values producing more diverse and creative outputs, usually at the expense of quality.)
  2. Identify the location of the first logical mistake (for example with a classifier, or here we just use labels from our dataset).
  3. Re-generate the mistake step at temperature = 1 and produce a set of eight outputs. Since the original output is known to lead to incorrect results, the goal is to find an alternative generation at this step that is significantly different from the original.
  4. From these eight outputs, select one that is different from the original mistake step. (We just use exact matching here, but in the future this can be something more sophisticated.)
  5. Using the new step, generate the rest of the trace as normal at temperature = 0.

It’s a very simple method that does not require any additional prompt crafting and avoids having to re-generate the entire trace. We test it using the mistake location data from BIG-Bench Mistake, and we find that it can correct CoT errors.

Recent work showed that self-correction methods, like Reflexion and RCI, cause deterioration in accuracy scores because there are more correct answers becoming incorrect than vice versa. Our method, on the other hand, produces more gains (by correcting wrong answers) than losses (by changing right answers to wrong answers).

We also compare our method with a random baseline, where we randomly assume a step to be a mistake. Our results show that this random baseline does produce some gains, but not as much as backtracking with the correct mistake location, and with more losses.

A diagram showing the gains and losses in accuracy for our method as well as a random baseline on each dataset.

4. Can mistake finding generalize to tasks the LLMs have never seen?

To answer this question, we fine-tuned a small model on four of the BIG-Bench tasks and tested it on the fifth, held-out task. We do this for every task, producing five fine-tuned models in total. Then we compare the results with just zero-shot prompting PaLM 2-L-Unicorn, a much larger model.

Bar chart showing the accuracy improvement of the fine-tuned small model compared to zero-shot prompting with PaLM 2-L-Unicorn.

Our results show that the much smaller fine-tuned reward model generally performs better than zero-shot prompting a large model, even though the reward model has never seen data from the task in the test set. The only exception is logical deduction, where it performs on par with zero-shot prompting.

This is a very promising result as we can potentially just use a small fine-tuned reward model to perform backtracking and improve accuracy on any task, even if we don’t have the data for it. This smaller reward model is completely independent of the generator LLM, and can be updated and further fine-tuned for individual use cases.

An illustration showing how our backtracking method works.

Conclusion

In this work, we created an evaluation benchmark dataset that the wider academic community can use to evaluate future LLMs. We further showed that LLMs currently struggle to find logical errors. However, if they could, we show the effectiveness of backtracking as a strategy that can provide gains on tasks. Finally, a smaller reward model can be trained on general mistake-finding tasks and be used to improve out-of-domain mistake finding, showing that mistake-finding can generalize.


Acknowledgements

Thank you to Peter Chen, Tony Mak, Hassan Mansoor and Victor Cărbune for contributing ideas and helping with the experiments and data collection. We would also like to thank Sian Gooding and Vicky Zayats for their comments and suggestions on the paper.


Source: Google AI Blog


Chrome Dev for Android Update

Hi everyone! We've just released Chrome Dev 122 (122.0.6238.3) for Android. It's now available on Google Play.

You can see a partial list of the changes in the Git log. For details on new features, check out the Chromium blog, and for details on web platform updates, check here.

If you find a new issue, please let us know by filing a bug.

Erhu Akpobaro
Google Chrome

Chrome Dev for Desktop Update

The Dev channel has been updated to 122.0.6238.2 for Windows, Mac and Linux.

A partial list of changes is available in the Git log. Interested in switching release channels? Find out how. If you find a new issue, please let us know by filing a bug. The community help forum is also a great place to reach out for help or learn about common issues.

Prudhvi Bommana
Google Chrome

MiraclePtr: protecting users from use-after-free vulnerabilities on more platforms

Welcome back to our latest update on MiraclePtr, our project to protect against use-after-free vulnerabilities in Google Chrome. If you need a refresher, you can read our previous blog post detailing MiraclePtr and its objectives.

More platforms

We are thrilled to announce that since our last update, we have successfully enabled MiraclePtr for more platforms and processes:

  • In June 2022, we enabled MiraclePtr for the browser process on Windows and Android.
  • In September 2022, we expanded its coverage to include all processes except renderer processes.
  • In June 2023, we enabled MiraclePtr for ChromeOS, macOS, and Linux.

Furthermore, we have changed security guidelines to downgrade MiraclePtr-protected issues by one severity level!

Evaluating Security Impact

First let’s focus on its security impact. Our analysis is based on two primary information sources: incoming vulnerability reports and crash reports from user devices. Let's take a closer look at each of these sources and how they inform our understanding of MiraclePtr's effectiveness.

Bug reports

Chrome vulnerability reports come from various sources, such as:

For the purposes of this analysis, we focus on vulnerabilities that affect platforms where MiraclePtr was enabled at the time the issues were reported. We also exclude bugs that occur inside a sandboxed renderer process. Since the initial launch of MiraclePtr in 2022, we have received 168 use-after-free reports matching our criteria.

What does the data tell us? MiraclePtr effectively mitigated 57% of these use-after-free vulnerabilities in privileged processes, exceeding our initial estimate of 50%. Reaching this level of effectiveness, however, required additional work. For instance, we not only rewrote class fields to use MiraclePtr, as discussed in the previous post, but also added MiraclePtr support for bound function arguments, such as Unretained pointers. These pointers have been a significant source of use-after-frees in Chrome, and the additional protection allowed us to mitigate 39 more issues.

Moreover, these vulnerability reports enable us to pinpoint areas needing improvement. We're actively working on adding support for select third-party libraries that have been a source of use-after-free bugs, as well as developing a more advanced rewriter tool that can handle transformations like converting std::vector<T*> into std::vector<raw_ptr<T>>. We've also made several smaller fixes, such as extending the lifetime of the task state object to cover several issues in the “this pointer” category.

Crash reports

Crash reports offer a different perspective on MiraclePtr's effectiveness. As explained in the previous blog post, when an allocation is quarantined, its contents are overwritten with a special bit pattern. If the allocation is used later, the pattern will often be interpreted as an invalid memory address, causing a crash when the process attempts to access memory at that address. Since the dereferenced address remains within a small, predictable memory range, we can distinguish MiraclePtr crashes from other crashes.

Although this approach has its limitations — such as not being able to obtain stack traces from allocation and deallocation times like AddressSanitizer does — it has enabled us to detect and fix vulnerabilities. Last year, six critical severity vulnerabilities were identified in the default setup of Chrome Stable, the version most people use. Impressively, five of the six were discovered while investigating MiraclePtr crash reports! One particularly interesting example is CVE-2022-3038. The issue was discovered through MiraclePtr crash reports and fixed in Chrome 105. Several months later, Google's Threat Analysis Group discovered an exploit for that vulnerability used in the wild against clients of a different Chromium-based browser that hadn’t shipped the fix yet.

To further enhance our crash analysis capabilities, we've recently launched an experimental feature that allows us to collect additional information for MiraclePtr crashes, including stack traces. This effectively shortens the average crash report investigation time.

Performance

MiraclePtr enables us to have robust protection against use-after-free bug exploits, but there is a performance cost associated with it. Therefore, we have conducted experiments on each platform where we have shipped MiraclePtr, which we used in our decision-making process.

The main cost of MiraclePtr is memory. Specifically, the memory usage of the browser process increased by 5.5-8% on desktop platforms and approximately 2% on Android. Yet, when examining the holistic memory usage across all processes, the impact remains within a moderate 1-3% range to lower percentiles only.

The main cause of the additional memory usage is the extra size to allocate the reference count. One might think that adding 4 bytes to each allocation wouldn’t be a big deal. However, there are many small allocations in Chrome, so even the 4B overhead is not negligible. Moreover, PartitionAlloc also uses pre-defined allocation bucket sizes, so this extra 4B pushes certain allocations (particularly power-of-2 sized) into a larger bucket, e.g. 4096B → 5120B.

We also considered the performance cost. We verified that there were no regressions to the majority of our top-level performance metrics, including all of the page load metrics, like Largest Contentful Paint, First Contentful Paint and Cumulative Layout Shift. We did find a few regressions, such as a 10% increase in the 99th percentile of the browser process main thread contention metric, a 1.5% regression in First Input Delay on ChromeOS, and a 1.5% regression in tab startup time on Android. The main thread contention metric tries to estimate how often a user input can be delayed and so for example on Windows this was a change from 1.6% to 1.7% at the 99th percentile only. These are all minor regressions. There has been zero change in daily active usage, and we do not anticipate these regressions to have any noticeable impact on users.

Conclusion

In summary, MiraclePtr has proven to be effective in mitigating use-after-free vulnerabilities and enhancing the overall security of the Chrome browser. While there are performance costs associated with the implementation of MiraclePtr, our analysis suggests that the benefits in terms of security improvements far outweigh these. We are committed to continually refining and expanding the feature to cover more areas. For example we are working to add coverage to third-party libraries used by the GPU process, and we plan to enable BRP on the renderer process. By sharing our findings and experiences, we hope to contribute to the broader conversation surrounding browser security and inspire further innovation in this crucial area.

Google Meet is now available on Logitech Android appliances

What’s changing 

Google Meet is now supported on Logitech’s Rally Bar and Rally Bar Mini Android-based appliances for collaboration rooms and spaces of just about any size. After initial setup, admins can easily enroll, manage, and monitor these devices using the Google admin console. Google Meet on Logitech Android-based devices is supported on CollabOS v1.11 as a video conferencing provider. The following Logitech Android devices now support Google Meet: 
  • Logitech Rally Bar 
  • Logitech Rally Bar 
  • Mini Tap IP


Additional details

As part of this launch, we are also providing admins with a new capability to protect their room devices using a passcode. This ensures that only authorized users are able to access and change the room’s device settings. This feature is only available for Logitech Rally Bar and Rally Bar Mini in appliance mode, where Rally Bar’s built-in computer supports Google Meet without the need for an additional computer or a user’s laptop. Visit the Help Center to learn more about setting up Logitech devices as Meet Hardware and enrolling your devices.

Getting started

  • Admins: 
    • Logitech Rally Bar and Rally Bar Mini appliances will need to be updated to CollabOS 1.11 in order to select Google Meet as the conferencing partner application. 
    • Once the device is updated to CollabOS 1.11 and the conferencing partner is set to Google Meet, follow the on-device prompts to enroll the device onto the Google Meet hardware admin console. Visit the Help Center to learn more about setting up Logitech devices as Meet Hardware.
    • Google Meet on Logitech Android appliances require Google Meet hardware licenses, please reach out to a Google Meet hardware reseller. 

  • End users: No action required. Once a Logitech Rally Bar and Rally Bar Mini have been successfully enrolled, you can join Google Meet meetings normally.

Rollout pace

  • This update is available as part of Logitech’s CollabOS 1.11 release. For more information, please reach out to your Logitech account team or reseller.

Availability

  • Available on Logitech Rally Bar and Rally Bar Mini customers. Support for additional Logitech devices will be added over time. 
  • Available to all Google Workspace customers.


Resources



Easily share Google Drive files to Google Calendar meeting attendees

What’s changing

Since introducing the new sharing dialog for Google Drive, Docs, Sheets, Slides, and Forms in 2020, we’ve made several enhancements to make sharing effortless across Workspace. Today, we’re excited to announce the option to share any file with all meeting participants on a Google Calendar invite via the sharing dialog within a file. 


As a file owner or editor, go to the Share button in a file > type in the title of a calendar event > select the event > confirm the correct list of meeting attendees are added > select the users’ access level > click Share. 
Sharing “Weekly notes” to a meeting using the sharing dialog

If you’d like to link the file to the calendar invite, you can select “Attach to calendar event” before clicking Share. 
Attaching a file to a calendar event


Who’s impacted 

End users 


Why you’d use it 

We know sharing files is critical to building a collaborative environment. With this new feature, users can easily share files with meeting attendees before a meeting, ensuring everyone is prepared and able to collaborate on the same file. 


Additional details 

If you attach a file directly to a Calendar invite, you will see a pop-up asking if you'd like to share the file with the meeting attendees. 


Getting started 

  • Admins: There is no admin control for this feature. 
  • End users: To share a file to a calendar event, you must be the file owner or editor and be a participant on the meeting that you’re sharing to on your calendar. Visit the Help Center to learn more about sharing files from Google Drive. 

Rollout pace 


Availability 

  • Available to all Google Workspace customers and users with personal Google Accounts 

Resources