Croissant: a metadata format for ML-ready datasets

Machine learning (ML) practitioners looking to reuse existing datasets to train an ML model often spend a lot of time understanding the data, making sense of its organization, or figuring out what subset to use as features. So much time, in fact, that progress in the field of ML is hampered by a fundamental obstacle: the wide variety of data representations.

ML datasets cover a broad range of content types, from text and structured data to images, audio, and video. Even within datasets that cover the same types of content, every dataset has a unique ad hoc arrangement of files and data formats. This challenge reduces productivity throughout the entire ML development process, from finding the data to training the model. It also impedes development of badly needed tooling for working with datasets.

There are general purpose metadata formats for datasets such as schema.org and DCAT. However, these formats were designed for data discovery rather than for the specific needs of ML data, such as the ability to extract and combine data from structured and unstructured sources, to include metadata that would enable responsible use of the data, or to describe ML usage characteristics such as defining training, test and validation sets.

Today, we're introducing Croissant, a new metadata format for ML-ready datasets. Croissant was developed collaboratively by a community from industry and academia, as part of the MLCommons effort. The Croissant format doesn't change how the actual data is represented (e.g., image or text file formats) — it provides a standard way to describe and organize it. Croissant builds upon schema.org, the de facto standard for publishing structured data on the Web, which is already used by over 40M datasets. Croissant augments it with comprehensive layers for ML relevant metadata, data resources, data organization, and default ML semantics.

In addition, we are announcing support from major tools and repositories: Today, three widely used collections of ML datasets — Kaggle, Hugging Face, and OpenML — will begin supporting the Croissant format for the datasets they host; the Dataset Search tool lets users search for Croissant datasets across the Web; and popular ML frameworks, including TensorFlow, PyTorch, and JAX, can load Croissant datasets easily using the TensorFlow Datasets (TFDS) package.


Croissant

This 1.0 release of Croissant includes a complete specification of the format, a set of example datasets, an open source Python library to validate, consume and generate Croissant metadata, and an open source visual editor to load, inspect and create Croissant dataset descriptions in an intuitive way.

Supporting Responsible AI (RAI) was a key goal of the Croissant effort from the start. We are also releasing the first version of the Croissant RAI vocabulary extension, which augments Croissant with key properties needed to describe important RAI use cases such as data life cycle management, data labeling, participatory data, ML safety and fairness evaluation, explainability, and compliance.


Why a shared format for ML data?

The majority of ML work is actually data work. The training data is the “code” that determines the behavior of a model. Datasets can vary from a collection of text used to train a large language model (LLM) to a collection of driving scenarios (annotated videos) used to train a car’s collision avoidance system. However, the steps to develop an ML model typically follow the same iterative data-centric process: (1) find or collect data, (2) clean and refine the data, (3) train the model on the data, (4) test the model on more data, (5) discover the model does not work, (6) analyze the data to find out why, (7) repeat until a workable model is achieved. Many steps are made harder by the lack of a common format. This “data development burden” is especially heavy for resource-limited research and early-stage entrepreneurial efforts.

The goal of a format like Croissant is to make this entire process easier. For instance, the metadata can be leveraged by search engines and dataset repositories to make it easier to find the right dataset. The data resources and organization information make it easier to develop tools for cleaning, refining, and analyzing data. This information and the default ML semantics make it possible for ML frameworks to use the data to train and test models with a minimum of code. Together, these improvements substantially reduce the data development burden.

Additionally, dataset authors care about the discoverability and ease of use of their datasets. Adopting Croissant improves the value of their datasets, while only requiring a minimal effort, thanks to the available creation tools and support from ML data platforms.


What can Croissant do today?

The Croissant ecosystem: Users can Search for Croissant datasets, download them from major repositories, and easily load them into their favorite ML frameworks. They can create, inspect and modify Croissant metadata using the Croissant editor.

Today, users can find Croissant datasets at:

With a Croissant dataset, it is possible to:

To publish a Croissant dataset, users can:

  • Use the Croissant editor UI (github) to generate a large portion of Croissant metadata automatically by analyzing the data the user provides, and to fill important metadata fields such as RAI properties.
  • Publish the Croissant information as part of their dataset Web page to make it discoverable and reusable.
  • Publish their data in one of the repositories that support Croissant, such as Kaggle, HuggingFace and OpenML, and automatically generate Croissant metadata.

Future direction

We are excited about Croissant's potential to help ML practitioners, but making this format truly useful requires the support of the community. We encourage dataset creators to consider providing Croissant metadata. We encourage platforms hosting datasets to provide Croissant files for download and embed Croissant metadata in dataset Web pages so that they can be made discoverable by dataset search engines. Tools that help users work with ML datasets, such as labeling or data analysis tools should also consider supporting Croissant datasets. Together, we can reduce the data development burden and enable a richer ecosystem of ML research and development.

We encourage the community to join us in contributing to the effort.


Acknowledgements

Croissant was developed by the Dataset Search, Kaggle and TensorFlow Datasets teams from Google, as part of an MLCommons community working group, which also includes contributors from these organizations: Bayer, cTuning Foundation, DANS-KNAW, Dotphoton, Harvard, Hugging Face, Kings College London, LIST, Meta, NASA, North Carolina State University, Open Data Institute, Open University of Catalonia, Sage Bionetworks, and TU Eindhoven.

Source: Google AI Blog


Designing your account deletion experience with users in mind

Posted by Tatiana van Maaren – Global T&S Partnerships Lead, Privacy & Security, May Smith - Product Manager, and Anita Issagholyan – Policy Lead

With millions of developers relying on our platform, Google Play is committed to keeping our ecosystem safe for everyone. That’s why, in addition to our ongoing investments in app privacy and security, we also continuously update our policies to respond to new challenges and user expectations.

For example, we recently introduced a new account deletion policy with required disclosures within the Data Safety section on the Play Store. Deleting an account should be as easy as creating one, so the new policy requires developers to provide information and web resources that help users to manage their data and understand an app's deletion practices.

To help you build trust and design a user-friendly experience that helps meet our policy requirements, consider these 5 best practices when implementing your account deletion solution.

1.     Make it seamless

Users prefer a simple and straightforward account deletion flow. Although users know that more steps may follow (such as authentication) navigating multiple screens before the deletion page can be a significant barrier and create negative feelings for the user. Consider providing your account deletion option on an account settings page or place a prominent button on the home screen. Design the flow with discoverability in mind by taking the user directly to the deletion process.

2.     Allow automatic deletion

Users feel that if they can create an account without talking to a customer service agent, they should be able to delete their account online, too. If automation is not on your roadmap just yet, consider a step-by-step deletion request form or a dedicated page to connect users with customer support.

3.     Offer guidance and explain potential implications

Users delete their accounts for various reasons, some of which may be better resolved another way. Early in your deletion flow, point your users toward a Help Center article that explains how your deletion process works in simple terms, including any potential consequences. For example, make it clear if your users will need to pause their payment method before deleting the account, or download any account data they want to keep. Helping your users understand the process in advance can prevent accidental deletions. For those who do change their minds, consider offering a way to recover their accounts within a reasonable timeframe.

Here’s an example of how Play Store Developer, Canva, has designed the in-app deletion flow to explain potential consequences of account deletion:

user journey on the Canva app in three panels
User journey on the Canva app
“User data privacy has always been important for us. We’ve always been intentional about our approach in optimizing the Canva app so our users can have more transparency and control over their data. We’re welcoming these new requirements from the Play store as we know this new flow will elevate users’ trust in our app and benefit our business in the long term.” - Will Currie, Software Engineer, Canva

4.     Confirm account deletion

Sometimes users misunderstand whether the account itself or just data collected by the app was deleted in the deletion process. Users often think that the data your app stored in the cloud will automatically be deleted at the same time as account deletion. Since it may take time to remove account data from internal company systems or comply with data retention requirements in different regions, transparency about the process can help you maintain trust in your brand and make it more likely for users to return in the future.

Here’s SYBO Games, has designed their web deletion in-app deletion flow:

user journey on the Sybo Games web resource in four panels
User journey on the SYBO Games web resource
“We are always striving to ensure that our games provide a fun user experience, built on a solid data protection foundation. When we learned about the new account deletion update on Google Play, we thought this was a great step forward to ensure that the entire developer ecosystem optimizes for user safety. We encourage developers to carve out time to embrace these improvements and prioritize regular privacy check-ins.”  - Elizabeth Ann Schweitzer, Games Compliance Manager, SYBO Games

5.     Don’t forget user engagement

This is a great opportunity to connect with your users at a critical moment. Make sure users who have uninstalled your app can easily remove their accounts through a web resource without needing to reinstall the app. You can also invite them to complete a survey or provide feedback on their decision.

Protecting users' data is essential for building trust and loyalty. By updating the Data Safety section on Google Play and continuing to optimize user experience for account deletion, you can strengthen trust in your company while striving for the highest level of user data protection.


Thank you for your continued collaboration and feedback in developing this data transparency feature and in helping make Google Play safe for all.

Google Meet co-host support added for client-side encrypted meetings

What’s changing

Client-side encrypted meetings are now getting support for using the co-hosts feature. This means that an organizer can plan and book client-side encrypted video meetings on behalf of other users and assign those as co-hosts to allow them to join and open the meetings independently from the organizer. Client-side encrypted meetings differ from point-to-point encrypted meetings in the way they always require a host to join first. This task can now be delegated and shared between multiple users without the organizer ever joining the call.



Getting started


Rollout pace

Note: This feature is only available on the web as planning meetings with co-hosts can only be done on a computer.

Availability

  • Available to Google Workspace Enterprise Plus, Education Plus, and Education Standard customers

Resources


Set up dropdown chips more easily in Google Sheets

What’s changing

In 2022, we announced the ability to create dropdown chips in Google Sheets. This custom formatting smart canvas feature enables you to easily indicate statuses or various project milestones within your spreadsheet. 

Today, we’re building upon this by introducing preset dropdowns. Instead of manually creating values for your dropdowns, you can insert preset dropdown chips that are configured for common use cases like priority or review statuses. After inserting a preset dropdown, you can use the data validation sidebar to easily adjust the options or add styles. 
Set up dropdown chips more easily in Google Sheets


Getting started 


Rollout pace 


Availability 

  • Available to all Google Workspace customers, Google Workspace Individual subscribers, and users with personal Google account 

Resources

Deprecation of Full Path and Path Attribution reports in Bid Manager API

Starting May 1, 2024, requests to retrieve, create, or run Full Path and Path Attribution reports through the Bid Manager API will return an error. We deprecated both report types in February 2024. We announced this deprecation last November.

After deprecation, running a query using the ReportType FULL_PATH or PATH_ATTRIBUTION generates an empty report. Existing Query and Report resources of these types are still retrievable, and report files generated previously will still be available.

Starting on May 1, 2024, ReportType values FULL_PATH or PATH_ATTRIBUTION and the pathQueryOptions field will sunset. As a result:

We’ve added these details to our change log. To avoid an interruption of service, we recommend that you stop creating, retrieving, or running any reports using these values before the applicable sunset date.

If you have questions regarding these changes, please contact us using our support contact form.

Now, everything’s really up-to-date in Kansas City – 8 Gig available in all GFiber KC service areas


We’re launching 8 Gig in our largest and oldest market — the original Google Fiber metro — Kansas City. The city was recently designated as one of the country’s 31 Tech Hubs by the federal government, a move that recognized both the incredible assets and resources already in the area and the potential for innovation that that infrastructure represents. GFiber is proud to have been part of that growth and development over the past twelve years, helping to drive innovation and testing the limits of speed as our first test site for 20 Gig. 

Thumbnail

KC was also one of the first GFiber cities to get our 5 Gig service. We’ve been steadily rolling out our new multigig products in Google Fiber cities across the country since early last year. To do this, we’re updating our network in each city to XGS PON to ensure that we can meet our customers’ needs as they grow over time. We’re upgrading our network holistically across all our markets. 

Included in the monthly cost, 8 Gig customers get symmetrical upload and download speeds of up to 8000 Mbps (wired), along with the GFiber Wi-Fi 6E Router (which allows for up to 1600 Mbps over Wi-Fi) and up to two Mesh Extenders for strong whole home Wi-Fi coverage and, as always, unlimited data, professional installation, and access to GFiber’s highly rated 24/7 customer service.   

You may be wondering why it’s taken us about a year to bring 8 Gig to Kansas City when it was one of the original cities to get 5 Gig in 2023 while other cities have gotten 8 Gig sooner. KC is our largest market — so supporting 8 Gig speeds across our entire service area required us to touch infrastructure across every part of the Kansas City metro area. This is great news not just for our newest 8 Gig customers, but for all our KC customers (and in other cities too), because it means that GFiber’s network is set up to support rising demand for more speed for years to come.

Kansas City has been on the forefront of internet speed since the very beginning, and it’s still very much on the fast track — sign up for 8 Gig here!

Posted by Nick Saporito, Head of Product




Introducing a new Text-To-Speech engine on Wear OS

Posted by Ouiam Koubaa – Product Manager and Yingzhe Li – Software Engineer

Today, we’re excited to announce the release of a new Text-To-Speech (TTS) engine that is performant and reliable. Text-to-speech turns text into natural-sounding speech across more than 50 languages powered by Google’s machine learning (ML) technology. The new text-to-speech engine on Wear OS uses decreased prosody ML models to bring faster synthesis on Wear OS devices.

Use cases for Wear OS’s text-to-speech can range from accessibility services, coaching cues for exercise apps, navigation cues, and reading aloud incoming alerts through the watch speaker or Bluetooth connected headphones. The engine is meant for brief interactions, so it shouldn’t be used for reading aloud a long article, or a long summary of a podcast.

How to use Wear OS’s TTS

Text-to-speech has long been supported on Android. Wear OS’s new TTS has been tuned to be performant and reliable on low-memory devices. All the Android APIs are still the same, so developers use the same process to integrate it into a Wear OS app, for example, TextToSpeech#speak can be used to speak specific text. This is available on devices that run Wear OS 4 or higher.

When the user interacts with the Wear OS TTS for the first time following a device boot, the synthesis engine is ready in about 10 seconds. For special cases where developers want the watch to speak immediately after opening an app or launching an experience, the following code can be used to pre-warm the TTS engine before any synthesis requests come in.

private fun initTtsEngine() {
    // Callback when TextToSpeech connection is set up
    val callback = TextToSpeech.OnInitListener { status ->
        if (status == TextToSpeech.SUCCESS) {
            Log.i(TAG, "tts Client Initialized successfully")


            // Get default TTS locale
            val defaultVoice = tts.voice
            if (defaultVoice == null) {
                Log.w(TAG, "defaultVoice == null")
                return@OnInitListener
            }


            // Set TTS engine to use default locale
            tts.language = defaultVoice.locale




            try {
                // Create a temporary file to synthesize sample text
                val tempFile =
                        File.createTempFile("tmpsynthesize", null, applicationContext.cacheDir)


                // Synthesize sample text to our file
                tts.synthesizeToFile(
                        /* text= */ "1 2 3", // Some sample text
                        /* params= */ null, // No params necessary for a sample request
                        /* file= */ tempFile,
                        /* utteranceId= */ "sampletext"
                )


                // And clean up the file
                tempFile.deleteOnExit()
            } catch (e: Exception) {
                Log.e(TAG, "Unhandled exception: ", e)
            }
        }
    }


    tts = TextToSpeech(applicationContext, callback)
}

When you are done using TTS, you can release the engine by calling tts.shutdown() in your activity’s onDestroy() method. This command should also be used when closing an app that TTS is used for.

Languages and Locales

By default, Wear OS TTS includes 7 pre-loaded languages in the system image: English, Spanish, French, Italian, German, Japanese, and Mandarin Chinese. OEMs may choose to preload a different set of languages. You can check what languages are available by using TextToSpeech#getAvailableLanguages(). During watch setup, if the user selects a system language that is not a pre-loaded voice file, the watch automatically downloads the corresponding voice file the first time the user connects to Wi-Fi while charging their watch.

There are limited cases where the speech output may differ from the user’s system language. For example, in a scenario where a safety app uses TTS to call emergency responders, developers might want to synthesize speech in the language of the locale the user is in, not in the language the user has their watch set to. To synthesize text in a different language from system settings, use TextToSpeech#setLanguage(java.util.Locale)

Conclusion

Your Wear OS apps now have the power to talk, either directly from the watch’s speakers or through Bluetooth connected headphones. Learn more about using TTS.

We look forward to seeing how you use Text-to-speech engine to create more helpful and engaging experiences for your users on Wear OS!


Copyright 2023 Google LLC.
SPDX-License-Identifier: Apache-2.0