Tag Archives: Open source

Celebrating Google Dev Library’s Women Contributors in AI/ML

Posted by Swathi Dharshna Subbaraj, Google Dev Library

Women have made remarkable progress in advancing AI/ML technology through their contributions to open source projects. They have developed and maintained tools, algorithms, and frameworks that enable researchers, developers, and businesses to create and implement cutting edge AI/ML solutions.

To celebrate those achievements, Google Dev Library has featured outstanding contributions from developers worldwide. It has also provided an opportunity to showcase contributions from women developers who are working on AI/ML projects. Read on to learn their projects and insights.

Contributors in Spotlight


Suzen Fylke

Suzen is a machine learning engineer with a passion for helping mission-driven and socially-minded companies leverage AI and data to drive impactful outcomes. With 3 years of experience at Twitter, Suzen developed platform tools that streamlined model development and deployment processes, allowing for faster iteration and improved efficiency. Sue recently shared her blog post titled "How to Visualize Custom TFX Artifacts With InteractiveContext" with Dev Library. Let's speak with Sue and learn more about her experience.

Headshot of Suzen Fylke, smiling

1.    Tell us more about your recent Dev Library submission on inspecting TFX artifactswith InteractiveContext and why you consider it invaluable for debugging TFX pipelines?
    One of my favorite things about TFX is being able to run pipeline steps individually and interactively inspect their results with InteractiveContext. I used to think you could only display standard artifacts with built-in visualizations, but, as it turns out, you can also use InteractiveContext with custom artifacts. Since I hadn't found any examples or documentation explaining how to display custom artifacts, I wrote a tutorial.


    2.    Can you walk me through your process for creating technical documentation for your projects to help other developers?   

    When I create technical documentation for work or open source projects, I do my best to follow the community's best practices and style guides and to center the reader. I think a lot about what readers can hope to learn or be able to do after reading the docs. I followed a similar approach when writing the tutorial I submitted.

    Most of my personal projects are active learning exercises. When I write about such projects, I focus much more on the process of building them than on the outcome. So, in addition to showing how they work, I describe what inspired me to create them, the challenges I encountered, and what's next for the project. I also include lots of links to resources I found helpful for understanding the tools and concepts I learned about.

    3.    What advice do you have for other women interested in developing open source AL/ML projects, and how can they get started? 

    I recommend contributing to communities you care about and projects you use and want to help improve. Create things using the project. Ask questions when documentation needs to be clarified. Report bugs when you encounter them. If you build something cool, demo it or write about it. If you find a problem you can fix, volunteer to do so. And if you get stuck or don't understand something, ask for help. I also recommend reading GitHub's "How to Contribute to Open Source" guide (https://opensource.guide/how-to-contribute/). My favorite takeaway is that open source projects are more than code and that there are many different ways to contribute based on your interests.

    4.    Your Dev Library author profile bio states that you’re exploring how to “make learning languages fun and approachable.” Can you walk me through that process? 
     
    This is aspirational and mainly a hobby right now. I love learning languages and learning how to learn languages. Languages are my "thing I can talk about for hours without getting bored." I don't actually have a process for this. Instead, I do a lot of exploring and experimenting and let my curiosity guide me. Sometimes this involves reading linguistics textbooks, trying different language-learning apps, contributing to projects like Common Voice, or learning how to use libraries like spaCy.

    5.    How do you see the field of open source AI/ML development evolving in the coming years, and how are you preparing for these changes?
    I see the continued development of tools and platforms aimed at democratizing machine learning. I hope this will enable people to meaningfully engage with the models and AI-powered products they use and better understand how they work. I also hope this will lead to more grassroots participatory research communities like Masakhane and encourage people without ML or software engineering backgrounds to create and contribute to open source projects.

    Aqsa is a passionate machine learning engineer with a strong curiosity for technology and a desire to share ideas with others. She has practical experience in diverse projects, including footfall forecasting, cataract detection, augmented reality, object detection, and recommender systems. Aqsa shared her blog post titled "Callbacks in TensorFlow — Customize the Behavior of your training" with Dev Library. Let's speak with Aqsa and learn more about her experience.

    Photo of Aqsa Kausar holding a microphone
    1.    Being Pakistan’s first Google Developer Expert (GDE), how do you approach building inclusive and diverse communities around you?
      As a Google Developer Expert (GDE), my responsibility is to help improve the tech community through inclusive and diverse events, workshops, and mentorship. With support from Google, fellow GDEs, and Google Developer Groups, we aim to create accessible opportunities for everyone, regardless of their background or experience level. As a speaker, I share my knowledge in ML with diverse audiences and offer mentorship to underrepresented individuals in tech, including women, minorities, and individuals from different backgrounds. I provide guidance on educational and career opportunities and connect people with resources, catering to as many as I can through various means of communication.

      2.     How do you approach collaborating with other developers on open source AI/ML projects, and what are some best practices you follow to ensure success?

      In our GDE community, we have active open source contributors who collaborate in groups for tutorials, research papers, and more. Collaboration is encouraged, and Googlers sometimes lead open source projects with GDEs. When you express interest, developers are open to working together. To foster a positive culture, we emphasize value and respect, clear goals, manageable tasks, communication channels, open communication, constructive feedback, and celebrating milestones. Successful collaboration hinges on valuing each other's time and skills.

      3.    How do you balance the need for technical rigor with the need for usability and accessibility in your open source projects?

      Understanding your audience and their needs is crucial to strike the right balance between technical rigor and usability. Simplify technical concepts for non-technical audiences and focus on practical applications. In open source projects, you have more flexibility, but in workshops or training, choose tools and technologies suitable for your audience. For beginners, use simpler language and interactive demos. For intermediate or advanced audiences, go deeper into technical details with coding snippets and complex concepts.

      4.    Why do you think it is important for technical writers to revise your content or projects regularly? Do you think it’s important that every tech writer or open source maintainer follow this best practice?

      Technology is ever-changing, so technical writers need to revise content regularly to ensure accuracy. Feedback from the audience can help make it accessible and relevant. However, contributors may not always have time to update their work due to busy schedules. Nevertheless, tech blogs and projects still provide a valuable kickstart for new developers, who can contribute with updates or follow-up blogs.

      5.    Can you tell me about a project you've worked on that you're particularly proud of, and what impact it has had on the open source community?

      I have been part of impactful initiatives such as Google Women Developer Academy, where I was a mentor for their pilot. The program helps women in tech improve their communication skills and prepares them for showcasing their talents, boosting their confidence. I also collaborated with fellow Google Developer Experts (GDEs) during the COVID-19 pandemic to create an open source course called "ML for Rookies," which simplifies machine learning concepts. Currently, I am working on a Cloud AI project supported by GCP and have started an open source "Cloud Playground" repo to make cloud-ai learning more accessible.

      Margaret, an ML Google Developer Expert (GDE) since 2018, is an ML research engineer who applies AI/ML to real world applications ranging from climate change to art and design. With expertise in deep learning, computer vision, TensorFlow, and on-device ML, she often writes and speaks at conferences. Margaret has shared multiple projects in topics like TensorFlow Lite with Dev Library. Let's speak with Margaret and learn more about her experience.

      Photo of Margaret Maynard-Reid, smiling

      1.    Can you share the Google technologies you work with?  
       
      Some of the Google technologies I work with are TensorFlow, TensorFlow Lite, Keras, Android, MediaPipe, and ML Kit. 

      2.    How do you approach collaborating with other developers on open source projects, and what are some best practices you follow to ensure a successful collaboration? 

      I’ve collaborated with Googlers, ML GDEs, students and professionals in tech. Consistent communication and observing best practices, such as code check-in and code reviews, are helpful to ensure a successful collaboration. 

      3.    What is your development process like for creating and maintaining open source AI/ML projects, and how do you prioritize which projects to work on? 

      There is limited time so prioritization is super important. I like to showcase new technologies or areas where developers including myself may have challenges with. Aside from code and tutorials, I also like to share my knowledge with sketchnotes and visual illustrations. 

      4.    You have been sharing learning resources on TensorFlow Lite. What advice do you have for other women interested in developing open source projects, and how can they get started? 
       
      There are many ways to contribute to open source projects: provide feedback on documentation or product features; write a tutorial with sample code; help fix bugs or contribute to libraries etc. It’s best to start simple and easy first, and then progress to more challenging projects. 

      5.    How do you see the field of open source AI/ML development evolving in the coming years, and how are you preparing for these changes? 

      Open source is becoming increasingly important for AI/ML development, evident in the recent development of generative AI and on-device machine learning for example. There will be even more opportunities for open source projects. Keep contributing because open source projects are a great way to learn the latest while helping others.

      Are you actively contributing to the AI/ML community? Become a Google Dev Library Contributor!

      Google Dev Library is a platform for showcasing open source projects featuring Google technologies. Join our global community of developers to showcase your projects. Submit your content.

      Rust fact vs. fiction: 5 Insights from Google’s Rust journey in 2022

      Reaching version 1.0 in just 2015, Rust is a relatively new language with a lot to offer. Developers eyeing the performance and safety guarantees that Rust provides, have to wonder if it's possible to just use Rust in place of what they've been using previously. What would happen if large companies tried to use it in their existing environment? How long would it take for developers to learn the language? Once they do, would they be productive?

      In this post, we will analyze some data covering years of early adoption of Rust here at Google. At Google, we have been seeing increased Rust adoption, especially in our consumer applications and platforms. Pulling from the over 1,000 Google developers who have authored and committed Rust code as some part of their work in 2022, we’ll address some rumors head-on, both confirming some issues that could be improved and sharing some enlightening discoveries we have made along the way.

      We’d like to particularly thank one of our key training vendors, Ferrous Systems, as we started our Rust adoption here at Google. We also want to highlight some new freely available self-service training materials called Comprehensive Rust 🦀 that we and the community have worked on over the last few quarters.

      Rumor 1: Rust takes more than 6 months to learn – Debunked !

      All survey participants are professional software developers (or a related field), employed at Google. While some of them had prior Rust experience (about 13%), most of them are coming from C/C++, Python, Java, Go, or Dart.

      Based on our studies, more than 2/3 of respondents are confident in contributing to a Rust codebase within two months or less when learning Rust. Further, a third of respondents become as productive using Rust as other languages in two months or less. Within four months, that number increased to over 50%. Anecdotally, these ramp-up numbers are in line with the time we’ve seen for developers to adopt other languages, both inside and outside of Google.

      Overall, we’ve seen no data to indicate that there is any productivity penalty for Rust relative to any other language these developers previously used at Google. This is supported by the students who take the Comprehensive Rust 🦀 class: the questions asked on the second and third day show that experienced software developers can become comfortable with Rust in a very short time.

      Pie graph depicting time until confident writing Rust. Still ramping up = 8.6% (orange), 2-3 weeks = 27% (blue), 1-2 months = 39.8% (red), 3-4 months = 15.6% (yellow), More than 4 months = 9% (green)

      Rumor 2: The Rust compiler is not as fast as people would like – Confirmed !

      Slow build speeds were by far the #1 reported challenge that developers have when using Rust, with only a little more than 40% of respondents finding the speed acceptable.

      There is already a fantastic community-wide effort improving and tracking rustc performance. This is supported by both volunteers and several companies (including Google), and we’re delighted to see key developers working in this space but clearly continuing and potentially growing additional support here would be beneficial.

      Rumor 3: Unsafe code and interop are always the biggest challenges – Debunked !

      The top three challenging areas of Rust for current Google developers were:

      Writing unsafe code and handling C/C++ interop were cited as something Google developers had encountered but were not top challenges. These three other areas are places where the Rust Language Design Team has been investing in flattening the learning curve overall as well as continued evolution, and our internal survey results strongly agree with these as areas of investment.

      Rumor 4: Rust has amazing compiler error messages – Confirmed !

      Rust is commonly regarded as having some of the most helpful error messages in the compiler space, and that held up in this survey as well. Only 9% of respondents are not satisfied with the quality of diagnostic and debugging information in Rust. Feedback from Comprehensive Rust 🦀 participants shows the same: people are amazed by the compiler messages. At first this is a surprise – people are used to ignoring large compiler errors, but after getting used to it, people love it.

      The following are excerpts from an exercise some internal Googlers have been doing to practice Rust – solving Advent of Code 2021 in Rust.

      On Day 5 of the exercises, we need to perform a search for entries within a table. The error below not only detects that our pattern matching on the result was missing a case, but also makes a suggestion for a fix.

      Image of code snippet showing error detection message for pattern matching in Rust

      On Day 11, we need to check for whether an element is within the bounds of a grid. The Rust warning below detects that we have a redundant comparison due to the fact that the types are unsigned, and suggests code that could be removed.

      Image of code snippet showing error detection message for redundant comparison in Rust

      Rumor 5: Rust code is high quality – Confirmed!

      The respondents said that the quality of the Rust code is high — 77% of developers were satisfied with the quality of Rust code. In fact, when asked to compare whether they felt that Rust code was more correct than the code that they write in other languages, an overwhelming 85% of respondents are confident that their Rust code is correct.

      And, it’s not just correct—it’s also easy to review. More than half of respondents say that Rust code is incredibly easy to review. As an engineering manager, that result is in many ways at least as interesting to me as the code authoring results, since code reviewing is at least as large a part of the role of a professional software engineer as authoring.

      As both we at Google and others have noted, developer satisfaction and productivity are correlated with both code quality and how long it takes to get a code review. If Rust is not only better for writing quality code, but also better for getting that code landed, that’s a pretty compelling set of reasons beyond even performance and memory safety for companies to be evaluating and considering adopting it.

      Looking forward

      While over a thousand developers is a good sample of engineers, we look forward to further adoption and a future survey that includes many more use cases. In addition, while many of the developers surveyed joined teams without Rust experience, this population does have more excited early adopters than we would like from a broader survey. Stay tuned over the next year for another update!

      By Lars Bergstrom, PhD – Android Platform Programming Languages and Kathy Brennan, PhD - Low-level Operating Systems Sr. User Experience Researcher

      Optimizing gVisor filesystems with Directfs

      gVisor is a sandboxing technology that provides a secure environment for running untrusted code. In our previous blog post, we discussed how gVisor performance improves with a root filesystem overlay. In this post, we'll dive into another filesystem optimization that was recently launched: directfs. It gives gVisor’s application kernel (the Sentry) secure direct access to the container filesystem, avoiding expensive round trips to the filesystem gofer.

      Origins of the Gofer

      gVisor is used internally at Google to run a variety of services and workloads. One of the challenges we faced while building gVisor was providing remote filesystem access securely to the sandbox. gVisor’s strict security model and defense in depth approach assumes that the sandbox may get compromised because it shares the same execution context as the untrusted application. Hence the sandbox cannot be given sensitive keys and credentials to access Google-internal remote filesystems.

      To address this challenge, we added a trusted filesystem proxy called a "gofer". The gofer runs outside the sandbox, and provides a secure interface for untrusted containers to access such remote filesystems. For architectural simplicity, gofers were also used to serve local filesystems as well as remote.

      Gofer process intermediates filesystem operations

      Isolating the Container Filesystem in runsc

      When gVisor was open sourced as runsc, the same gofer model was copied over to maintain the same security guarantees. runsc was configured to start one gofer process per container which serves the container filesystem to the sandbox over a predetermined protocol (now LISAFS). However, a gofer adds a layer of indirection with significant overhead.

      This gofer model (built for remote filesystems) brings very few advantages for the runsc use-case, where all the filesystems served by the gofer (like rootfs and bind mounts) are mounted locally on the host. The gofer directly accesses them using filesystem syscalls.

      Linux provides some security primitives to effectively isolate local filesystems. These include, mount namespaces, pivot_root and detached bind mounts1. Directfs is a new filesystem access mode that uses these primitives to expose the container filesystem to the sandbox in a secure manner. The sandbox’s view of the filesystem tree is limited to just the container filesystem. The sandbox process is not given access to anything mounted on the broader host filesystem. Even if the sandbox gets compromised, these mechanisms provide additional barriers to prevent broader system compromise.

      Directfs

      In directfs mode, the gofer still exists as a cooperative process outside the sandbox. As usual, the gofer enters a new mount namespace, sets up appropriate bind mounts to create the container filesystem in a new directory and then pivot_root(2)s into that directory. Similarly, the sandbox process enters new user and mount namespaces and then pivot_root(2)s into an empty directory to ensure it cannot access anything via path traversal. But instead of making RPCs to the gofer to access the container filesystem, the sandbox requests the gofer to provide file descriptors to all the mount points via SCM_RIGHTS messages. The sandbox then directly makes file-descriptor-relative syscalls (e.g. fstatat(2), openat(2), mkdirat(2), etc) to perform filesystem operations.

      Sandbox directly accesses container filesystem with directfs

      Earlier when the gofer performed all filesystem operations, we could deny all these syscalls in the sandbox process using seccomp. But with directfs enabled, the sandbox process's seccomp filters need to allow the usage of these syscalls. Most notably, the sandbox can now make openat(2) syscalls (which allow path traversal), but with certain restrictions: O_NOFOLLOW is required, no access to procfs and no directory FDs from the host. We also had to give the sandbox the same privileges as the gofer (for example CAP_DAC_OVERRIDE and CAP_DAC_READ_SEARCH), so it can perform the same filesystem operations.

      It is noteworthy that only the trusted gofer provides FDs (of the container filesystem) to the sandbox. The sandbox cannot walk backwards (using ‘..’) or follow a malicious symlink to escape out of the container filesystem. In effect, we've decreased our dependence on the syscall filters to catch bad behavior, but correspondingly increased our dependence on Linux's filesystem isolation protections.

      Performance

      Making RPCs to the gofer for every filesystem operation adds a lot of overhead to runsc. Hence, avoiding gofer round trips significantly improves performance. Let's find out what this means for some of our benchmarks. We will run the benchmarks using our newly released systrap platform on bind mounts (as opposed to rootfs). This would simulate more realistic use cases because bind mounts are extensively used while configuring filesystems in containers. Bind mounts also do not have an overlay (like the rootfs mount), so all operations go through goferfs / directfs mount.

      Let's first look at our stat micro-benchmark, which repeatedly calls stat(2) on a file.

      Stat benchmark improvement with directfs
      The stat(2) syscall is more than 2x faster! However, since this is not representative of real-world applications, we should not extrapolate these results. So let's look at some real-world benchmarks.
      Stat benchmark improvement with directfs
      We see a 12% reduction in the absolute time to run these workloads and 17% reduction in Ruby load time!

      Conclusion

      The gofer model in runsc was overly restrictive for accessing host files. We were able to leverage existing filesystem isolation mechanisms in Linux to bypass the gofer without compromising security. Directfs significantly improves performance for certain workloads. This is part of our ongoing efforts to improve gVisor performance. You can learn more about gVisor at gvisor.dev. You can also use gVisor in GKE with GKE Sandbox. Happy sandboxing!


      1Detached bind mounts can be created by first creating a bind mount using mount(MS_BIND) and then detaching it from the filesystem tree using umount(MNT_DETACH).


      By Ayush Ranjan, Software Engineer – Google

      OpenTitan RTL freeze

      We are excited to announce that the OpenTitan® coalition has successfully reached a key milestone—RTL freeze of its first engineering sample release candidate! A snapshot of our high quality, open source silicon root of trust hardware implementation has been released for synthesis, layout and fabrication. We expect engineering sample chips to be available for lab testing and evaluation by the end of 2023.

      This is a major achievement that represents the culmination of a multi-year investment and long-term, coordinated effort by the project’s active community of commercial and academic partners—including Google, G+D Mobile Security, ETH Zurich, Nuvoton, Winbond, Seagate, Western Digital, Rivos, and zeroRISC, plus a number of independent contributors. The OpenTitan project and its community are actively supported and maintained by lowRISC C.I.C., an independent non-profit.

      Hitting this milestone demonstrates that large-scale engineering efforts can be successful when many organizations with aligned interests collaborate on an open source project. It also matters because traditionally, computing ecosystems have had to depend heavily on proprietary hardware (silicon) and software solutions to provide foundational, or “root,” trust assurances to their users. OpenTitan fundamentally changes that paradigm for the better, delivering secure root of trust silicon technology which is open source, high quality, and publicly verifiable.

      Our belief is that core security features like the authenticity of the root of trust and the firmware it executes should be safely commoditized rights guaranteed to the end user—not areas for differentiation. To that end, we have made available a high-quality, industrial strength, reusable ecosystem of OpenTitan blocks, top-levels, infrastructure, and software IP adaptable for many use cases, delivered under a permissive, no-cost license and with known-good provenance. OpenTitan's now-complete, standalone “Earl Grey” chip implementation, design verification, full-chip testing, and continuous integration (CI) infrastructure are all available on GitHub today.

      Flowchart illustrating the silicone process and OpenTitan

      The silicon process and OpenTitan

      This release means the OpenTitan chip digital design is complete and has been verified to be of sufficiently high quality that a tapeout is expected to succeed. In other words, the logical design is judged to be of sufficient maturity to translate into a physical layout and create a physical chip. The initial manufacturing will be performed in a smaller batch, delivering engineering samples which allow post-silicon verification of the physical silicon, prior to creating production devices in large volume.


      Earl Grey: Discrete implementation of OpenTitan

      Design Verification

      Industrial quality implementation has been a core tenet of the OpenTitan project from the outset, both to ensure the design meets its goals—including security—and to ensure the first physical chips are successful. OpenTitan’s hardware development stages ensure all hardware blocks go through several gating design and verification reviews before final integration signoff. This verification has required development of comprehensive testbenches and test infrastructure, all part of the open source project. Both individual blocks and the top-level Earl Grey design have functional and code coverage above 90%—at or above the standards of typical closed-source designs—with 40k+ tests running nightly and reported publicly via the OpenTitan Design Verification Dashboard. Regressions are caught and resolved quickly, ensuring design quality is maintained over the long term.

      Software tooling

      OpenTitan has led the way in making open source silicon a reality, and doing so requires much more than just open source silicon RTL and Design Verification collateral. Successful chips require real software support to have broad industry impact and adoption. OpenTitan has created generalizable infrastructure for silicon projects (test frameworks, continuous integration infrastructure, per-block DIFs), host tools like opentitantool to support interactions with all OpenTitan instances, and formal releases (e.g. the ROM to guarantee important security functionality such as firmware verification and ownership transfer).

      Documentation

      A good design isn’t worth much if it’s hard to use. With this in mind, thorough and accurate documentation is a major component of the OpenTitan project too. This includes a Getting Started Guide, which is a ‘from scratch’ walkthrough on a Linux workstation, covering software and tooling installation, and hardware setup. It includes a playbook to run local simulations or even emulate the entire OpenTitan chip on an FPGA.

      Furthermore, OpenTitan actively maintains live dashboards of quality metrics for its entire IP ecosystem (e.g. regression testing and coverage reports). If you’re new to open source silicon development, there are comprehensive resources describing project standards for technical contribution that have been honed to effectively facilitate inter-organizational collaboration.

      Thriving open source community

      OpenTitan’s broad community has been critical to its success. As the following metrics show (baselined from the project’s public launch in 2019), the OpenTitan community is rapidly growing:

      • More than eight times the number of commits at launch: from 2,500 to over 20,000.
      • 140 contributors to the code base
      • 13k+ merged pull requests
      • 1.5M+ LoC, including 500k LoC of HDL
      • 1.8k Github stars

      Participating in OpenTitan

      Reaching this key RTL freeze milestone is a major step towards transparency at the very foundation of the security stack: the silicon root of trust. The coordinated contributions of OpenTitan’s project's partners—enabled by lowRISC’s Silicon Commons™ approach to open source silicon development—are what has enabled us to get here today.

      This is a watershed moment for the trustworthiness of systems we all rely on. The future of free and open, high quality silicon implementations is bright, and we expect to see many more devices including OpenTitan top-levels and ecosystem IP in the future!

      If you are interested in contributing to OpenTitan, visit the open source GitHub repository or reach out to the OpenTitan team.

      By Cyrus Stoller, Miguel Osorio, and Will Drewry, OpenTitan – Google

      Controlling Stable Diffusion with JAX, diffusers, and Cloud TPUs

      Diffusion models are state-of-the-art in generating photorealistic images from text. These models are hard to control through only text and generation parameters. To overcome this, the open source community developed ControlNet (GitHub), a neural network structure to control diffusion models by adding more conditions on top of the text prompts. These conditions include canny edge filters, segmentation maps, and pose keypoints. Thanks to the 🧨diffusers library, it is very easy to train, fine-tune or control diffusion models written in various frameworks, including JAX!

      At Hugging Face, we were particularly excited to see the open source machine learning (ML) community leverage these tools to explore fun and creative diffusion models. We joined forces with Google Cloud to host a community sprint where participants explored the capabilities of controlling Stable Diffusion by building various open source applications with JAX and Diffusers, using Google Cloud TPU v4 accelerators. In this three week sprint, participants teamed up, came up with various project ideas, trained ControlNet models, and built applications based on them. The sprint resulted in 26 projects, accessible via a leaderboard here. These demos use Stable Diffusion (v1.5 checkpoint) initialized with ControlNet models. We worked with Google Cloud to provide access to TPU v4-8 hardware with 3TB storage, as well as NVIDIA A10G GPUs to speed up the inference in these applications.

      Below, we showcase a few projects that stood out from the sprint, and that anyone can create a demo themselves. When picking projects to highlight, we considered several factors:

      • How well-described are the models produced?
      • Are the models, datasets, and other artifacts fully open sourced?
      • Are the applications easy to use? Are they well described?

      The projects were voted on by a panel of experts and the top ten projects on the leaderboard won prizes.

      Control with SAM

      One team used the state-of-the-art Segment Anything Model (SAM) output as an additional condition to control the generated images. SAM produces zero-shot segmentation maps with fine details, which helps extract semantic information from images for control. You can see an example below and try the demo here.

      Screencap of the 'Control with SAM' project

      Fusing MediaPipe and ControlNet

      Another team used MediaPipe to extract hand landmarks to control Stable Diffusion. This application allows you to generate images based on your hand pose and prompt. You can also use a webcam to input an image. See an example below, and try it yourself here.

      Screencap of a project fusing MediaPipe and ControlNet

      Make-a-Video

      Top on the leaderboard is Make-a-Video, which generates video from a text prompt and a hint image. It is based on latent diffusion with temporal convolutions for video and attention. You can try the demo here.

      Screencap of the 'Make-a-Video' project

      Bootstrapping interior designs

      The project that won the sprint is ControlNet for interior design. The application can generate interior design based on a room image and prompt. It can also perform segmentation and generations, guided by image inpainting. See the application in inpainting mode below.

      Screencap of a project using ControlNet for interior design

      In addition to the projects above, many applications were built to enhance images, like this application to colorize grayscale images. You can check out the leaderboard to try all the projects.

      Learning more about diffusion models

      To kick-off the sprint, we organized a three-day series of talks by leading scientists and engineers from Google, Hugging Face, and the open source diffusion community. We'd recommend that anyone interested in learning more about diffusion models and generative AI take a look at the recorded sessions below!

      Tim Salimans (Google Research) speaking on Discrete Diffusion Models
      Tim Salimans (Google Research) speaking on Discrete Diffusion Models
      You can watch all the talks from the links below.

      You can check out the sprint homepage to learn more about the sprint.

      Acknowledgements

      We would like to thank Google Cloud for providing TPUs and storage to help make this great sprint happen, in particular Bertrand Rondepierre and Jonathan Caton for the hard work behind the scenes to get all of the Cloud TPUs allocated so participants had cutting-edge hardware to build on and an overall great experience. And also Andreas Steiner and Cristian Garcia for helping to answer questions in our Discord forum and for helping us make the training script example better. Their help is deeply appreciated.

      By Merve Noyan and Sayak Paul – Hugging Face

      Controlling Stable Diffusion with JAX, diffusers, and Cloud TPUs

      Diffusion models are state-of-the-art in generating photorealistic images from text. These models are hard to control through only text and generation parameters. To overcome this, the open source community developed ControlNet (GitHub), a neural network structure to control diffusion models by adding more conditions on top of the text prompts. These conditions include canny edge filters, segmentation maps, and pose keypoints. Thanks to the 🧨diffusers library, it is very easy to train, fine-tune or control diffusion models written in various frameworks, including JAX!

      At Hugging Face, we were particularly excited to see the open source machine learning (ML) community leverage these tools to explore fun and creative diffusion models. We joined forces with Google Cloud to host a community sprint where participants explored the capabilities of controlling Stable Diffusion by building various open source applications with JAX and Diffusers, using Google Cloud TPU v4 accelerators. In this three week sprint, participants teamed up, came up with various project ideas, trained ControlNet models, and built applications based on them. The sprint resulted in 26 projects, accessible via a leaderboard here. These demos use Stable Diffusion (v1.5 checkpoint) initialized with ControlNet models. We worked with Google Cloud to provide access to TPU v4-8 hardware with 3TB storage, as well as NVIDIA A10G GPUs to speed up the inference in these applications.

      Below, we showcase a few projects that stood out from the sprint, and that anyone can create a demo themselves. When picking projects to highlight, we considered several factors:

      • How well-described are the models produced?
      • Are the models, datasets, and other artifacts fully open sourced?
      • Are the applications easy to use? Are they well described?

      The projects were voted on by a panel of experts and the top ten projects on the leaderboard won prizes.

      Control with SAM

      One team used the state-of-the-art Segment Anything Model (SAM) output as an additional condition to control the generated images. SAM produces zero-shot segmentation maps with fine details, which helps extract semantic information from images for control. You can see an example below and try the demo here.

      Screencap of the 'Control with SAM' project

      Fusing MediaPipe and ControlNet

      Another team used MediaPipe to extract hand landmarks to control Stable Diffusion. This application allows you to generate images based on your hand pose and prompt. You can also use a webcam to input an image. See an example below, and try it yourself here.

      Screencap of a project fusing MediaPipe and ControlNet

      Make-a-Video

      Top on the leaderboard is Make-a-Video, which generates video from a text prompt and a hint image. It is based on latent diffusion with temporal convolutions for video and attention. You can try the demo here.

      Screencap of the 'Make-a-Video' project

      Bootstrapping interior designs

      The project that won the sprint is ControlNet for interior design. The application can generate interior design based on a room image and prompt. It can also perform segmentation and generations, guided by image inpainting. See the application in inpainting mode below.

      Screencap of a project using ControlNet for interior design

      In addition to the projects above, many applications were built to enhance images, like this application to colorize grayscale images. You can check out the leaderboard to try all the projects.

      Learning more about diffusion models

      To kick-off the sprint, we organized a three-day series of talks by leading scientists and engineers from Google, Hugging Face, and the open source diffusion community. We'd recommend that anyone interested in learning more about diffusion models and generative AI take a look at the recorded sessions below!

      Tim Salimans (Google Research) speaking on Discrete Diffusion Models
      Tim Salimans (Google Research) speaking on Discrete Diffusion Models
      You can watch all the talks from the links below.

      You can check out the sprint homepage to learn more about the sprint.

      Acknowledgements

      We would like to thank Google Cloud for providing TPUs and storage to help make this great sprint happen, in particular Bertrand Rondepierre and Jonathan Caton for the hard work behind the scenes to get all of the Cloud TPUs allocated so participants had cutting-edge hardware to build on and an overall great experience. And also Andreas Steiner and Cristian Garcia for helping to answer questions in our Discord forum and for helping us make the training script example better. Their help is deeply appreciated.

      By Merve Noyan and Sayak Paul – Hugging Face

      Accelerate JAX models on Intel GPUs via PJRT

      We are excited to announce the first PJRT plugin implementation in Intel Extension for TensorFlow, which seamlessly runs JAX models on Intel® GPU. The PJRT API simplified the integration, which allowed the Intel GPU plugin to be developed separately and quickly integrated into JAX. This same PJRT implementation also enables initial Intel GPU support for TensorFlow and PyTorch models with XLA acceleration.

      Image of the Intel
      Figure 1. Intel Data Center GPU Max Series

      With the shared vision that modular interfaces make integration easier and enable faster, independent development, Intel and Google collaborated in developing the TensorFlow PluggableDevice mechanism. This is the supported way to extend TensorFlow to new devices and allows hardware vendors to release separate plugin binaries. Intel has continued to work with Google to build modular interfaces for the XLA compiler and to develop the PJRT plugin to run JAX workloads on Intel GPUs.

      JAX

      JAX is an open source Python library designed for complex numerical computations on high-performance computing devices like GPUs and TPUs. It supports NumPy functions and provides automatic differentiation as well as a composable function transformation system to build and train neural networks.

      JAX uses XLA as its compilation and execution backend to optimize and parallelize computations, particularly on AI hardware accelerators. When a JAX program is executed, the Python code is transformed into OpenXLA’s StableHLO operations, which are then passed to PJRT for compilation and execution. Underneath, the StableHLO operations are compiled into machine code by the XLA compiler, which can then be executed on the target hardware accelerator.

      PJRT

      PJRT (used in conjunction with OpenXLA’s StableHLO) provides a hardware- and framework-independent interface for compilers and runtimes (recent announcement). The PJRT interface supports the plugin from a new device backend. This interface provides a means for a straightforward integration of JAX into Intel's systems, and enables JAX workloads on Intel GPUs. Through PJRT integration with various AI frameworks, Intel’s GPU plugin can deliver hardware acceleration and oneAPI optimizations to a wider range of developers using Intel GPUs.

      The PJRT API is a framework-independent API to allow upper level AI frameworks to compile and execute numeric computation represented in StableHLO on an AI hardware/accelerator. It has been integrated with popular AI frameworks including JAX, TensorFlow (via TF-XLA) and PyTorch (via PyTorch-XLA) which enables hardware vendors to provide one plugin for their new AI hardware and all these popular AI Frameworks will support it. It also provides low level primitives to enable efficient interaction with upper level AI frameworks including zero-copy buffer donation, light-weight dependency management, etc, which enables AI frameworks to best utilize hardware resources and achieve high-performance execution.

      Image of the Intel
      Figure 2. PJRT simplifies the integration of oneAPI on Intel GPU into AI Frameworks

      PJRT Plugin for Intel GPU

      The Intel GPU plugin implements the PJRT API by compiling StableHLO and dispatching the executable to Intel GPUs. The compilation is based on XLA implementation, adding target-specific passes for Intel GPUs and leveraging oneAPI performance libraries for acceleration. The device execution is supported using SYCL runtime. The Intel GPU Plugin also implements device registration, enumeration, and SPMD execution mode.

      PJRT’s high-level runtime abstraction allows the plugin to develop its own low-level device management modules and use the advanced runtime features provided by the new device. For example, the Intel GPU plugin developed an out-of-order queue feature provided by SYCL runtime. Compared to fitting the plugin implementation to a low-level runtime interface, such as the stream executor C API used in PluggableDevice, implementing PJRT runtime interface is straightforward and efficient.

      It’s simple to get started using the Intel GPU plugin to run a JAX program, including JAX-based frameworks like Flax and T5X. Just build the plugin (example documentation) then set the environment variable and dependent library paths. JAX automatically looks for the plugin library and loads it into the current process.

      Below are example code snippets of running JAX on an Intel GPU.

      $ export PJRT_NAMES_AND_LIBRARY_PATHS='xpu:Your_itex_library/libitex_xla_extension.so' $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:Your_Python_site-packages/jaxlib $ python >>> import numpy as np >>> import jax >>> jax.local_devices() # PJRT Intel GPU plugin loaded [IntelXpuDevice(id=0, process_index=0), IntelXpuDevice(id=1, process_index=0)] >>> x = np.random.rand(2,2).astype(np.float32) >>> y = np.random.rand(2,2).astype(np.float32) >>> z = jax.numpy.add(x, y) # Runs on Intel XPU
      This is the latest example of Intel AI tools and frameworks leveraging oneAPI software libraries to provide high performance on Intel GPU.

      Future Work

      This PJRT plugin for Intel GPUs has also been integrated into TensorFlow to run XLA supported ops in TensorFlow models. However, XLA has a smaller op set than TensorFlow. For many TensorFlow models in production, some parts of the model graph are executed with PJRT (XLA compatible) while other parts are executed with the classic TensorFlow runtime using TensorFlow OpKernel. This mixed execution model requires PJRT and TensorFlow OpKernel to work seamlessly with each other. The TensorFlow team has introduced the NextPluggableDevice API to enable this.

      When using NextPluggableDevice API, PJRT manages all critical hardware states (e.g. allocator, stream manager, driver, etc) and NextPluggableDevice API allows hardware vendors to build new TensorFlow OpKernels that can access those hardware states via PJRT. PJRT and NextPluggableDevice API enable interoperability between classic TensorFlow runtime and XLA, allowing the XLA subgraph to produce a PJRT buffer and feed to TensorFlow and vice versa.

      As a next step, Intel will continue working with Google to adopt the NextPluggableDevice API to implement non-XLA ops on Intel GPUs supporting all TensorFlow models.

      Written in collaboration with Jianhui Li, Zhoulong Jiang, and Yiqiang Li from Intel.

      By Jieying Luo, Chuanhao Zhuge, and Xiao Yu – Google

      Foundation models for reasoning on charts

      Visual language is the form of communication that relies on pictorial symbols outside of text to convey information. It is ubiquitous in our digital life in the form of iconography, infographics, tables, plots, and charts, extending to the real world in street signs, comic books, food labels, etc. For that reason, having computers better understand this type of media can help with scientific communication and discovery, accessibility, and data transparency.

      While computer vision models have made tremendous progress using learning-based solutions since the advent of ImageNet, the focus has been on natural images, where all sorts of tasks, such as classification, visual question answering (VQA), captioning, detection and segmentation, have been defined, studied and in some cases advanced to reach human performance. However, visual language has not garnered a similar level of attention, possibly because of the lack of large-scale training sets in this space. But over the last few years, new academic datasets have been created with the goal of evaluating question answering systems on visual language images, like PlotQA, InfographicsVQA, and ChartQA.

      Example from ChartQA. Answering the question requires reading the information and computing the sum and the difference.

      Existing models built for these tasks relied on integrating optical character recognition (OCR) information and their coordinates into larger pipelines but the process is error prone, slow, and generalizes poorly. The prevalence of these methods was because existing end-to-end computer vision models based on convolutional neural networks (CNNs) or transformers pre-trained on natural images could not be easily adapted to visual language. But existing models are ill-prepared for the challenges in answering questions on charts, including reading the relative height of bars or the angle of slices in pie charts, understanding axis scales, correctly mapping pictograms with their legend values with colors, sizes and textures, and finally performing numerical operations with the extracted numbers.

      In light of these challenges, we propose “MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering”. MatCha, which stands for math and charts, is a pixels-to-text foundation model (a pre-trained model with built-in inductive biases that can be fine-tuned for multiple applications) trained on two complementary tasks: (a) chart de-rendering and (b) math reasoning. In chart de-rendering, given a plot or chart, the image-to-text model is required to generate its underlying data table or the code used to render it. For math reasoning pre-training, we pick textual numerical reasoning datasets and render the input into images, which the image-to-text model needs to decode for answers. We also propose “DePlot: One-shot visual language reasoning by plot-to-table translation”, a model built on top of MatCha for one-shot reasoning on charts via translation to tables. With these methods we surpass the previous state of the art in ChartQA by more than 20% and match the best summarization systems that have 1000 times more parameters. Both papers will be presented at ACL2023.


      Chart de-rendering

      Plots and charts are usually generated by an underlying data table and a piece of code. The code defines the overall layout of the figure (e.g., type, direction, color/shape scheme) and the underlying data table establishes the actual numbers and their groupings. Both the data and code are sent to a compiler/rendering engine to create the final image. To understand a chart, one needs to discover the visual patterns in the image and effectively parse and group them to extract the key information. Reversing the plot rendering process demands all such capabilities and can thus serve as an ideal pre-training task.

      A chart created from a table in the Airbus A380 Wikipedia page using random plotting options. The pre-training task for MatCha consists of recovering the source table or the source code from the image.

      In practice, it is challenging to simultaneously obtain charts, their underlying data tables, and their rendering code. To collect sufficient pre-training data, we independently accumulate [chart, code] and [chart, table] pairs. For [chart, code], we crawl all GitHub IPython notebooks with appropriate licenses and extract blocks with figures. A figure and the code block right before it are saved as a [chart, code] pair. For [chart, table] pairs, we explored two sources. For the first source, synthetic data, we manually write code to convert web-crawled Wikipedia tables from the TaPas codebase to charts. We sampled from and combined several plotting options depending on the column types. In addition, we also add [chart, table] pairs generated in PlotQA to diversify the pre-training corpus. The second source is web-crawled [chart, table] pairs. We directly use the [chart, table] pairs crawled in the ChartQA training set, containing around 20k pairs in total from four websites: Statista, Pew, Our World in Data, and OECD.


      Math reasoning

      We incorporate numerical reasoning knowledge into MatCha by learning math reasoning skills from textual math datasets. We use two existing textual math reasoning datasets, MATH and DROP for pre-training. MATH is synthetically created, containing two million training examples per module (type) of questions. DROP is a reading-comprehension–style QA dataset where the input is a paragraph context and a question.

      To solve questions in DROP, the model needs to read the paragraph, extract relevant numbers and perform numerical computation. We found both datasets to be complementary. MATH contains a large number of questions across different categories, which helps us identify math operations needed to explicitly inject into the model. DROP’s reading-comprehension format resembles the typical QA format wherein models simultaneously perform information extraction and reasoning. In practice, we render inputs of both datasets into images. The model is trained to decode the answer.

      To improve the math reasoning skills of MatCha we incorporate examples from MATH and DROP into the pre-training objective, by rendering the input text as images.

      End-to-end results

      We use a Pix2Struct model backbone, which is an image-to-text transformer tailored for website understanding, and pre-train it with the two tasks described above. We demonstrate the strengths of MatCha by fine-tuning it on several visual language tasks — tasks involving charts and plots for question answering and summarization where no access to the underlying table is possible. MatCha surpasses previous models’ performance by a large margin and also outperforms the previous state of the art, which assumes access to underlying tables.

      In the figure below, we first evaluate two baseline models that incorporate information from an OCR pipeline, which until recently was the standard approach for working with charts. The first is based on T5, the second on VisionTaPas. We also compare against PaLI-17B, which is a large (~1000 times larger than the other models) image plus text-to-text transformer trained on a diverse set of tasks but with limited capabilities for reading text and other forms of visual language. Finally, we report the Pix2Struct and MatCha model results.

      Experimental results on two chart QA benchmarks ChartQA & PlotQA (using relaxed accuracy) and a chart summarization benchmark chart-to-text (using BLEU4). Matcha surpasses the state of the art by a large margin on QA, compared to larger models, and matches these larger models on summarization.

      For QA datasets, we use the official relaxed accuracy metric that allows for small relative errors in numerical outputs. For chart-to-text summarization, we report BLEU scores. MatCha achieves noticeably improved results compared to baselines for question answering, and comparable results to PaLI in summarization, where large size and extensive long text/captioning generation pre-training are advantageous for this kind of long-form text generation.


      Derendering plus large language model chains

      While extremely performant for their number of parameters, particularly on extractive tasks, we observed that fine-tuned MatCha models could still struggle with end-to-end complex reasoning (e.g., mathematical operations involving large numbers or multiple steps). Thus, we also propose a two-step method to tackle this: 1) a model reads a chart, then outputs the underlying table, 2) a large language model (LLM) reads this output and then tries to answer the question solely based on the textual input.

      For the first model, we fine-tuned MatCha solely on the chart-to-table task, increasing the output sequence length to guarantee it could recover all or most of the information in the chart. DePlot is the resulting model. In the second stage, any LLM (such as FlanPaLM or Codex) can be used for the task, and we can rely on the standard methods to increase performance on LLMs, for example chain-of-thought and self-consistency. We also experimented with program-of-thoughts where the model produces executable Python code to offload complex computations.

      An illustration of the DePlot+LLM method. This is a real example using FlanPaLM and Codex. The blue boxes are input to the LLM and the red boxes contain the answer generated by the LLMs. We highlight some of the key reasoning steps in each answer.

      As shown in the example above, the DePlot model in combination with LLMs outperforms fine-tuned models by a significant margin, especially so in the human-sourced portion of ChartQA, where the questions are more natural but demand more difficult reasoning. Furthermore, DePlot+LLM can do so without access to any training data.

      We have released the new models and code at our GitHub repo, where you can try it out yourself in colab. Checkout the papers for MatCha and DePlot for more details on the experimental results. We hope that our results can benefit the research community and make the information in charts and plots more accessible to everyone.


      Acknowledgements

      This work was carried out by Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen and Yasemin Altun from our Language Team as part of Fangyu's internship project. Nigel Collier from Cambridge also was a collaborator. We would like to thank Joshua Howland, Alex Polozov, Shrestha Basu Mallick, Massimo Nicosia and William Cohen for their valuable comments and suggestions.

      Source: Google AI Blog


      Foundation models for reasoning on charts

      Visual language is the form of communication that relies on pictorial symbols outside of text to convey information. It is ubiquitous in our digital life in the form of iconography, infographics, tables, plots, and charts, extending to the real world in street signs, comic books, food labels, etc. For that reason, having computers better understand this type of media can help with scientific communication and discovery, accessibility, and data transparency.

      While computer vision models have made tremendous progress using learning-based solutions since the advent of ImageNet, the focus has been on natural images, where all sorts of tasks, such as classification, visual question answering (VQA), captioning, detection and segmentation, have been defined, studied and in some cases advanced to reach human performance. However, visual language has not garnered a similar level of attention, possibly because of the lack of large-scale training sets in this space. But over the last few years, new academic datasets have been created with the goal of evaluating question answering systems on visual language images, like PlotQA, InfographicsVQA, and ChartQA.

      Example from ChartQA. Answering the question requires reading the information and computing the sum and the difference.

      Existing models built for these tasks relied on integrating optical character recognition (OCR) information and their coordinates into larger pipelines but the process is error prone, slow, and generalizes poorly. The prevalence of these methods was because existing end-to-end computer vision models based on convolutional neural networks (CNNs) or transformers pre-trained on natural images could not be easily adapted to visual language. But existing models are ill-prepared for the challenges in answering questions on charts, including reading the relative height of bars or the angle of slices in pie charts, understanding axis scales, correctly mapping pictograms with their legend values with colors, sizes and textures, and finally performing numerical operations with the extracted numbers.

      In light of these challenges, we propose “MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering”. MatCha, which stands for math and charts, is a pixels-to-text foundation model (a pre-trained model with built-in inductive biases that can be fine-tuned for multiple applications) trained on two complementary tasks: (a) chart de-rendering and (b) math reasoning. In chart de-rendering, given a plot or chart, the image-to-text model is required to generate its underlying data table or the code used to render it. For math reasoning pre-training, we pick textual numerical reasoning datasets and render the input into images, which the image-to-text model needs to decode for answers. We also propose “DePlot: One-shot visual language reasoning by plot-to-table translation”, a model built on top of MatCha for one-shot reasoning on charts via translation to tables. With these methods we surpass the previous state of the art in ChartQA by more than 20% and match the best summarization systems that have 1000 times more parameters. Both papers will be presented at ACL2023.


      Chart de-rendering

      Plots and charts are usually generated by an underlying data table and a piece of code. The code defines the overall layout of the figure (e.g., type, direction, color/shape scheme) and the underlying data table establishes the actual numbers and their groupings. Both the data and code are sent to a compiler/rendering engine to create the final image. To understand a chart, one needs to discover the visual patterns in the image and effectively parse and group them to extract the key information. Reversing the plot rendering process demands all such capabilities and can thus serve as an ideal pre-training task.

      A chart created from a table in the Airbus A380 Wikipedia page using random plotting options. The pre-training task for MatCha consists of recovering the source table or the source code from the image.

      In practice, it is challenging to simultaneously obtain charts, their underlying data tables, and their rendering code. To collect sufficient pre-training data, we independently accumulate [chart, code] and [chart, table] pairs. For [chart, code], we crawl all GitHub IPython notebooks with appropriate licenses and extract blocks with figures. A figure and the code block right before it are saved as a [chart, code] pair. For [chart, table] pairs, we explored two sources. For the first source, synthetic data, we manually write code to convert web-crawled Wikipedia tables from the TaPas codebase to charts. We sampled from and combined several plotting options depending on the column types. In addition, we also add [chart, table] pairs generated in PlotQA to diversify the pre-training corpus. The second source is web-crawled [chart, table] pairs. We directly use the [chart, table] pairs crawled in the ChartQA training set, containing around 20k pairs in total from four websites: Statista, Pew, Our World in Data, and OECD.


      Math reasoning

      We incorporate numerical reasoning knowledge into MatCha by learning math reasoning skills from textual math datasets. We use two existing textual math reasoning datasets, MATH and DROP for pre-training. MATH is synthetically created, containing two million training examples per module (type) of questions. DROP is a reading-comprehension–style QA dataset where the input is a paragraph context and a question.

      To solve questions in DROP, the model needs to read the paragraph, extract relevant numbers and perform numerical computation. We found both datasets to be complementary. MATH contains a large number of questions across different categories, which helps us identify math operations needed to explicitly inject into the model. DROP’s reading-comprehension format resembles the typical QA format wherein models simultaneously perform information extraction and reasoning. In practice, we render inputs of both datasets into images. The model is trained to decode the answer.

      To improve the math reasoning skills of MatCha we incorporate examples from MATH and DROP into the pre-training objective, by rendering the input text as images.

      End-to-end results

      We use a Pix2Struct model backbone, which is an image-to-text transformer tailored for website understanding, and pre-train it with the two tasks described above. We demonstrate the strengths of MatCha by fine-tuning it on several visual language tasks — tasks involving charts and plots for question answering and summarization where no access to the underlying table is possible. MatCha surpasses previous models’ performance by a large margin and also outperforms the previous state of the art, which assumes access to underlying tables.

      In the figure below, we first evaluate two baseline models that incorporate information from an OCR pipeline, which until recently was the standard approach for working with charts. The first is based on T5, the second on VisionTaPas. We also compare against PaLI-17B, which is a large (~1000 times larger than the other models) image plus text-to-text transformer trained on a diverse set of tasks but with limited capabilities for reading text and other forms of visual language. Finally, we report the Pix2Struct and MatCha model results.

      Experimental results on two chart QA benchmarks ChartQA & PlotQA (using relaxed accuracy) and a chart summarization benchmark chart-to-text (using BLEU4). Matcha surpasses the state of the art by a large margin on QA, compared to larger models, and matches these larger models on summarization.

      For QA datasets, we use the official relaxed accuracy metric that allows for small relative errors in numerical outputs. For chart-to-text summarization, we report BLEU scores. MatCha achieves noticeably improved results compared to baselines for question answering, and comparable results to PaLI in summarization, where large size and extensive long text/captioning generation pre-training are advantageous for this kind of long-form text generation.


      Derendering plus large language model chains

      While extremely performant for their number of parameters, particularly on extractive tasks, we observed that fine-tuned MatCha models could still struggle with end-to-end complex reasoning (e.g., mathematical operations involving large numbers or multiple steps). Thus, we also propose a two-step method to tackle this: 1) a model reads a chart, then outputs the underlying table, 2) a large language model (LLM) reads this output and then tries to answer the question solely based on the textual input.

      For the first model, we fine-tuned MatCha solely on the chart-to-table task, increasing the output sequence length to guarantee it could recover all or most of the information in the chart. DePlot is the resulting model. In the second stage, any LLM (such as FlanPaLM or Codex) can be used for the task, and we can rely on the standard methods to increase performance on LLMs, for example chain-of-thought and self-consistency. We also experimented with program-of-thoughts where the model produces executable Python code to offload complex computations.

      An illustration of the DePlot+LLM method. This is a real example using FlanPaLM and Codex. The blue boxes are input to the LLM and the red boxes contain the answer generated by the LLMs. We highlight some of the key reasoning steps in each answer.

      As shown in the example above, the DePlot model in combination with LLMs outperforms fine-tuned models by a significant margin, especially so in the human-sourced portion of ChartQA, where the questions are more natural but demand more difficult reasoning. Furthermore, DePlot+LLM can do so without access to any training data.

      We have released the new models and code at our GitHub repo, where you can try it out yourself in colab. Checkout the papers for MatCha and DePlot for more details on the experimental results. We hope that our results can benefit the research community and make the information in charts and plots more accessible to everyone.


      Acknowledgements

      This work was carried out by Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen and Yasemin Altun from our Language Team as part of Fangyu's internship project. Nigel Collier from Cambridge also was a collaborator. We would like to thank Joshua Howland, Alex Polozov, Shrestha Basu Mallick, Massimo Nicosia and William Cohen for their valuable comments and suggestions.

      Source: Google AI Blog


      Google Research at I/O 2023

      Wednesday, May 10th was an exciting day for the Google Research community as we watched the results of months and years of our foundational and applied work get announced on the Google I/O stage. With the quick pace of announcements on stage, it can be difficult to convey the substantial effort and unique innovations that underlie the technologies we presented. So today, we’re excited to reveal more about the research efforts behind some of the many exciting announcements at this year's I/O.


      PaLM 2

      Our next-generation large language model (LLM), PaLM 2, is built on advances in compute-optimal scaling, scaled instruction-fine tuning and improved dataset mixture. By fine-tuning and instruction-tuning the model for different purposes, we have been able to integrate state-of-the-art capabilities into over 25 Google products and features, where it is already helping to inform, assist and delight users. For example:

      • Bard is an early experiment that lets you collaborate with generative AI and helps to boost productivity, accelerate ideas and fuel curiosity. It builds on advances in deep learning efficiency and leverages reinforcement learning from human feedback to provide more relevant responses and increase the model’s ability to follow instructions. Bard is now available in 180 countries, where users can interact with it in English, Japanese and Korean, and thanks to the multilingual capabilities afforded by PaLM 2, support for 40 languages is coming soon.
      • With Search Generative Experience we’re taking more of the work out of searching, so you’ll be able to understand a topic faster, uncover new viewpoints and insights, and get things done more easily. As part of this experiment, you’ll see an AI-powered snapshot of key information to consider, with links to dig deeper.
      • MakerSuite is an easy-to-use prototyping environment for the PaLM API, powered by PaLM 2. In fact, internal user engagement with early prototypes of MakerSuite accelerated the development of our PaLM 2 model itself. MakerSuite grew out of research focused on prompting tools, or tools explicitly designed for customizing and controlling LLMs. This line of research includes PromptMaker (precursor to MakerSuite), and AI Chains and PromptChainer (one of the first research efforts demonstrating the utility of LLM chaining).
      • Project Tailwind also made use of early research prototypes of MakerSuite to develop features to help writers and researchers explore ideas and improve their prose; its AI-first notebook prototype used PaLM 2 to allow users to ask questions of the model grounded in documents they define.
      • Med-PaLM 2 is our state-of-the-art medical LLM, built on PaLM 2. Med-PaLM 2 achieved 86.5% performance on U.S. Medical Licensing Exam–style questions, illustrating its exciting potential for health. We’re now exploring multimodal capabilities to synthesize inputs like X-rays.
      • Codey is a version of PaLM 2 fine-tuned on source code to function as a developer assistant. It supports a broad range of Code AI features, including code completions, code explanation, bug fixing, source code migration, error explanations, and more. Codey is available through our trusted tester program via IDEs (Colab, Android Studio, Duet AI for Cloud, Firebase) and via a 3P-facing API.

      Perhaps even more exciting for developers, we have opened up the PaLM APIs & MakerSuite to provide the community opportunities to innovate using this groundbreaking technology.

      PaLM 2 has advanced coding capabilities that enable it to find code errors and make suggestions in a number of different languages.

      Imagen

      Our Imagen family of image generation and editing models builds on advances in large Transformer-based language models and diffusion models. This family of models is being incorporated into multiple Google products, including:

      • Image generation in Google Slides and Android’s Generative AI wallpaper are powered by our text-to-image generation features.
      • Google Cloud’s Vertex AI enables image generation, image editing, image upscaling and fine-tuning to help enterprise customers meet their business needs.
      • I/O Flip, a digital take on a classic card game, features Google developer mascots on cards that were entirely AI generated. This game showcased a fine-tuning technique called DreamBooth for adapting pre-trained image generation models. Using just a handful of images as inputs for fine-tuning, it allows users to generate personalized images in minutes. With DreamBooth, users can synthesize a subject in diverse scenes, poses, views, and lighting conditions that don’t appear in the reference images.
        I/O Flip presents custom card decks designed using DreamBooth.

      Phenaki

      Phenaki, Google’s Transformer-based text-to-video generation model was featured in the I/O pre-show. Phenaki is a model that can synthesize realistic videos from textual prompt sequences by leveraging two main components: an encoder-decoder model that compresses videos to discrete embeddings and a transformer model that translates text embeddings to video tokens.


      ARCore and the Scene Semantic API

      Among the new features of ARCore announced by the AR team at I/O, the Scene Semantic API can recognize pixel-wise semantics in an outdoor scene. This helps users create custom AR experiences based on the features in the surrounding area. This API is empowered by the outdoor semantic segmentation model, leveraging our recent works around the DeepLab architecture and an egocentric outdoor scene understanding dataset. The latest ARCore release also includes an improved monocular depth model that provides higher accuracy in outdoor scenes.

      Scene Semantics API uses DeepLab-based semantic segmentation model to provide accurate pixel-wise labels in a scene outdoors.

      Chirp

      Chirp is Google's family of state-of-the-art Universal Speech Models trained on 12 million hours of speech to enable automatic speech recognition (ASR) for 100+ languages. The models can perform ASR on under-resourced languages, such as Amharic, Cebuano, and Assamese, in addition to widely spoken languages like English and Mandarin. Chirp is able to cover such a wide variety of languages by leveraging self-supervised learning on unlabeled multilingual dataset with fine-tuning on a smaller set of labeled data. Chirp is now available in the Google Cloud Speech-to-Text API, allowing users to perform inference on the model through a simple interface. You can get started with Chirp here.


      MusicLM

      At I/O, we launched MusicLM, a text-to-music model that generates 20 seconds of music from a text prompt. You can try it yourself on AI Test Kitchen, or see it featured during the I/O preshow, where electronic musician and composer Dan Deacon used MusicLM in his performance.

      MusicLM, which consists of models powered by AudioLM and MuLAN, can make music (from text, humming, images or video) and musical accompaniments to singing. AudioLM generates high quality audio with long-term consistency. It maps audio to a sequence of discrete tokens and casts audio generation as a language modeling task. To synthesize longer outputs efficiently, it used a novel approach we’ve developed called SoundStorm.


      Universal Translator dubbing

      Our dubbing efforts leverage dozens of ML technologies to translate the full expressive range of video content, making videos accessible to audiences across the world. These technologies have been used to dub videos across a variety of products and content types, including educational content, advertising campaigns, and creator content, with more to come. We use deep learning technology to achieve voice preservation and lip matching and enable high-quality video translation. We’ve built this product to include human review for quality, safety checks to help prevent misuse, and we make it accessible only to authorized partners.


      AI for global societal good

      We are applying our AI technologies to solve some of the biggest global challenges, like mitigating climate change, adapting to a warming planet and improving human health and wellbeing. For example:

      • Traffic engineers use our Green Light recommendations to reduce stop-and-go traffic at intersections and improve the flow of traffic in cities from Bangalore to Rio de Janeiro and Hamburg. Green Light models each intersection, analyzing traffic patterns to develop recommendations that make traffic lights more efficient — for example, by better synchronizing timing between adjacent lights, or adjusting the “green time” for a given street and direction.
      • We’ve also expanded global coverage on the Flood Hub to 80 countries, as part of our efforts to predict riverine floods and alert people who are about to be impacted before disaster strikes. Our flood forecasting efforts rely on hydrological models informed by satellite observations, weather forecasts and in-situ measurements.

      Technologies for inclusive and fair ML applications

      With our continued investment in AI technologies, we are emphasizing responsible AI development with the goal of making our models and tools useful and impactful while also ensuring fairness, safety and alignment with our AI Principles. Some of these efforts were highlighted at I/O, including:

      • The release of the Monk Skin Tone Examples (MST-E) Dataset to help practitioners gain a deeper understanding of the MST scale and train human annotators for more consistent, inclusive, and meaningful skin tone annotations. You can read more about this and other developments on our website. This is an advancement on the open source release of the Monk Skin Tone (MST) Scale we launched last year to enable developers to build products that are more inclusive and that better represent their diverse users.
      • A new Kaggle competition (open until August 10th) in which the ML community is tasked with creating a model that can quickly and accurately identify American Sign Language (ASL) fingerspelling — where each letter of a word is spelled out in ASL rapidly using a single hand, rather than using the specific signs for entire words — and translate it into written text. Learn more about the fingerspelling Kaggle competition, which features a song from Sean Forbes, a deaf musician and rapper. We also showcased at I/O the winning algorithm from the prior year’s competition powers PopSign, an ASL learning app for parents of deaf or hard of hearing children created by Georgia Tech and Rochester Institute of Technology (RIT).

      Building the future of AI together

      It’s inspiring to be part of a community of so many talented individuals who are leading the way in developing state-of-the-art technologies, responsible AI approaches and exciting user experiences. We are in the midst of a period of incredible and transformative change for AI. Stay tuned for more updates about the ways in which the Google Research community is boldly exploring the frontiers of these technologies and using them responsibly to benefit people’s lives around the world. We hope you're as excited as we are about the future of AI technologies and we invite you to engage with our teams through the references, sites and tools that we’ve highlighted here.


      Source: Google AI Blog