Author Archives:

Introducing the Workspace Policy API mutate endpoints for DLP

The Workspace Policy API provides a centralized, comprehensive view of your security settings, eliminating the need to navigate to numerous pages in the Admin console.

With our latest update, we are introducing mutate endpoints (Create, Update, Delete) alongside existing read-only capabilities (Get, List) for data loss prevention (DLP) rules and detectors. This allows super admins to programmatically manage and fully automate the entire lifecycle of their DLP policies, from initial creation to real-time activation and deactivation.

Note this is an API-only launch for capabilities currently supported in the Admin console.

About DLP

DLP lets Workspace admins control external file sharing to prevent sensitive information leaks. It scans files for violations, triggering incidents and protective actions like content blocking.

How DLP works:

  • Admins define rules for sensitive content across Drive, Gmail, Chat, and Chrome.
  • DLP scans content for DLP rule violations that trigger DLP incidents.
  • DLP enforces the rules you defined and violations trigger actions, such as alerts.
  • Admins are alerted for DLP rule violations.
Summary of capabilities supported by mutate endpoints for DLP

Getting started

  • Admins: You must be a super admin to use the Policy API. See our developer documentation to learn more about the Policy API. You can also use GAM, an open source tool for managing Workspace, which now supports the Policy API.
  • End users: This is an admin-only capability.

Rollout pace

Availability

  • Available to all Google Workspace customers and Workspace Individual subscribers

Resources

This entry was posted in Uncategorized on by .

Unlocking TPU performance: Deep kernel profiling with XProf

Unlocking TPU performance: Deep kernel profiling with XProf

As machine learning workloads scale to unprecedented heights, developers are increasingly writing highly specialized Tensor Processing Unit (TPU) kernels using frameworks like Pallas, Mosaic, and Triton to maximize hardware performance.

However, customizing high-performance kernels has historically introduced a major engineering challenge: optimization blind spots. To legacy performance profilers, custom compilation paths appear as opaque execution paths. Developers are left with single, massive execution blocks in their trace captures, lacking granular visibility into what is actually occurring inside the chip's internal components. Did a vector processing instruction stall? Was matrix math idle due to data loading bottlenecks?

Traditional profiling relies heavily on compile-time static cost models to estimate kernel efficiency. While helpful for standard operations, these models cannot capture dynamic runtime realities like instruction execution stalls, memory subsystem congestion, or hardware scheduling conflicts.

To open this opaque execution path, we are excited to introduce the Kernel Profiling suite in XProf—a low-level hardware debugging suite engineered specifically for Pallas kernel authoring and optimization on Google TPUs. By combining static compilation tracking with dynamic, sub-microsecond hardware telemetry, XProf Kernel provides the deep transparency required to optimize high-scale ML workloads.

Deep visibility: HLO Graphs & MLIR Inspection

The first step in debugging any custom kernel is understanding how your high-level code is translated by the compiler. When compiling a JAX or PyTorch model, the compiler generates a High-Level Optimizer (HLO) graph. Previously, custom calls inside these graphs remained completely obscured.

XProf's updated Graph Viewer resolves this by exposing the internal compilation logic of these custom regions directly. To unlock this deep visibility, developers must pass the appropriate debug flags to the XLA compilation environment.
--xla_enable_custom_call_region_trace=true
--xla_xprof_register_llo_debug_info=true

Once these flags are active, any trace captured via XProf includes comprehensive compiler metadata. In the XProf Graph Viewer, clicking on a custom-call block reveals an interactive panel titled "Custom Call Text." This displays the raw, lowered MLIR (Multi-Level Intermediate Representation) code generated by the compiler.

A screenshot of the TensorBoard XProf interface displaying an HLO graph, with a Custom Call Text panel open to reveal raw MLIR code
Figure 1: XProf interface displaying an HLO graph, with a "Custom Call Text" panel to reveal raw MLIR code

By displaying the MLIR text side-by-side with high-level source-code representations, developers can immediately verify whether the compiler is correctly fusing operations and structuring memory tiles as intended.

Tracing Instrumented Low-Level Operations (LLO) Analysis

To provide cycle-level execution visibility, XProf exposes Low-Level Operations (LLO) bundle data directly inside the Trace Viewer. An LLO bundle represents the actual machine instructions issued to the TPU core's functional units during every clock cycle.

Through dynamic instrumentation, XProf inserts hardware markers exactly when a LLO bundle region executes. Within the Trace Viewer, this manifests as dedicated, time-aligned execution tracks representing the TPU bundle's slot utilization metrics from static analysis:

  • MXU (Matrix Multiply Unit): Tracks active, busy cycles of high-throughput matrix-multiplication pipelines.
  • Scalar and Vector ALUs: Displays the execution profile of mathematical operations, letting you spot pipeline imbalances.
  • Vector Fills, Loads, Spills, and Stores: Exposes HBM-to-register data movement, critical for identifying bandwidth-throttling bottlenecks.
  • XLU (Cross-Lane Unit): Monitors collective communications and data shuffling across physical TPU cores.
XProf Capture Profile trace viewer interface showing dynamic hardware execution tracks
Figure 2: XProf Capture Profile trace viewer interface showing dynamic hardware execution tracks

Runtime Performance Counter Sampling

While static analysis effectively verifies instruction counts or vector store logic, it remains detached from the dynamic realities of runtime execution. To bridge this gap, XProf introduces fine-grained, periodic performance counter sampling—available starting with TPU v7 (Ironwood). This capability empowers developers to move beyond static estimation and measure precisely how hardware blocks are utilized in real-time, providing the empirical ground truth needed to identify whether compute units are truly active or stalled by memory subsystems.

Consider the optimization of a tiled matrix multiplication (Matmul) kernel. While a static trace might indicate a logically perfect sequence of operations, real-world performance often falters if the Matrix Multiply Unit (MXU) sits idle while awaiting data from High-Bandwidth Memory (HBM). To diagnose and resolve such bottlenecks, developers can utilize a structured three-step profiling workflow:

  1. Set up the Profiling Environment: Configure the TPU v7 (Ironwood) runtime by defining specific hardware counters—such as scalar issues or synchronization waits.
  2. Capture a Kernel Profile: Use the XProf request interface to capture fine-grained performance counters, which can then be visualized as a time-series within the Trace Viewer.
  3. Interpret the Data: Analyze the resulting counters to distinguish between a Memory-Bound Scenario (characterized by massive spikes in sync_wait) and an Optimized Scenario. For instance, implementing triple buffering to overlap memory loads with MXU compute can reduce runtime from 125.5µs to 88µs—a ~30% performance gain validated by a drastic reduction in synchronization events.

By shifting from static code inspection to empirical runtime telemetry, hardware behavior explicitly validates optimization strategies, ensuring every cycle on the silicon is spent productively. For a hands-on example to check out these techniques, please explore our Pallas Matmul w/ Perf Counters demo.

XProf timeline highlighting a comparison between a detailed Runtime Perf Counter section sampling at a 1-microsecond frequency and a Static LLO Region track below it
Figure 3: XProf timeline highlighting a comparison between a detailed "Runtime Perf Counter" section sampling at a 1-microsecond frequency and a "Static LLO Region" track below it

Visualizing the "Utilization Gap"

This dynamic tracking exposes the significant gap left by traditional static analysis tools. A static tool analyzes instructions linearly, completely ignoring time. It might flag an MXU instruction block as "100% Utilized."

In contrast, XProf plots actual hardware execution over time. You might discover that a long-running Scalar ALU operation is stalling the entire execution pipeline, leaving the powerful MXU completely idle. By visualizing these temporal idle gaps, developers can adjust data shapes, memory alignments, and instruction sequencing to maximize compute density.

STATIC ESTIMATION:
[========== Block Execution: MXU Flagged 100% Utilized ==========]

XPROF REAL-WORLD TIMELINE:
├─ [Scalar ALU (Active)] ─┼─ [MXU (Active)] ─┼── [MXU (Idle / Memory Stall)] ──┤
│ Stalling pipeline...     │ Compute phase     │ Starved; waiting for HBM Load    │
Figure 4 : The UI shows the active TPU Core functional unit tracks (MXU, Scalar ALU, Vector ALU, and memory data pipelines) aligned side-by-side with the active framework Ops, exposing exact execution times and real-time idle cycles.

Overall Utilization from Performance Counters

Navigating profiling metrics can be daunting. Relying on metrics calculated via compile-time cost models often misrepresents performance when applied to custom compilation paths. To solve this, XProf establishes a clear Hierarchy of Trust:

                  ┌───────────────────────────────┐
                  │     Absolute Ground Truth     │
                  │  (HBM, Hardware Registers,    │ (100% Trustworthy)
                  │       TPO Metrics, CSRs)      │
                  └───────────────┬───────────────┘
                                  ▼
                  ┌───────────────────────────────┐
                  │       Estimated Metrics       │
                  │   (Program Optimal FLOPs,     │ (Requires caution with
                  │      Goodput Efficiency)      │  custom compiling paths)
                  └───────────────────────────────┘
Figure 5: Hierarchy of Metrics
  1. The Absolute Ground Truth (100% Trustworthy): Metrics derived directly from physical hardware registers (HBM utilization, TPO metrics, unprivileged hardware stats). When profiling custom kernels, these represent physical reality and should be your primary optimization anchors.
  2. Estimated Metrics (Use with Caution): Metrics like "Compared to program optimal FLOPS" or "Goodput efficiency" rely on XLA cost models. Because custom compilation paths bypass standard passes, these metrics can be highly skewed or outright non-functional.

For the unvarnished truth, XProf exposes the Perf Counters View, providing direct, tabular access to over 16,000 raw hardware counters read straight from the TPU silicon.

A screenshot of the XProf Perf Counters tabular view, displaying a list of unprivileged hardware counters alongside their corresponding raw decimal and hexadecimal values
Figure 6: XProf Perf Counters Tabular View

Understanding Trace Tracks: The height of a trace track does not represent a normalized 0-100% percentage. It represents the maximum raw counter value observed in that interval. For example, if a counter increments by 100 cycles over a 500-nanosecond trace window (roughly 1,000 clock cycles on a 2.0 GHz core), it indicates exactly 10% physical utilization of that unit.

To configure and profile the runtime performance counters sampling method, please follow the instructions from <openxla.org/xprof/kernel-profiling.html>.

Advanced Sampling: Event-Triggered Profiling

Previously, dynamic capturing was limited to Periodic Sampling Mode—polling counters based on a host-level timer, which hit a physical resolution floor of 1 microsecond.

           CORE 0           CORE 1           CORE 2           CORE 3
      ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
      │  28 Counters │ │  28 Counters │ │  28 Counters │ │  28 Counters │
      └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
      └─────────────────────────────────────────────────────────────────┘
                            4 x 28 Sparse Matrix
Figure 7: Sparse Matrix Configuration

To capture lightning-fast hardware cycles, XProf now supports External Event-Triggered Mode. The dynamic sampler intercepts physical TPU trace instructions and boundary triggers (such as entering/exiting custom call scopes), allowing for sub-microsecond capture latency and precise attribution.

Developers can configure up to 28 hardware counters per core, distributed across up to four active SparseCores, creating a 4 x 28 profiling matrix that maximizes data variety while protecting workload performance.

Activating this is straightforward via standard JAX JIT profilers:

options = jax.profiler.ProfileOptions()

# Example request for externally triggered collection
options.advanced_configuration = {
"tpu_enable_periodic_counter_sampling" : True,
"tpu_tc_perf_counter_sampling_options" : (
          'is_external_trigger:true scaling:0 counter_size_bits:1 indices:10 indices:11 indices:56 indices:57 indices:58'
),
}

# For periodic sampling, please use interval_us instead of is_external_trigger.

Getting Started

Ready to transition from guessing performance to measuring and optimizing the physical limits of your ML silicon? Explore these open-source resources to get started with XProf Kernel today:

This entry was posted in Uncategorized on by .

Convert rubric files and images into Google Classroom rubrics with help from Gemini

Building on our October launch, Gemini in Google Classroom can now help educators more easily convert rubric files and images into Google Classroom rubrics, right within the assignment creation workflow. Educators can now upload more file types, such as .jpeg and .png files. For example, by uploading a photo of a physical rubric or using existing files, Gemini in Classroom can help educators quickly generate structured, interactive rubrics within the Classroom interface. They can then make edits to the converted rubric before saving it. This Gemini-powered automation reduces manual data entry and helps educators maintain consistent grading standards across their assignments.

With this launch, rubric conversion will be controlled by the Gemini in Classroom setting in the Admin console. If Gemini in Classroom is disabled for your organization, you’ll no longer be able to convert rubrics from documents or images.

This feature is only available in English for users over age 18.

Getting started

Rollout pace

Availability

  • Education: Education Fundamentals, Standard, and Plus

Resources

This entry was posted in Uncategorized on by .

Request lightweight document alignment with approvals in Google Drive

Google Drive is introducing alignment approvals, a lightweight mechanism that allows teams to request and record document sign-offs without file changes resetting the approval flow. When a document is in a partially approved state, collaborators can continue making edits without resetting any recorded approver decisions.

Alignment approvals can be initiated via a new checkbox within the standard request dialog across web clients.


When checked, "Require all approvers to review the same content" resets pending approvals if the file content changes. This is the default behavior.


When unchecked, changes to the file content don't reset pending approvals.

This feature serves use cases where strict content locking is unnecessary, making it easier for teams to maintain momentum on fluid projects. While an approval remains pending, individual approvers retain the flexibility to manually reset their approved status back to pending if subsequent content edits no longer match their expectations.

This update follows a series of recent enhancements to the Drive approvals ecosystem, including programmatic approval management with the Drive API.

Getting started

  • Admins: Approval requests are enabled by default and can be disabled at the domain, OU, and group level. There is no admin setting that controls alignment approvals specifically; users can access them if they have access to the broader approval request feature. Visit the Help Center to learn more about managing Drive approvals.
  • End users: Alignment approvals will be off by default and can be enabled by the user. Visit the Help Center to learn more about getting approvals in Drive.

Rollout pace

Availability

  • Business: Business Standard and Plus
  • Enterprise: Enterprise Starter, Standard, and Plus
  • Education: Education Plus
  • Other Editions: Enterprise Essentials and Enterprise Essentials Plus; Nonprofits
  • Education Add-ons: Teaching and Learning

Resources

This entry was posted in Uncategorized on by .

Datadog delivers millions of in-depth performance insights with ProfilingManager

Posted by Alice Yuan, Developer Relations Engineer at Google, Arti Arutiunov, Product Manager at Datadog and Nikita Ogorodnikov, Staff Software Engineer at Datadog



Performance regressions are notoriously hard to reproduce, making regressions a massive bottleneck for mobile developers. Although signals like ANR rates indicate what issues occur in production, pinpointing the specific line of code that resulted in the performance issue has historically necessitated exhaustive manual reproduction or speculative trial-and-error experimentation.

Datadog collaborated with Google to mitigate this frustration by integrating the ProfilingManager API (available on Android 15+ devices) into its Real User Monitoring (RUM) and Continuous Profiling platforms. This integration transforms the debugging workflow, allowing developers to move beyond surface-level symptoms to being able to detect the why behind a performance bottleneck.

By leveraging this system-level API, Datadog now processes millions of production profiles weekly across the globe according to Datadog internal data of June 2026. It provides engineering teams with a new level of visibility into real-world performance, all while maintaining a low runtime overhead for production-scale performance monitoring.

The impact of ProfilingManager

ProfilingManager is a system service introduced in Android 15 that enables apps to programmatically collect performance data such as call stack samples, field traces and memory heap dumps directly from production environments. This capability shifts the engineering paradigm from reactive manual reproduction to proactive field analysis.

ProfilingManager is a highly performant solution for code-level insights.  Of the solutions we evaluated, it has the lowest runtime overhead,  gives deep visibility into Java, Kotlin, and C++ traces, and opens the door to gather memory profiles and system-level traces during critical moments like ANRs and out-of-memory (OOM) errors. Yi Lu, Senior Engineer at Datadog


For example, a Google communications app used field traces to investigate why its cold start times were slower on newer, more powerful hardware. By diving into the field-collected traces and comparing traces across different device types, the engineer discovered a hidden scheduling issue: a background text-to-speech service was unnecessarily being prewarmed during app startup. The traces revealed that this background process was monopolizing the device's highest-performing big CPU core, forcing the app's main thread to sleep while the prewarm occurred.

Solving the Android code-level visibility challenge

Prior to the implementation of ProfilingManager, Datadog’s Real User Monitoring (RUM) focused on high-level application health and session-level telemetry to assess the user journey. Engineering teams could monitor Android performance signals like time to initial display, ANR rates, CPU load, and frozen frames. These insights extended to granular interactions, such as network latency, touch events, and main thread hangs. However, while this data effectively highlighted which performance bottlenecks were surfacing in the field, it provided no clear path to identifying the root cause of these failures.


We realized that across our profiling features, performance profiling on mobile applications remained a blind spot. Teams could see that an Android user experienced a slow screen render or an ANR, but lacked the same code-level visibility they relied on for their backend services. - Bryan Antigua, Senior Product Manager at Datadog


To address this, Datadog needed a profiling engine capable of capturing Android traces directly from devices in production with minimal performance impact. After evaluating alternative approaches, such as writing their own trace processor using Android Debug APIs, the team selected ProfilingManager because it is the most performant solution of the profiling options they evaluated and offloads the sampling decisions overhead to the OS.

ProfilingManager supports a wide range of collection methods, including CPU traces, call stack sampling, memory analysis through Java heap dumps and native heap profiles. It enables developers to profile production builds, upload trace files to external storage, and review them in the Perfetto trace analyzer UI. As a SaaS provider, Datadog uploads, visualizes, and analyzes these profiles collected via its SDK, providing a unified view of application health.

By centralizing high-fidelity telemetry within a unified observability API, ProfilingManager empowers Datadog and its clients to proactively monitor, investigate, and remediate complex Android performance regressions through key technical advantages:
  • Granular session diagnostics: ProfilingManager enhances debuggability by delivering direct OS-level trace data, overcoming the visibility and alignment challenges typical of custom logging with system services. To dive deeper, developers can download these traces from Datadog to investigate further in visualization tools like the Perfetto UI.
  • Automated telemetry triggers: By leveraging native system events to initiate trace recordings at key optimization points, Datadog reduces the need to build custom collection logic. While the initial rollout focuses on the APP_FULLY_DRAWN signal, there are already plans to expand this observability to include ANR, OOM, and COLD_START triggers.
  • Proactive trace snapshots: By interfacing directly with the system-level Perfetto service (traced), ProfilingManager utilizes a proactive background recording model designed to capture unpredictable issues. This ensures that developers receive a precise visualization of the events leading up to a performance anomaly, offering a level of insight that exceeds what is possible through manual instrumentation.
  • Bottleneck detection at scale: Datadog is able to synthesize telemetry from across Datadog’s global customer base to uncover regressions that only emerge under unique hardware configurations and variable network environments.
  • System-enforced resource stability: The API leverages sampling trace collection to ensure performance and user experience impacts remain unnoticeable.
  • On-device data controls: ProfilingManager filters out irrelevant information from other processes on-device before the profile is delivered to the app. This minimizes file sizes and ensures that only data relevant to the app's processes is provided.

Processing millions of weekly profiles to optimize real-world apps

An example of Datadog's time to initial display measurement with 
stack sampling powered by ProfilingManager

Integrating a system-level profiling API into a global monitoring SDK required solving infrastructure challenges. Because ProfilingManager generates highly detailed performance traces, the Datadog engineering team had to build a pipeline capable of parsing and analyzing these profiles on the server side at scale. Beyond profile collection, Datadog also emphasizes the importance of balancing sampling frequency with collecting enough data to generate meaningful insights about your application. Datadog relies on ProfilingManager’s built-in rate limiting as a critical stability safeguard, preventing excessive telemetry requests from overburdening user devices.

The team has been profiling Datadog's own native Android application and a number of early adopters’ applications for months, gathering millions of profiles to ensure a fast, error-free launch experience and to refine their performance-detection algorithms. Today, the production integration seamlessly scales across a variety of Android devices.

Conclusion

By integrating Android’s ProfilingManager API, Datadog successfully closed the visibility gap between backend systems and mobile client applications for their customers. By processing millions of profiles weekly with negligible device overhead, Datadog equips Android developers with the code-level insights necessary to diagnose complex performance bugs instantly, helping developers build smoother applications and improve their app’s performance signals in the Play Store. To adopt the ProfilingManager API directly into your performance observability framework, check out our documentation.

In the future, Datadog aims to make Android profiling data a first-class input for coding agents to autonomously resolve performance bottlenecks, closing the feedback loop between detection and remediation. Datadog is working toward making Android profiling broadly accessible to developers.

To get started using the Datadog real user monitoring feature powered by ProfilingManager, visit Datadog Mobile Real User Monitoring.

This entry was posted in Uncategorized on by .

Introducing the Google Colab CLI

Google has announced the Google Colab Command-Line Interface (CLI), a new tool that allows developers and AI agents to connect local terminals to remote Colab runtimes for frictionless execution. The lightweight CLI enables users to easily request high-powered GPUs, run local Python scripts remotely, and seamlessly retrieve artifact logs or models like fine-tuned Gemma 3 adapters. By integrating directly into standard terminal environments, the tool is highly programmable and ready to be used by AI agents such as Antigravity or Claude Code to manage complex machine learning pipelines.
This entry was posted in Uncategorized on by .