
How to keep using Google Maps even when your phone is offline

Increasing efficiency through better meeting room management with room release
Google Sheets adds powerful new functions for advanced analysis
Improving accessibility of the “connect device” feature on Google Meet hardware devices
Videos have become an increasingly important part of our daily lives, spanning fields such as entertainment, education, and communication. Understanding the content of videos, however, is a challenging task as videos often contain multiple events occurring at different time scales. For example, a video of a musher hitching up dogs to a dog sled before they all race away involves a long event (the dogs pulling the sled) and a short event (the dogs being hitched to the sled). One way to spur research in video understanding is via the task of dense video captioning, which consists of temporally localizing and describing all events in a minutes-long video. This differs from single image captioning and standard video captioning, which consists of describing short videos with a single sentence.
Dense video captioning systems have wide applications, such as making videos accessible to people with visual or auditory impairments, automatically generating chapters for videos, or improving the search of video moments in large databases. Current dense video captioning approaches, however, have several limitations — for example, they often contain highly specialized task-specific components, which make it challenging to integrate them into powerful foundation models. Furthermore, they are often trained exclusively on manually annotated datasets, which are very difficult to obtain and hence are not a scalable solution.
In this post, we introduce “Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning”, to appear at CVPR 2023. The Vid2Seq architecture augments a language model with special time tokens, allowing it to seamlessly predict event boundaries and textual descriptions in the same output sequence. In order to pre-train this unified model, we leverage unlabeled narrated videos by reformulating sentence boundaries of transcribed speech as pseudo-event boundaries, and using the transcribed speech sentences as pseudo-event captions. The resulting Vid2Seq model pre-trained on millions of narrated videos improves the state of the art on a variety of dense video captioning benchmarks including YouCook2, ViTT and ActivityNet Captions. Vid2Seq also generalizes well to the few-shot dense video captioning setting, the video paragraph captioning task, and the standard video captioning task. Finally, we have also released the code for Vid2Seq here.
![]() |
Vid2Seq is a visual language model that predicts dense event captions together with their temporal grounding in a video by generating a single sequence of tokens. |
Multimodal transformer architectures have improved the state of the art on a wide range of video tasks, such as action recognition. However it is not straightforward to adapt such an architecture to the complex task of jointly localizing and captioning events in minutes-long videos.
For a general overview of how we achieve this, we augment a visual language model with special time tokens (like text tokens) that represent discretized timestamps in the video, similar to Pix2Seq in the spatial domain. Given visual inputs, the resulting Vid2Seq model can both take as input and generate sequences of text and time tokens. First, this enables the Vid2Seq model to understand the temporal information of the transcribed speech input, which is cast as a single sequence of tokens. Second, this allows Vid2Seq to jointly predict dense event captions and temporally ground them in the video while generating a single sequence of tokens.
The Vid2Seq architecture includes a visual encoder and a text encoder, which encode the video frames and the transcribed speech input, respectively. The resulting encodings are then forwarded to a text decoder, which autoregressively predicts the output sequence of dense event captions together with their temporal localization in the video. The architecture is initialized with a powerful visual backbone and a strong language model.
Due to the dense nature of the task, the manual collection of annotations for dense video captioning is particularly expensive. Hence we pre-train the Vid2Seq model using unlabeled narrated videos, which are easily available at scale. In particular, we use the YT-Temporal-1B dataset, which includes 18 million narrated videos covering a wide range of domains.
We use transcribed speech sentences and their corresponding timestamps as supervision, which are cast as a single sequence of tokens. We pre-train Vid2Seq with a generative objective that teaches the decoder to predict the transcribed speech sequence given visual inputs only, and a denoising objective that encourages multimodal learning by requiring the model to predict masked tokens given a noisy transcribed speech sequence and visual inputs. In particular, noise is added to the speech sequence by randomly masking out spans of tokens.
![]() |
Vid2Seq is pre-trained on unlabeled narrated videos with a generative objective (top) and a denoising objective (bottom). |
The resulting pre-trained Vid2Seq model can be fine-tuned on downstream tasks with a simple maximum likelihood objective using teacher forcing (i.e., predicting the next token given previous ground-truth tokens). After fine-tuning, Vid2Seq notably improves the state of the art on three standard downstream dense video captioning benchmarks (ActivityNet Captions, YouCook2 and ViTT) and two video clip captioning benchmarks (MSR-VTT, MSVD). In our paper we provide additional ablation studies, qualitative results, as well as results in the few-shot settings and in the video paragraph captioning task.
![]() |
Comparison to state-of-the-art methods for dense video captioning (left) and for video clip captioning (right), on the CIDEr metric (higher is better). |
We introduce Vid2Seq, a novel visual language model for dense video captioning that simply predicts all event boundaries and captions as a single sequence of tokens. Vid2Seq can be effectively pretrained on unlabeled narrated videos at scale, and achieves state-of-the-art results on various downstream dense video captioning benchmarks. Learn more from the paper and grab the code here.
This research was conducted by Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic and Cordelia Schmid.
Hi everyone! We've just released Chrome Dev 113 (113.0.5651.0) for Android. It's now available on Google Play.
You can see a partial list of the changes in the Git log. For details on new features, check out the Chromium blog, and for details on web platform updates, check here.
If you find a new issue, please let us know by filing a bug.
Erhu Akpobaro
Google Chrome
The dev channel has been updated to 113.0.5653.0 for Windows, Linux and Mac.
The Beta channel is being updated to ChromeOS version: 15359.24.0 and Browser version: 112.0.5615.29 for most devices. This build contains a number of bug fixes and security updates.
If you find new issues, please let us know one of the following ways
Interested in switching channels? Find out how.
Google ChromeOS.
Posted by Filipe Gracio, PhD - Customer Engineer
I keep forgetting what I have in the freezer. At first I used Google Sheets to keep track of it, but I wanted something that was easy to consult and update on my smartphone. So I turned to AppSheet! Here’s a tutorial to follow to make a similar tracking solution.
First I created a database that imported my data from the Sheet:
![]() |
After I selected “Import from Sheets” and selected the sheet I was cumbersomely maintaining, I get the preview of the new database:
![]() |
Then I can go back and create an App:
![]() |
After I name it I can select the database I just created.
![]() |
Then
![]() |
The App now starts getting created, and then I can start customizing it!
I decided I want to actually add more information to the App. For example, I want to categorize my items, so I need another column. I can edit the data for this and I'll add a column “Category”.
![]() |
After adding the extra column, this is the result:
![]() |
That’s going to come in handy later for presentation and organization!
Now let's do some configuration about how the items are presented on the actual app. That’s in the UX section of the App builder. I want to select “Table”, Group by “category” and then sort alphabetically by “Item”
![]() |
After tweaking a few more options in UX “Brand” and “Format Rules”, this is how my app is looks:
![]() |
Now, I can see what I have in the freezer at all times. If I cook something and have a leftover, I can just add it by clicking the + button. After that, I just need to add in the info:
![]() |
And of course, if I use something I can just tap on it to edit the amount (or delete it).
This small App is something I use every week now! It is much easier than my old method, plus I learned how to use AppSheet. And this was just a quite simple use case - which only touched the tip of the iceberg of AppSheet’s features. If you work for organizations that have information to share and organize, this technology could be useful for you.
Try it out for yourself: you can use the complete set of AppSheet features at no cost while building one or many app prototypes. You can also invite up to 10 test users at no cost to use your apps and share feedback.
Thank you to my colleague Florian Opitz, Customer Engineer - Google Workspace + Security , for his useful edits and suggestions.
Globalized technology has the potential to create large-scale societal impact, and having a grounded research approach rooted in existing international human and civil rights standards is a critical component to assuring responsible and ethical AI development and deployment. The Impact Lab team, part of Google’s Responsible AI Team, employs a range of interdisciplinary methodologies to ensure critical and rich analysis of the potential implications of technology development. The team’s mission is to examine socioeconomic and human rights impacts of AI, publish foundational research, and incubate novel mitigations enabling machine learning (ML) practitioners to advance global equity. We study and develop scalable, rigorous, and evidence-based solutions using data analysis, human rights, and participatory frameworks.
The uniqueness of the Impact Lab’s goals is its multidisciplinary approach and the diversity of experience, including both applied and academic research. Our aim is to expand the epistemic lens of Responsible AI to center the voices of historically marginalized communities and to overcome the practice of ungrounded analysis of impacts by offering a research-based approach to understand how differing perspectives and experiences should impact the development of technology.
In response to the accelerating complexity of ML and the increased coupling between large-scale ML and people, our team critically examines traditional assumptions of how technology impacts society to deepen our understanding of this interplay. We collaborate with academic scholars in the areas of social science and philosophy of technology and publish foundational research focusing on how ML can be helpful and useful. We also offer research support to some of our organization’s most challenging efforts, including the 1,000 Languages Initiative and ongoing work in the testing and evaluation of language and generative models. Our work gives weight to Google's AI Principles.
To that end, we:
We strive not only to reimagine existing frameworks for assessing the adverse impact of AI to answer ambitious research questions, but also to promote the importance of this work.
Our motivation for providing rigorous analytical tools and approaches is to ensure that social-technical impact and fairness is well understood in relation to cultural and historical nuances. This is quite important, as it helps develop the incentive and ability to better understand communities who experience the greatest burden and demonstrates the value of rigorous and focused analysis. Our goals are to proactively partner with external thought leaders in this problem space, reframe our existing mental models when assessing potential harms and impacts, and avoid relying on unfounded assumptions and stereotypes in ML technologies. We collaborate with researchers at Stanford, University of California Berkeley, University of Edinburgh, Mozilla Foundation, University of Michigan, Naval Postgraduate School, Data & Society, EPFL, Australian National University, and McGill University.
![]() |
We examine systemic social issues and generate useful artifacts for responsible AI development. |
We also developed the Equitable AI Research Roundtable (EARR), a novel community-based research coalition created to establish ongoing partnerships with external nonprofit and research organization leaders who are equity experts in the fields of education, law, social justice, AI ethics, and economic development. These partnerships offer the opportunity to engage with multi-disciplinary experts on complex research questions related to how we center and understand equity using lessons from other domains. Our partners include PolicyLink; The Education Trust - West; Notley; Partnership on AI; Othering and Belonging Institute at UC Berkeley; The Michelson Institute for Intellectual Property, HBCU IP Futures Collaborative at Emory University; Center for Information Technology Research in the Interest of Society (CITRIS) at the Banatao Institute; and the Charles A. Dana Center at the University of Texas, Austin. The goals of the EARR program are to: (1) center knowledge about the experiences of historically marginalized or underrepresented groups, (2) qualitatively understand and identify potential approaches for studying social harms and their analogies within the context of technology, and (3) expand the lens of expertise and relevant knowledge as it relates to our work on responsible and safe approaches to AI development.
Through semi-structured workshops and discussions, EARR has provided critical perspectives and feedback on how to conceptualize equity and vulnerability as they relate to AI technology. We have partnered with EARR contributors on a range of topics from generative AI, algorithmic decision making, transparency, and explainability, with outputs ranging from adversarial queries to frameworks and case studies. Certainly the process of translating research insights across disciplines into technical solutions is not always easy but this research has been a rewarding partnership. We present our initial evaluation of this engagement in this paper.
![]() |
EARR: Components of the ML development life cycle in which multidisciplinary knowledge is key for mitigating human biases. |
In partnership with our Civil and Human Rights Program, our research and analysis process is grounded in internationally recognized human rights frameworks and standards including the Universal Declaration of Human Rights and the UN Guiding Principles on Business and Human Rights. Utilizing civil and human rights frameworks as a starting point allows for a context-specific approach to research that takes into account how a technology will be deployed and its community impacts. Most importantly, a rights-based approach to research enables us to prioritize conceptual and applied methods that emphasize the importance of understanding the most vulnerable users and the most salient harms to better inform day-to-day decision making, product design and long-term strategies.
We seek to employ an approach to dataset curation, model development and evaluation that is rooted in equity and that avoids expeditious but potentially risky approaches, such as utilizing incomplete data or not considering the historical and social cultural factors related to a dataset. Responsible data collection and analysis requires an additional level of careful consideration of the context in which the data are created. For example, one may see differences in outcomes across demographic variables that will be used to build models and should question the structural and system-level factors at play as some variables could ultimately be a reflection of historical, social and political factors. By using proxy data, such as race or ethnicity, gender, or zip code, we are systematically merging together the lived experiences of an entire group of diverse people and using it to train models that can recreate and maintain harmful and inaccurate character profiles of entire populations. Critical data analysis also requires a careful understanding that correlations or relationships between variables do not imply causation; the association we witness is often caused by additional multiple variables.
Building on this expanded and nuanced social understanding of data and dataset construction, we also approach the problem of anticipating or ameliorating the impact of ML models once they have been deployed for use in the real world. There are myriad ways in which the use of ML in various contexts — from education to health care — has exacerbated existing inequity because the developers and decision-making users of these systems lacked the relevant social understanding, historical context, and did not involve relevant stakeholders. This is a research challenge for the field of ML in general and one that is central to our team.
Our team also recognizes the saliency of understanding the socio-technical context globally. In line with Google’s mission to “organize the world’s information and make it universally accessible and useful”, our team is engaging in research partnerships globally. For example, we are collaborating with The Natural Language Processing team and the Human Centered team in the Makerere Artificial Intelligence Lab in Uganda to research cultural and language nuances as they relate to language model development.
We continue to address the impacts of ML models deployed in the real world by conducting further socio-technical research and engaging external experts who are also part of the communities that are historically and globally disenfranchised. The Impact Lab is excited to offer an approach that contributes to the development of solutions for applied problems through the utilization of social-science, evaluation, and human rights epistemologies.
We would like to thank each member of the Impact Lab team — Jamila Smith-Loud, Andrew Smart, Jalon Hall, Darlene Neal, Amber Ebinama, and Qazi Mamunur Rashid — for all the hard work they do to ensure that ML is more responsible to its users and society across communities and around the world.
Posted by Laura Cincera, Program Manager Google Developer Student Clubs Europe
Mental health remains one of the most neglected areas of healthcare worldwide, with nearly 1 billion people currently living with a mental health condition that requires support. But what if there was a way to make mental health care more accessible and tailored to individual needs?
The Google Developer Student Clubs Solution Challenge aims to inspire and empower university students to tackle our most pressing challenges - like mental health. The Solution Challenge is an annual opportunity to turn visionary ideas into reality and make a real-world impact using the United Nations' 17 Sustainable Development Goals as a blueprint for action. Students from all over the world work together and apply their skills to create innovative solutions using Google technology, creativity and the power of community.
One of last year’s top Solution Challenge proposals, Xtrinsic, was a cooperation between two communities of student leaders - GDSC Freiburg in Germany and GDSC Kyiv in Ukraine. The team developed an innovative mental health research and therapy application that adapts to users' personal habits and needs providing effective support at scale.
![]() |
Using a wearable device and TensorFlow, Xtrinsic helps users manage their symptoms by providing customized behavioral suggestions based on their physiological signs. It acts as an intervention tool for mental health issues such as nightmares, panic attacks, and anxiety and adapts the user's environment to their specific needs - which is essential for effective interventions. For example, if the user experiences a panic attack, the app detects the physiological signs using a smartwatch and a machine learning model, and triggers appropriate action, such as playing relaxing sounds, changing the room light to blue, or starting a guided breathing exercise. The solution was built using several Google technologies, including Android, Assistant/Actions on Google, Firebase, Flutter, Google Cloud, TensorFlow, WearOS, DialogFlow, and Google Health Services.
The team behind Xtrinsic is diverse. Alexander, Chikordili, Emma and Vandysh come from different backgrounds but share a passion for AI and how it can be leveraged to improve the lives of many. They all recognize the importance of shedding awareness on mental health and creating a supportive culture that is free from stigma. Their personal experiences in conflict areas, such as Syria and Ukraine inspired them to develop the application.
![]() |
Xtrinsic was recognized as one of the Top 3 winning teams in the 2022 Google Solution Challenge for its innovative approach to mental health research and therapy. The team has since supported several other social impact initiatives - helping grow the network of entrepreneurs and community leaders in Europe and beyond.
![]() |
If you feel inspired to make a positive change through technology, submit your project to Solution Challenge 2023 here. And if you’re passionate about technology and are ready to use your skills to help your local community, then consider becoming a Google Developer Student Clubs Lead!
We encourage all interested university students to apply here and submit their applications as soon as possible. The applications in Europe, India, North America and MENA are currently open.
Learn more about Google Developer Student Clubs here.