September 2021 update to Display & Video 360 API v1


Today we’re releasing an update to the Display & Video 360 API that includes the following features:
More detailed information about this update can be found in the Display & Video 360 API release notes. Before using these new features, make sure to update your client library to the latest version.


If you run into issues or need help with these new features or samples, please contact us using our support contact form.

Our Grace Hopper subsea cable has landed in the UK

Last year, we announced a new subsea cable — named Grace Hopper after the computer science pioneer — that will run between the United States, the United Kingdom and Spain. The cable will improve the resilience of the Google network that underpins our consumer and enterprise products. The 16-fibre pair Google-funded cable will connect New York (United States) to Bude (United Kingdom) and Bilbao (Spain).

Today, the Grace Hopper cable has landed in Bude, Cornwall. 

Many people around the world use Google products every day to stay in touch with friends and family, travel from point A to point B, find new customers or export products to new markets. As our first Google-funded cable to the U.K., Grace Hopper is part of our ongoing investment in the country, supporting users who rely on our products and customers using our tools to grow their business.

We know that technology is only becoming more important for the U.K. economy. The amount technology contributes to the U.K. economy has grown on average by 7% year on year since 2016. And U.K.-based venture capital investment is ranked third in the world, reaching a record high of $15 billion in 2020, despite the challenging conditions from the COVID-19 pandemic. What’s more, 10% of all current U.K. job vacancies are in tech roles, and the number of people employed in the tech sector has grown 40% in two years. With this in mind, improving the diversity and resilience of Google’s network is crucial to our ability to continue supporting one of the U.K.’s most vital sectors, as well as its long-term economic success.  

Grace Hopper represents a new generation of trans-Atlantic cable coming to the U.K. shores and is one of the first new cables to connect the U.S. and the U.K. since 2003. Moreover, with the ongoing pandemic fostering a new digital normal, Google-funded subsea cables allow us to plan and prepare for the future capacity needs of our customers, no matter where they are in the world. Grace Hopper will connect the U.K. to help meet the rapidly growing demand for high-bandwidth connectivity and services.

An image of the Google buoy landing on the sandy beach of Bude, Cornwall

Grace Hopper buoy landing on the beach in Bude, Cornwall

Alongside CurieDunantEquiano and Firmina, Grace Hopper is the latest cable to connect continents along the ocean floor with an additional layer of security beyond what’s available over the public internet. We’ve worked with established channels and experts for years to ensure that Grace Hopper will be able to achieve better reliability in global communications, and free flows of data.

Following a successful Bilbao landing earlier in September, Grace Hopper also marks our first ever Google-funded route to Spain, taking a unique path from our existing cables, such as Dunant, which connects the U.S. and France, and Havfrue, which links the U.S. and Denmark. The cable will use novel “fibre switching,” which allows us to better move traffic around outages for increased reliability. Once it is complete, Grace Hopper will carry traffic quickly and securely between the continents, increasing capacity and powering Google services like Meet, Gmail and Google Cloud.

Grace Hopper will use this new switching architecture to provide optimum levels of network flexibility and resilience to adjust to unforeseen failures or traffic patterns. The multi-directional switching architecture is a significant breakthrough for uncertain times, and will more tightly integrate the upcoming Google Cloud region in Madrid into our global infrastructure. 

With the landing of the Grace Hopper cable in Cornwall, we look forward to supporting the next great U.K. tech innovations.


Our new animated series brings data centers to life

If you rely on the internet to search for the answer to a burning question, access work documents or stream your favorite TV show, you may have wondered how you can get the content you want so easily and quickly. You can thank a data center for that. 

Which may make you wonder: What exactly is a data center, and what is its purpose?

Google’s Discovering Data Centers series of short animated videos has the answers. As host of this series, I invite you to join us and learn about these expansive, supercomputer-filled warehouses that we all rely on, yet may know little about.

A loop of an animated video showing a data center campus surrounded by trees, blue sky, power lines, and wind turbines. Three small bubbles appear over the data center with images in each: a computer server to represent storage, wires to represent the power supply, and a fan to represent the cooling infrastructure.

Each video in this series helps peel back the layers on what makes data centers so fascinating: design, technology, operations and sustainability. There are times you click Start on Google Maps, edit a Google Doc or watch a YouTube video on how to fix something. By watching this series, you’ll better understand how Google’s data centers get you and billions of other users like you to that content quickly, securely and sustainably. 

Discovering Data Centers will help you understand: 

  • How data centers play a critical role in organizing your and the world’s information.
  • Data center design and how data centers are built to be sustainable. 
  • Our core principles, which show you can depend on us to be available 24/7. 

As the second season of our series gets underway, upcoming topics include: 

  • How hundreds of machines at a data center store data.
  • How our network allows data to travel through and between data centers within seconds. 
  • How encryption of data works to help secure every packet of data stored in our data centers.

To watch this series and see how data centers benefit you, visit our website. Check back monthly for new episodes where I’ll continue to reveal all the layers that make a data center hum. 

Click through the images below to read episode descriptions and take a peek at the engineering marvels that are today’s data centers.


Stable Channel Update for Desktop

The Stable channel has been updated to 93.0.4577.82 for Windows, Mac and Linux which will roll out over the coming days/weeks


A full list of changes in this build is available in the log. Interested in switching release channels?  Find out how here. If you find a new issue, please let us know by filing a bug. The community help forum is also a great place to reach out for help or learn about common issues.

Security Fixes and Rewards

Note: Access to bug details and links may be kept restricted until a majority of users are updated with a fix. We will also retain restrictions if the bug exists in a third party library that other projects similarly depend on, but haven’t yet fixed.


This update includes 11 security fixes. Below, we highlight fixes that were contributed by external researchers. Please see the Chrome Security Page for more information.


[$7500][1237533] High CVE-2021-30625: Use after free in Selection API. Reported by Marcin Towalski of Cisco Talos on 2021-08-06

[$7500][1241036] High CVE-2021-30626: Out of bounds memory access in ANGLE. Reported by Jeonghoon Shin of Theori on 2021-08-18

[$5000][1245786] High CVE-2021-30627: Type Confusion in Blink layout. Reported by Aki Helin of OUSPG on 2021-09-01

[$TBD][1241123] High CVE-2021-30628: Stack buffer overflow in ANGLE. Reported by Jaehun Jeong(@n3sk) of Theori on 2021-08-18

[$TBD][1243646] High CVE-2021-30629: Use after free in Permissions. Reported by Weipeng Jiang (@Krace) from Codesafe Team of Legendsec at Qi'anxin Group on 2021-08-26

[$TBD][1244568] High CVE-2021-30630: Inappropriate implementation in Blink . Reported by SorryMybad (@S0rryMybad) of Kunlun Lab on 2021-08-30

[$TBD][1246932] High CVE-2021-30631: Type Confusion in Blink layout. Reported by Atte Kettunen of OUSPG on 2021-09-06

[$TBD][1247763] High CVE-2021-30632: Out of bounds write in V8. Reported by Anonymous on 2021-09-08

[$TBD][1247766] High CVE-2021-30633: Use after free in Indexed DB API. Reported by Anonymous on 2021-09-08


We would also like to thank all security researchers that worked with us during the development cycle to prevent security bugs from ever reaching the stable channel.


Google is aware that exploits for CVE-2021-30632 and CVE-2021-30633 exist in the wild.


As usual, our ongoing internal security work was responsible for a wide range of fixes:

[1249027] Various fixes from internal audits, fuzzing and other initiatives




Prudhvikumar Bommana
Google Chrome

Chrome for Android Update

Hi, everyone! We've just released Chrome 93 (93.0.4577.82) for Android: it'll become available on Google Play over the next few days.

This release includes stability and performance improvements. You can see a full list of the changes in the Git log. If you find a new issue, please let us know by filing a bug.

Ben Mason
Google Chrome

Dev Channel Update for Chrome OS

The Dev channel is being updated to 95.0.4635.0 (Platform version: 14209.0.0) for most Chrome OS devices.

This build contains a number of features, bug fixes and security updates, please find release notes here.

If you find new issues, please let us know by visiting our forum or filing a bug. Interested in switching channels Find out how. You can submit feedback using ‘Report an issue...’ in the Chrome menu (3 vertical dots in the upper right corner of the browser). 

Daniel Gagnon,
Google Chrome OS

Chrome for iOS Update

Hi, everyone! We've just released Chrome 93 (93.0.4577.78) for iOS: it'll become available on App Store in the next few hours.

This release includes stability and performance improvements. You can see a full list of the changes in the Git log. If you find a new issue, please let us know by filing a bug.

Harry Souders

Google Chrome

Music Conditioned 3D Dance Generation with AIST++

Dancing is a universal language found in nearly all cultures, and is an outlet many people use to express themselves on contemporary media platforms today. The ability to dance by composing movement patterns that align to music beats is a fundamental aspect of human behavior. However, dancing is a form of art that requires practice. In fact, professional training is often required to equip a dancer with a rich repertoire of dance motions needed to create expressive choreography. While this process is difficult for people, it is even more challenging for a machine learning (ML) model, because the task requires the ability to generate a continuous motion with high kinematic complexity, while capturing the non-linear relationship between the movements and the accompanying music.

In “AI Choreographer: Music-Conditioned 3D Dance Generation with AIST++”, presented at ICCV 2021, we propose a full-attention cross-modal Transformer (FACT) model can mimic and understand dance motions, and can even enhance a person’s ability to choreograph dance. Together with the model, we released a large-scale, multi-modal 3D dance motion dataset, AIST++, which contains 5.2 hours of 3D dance motion in 1408 sequences, covering 10 dance genres, each including multi-view videos with known camera poses. Through extensive user studies on AIST++, we find that the FACT model outperforms recent state-of-the-art methods, both qualitatively and quantitatively.

We present a novel full-attention cross-modal transformer (FACT) network that can generate realistic 3D dance motion (right) conditioned on music and a new 3D dance dataset, AIST++ (left).

We generate the proposed 3D motion dataset from the existing AIST Dance Database — a collection of videos of dance with musical accompaniment, but without any 3D information. AIST contains 10 dance genres: Old School (Break, Pop, Lock and Waack) and New School (Middle Hip-Hop, LA-style Hip-Hop, House, Krump, Street Jazz and Ballet Jazz). Although it contains multi-view videos of dancers, these cameras are not calibrated.

For our purposes, we recovered the camera calibration parameters and the 3D human motion in terms of parameters used by the widely used SMPL 3D model. The resulting database, AIST++, is a large-scale, 3D human dance motion dataset that contains a wide variety of 3D motion, paired with music. Each frame includes extensive annotations:

  • 9 views of camera intrinsic and extrinsic parameters;
  • 17 COCO-format human joint locations in both 2D and 3D;
  • 24 SMPL pose parameters along with the global scaling and translation.

The motions are equally distributed among all 10 dance genres, covering a wide variety of music tempos in beat per minute (BPM). Each genre of dance contains 85% basic movements and 15% advanced movements (longer choreographies freely designed by the dancers).

The AIST++ dataset also contains multi-view synchronized image data, making it useful for other research directions, such as 2D/3D pose estimation. To our knowledge, AIST++ is the largest 3D human dance dataset with 1408 sequences, 30 subjects and 10 dance genres, and with both basic and advanced choreographies.

An example of a 3D dance sequence in the AIST++ dataset. Left: Three views of the dance video from the AIST database. Right: Reconstructed 3D motion visualized in 3D mesh (top) and skeletons (bottom).

Because AIST is an instructional database, it records multiple dancers following the same choreography for different music with varying BPM, a common practice in dance. This posits a unique challenge in cross-modal sequence-to-sequence generation as the model needs to learn the one-to-many mapping between audio and motion. We carefully construct non-overlapping train and test subsets on AIST++ to ensure neither choreography nor music is shared across the subsets.

Full Attention Cross-Modal Transformer (FACT) Model
Using this data, we train the FACT model to generate 3D dance from music. The model begins by encoding seed motion and audio inputs using separate motion and audio transformers. The embeddings are then concatenated and sent to a cross-modal transformer, which learns the correspondence between both modalities and generates N future motion sequences. These sequences are then used to train the model in a self-supervised manner. All three transformers are jointly learned end-to-end. At test time, we apply this model in an autoregressive framework, where the predicted motion serves as the input to the next generation step. As a result, the FACT model is capable of generating long range dance motion frame-by-frame.

The FACT network takes in a music piece (Y) and a 2-second sequence of seed motion (X), then generates long-range future motions that correlate with the input music.

FACT involves three key design choices that are critical for producing realistic 3D dance motion from music.

  1. All of the transformers use a full-attention mask, which can be more expressive than typical causal models because internal tokens have access to all inputs.
  2. We train the model to predict N futures beyond the current input, instead of just the next motion. This encourages the network to pay more attention to the temporal context, and helps prevent the model from motion freezing or diverging after a few generation steps.
  3. We fuse the two embeddings (motion and audio) early and employ a deep 12-layer cross-modal transformer module, which is essential for training a model that actually pays attention to the input music.

Results
We evaluate the performance based on three metrics:

Motion Quality: We calculate the Frechet Inception Distance (FID) between the real dance motion sequences in the AIST++ test set and 40 model generated motion sequences, each with 1200 frames (20 secs). We denote the FID based on the geometric and kinetic features as FIDg and FIDk, respectively.

Generation Diversity: Similar to prior work, to evaluate the model’s ability to generate divers dance motions, we calculate the average Euclidean distance in the feature space across 40 generated motions on the AIST++ test set, again comparing geometric feature space (Distg) and in the kinetic feature space (Distk).

Four different dance choreographies (right) generated using different music, but the same two second seed motion (left). The genres of the conditioning music are: Break, Ballet Jazz, Krump and Middle Hip-hop. The seed motion comes from hip-hop dance.

Motion-Music Correlation: Because there is no well-designed metric to measure the correlation between input music (music beats) and generated 3D motion (kinematic beats), we propose a novel metric, called Beat Alignment Score (BeatAlign).

Kinetic velocity (blue curve) and kinematic beats (green dotted line) of the generated dance motion, as well as the music beats (orange dotted line). The kinematic beats are extracted by finding local minima from the kinetic velocity curve.

Quantitative Evaluation
We compare the performance of FACT on each of these metrics to that of other state-of-the-art methods.

Compared to three recent state-of-the-art methods (Li et al., Dancenet, and Dance Revolution), the FACT model generates motions that are more realistic, better correlated with input music, and more diversified when conditioned on different music. *Note that the Li et al. generated motions are discontinuous, making the average kinetic feature distance abnormally high.

We also perceptually evaluate the motion-music correlation with a user study in which each participant is asked to watch 10 videos showing one of our results and one random counterpart, and then select which dancer is more in sync with the music. The study consisted of 30 participants, ranging from professional dancers to people who rarely dance. Compared to each baseline, 81% prefered the FACT model output to that of Li et al., 71% prefered FACT to Dancenet, and 77% prefered it Dance Revolution. Interestingly, 75% of participants preferred the unpaired AIST++ dance motion to that generated by FACT, which is unsurprising since the original dance captures are highly expressive.

Qualitative Results
Compared with prior methods like DanceNet (left) and Li et. al. (middle), 3D dance generated using the FACT model (right) is more realistic and better correlated with input music.

More generated 3D dances using the FACT model.

Conclusion and Discussion
We present a model that can not only learn the audio-motion correspondence, but also can generate high quality 3D motion sequences conditioned on music. Because generating 3D movement from music is a nascent area of study, we hope our work will pave the way for future cross-modal audio to 3D motion generation. We are also releasing AIST++, the largest 3D human dance dataset to date. This proposed, multi-view, multi-genre, cross-modal 3D motion dataset can not only help research in the conditional 3D motion generation research but also human understanding research in general. We are releasing the code in our GitHub repository and the trained model here.

While our results show a promising direction in this problem of music conditioned 3D motion generation, there are more to be explored. First, our approach is kinematic-based and we do not reason about physical interactions between the dancer and the floor. Therefore the global translation can lead to artifacts, such as foot sliding and floating. Second, our model is currently deterministic. Exploring how to generate multiple realistic dances per music is an exciting direction.

Acknowledgements
We gratefully acknowledge the contribution of other co-authors, including Ruilong Li and David Ross. We thank Chen Sun, Austin Myers, Bryan Seybold and Abhijit Kundu for helpful discussions. We thank Emre Aksan and Jiaman Li for sharing their code. We also thank Kevin Murphy for the early attempts in this direction, as well as Peggy Chi and Pan Chen for the help on user study experiments.

Source: Google AI Blog


Drone control via gestures using MediaPipe Hands

A guest post by Neurons Lab

Please note that the information, uses, and applications expressed in the below post are solely those of our guest author, Neurons Lab, and not necessarily those of Google.

How the idea emerged

With the advancement of technology, drones have become not only smaller, but also have more compute. There are many examples of iPhone-sized quadcopters in the consumer drone market and the computing power to do live tracking while recording 4K video. However, the most important element has not changed much - the controller. It is still bulky and not intuitive for beginners to use. There is a smartphone with on-display control as an option; however, the control principle is still the same. 

That is how the idea for this project emerged: a more personalised approach to control the drone using gestures. ML Engineer Nikita Kiselov (me) together with consultation from my colleagues at Neurons Lab undertook this project. 

Demonstration of drone flight control via gestures using MediaPipe Hands

Figure 1: [GIF] Demonstration of drone flight control via gestures using MediaPipe Hands

Why use gesture recognition?

Gestures are the most natural way for people to express information in a non-verbal way.  Gesture control is an entire topic in computer science that aims to interpret human gestures using algorithms. Users can simply control devices or interact without physically touching them. Nowadays, such types of control can be found from smart TV to surgery robots, and UAVs are not the exception.

Although gesture control for drones have not been widely explored lately, the approach has some advantages:

  • No additional equipment needed.
  • More human-friendly controls.
  • All you need is a camera that is already on all drones.

With all these features, such a control method has many applications.

Flying action camera. In extreme sports, drones are a trendy video recording tool. However, they tend to have a very cumbersome control panel. The ability to use basic gestures to control the drone (while in action)  without reaching for the remote control would make it easier to use the drone as a selfie camera. And the ability to customise gestures would completely cover all the necessary actions.

This type of control as an alternative would be helpful in an industrial environment like, for example, construction conditions when there may be several drone operators (gesture can be used as a stop-signal in case of losing primary source of control).

The Emergencies and Rescue Services could use this system for mini-drones indoors or in hard-to-reach places where one of the hands is busy. Together with the obstacle avoidance system, this would make the drone fully autonomous, but still manageable when needed without additional equipment.

Another area of application is FPV (first-person view) drones. Here the camera on the headset could be used instead of one on the drone to recognise gestures. Because hand movement can be impressively precise, this type of control, together with hand position in space, can simplify the FPV drone control principles for new users. 

However, all these applications need a reliable and fast (really fast) recognition system. Existing gesture recognition systems can be fundamentally divided into two main categories: first - where special physical devices are used, such as smart gloves or other on-body sensors; second - visual recognition using various types of cameras. Most of those solutions need additional hardware or rely on classical computer vision techniques. Hence, that is the fast solution, but it's pretty hard to add custom gestures or even motion ones. The answer we found is MediaPipe Hands that was used for this project.

Overall project structure

To create the proof of concept for the stated idea, a Ryze Tello quadcopter was used as a UAV. This drone has an open Python SDK, which greatly simplified the development of the program. However, it also has technical limitations that do not allow it to run gesture recognition on the drone itself (yet). For this purpose a regular PC or Mac was used. The video stream from the drone and commands to the drone are transmitted via regular WiFi, so no additional equipment was needed. 

To make the program structure as plain as possible and add the opportunity for easily adding gestures, the program architecture is modular, with a control module and a gesture recognition module. 

Scheme that shows overall project structure and how videostream data from the drone is processed

Figure 2: Scheme that shows overall project structure and how videostream data from the drone is processed

The application is divided into two main parts: gesture recognition and drone controller. Those are independent instances that can be easily modified. For example, to add new gestures or change the movement speed of the drone.

Video stream is passed to the main program, which is a simple script with module initialisation, connections, and typical for the hardware while-true cycle. Frame for the videostream is passed to the gesture recognition module. After getting the ID of the recognised gesture, it is passed to the control module, where the command is sent to the UAV. Alternatively, the user can control a drone from the keyboard in a more classical manner.

So, you can see that the gesture recognition module is divided into keypoint detection and gesture classifier. Exactly the bunch of the MediaPipe key point detector along with the custom gesture classification model distinguishes this gesture recognition system from most others.

Gesture recognition with MediaPipe

Utilizing MediaPipe Hands is a winning strategy not only in terms of speed, but also in flexibility. MediaPipe already has a simple gesture recognition calculator that can be inserted into the pipeline. However, we needed a more powerful solution with the ability to quickly change the structure and behaviour of the recognizer. To do so and classify gestures, the custom neural network was created with 4 Fully-Connected layers and 1 Softmax layer for classification.

Figure 3: Scheme that shows the structure of classification neural network

Figure 3: Scheme that shows the structure of classification neural network

This simple structure gets a vector of 2D coordinates as an input and gives the ID of the classified gesture. 

Instead of using cumbersome segmentation models with a more algorithmic recognition process, a simple neural network can easily handle such tasks. Recognising gestures by keypoints, which is a simple vector with 21 points` coordinates, takes much less data and time. What is more critical, new gestures can be easily added because model retraining tasks take much less time than the algorithmic approach.

To train the classification model, dataset with keypoints` normalised coordinates and ID of a gesture was used. The numerical characteristic of the dataset was that:

  • 3 gestures with 300+ examples (basic gestures)
  • 5 gestures with 40 -150 examples 

All data is a vector of x, y coordinates that contain small tilt and different shapes of hand during data collection.

Figure 4: Confusion matrix and classification report for classification

Figure 4: Confusion matrix and classification report for classification

We can see from the classification report that the precision of the model on the test dataset (this is 30% of all data) demonstrated almost error-free for most classes, precision > 97% for any class. Due to the simple structure of the model, excellent accuracy can be obtained with a small number of examples for training each class. After conducting several experiments, it turned out that we just needed the dataset with less than 100 new examples for good recognition of new gestures. What is more important, we don’t need to retrain the model for each motion in different illumination because MediaPipe takes over all the detection work.

Figure 5: [GIF] Test that demonstrates how fast classification network can distinguish newly trained gestures using the information from MediaPipe hand detector

Figure 5: [GIF] Test that demonstrates how fast classification network can distinguish newly trained gestures using the information from MediaPipe hand detector

From gestures to movements

To control a drone, each gesture should represent a command for a drone. Well, the most excellent part about Tello is that it has a ready-made Python API to help us do that without explicitly controlling motors hardware. We just need to set each gesture ID to a command.

Figure 6: Command-gesture pairs representation

Figure 6: Command-gesture pairs representation

Each gesture sets the speed for one of the axes; that’s why the drone’s movement is smooth, without jitter. To remove unnecessary movements due to false detection, even with such a precise model, a special buffer was created, which is saving the last N gestures. This helps to remove glitches or inconsistent recognition.

The fundamental goal of this project is to demonstrate the superiority of the keypoint-based gesture recognition approach compared to classical methods. To demonstrate all the potential of this recognition model and its flexibility, there is an ability to create the dataset on the fly … on the drone`s flight! You can create your own combinations of gestures or rewrite an existing one without collecting massive datasets or manually setting a recognition algorithm. By pressing the button and ID key, the vector of detected points is instantly saved to the overall dataset. This new dataset can be used to retrain classification network to add new gestures for the detection. For now, there is a notebook that can be run on Google Colab or locally. Retraining the network-classifier takes about 1-2 minutes on a standard CPU instance. The new binary file of the model can be used instead of the old one. It is as simple as that. But for the future, there is a plan to do retraining right on the mobile device or even on the drone.

Figure 7: Notebook for model retraining in action

Figure 7: Notebook for model retraining in action

Summary 

This project is created to make a push in the area of the gesture-controlled drones. The novelty of the approach lies in the ability to add new gestures or change old ones quickly. This is made possible thanks to MediaPipe Hands. It works incredibly fast, reliably, and ready out of the box, making gesture recognition very fast and flexible to changes. Our Neuron Lab`s team is excited about the demonstrated results and going to try other incredible solutions that MediaPipe provides. 

We will also keep track of MediaPipe updates, especially about adding more flexibility in creating custom calculators for our own models and reducing barriers to entry when creating them. Since at the moment our classifier model is outside the graph, such improvements would make it possible to quickly implement a custom calculator with our model into reality.

Another highly anticipated feature is Flutter support (especially for iOS). In the original plans, the inference and visualisation were supposed to be on a smartphone with NPU\GPU utilisation, but at the moment support quality does not satisfy our requests. Flutter is a very powerful tool for rapid prototyping and concept checking. It allows us to throw and test an idea cross-platform without involving a dedicated mobile developer, so such support is highly demanded. 

Nevertheless, the development of this demo project continues with available functionality, and there are already several plans for the future. Like using the MediaPipe Holistic for face recognition and subsequent authorisation. The drone will be able to authorise the operator and give permission for gesture control. It also opens the way to personalisation. Since the classifier network is straightforward, each user will be able to customise gestures for themselves (simply by using another version of the classifier model). Depending on the authorised user, one or another saved model will be applied. Also in the plans to add the usage of Z-axis. For example, tilt the palm of your hand to control the speed of movement or height more precisely. We encourage developers to innovate responsibly in this area, and to consider responsible AI practices such as testing for unfair biases and designing with safety and privacy in mind.

We highly believe that this project will motivate even small teams to do projects in the field of ML computer vision for the UAV, and MediaPipe will help to cope with the limitations and difficulties on their way (such as scalability, cross-platform support and GPU inference).


If you want to contribute, have ideas or comments about this project, please reach out to [email protected], or visit the GitHub page of the project.

This blog post is curated by Igor Kibalchich, ML Research Product Manager at Google AI.

Drone control via gestures using MediaPipe Hands

A guest post by Neurons Lab

Please note that the information, uses, and applications expressed in the below post are solely those of our guest author, Neurons Lab, and not necessarily those of Google.

How the idea emerged

With the advancement of technology, drones have become not only smaller, but also have more compute. There are many examples of iPhone-sized quadcopters in the consumer drone market and the computing power to do live tracking while recording 4K video. However, the most important element has not changed much - the controller. It is still bulky and not intuitive for beginners to use. There is a smartphone with on-display control as an option; however, the control principle is still the same. 

That is how the idea for this project emerged: a more personalised approach to control the drone using gestures. ML Engineer Nikita Kiselov (me) together with consultation from my colleagues at Neurons Lab undertook this project. 

Demonstration of drone flight control via gestures using MediaPipe Hands

Figure 1: [GIF] Demonstration of drone flight control via gestures using MediaPipe Hands

Why use gesture recognition?

Gestures are the most natural way for people to express information in a non-verbal way.  Gesture control is an entire topic in computer science that aims to interpret human gestures using algorithms. Users can simply control devices or interact without physically touching them. Nowadays, such types of control can be found from smart TV to surgery robots, and UAVs are not the exception.

Although gesture control for drones have not been widely explored lately, the approach has some advantages:

  • No additional equipment needed.
  • More human-friendly controls.
  • All you need is a camera that is already on all drones.

With all these features, such a control method has many applications.

Flying action camera. In extreme sports, drones are a trendy video recording tool. However, they tend to have a very cumbersome control panel. The ability to use basic gestures to control the drone (while in action)  without reaching for the remote control would make it easier to use the drone as a selfie camera. And the ability to customise gestures would completely cover all the necessary actions.

This type of control as an alternative would be helpful in an industrial environment like, for example, construction conditions when there may be several drone operators (gesture can be used as a stop-signal in case of losing primary source of control).

The Emergencies and Rescue Services could use this system for mini-drones indoors or in hard-to-reach places where one of the hands is busy. Together with the obstacle avoidance system, this would make the drone fully autonomous, but still manageable when needed without additional equipment.

Another area of application is FPV (first-person view) drones. Here the camera on the headset could be used instead of one on the drone to recognise gestures. Because hand movement can be impressively precise, this type of control, together with hand position in space, can simplify the FPV drone control principles for new users. 

However, all these applications need a reliable and fast (really fast) recognition system. Existing gesture recognition systems can be fundamentally divided into two main categories: first - where special physical devices are used, such as smart gloves or other on-body sensors; second - visual recognition using various types of cameras. Most of those solutions need additional hardware or rely on classical computer vision techniques. Hence, that is the fast solution, but it's pretty hard to add custom gestures or even motion ones. The answer we found is MediaPipe Hands that was used for this project.

Overall project structure

To create the proof of concept for the stated idea, a Ryze Tello quadcopter was used as a UAV. This drone has an open Python SDK, which greatly simplified the development of the program. However, it also has technical limitations that do not allow it to run gesture recognition on the drone itself (yet). For this purpose a regular PC or Mac was used. The video stream from the drone and commands to the drone are transmitted via regular WiFi, so no additional equipment was needed. 

To make the program structure as plain as possible and add the opportunity for easily adding gestures, the program architecture is modular, with a control module and a gesture recognition module. 

Scheme that shows overall project structure and how videostream data from the drone is processed

Figure 2: Scheme that shows overall project structure and how videostream data from the drone is processed

The application is divided into two main parts: gesture recognition and drone controller. Those are independent instances that can be easily modified. For example, to add new gestures or change the movement speed of the drone.

Video stream is passed to the main program, which is a simple script with module initialisation, connections, and typical for the hardware while-true cycle. Frame for the videostream is passed to the gesture recognition module. After getting the ID of the recognised gesture, it is passed to the control module, where the command is sent to the UAV. Alternatively, the user can control a drone from the keyboard in a more classical manner.

So, you can see that the gesture recognition module is divided into keypoint detection and gesture classifier. Exactly the bunch of the MediaPipe key point detector along with the custom gesture classification model distinguishes this gesture recognition system from most others.

Gesture recognition with MediaPipe

Utilizing MediaPipe Hands is a winning strategy not only in terms of speed, but also in flexibility. MediaPipe already has a simple gesture recognition calculator that can be inserted into the pipeline. However, we needed a more powerful solution with the ability to quickly change the structure and behaviour of the recognizer. To do so and classify gestures, the custom neural network was created with 4 Fully-Connected layers and 1 Softmax layer for classification.

Figure 3: Scheme that shows the structure of classification neural network

Figure 3: Scheme that shows the structure of classification neural network

This simple structure gets a vector of 2D coordinates as an input and gives the ID of the classified gesture. 

Instead of using cumbersome segmentation models with a more algorithmic recognition process, a simple neural network can easily handle such tasks. Recognising gestures by keypoints, which is a simple vector with 21 points` coordinates, takes much less data and time. What is more critical, new gestures can be easily added because model retraining tasks take much less time than the algorithmic approach.

To train the classification model, dataset with keypoints` normalised coordinates and ID of a gesture was used. The numerical characteristic of the dataset was that:

  • 3 gestures with 300+ examples (basic gestures)
  • 5 gestures with 40 -150 examples 

All data is a vector of x, y coordinates that contain small tilt and different shapes of hand during data collection.

Figure 4: Confusion matrix and classification report for classification

Figure 4: Confusion matrix and classification report for classification

We can see from the classification report that the precision of the model on the test dataset (this is 30% of all data) demonstrated almost error-free for most classes, precision > 97% for any class. Due to the simple structure of the model, excellent accuracy can be obtained with a small number of examples for training each class. After conducting several experiments, it turned out that we just needed the dataset with less than 100 new examples for good recognition of new gestures. What is more important, we don’t need to retrain the model for each motion in different illumination because MediaPipe takes over all the detection work.

Figure 5: [GIF] Test that demonstrates how fast classification network can distinguish newly trained gestures using the information from MediaPipe hand detector

Figure 5: [GIF] Test that demonstrates how fast classification network can distinguish newly trained gestures using the information from MediaPipe hand detector

From gestures to movements

To control a drone, each gesture should represent a command for a drone. Well, the most excellent part about Tello is that it has a ready-made Python API to help us do that without explicitly controlling motors hardware. We just need to set each gesture ID to a command.

Figure 6: Command-gesture pairs representation

Figure 6: Command-gesture pairs representation

Each gesture sets the speed for one of the axes; that’s why the drone’s movement is smooth, without jitter. To remove unnecessary movements due to false detection, even with such a precise model, a special buffer was created, which is saving the last N gestures. This helps to remove glitches or inconsistent recognition.

The fundamental goal of this project is to demonstrate the superiority of the keypoint-based gesture recognition approach compared to classical methods. To demonstrate all the potential of this recognition model and its flexibility, there is an ability to create the dataset on the fly … on the drone`s flight! You can create your own combinations of gestures or rewrite an existing one without collecting massive datasets or manually setting a recognition algorithm. By pressing the button and ID key, the vector of detected points is instantly saved to the overall dataset. This new dataset can be used to retrain classification network to add new gestures for the detection. For now, there is a notebook that can be run on Google Colab or locally. Retraining the network-classifier takes about 1-2 minutes on a standard CPU instance. The new binary file of the model can be used instead of the old one. It is as simple as that. But for the future, there is a plan to do retraining right on the mobile device or even on the drone.

Figure 7: Notebook for model retraining in action

Figure 7: Notebook for model retraining in action

Summary 

This project is created to make a push in the area of the gesture-controlled drones. The novelty of the approach lies in the ability to add new gestures or change old ones quickly. This is made possible thanks to MediaPipe Hands. It works incredibly fast, reliably, and ready out of the box, making gesture recognition very fast and flexible to changes. Our Neuron Lab`s team is excited about the demonstrated results and going to try other incredible solutions that MediaPipe provides. 

We will also keep track of MediaPipe updates, especially about adding more flexibility in creating custom calculators for our own models and reducing barriers to entry when creating them. Since at the moment our classifier model is outside the graph, such improvements would make it possible to quickly implement a custom calculator with our model into reality.

Another highly anticipated feature is Flutter support (especially for iOS). In the original plans, the inference and visualisation were supposed to be on a smartphone with NPU\GPU utilisation, but at the moment support quality does not satisfy our requests. Flutter is a very powerful tool for rapid prototyping and concept checking. It allows us to throw and test an idea cross-platform without involving a dedicated mobile developer, so such support is highly demanded. 

Nevertheless, the development of this demo project continues with available functionality, and there are already several plans for the future. Like using the MediaPipe Holistic for face recognition and subsequent authorisation. The drone will be able to authorise the operator and give permission for gesture control. It also opens the way to personalisation. Since the classifier network is straightforward, each user will be able to customise gestures for themselves (simply by using another version of the classifier model). Depending on the authorised user, one or another saved model will be applied. Also in the plans to add the usage of Z-axis. For example, tilt the palm of your hand to control the speed of movement or height more precisely. We encourage developers to innovate responsibly in this area, and to consider responsible AI practices such as testing for unfair biases and designing with safety and privacy in mind.

We highly believe that this project will motivate even small teams to do projects in the field of ML computer vision for the UAV, and MediaPipe will help to cope with the limitations and difficulties on their way (such as scalability, cross-platform support and GPU inference).


If you want to contribute, have ideas or comments about this project, please reach out to [email protected], or visit the GitHub page of the project.

This blog post is curated by Igor Kibalchich, ML Research Product Manager at Google AI.