Please note that the information, uses, and applications expressed in the below post are solely those of our guest author, SignAll.
SignAll SDK: Sign language interface using MediaPipe is now available for developers
When Google published the first versions of its on-device hand tracking technology in MediaPipe, the work could serve as a basis for developers to build sign language recognition solutions into their own apps. Later updates to this hand tracking solution have further improved its accuracy where other technologies have fallen short (Figure 1).
Figure 1. Illustrating MediaPipe’s improvement over time: hand skeleton tracking output of an older version (2020.02.10) and the latest version (2020.12.16). This handshape is used in sign language frequently, but often missed because of the lack of representation in training datasets.
Watch the full video
SignAll is a startup working on sign language translation technology. Its mission is to make sign language interpretation universally available, both through communication between Deaf and hearing parties, and between Deaf individuals and computers. The SignAll’s products, which are used nationwide in the US for both communications and education, employ a complex multi-camera setup and gloves with colored markers. While sign languages’ complexity goes far beyond handshapes (facial features, body, grammar, etc.), it is true that the accurate tracking of the hands has been a huge obstacle in the first layer of processing – computer vision. MediaPipe unlocked the possibility to offer SignAll’s solutions not only glove free, but also by using a single camera. SignAll has just announced the availability of the first SDK of its kind, so developers can now enable sign language input in their apps.
The company recently published an interactive educational app in the App Store, that lets the user practice signing with immediate feedback, this app also serves as a demonstration of the possibilities with the SDK.
SignAll with MediaPipe Hands
Our system uses several layers for sign recognition, and each one uses more and more abstract data. The low-level layer extracts crucial hand, body, and face data from 2D and 3D cameras. In our first implementation, this layer detects the colors of the gloves and creates 3D hand data. Replacing this with MediaPipe Hands (supplemented by the MediaPipe Pose and MediaPipe Face Mesh) has been a game changer for using our system without gloves or special lighting.
Figure 2. Demo of our SignAll SDK developed using MediaPipe
As mentioned earlier, we use multiple cameras with depth sensors which are calibrated in the real world. This allows for a more accurate 3D world space than that of a local camera or tensor spaces, but requires the use of hand landmark detection for each camera. The cameras are placed distinctly to each other in position and orientation so the hands are visible more frequently, as one hand might cover the other from one camera but not necessarily from the others.
Figure 3. Corrected 3D hand shape based on three cameras. Due to the unique orientation, the detection from the front camera is wrong, but the side camera can correct the result.
The next step is to filter and smooth the data to replicate the precise measurements offered by our colored glove markers. Although SignAll’s markers are different from the landmarks given by MediaPipe, we used our hand model to generate colored markers from landmarks. Therefore, the new mocap data is fully compatible with the previous one.
Although we are focused mainly on the hands, we also integrated MediaPipe Pose and MediaPipe Face Mesh. The pose landmarks provide accurate hand position information, even when touching or close to each other.
While the two versions of mocap are compatible, the nature of the artifacts is different: direct measure of each marker and simulated markers from a globally detected hand. Due to this discrepancy, we had to refine the parameters on higher levels. On the other hand, we could still use our huge sign database for the gloveless configuration. By replacing the low-level data and refining our higher-level data, we could test our system without gloves. Going gloveless can be a huge step for using our sign recognition technology easily worldwide.
Figure 4. Demonstration of compatible mocap from different low-level trackings. The right side is without gloves; the left side is with gloves. This compatibility enables the usage of SignAll’s meticulously labeled dataset of 300,000+ sign language videos to be used for the training of recognition models based on different low-level data. Watch the full video
The SignAll system using MediaPipe framework
After integrating MediaPipe Hands into our system, we also wanted to take advantage of the customization and scaling opportunities provided by the MediaPipe framework on multiple platforms. This allowed us to not only prototype our research state methods in Python, but also deliver our end-user solutions for Windows, iOS, Android, and even the Web. Thanks to the similarities between our module graph system and the calculator graph of MediaPipe, our existing processing units can be reused in this new framework with minor modifications. With that said, the extended platform set also comes with other challenges, like using only a single 2D camera in most cases instead of a calibrated multi-camera system.
The models, algorithms and techniques we have used were mostly developed to work on our mocap data interpreted in the 3D global world. The data extracted from a single-camera setup, of course, cannot be as detailed. That is the reason we had to make some adjustments to our implementations, fine-tune the algorithms and add some extra logic (e.g., dynamically adapting to the changes of space resulted by the hand-held camera use-case). Luckily, the MediaPipe framework enables us to implement the core processing units in C++, so we can still benefit from the runtime-optimized core solutions we previously developed.
Some higher-level models trained on 3D data also needed to be re-trained in order to perform better on the data originating from a single 2D source. The MediaPipe landmarks are defined by 3D coordinates, which makes it possible to reuse the existing training methods and concepts. On the other hand, the 2D information is more directly extracted and therefore more stable than the third coordinate, which was taken into consideration while designing the training modifications.
Luckily, it is not necessary to make an entirely new data recording for this purpose. We can still use our huge video database annotated in great detail. The preprocessed mocap data we can extract from our recordings and interpreted in the 3D world can be used to simulate hand, skeleton, or face landmark detections in any virtual camera view.
Among the data on virtual camera views, we also use traditional 2D recordings in sufficient proportion to cover the unique noise characteristics of the landmark detections. Thanks to having most of these types of data in advance, we can focus on the most exciting part - trying the latest techniques and training new models.
The advancement made possible by MediaPipe enabled SignAll to change its model. In addition to offering all-in-one products for sign language education and translation, SignAll is now starting to offer an SDK for developers. The capabilities of this SDK depend on the type of the camera or cameras being used and the available computational capacity. The possible functions enabled by using the SDK vary from launching video calls by signing the contact’s name (watch a demo here), adding addresses into navigation by signing (as a counterpart to speech input), or ordering food on a fast-food restaurant’s kiosk or drive-thru. With its mission to make sign language an alternative everywhere that voice can be used, SignAll is excited to see more and more apps implementing this feature.
We are eager to try future updates of MediaPipe, which could bring us closer to our ultimate goal of having our solutions available for everyone on any device. The most awaited update is the ability to build custom MediaPipe graphs and add our own calculators for web-based solutions aided by the WebAssembly technology, so websites will be able to use a new level of accessibility features for Deaf visitors.