The Signs Are Obvious
This AI-powered sign language recognition system incorporates hand gestures, body language, and more to enhance translation accuracy.
Sign language is an important means of communication for the deaf and hard of hearing, offering a window to a world that would otherwise be largely inaccessible. The combination of hand movements, facial expressions, and body language in signing enables individuals to convey their ideas with subtlety and remarkable precision.
However, sign language is not universally understood, resulting in significant communication barriers for those who rely on it. Compounding this challenge is the existence of multiple sign languages worldwide, each with its own distinct characteristics, analogous to the diversity of spoken languages. A reliable translator would go a long way toward solving this problem, as it would take away the substantial burdens that come with learning sign language.
Computer vision-based approaches offer a lot of promise on this front. By using such an approach, pointing a smartphone camera at an individual as they sign might be all it takes to see a translation. But existing algorithms tend to focus on only certain aspects of signing, like hand movements. Since everything from movements of the body to facial expressions factor into the meaning a signer is trying to convey, these techniques are sometimes inaccurate. Furthermore, the actions a signer takes may be very subtle, which causes further issues for present computer vision-based approaches.
A team led by researchers at Osaka Metropolitan University has recently made strides in overcoming these present issues. They have developed a novel word-level sign language recognition (WSLR) method using a multi-stream neural network (MSNN) that integrates various sources of information. By capturing the full information that the signer is trying to convey, and analyzing it with an algorithm that can recognize fine details, they have demonstrated that translation accuracy can be significantly improved.
The researchers’ MSNN consists of three main streams: (1) a base stream that captures global upper-body movements through appearance and optical flow information, (2) a local image stream that magnifies and focuses on detailed features of the hands and face, and (3) a skeleton stream that analyzes the relative positions of the body and hands using a spatiotemporal graph convolutional network. By combining these streams, the method improves the recognition accuracy of fine-grained details in sign language gestures while minimizing the influence of background noise.
The proposed method was validated using two datasets for American Sign Language recognition: WLASL and MS-ASL. WLASL was utilized to test scalability due to its large class variety, while MS-ASL tested the system’s accuracy from diverse viewpoints. Preprocessing involved detecting signers’ bounding boxes using YOLOv3 or SSD, resizing, and applying data augmentation, including random cropping and horizontal flipping, to enhance model robustness.
Quantitative evaluations compared the proposed MSNN to two baselines and state-of-the-art methods. Results showed significant accuracy improvements when incorporating local image and skeleton streams, particularly for challenging signs with subtle gesture differences. For example, Top-1 accuracy on WLASL100 increased by 10.71 percent with the local stream and 5.18 percent with the skeleton stream.
The team plans to enhance their model's recognition accuracy in the future by extending their research to more realistic environments with diverse signers and complex backgrounds. They also aim to generalize their method to other sign languages, such as British, Japanese, and Indian sign languages, through additional experiments and modifications. Ultimately, their goal is to expand the framework to support continuous sign language recognition, providing valuable assistance to the hearing-impaired community.