Sensing the Mood
This two-stage transformer-based sentiment analysis pipeline captures the intricacies of multiple data types to accurately predict emotions.
Recent algorithmic advancements have brought sentiment analysis to the forefront. Sentiment analysis typically involves using natural language processing, text analysis, and computational linguistics to identify and extract subjective information from text. This technique helps determine the sentiment expressed in a piece of writing, for example, categorizing it as positive, negative, or neutral.
This has become an important tool for businesses as it provides insights into customer opinions, enhancing decision-making processes and allowing for improved customer service and product development. Sentiment analysis applications can monitor social media to gauge public opinion about brands or events, and analyze customer reviews to understand product strengths and weaknesses. This analytical tool is also valuable in political analysis, where it helps in understanding public opinion on policies and candidates.
As with other areas of machine learning, the focus of many researchers is being turned to the development of multimodal models. By analyzing data of multiple types, algorithms generally learn to make better predictions. But in the case of sentiment analysis, there is much work yet to be done. Present approaches commonly focus either on blending the data types together, or on exploring the interactions between data types.
The problem is that these approaches both lead to information loss, which means the resulting models will be lacking in accuracy. A team led by researchers at Anhui University set out to change this present paradigm by developing a new AI framework for sentiment analysis. Their solution is composed of a two-stage pipeline that captures information on multiple levels that would otherwise likely be lost. When compared with some of today’s best models, the new pipeline performed better, demonstrating that this multi-stage algorithm may be a good option moving forward.
The pipeline involves several steps to process and analyze multimedia content to predict emotions. First, the pipeline extracts features from the text, audio, and video data. These features are then enhanced with additional contextual information, creating context-aware representations for each modality. In the initial fusion stage, these context-aware representations are combined: text features interact with audio and video features, allowing each type of data to adjust and complement the others. This interaction results in an integrated representation that merges text with adapted audio and video features.
In the next stage, the output from the first fusion, which is text-centered, combines again with the non-text features that have been adjusted during the initial fusion. This second fusion stage further refines the data, enhancing the combined features before they are used for emotion prediction.
The core framework uses stacked transformers, which include bidirectional cross-modal transformers and a transformer encoder. The bidirectional interaction layer, responsible for the first fusion stage, allows for cross-modal interaction where text, audio, and video features influence each other. The refine layer performs the second-stage fusion, fine-tuning the interactions among the features.
When tested against benchmark models on a set of three open datasets, the new pipeline consistently performed as well or better than the existing models. But the multi-stage approach does come with additional computational overhead. In the future, the team intends to explore the possibility of using more advanced transformers to enhance the efficiency of the algorithm.