What are you looking for?
51 Résultats pour : « Portes ouvertes »

L'ÉTS vous donne rendez-vous à sa journée portes ouvertes qui aura lieu sur son campus à l'automne et à l'hiver : Samedi 18 novembre 2023 Samedi 17 février 2024 Le dépôt de votre demande d'admission à un programme de baccalauréat ou au cheminement universitaire en technologie sera gratuit si vous étudiez ou détenez un diplôme collégial d'un établissement québécois.

Information Technology Engineering Research and Innovation Intelligent and Autonomous Systems LIVIA – Imaging, Vision and Artificial Intelligence Laboratory

Emotion Recognition using Cross-Attentional AudioVisual Fusion

Expression of different emotions

Purchased on Istockphoto.com. Copyright.

SUMMARY

Multimodal analysis has recently drawn much interest in affective computing, since it can improve the overall accuracy of emotion recognition (ER) over isolated unimodal approaches. In this work, we explored the fusion of audio (A) and visual (V) modalities in a complementary fashion in order to extract robust multimodal feature representations. The objective was to address the problem of continuous emotion recognition, where we aim to estimate the wide range of human emotions on a continuous scale of valence and arousal. Specifically, we introduced a cross-attentional fusion approach to extract the salient features across AV modalities, allowing for an accurate prediction of continuous values of valence and arousal. Our new cross-attentional AV fusion model efficiently leverages the intermodal AV relationships. In particular, it computes cross-attention weights to focus on the more relevant features across individual modalities, thereby combines contributive feature representations, which are then fed to prediction layers to predict valence and arousal. Our work has a lot of potential in real-world applications such as pain intensity estimation, depression level estimation, etc., and in health care, driver fatigue detection in driver assistance systems, etc.

Emotion Recognition: A Challenging Task

Automatic recognition and analysis of human emotions have drawn much attention over the past few decades. It has a wide range of applications in various fields, such as health care (anger, fatigue, depression and pain assessment), robotics (human-machine interaction), driver assistance (driver condition assessment). Emotion recognition (ER) is a challenging problem since the expressions linked to human emotions are extremely diverse in nature across individuals and cultures.

Emotion expression

Figure 1 : Diverse Emotions across individuals and cultures [1]

The Valence-Arousal Space

Recently, real-world applications have brought about a shift in affective computing research from laboratory-controlled environments to more realistic natural settings. This shift has further led to the analysis of a wide range of subtle, continuous emotional states elicited in real-world settings, such as pain intensity estimation, depression level estimation, etc. Normally, continuous ER states are formulated as a dimensional ER problem, where complex human emotions can be represented in a dimensional space. Figure 2 illustrates the two-dimensional space representing emotional states, where valence and arousal are employed as dimensional axes. Valence reflects the wide range of emotions in the dimension of pleasantness, from being negative (sad) to positive (happy), whereas arousal spans the range of intensities from passive (sleepiness) to active (high excitement).

valence vs arousal state

Fig 2: Valence-Arousal Space

Using Multimodal Systems

Human emotions can be conveyed through various modalities like face, voice, text and biosignals (electroencephalogram, electrocardiogram, etc.), each typically carrying diverse information. Although human emotions can be expressed through various modalities, vocal and facial modalities are the predominant contact-free channels in videos carrying complementary information. In this work, we investigated the prospect of efficiently leveraging the complementary nature of AV relationships captured in videos to improve the performance of multimodal systems over unimodal ones. For instance, when the facial modality is missing due to pose, blur, low illumination, etc., we can still leverage the audio modality to estimate the emotional state, and vice versa.

A model to recognize emotions

Fig 3: AV Fusion Model for Dimensional Emotion Recognition

Given the set of video sequences, we extracted the audio and visual streams separately, where the visual stream is preprocessed to obtain cropped and aligned images and the audio stream is processed to obtain spectrograms of the corresponding visual clips. Then they were fed to visual and audio backbones to extract the corresponding visual and audio features respectively, which was further fed to the cross-attentional model. In the fusion (cross-attentional) model, we obtained the attention weights for each modality based on the correlation measure across the audio and visual features. The higher correlation measure shows that the corresponding audio and visual features are highly related to each other and carry relevant information. The final attended features are then obtained using the attention weights and fed to the prediction layer to estimate the predictions of valence and arousal.

Additional Information

For more information on that research, please read the following conference paper:

[2] R. G. Praveen, E. Granger and P. Cardinal, “Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition,” 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), 2021, pp. 1-8.