Team RAS in the 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach

Official repository of the team RAS for the 10th ABAW Valence/Arousal Estimation Challenge. The repository describes a multimodal valence/arousal estimation approach that combines face, behavior, and audio modalities, and reports a CCC of 0.658 on the development set of the Aff-Wild2 dataset.

Each top-level source folder is a standalone project focused on one part of the general pipeline (see Figure): visual dynamics, behavior modelling, audio modelling or multimodal fusion.

Repository structure

Although each folder is self-contained, the repository naturally forms a multimodal workflow:

Visual features can be extracted and modeled in src_visual_dynamic_model/.
Behavior descriptions / embeddings can be produced in src_behavior_description/.
Those embeddings can be trained as a standalone predictor in src_behavior_model/.
Audio features and Chimera-based multimodal training live in src_audio_and_fusion/ (RAAV).
If you prefer a standalone precomputed-feature fusion setup, use src_fusion_model/ (DCMMOE).

`src_visual_dynamic_model/`

Dynamic visual valence/arousal model built on cached frame-level visual features.

This project is intended for temporal visual modeling. According to its local README, it includes:

dataset/index construction from AffWild2-style annotation TXT files;
cached feature loading from .npz files;
a temporal model (VisualDynamicModel);
CCC-based loss and metrics;
best-checkpoint selection by validation VA score;
optional GRADA-based feature extraction from cropped faces;
export of PKL records and test TXT predictions.

Typical workflow:

optionally extract / cache features;
train a temporal visual model;
export PKLs for downstream fusion or generate a test submission file.

`src_behavior_description/`

Behavior-description extraction with Qwen3-VL.

This folder currently contains a Qwen3-VL-based script that generates natural-language behavior descriptions and embedding packs from video segments. In practice, this branch serves as a way to turn short clips into:

textual affect descriptions;
multimodal pooled embeddings;
text-only pooled embeddings;
visual pooled embeddings;
serialized pickle artifacts for later reuse.

`src_behavior_model/`

Valence/arousal regression from precomputed behavior embeddings.

This project trains a model on top of cached behavior embeddings. Its local README describes it as a pipeline for:

loading precomputed segment-level embeddings (by default Qwen-based embeddings);
training a regression model for valence/arousal prediction;
evaluating on validation data;
saving the best checkpoint and training history.

The default entry point is:

python src_behavior_model/main.py

The main configuration lives in:

src_behavior_model/configs/text_va.toml

`src_audio_and_fusion/`

Audio modeling and Chimera-based multimodal fusion pipeline.

This folder contains the most end-to-end training pipeline in the repository. It covers:

audio extraction from videos;
sequence window generation;
filtering windows using speaking / mouth-openness heuristics;
audio model training and evaluation;
export of audio features / predictions;
Chimera ML-based multimodal fusion training and evaluation;
optional late fusion of several submission folders.

See the local README inside src_audio_and_fusion/ for the full step-by-step pipeline and expected data layout.

`src_fusion_model/`

Fusion model for precomputed multimodal PKL features.

This project performs feature-level multimodal fusion when each modality has already exported train.pkl, val.pkl, and test.pkl. The local README expects each PKL to contain records keyed by frame path, with embeddings, predictions, and labels.

This folder is useful when you already have several independent modality outputs and want to train a dedicated fusion architecture over them. It provides:

fusion-model training;
optional search / grid search over fusion settings;
test submission generation from a trained checkpoint.

This is the standalone multimodal fusion baseline outside the Chimera-based src_audio_and_fusion pipeline.

Citation

If you use this repository, please cite the paper:

Ryumina E., Markitantov M., Axyonov A., Ryumin D. Dolgushin M., Dresvyanskiy D., Karpov A. Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach // arXiv preprint arXiv:2603.13056, 2026.

or

@article{ryumina2026team,
  title={Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach},
  author={Ryumina, Elena and Markitantov, Maxim and Axyonov, Alexandr and Ryumin, Dmitry and Dolgushin, Mikhail and Dresvyanskiy, Denis and Karpov, Alexey},
  journal={arXiv preprint arXiv:2603.13056},
  year={2026},
  doi={10.48550/arXiv.2603.13056}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
figures		figures
src_audio_and_fusion		src_audio_and_fusion
src_behavior_description		src_behavior_description
src_behavior_model		src_behavior_model
src_fusion_model		src_fusion_model
src_visual_dynamic_model		src_visual_dynamic_model
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Team RAS in the 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach

Repository structure

`src_visual_dynamic_model/`

`src_behavior_description/`

`src_behavior_model/`

`src_audio_and_fusion/`

`src_fusion_model/`

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Team RAS in the 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach

Repository structure

src_visual_dynamic_model/

src_behavior_description/

src_behavior_model/

src_audio_and_fusion/

src_fusion_model/

Citation

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`src_visual_dynamic_model/`

`src_behavior_description/`

`src_behavior_model/`

`src_audio_and_fusion/`

`src_fusion_model/`

Packages