model: add image and video to LCO embed model by isaac-chung · Pull Request #4384 · embeddings-benchmark/mteb

isaac-chung · 2026-04-15T20:34:56Z

Extend LCO-embedding models to handle image and video, following the batch call examples in the HF model pages.

I was able to verify the omni-3B model with the MIEB-lite clustering (2 tasks) average scores. Will also run this on the MIEB-lite benchmark and a video task. Results: embeddings-benchmark/results#487

I have filled out the ModelMeta object to the extent possible
I have ensured that my model can be loaded using
- mteb.get_model(model_name, revision) and
- mteb.get_model_meta(model_name, revision)
I have tested the implementation works on a representative set of tasks.
The model is public, i.e., is available either as an API or the weights are publicly available to download

srun --partition=shared uv run --no-sync \ 
mteb run -m LCO-Embedding/LCO-Embedding-Omni-3B \ 
-b "MIEB(lite)" --output-folder /home/niklas/isaac/results

KennethEnevoldsen

looks reasonable, but would be good to have @gowitheflow-1998 give it a review as well.

We should probably also add him as the contact in the model meta.

Samoed · 2026-04-16T13:28:12Z

            )

-            audio_inputs, _, _ = process_mm_info(
+            audio_inputs, image_inputs, video_inputs = process_mm_info(


Using video with process_mm_info requires some hacks and can cause problems (@AdnanElAssadi56 has it for omni embed #4388 (comment)). I looked through it and it seems it doesn't do anything helpful in our case, and we can remove its usage #4388 (comment)

gowitheflow-1998 · 2026-04-16T13:36:37Z

thanks! @isaac-chung @Samoed @KennethEnevoldsen Will test it out asap. Any branch I can test video tasks on?

Samoed · 2026-04-16T13:38:58Z

@gowitheflow-1998 Video tasks integrated directly to main. We have integrated few of them already. You can run Kinetics400VA or MSRVTTV2T

isaac-chung · 2026-04-16T13:39:15Z

@gowitheflow-1998 I just updated this branch. There should be Kinetics400V that uses only video.

gowitheflow-1998 · 2026-04-18T12:36:35Z

Have checked! Indeed the video part needed some hacks, as qwen's process_mm_info from qwen_omni_utils takes either a video path or a list of image frames. I'd suggest the following implementation to be compatible with current video part on mteb's end.

take 2 frames every second (align with default setting of Qwen-VL style or Qwen-Omni style models) and get image frames. As most omni embedding models are based on Qwen2.5-Omni now (maybe soon 3 or 3.6), this block can be re-used in a lot of other MVEB models, so that we can use Qwen's original process_mm_info without having to modify that part

            # Video
            video_row = video_list[i] if i < len(video_list) else None
            if video_row is not None:
                total_frames = len(video_row)
                
                if total_frames > 0:
                    target_fps = 2.0 
                    if hasattr(video_row, "metadata") and video_row.metadata.average_fps and video_row.metadata.average_fps > 0:
                        duration_sec = total_frames / video_row.metadata.average_fps
                    elif hasattr(video_row, "metadata") and video_row.metadata.duration_seconds:
                        duration_sec = video_row.metadata.duration_seconds
                    else:
                        duration_sec = total_frames / 30.0  # Fallback assumption if metadata is missing

                    num_samples = max(1, int(duration_sec * target_fps))
                                        
                    indices = torch.linspace(0, total_frames - 1, num_samples).long()
                    
                    pil_frames = []
                    for idx in indices:
                        frame_index = int(idx.item()) 
                        frame_tensor = video_row[frame_index].data
                        pil_frames.append(to_pil(frame_tensor))
                    
                    content.append({
                        "type": "video", 
                        "video": pil_frames,
                        # "max_pixels": 224 * 224 # Optional
                    })

LCO-Embedding-Omni-3B results using this implementation:

{
  "dataset_revision": "4661603cee25c1fd370e5478a2953203cf37155b",
  "task_name": "MSRVTTV2T",
  "mteb_version": "2.12.19",
  "scores": {
    "test": [
      {
        "ndcg_at_1": 0.41069,
        "ndcg_at_3": 0.50421,
        "ndcg_at_5": 0.53485,
        "ndcg_at_10": 0.56797,
        "ndcg_at_20": 0.58558,
        "ndcg_at_100": 0.60971,
        "ndcg_at_1000": 0.61702,
        "map_at_1": 0.41069,
        "map_at_3": 0.4818,
        "map_at_5": 0.49863,
        "map_at_10": 0.5123,
        "map_at_20": 0.51717,
        "map_at_100": 0.52061,
        "map_at_1000": 0.52092,
        "recall_at_1": 0.41069,
        "recall_at_3": 0.56883,
        "recall_at_5": 0.64391,
        "recall_at_10": 0.7463,
        "recall_at_20": 0.8157,
        "recall_at_100": 0.94425,
        "recall_at_1000": 1.0,
        "accuracy": 0.41069,
        "precision_at_1": 0.41069,
        "precision_at_3": 0.18961,
        "precision_at_5": 0.12878,
        "precision_at_10": 0.07463,
        "precision_at_20": 0.04078,
        "precision_at_100": 0.00944,
        "precision_at_1000": 0.001,
        "mrr_at_1": 0.410694,
        "mrr_at_3": 0.481797,
        "mrr_at_5": 0.498635,
        "mrr_at_10": 0.512303,
        "mrr_at_20": 0.517168,
        "mrr_at_100": 0.520608,
        "mrr_at_1000": 0.520915,
        "nauc_ndcg_at_1_max": 0.400619,
        "nauc_ndcg_at_1_std": -0.199334,
        "nauc_ndcg_at_1_diff1": 0.59591,
        "nauc_ndcg_at_3_max": 0.422876,
        "nauc_ndcg_at_3_std": -0.208856,
        "nauc_ndcg_at_3_diff1": 0.54331,
        "nauc_ndcg_at_5_max": 0.421099,
        "nauc_ndcg_at_5_std": -0.208785,
        "nauc_ndcg_at_5_diff1": 0.541023,
        "nauc_ndcg_at_10_max": 0.433317,
        "nauc_ndcg_at_10_std": -0.198222,
        "nauc_ndcg_at_10_diff1": 0.53386,
        "nauc_ndcg_at_20_max": 0.431293,
        "nauc_ndcg_at_20_std": -0.193463,
        "nauc_ndcg_at_20_diff1": 0.535155,
        "nauc_ndcg_at_100_max": 0.431278,
        "nauc_ndcg_at_100_std": -0.187593,
        "nauc_ndcg_at_100_diff1": 0.540255,
        "nauc_ndcg_at_1000_max": 0.426029,
        "nauc_ndcg_at_1000_std": -0.197315,
        "nauc_ndcg_at_1000_diff1": 0.544727,
        "nauc_map_at_1_max": 0.400619,
        "nauc_map_at_1_std": -0.199334,
        "nauc_map_at_1_diff1": 0.59591,
        "nauc_map_at_3_max": 0.417249,
        "nauc_map_at_3_std": -0.205535,
        "nauc_map_at_3_diff1": 0.555626,
        "nauc_map_at_5_max": 0.415844,
        "nauc_map_at_5_std": -0.205665,
        "nauc_map_at_5_diff1": 0.554194,
        "nauc_map_at_10_max": 0.419981,
        "nauc_map_at_10_std": -0.20208,
        "nauc_map_at_10_diff1": 0.551677,
        "nauc_map_at_20_max": 0.419351,
        "nauc_map_at_20_std": -0.201274,
        "nauc_map_at_20_diff1": 0.552191,
        "nauc_map_at_100_max": 0.41963,
        "nauc_map_at_100_std": -0.200931,
        "nauc_map_at_100_diff1": 0.553015,
        "nauc_map_at_1000_max": 0.419481,
        "nauc_map_at_1000_std": -0.201207,
        "nauc_map_at_1000_diff1": 0.553171,
        "nauc_recall_at_1_max": 0.400619,
        "nauc_recall_at_1_std": -0.199334,
        "nauc_recall_at_1_diff1": 0.59591,
        "nauc_recall_at_3_max": 0.440159,
        "nauc_recall_at_3_std": -0.219456,
        "nauc_recall_at_3_diff1": 0.505916,
        "nauc_recall_at_5_max": 0.438982,
        "nauc_recall_at_5_std": -0.219375,
        "nauc_recall_at_5_diff1": 0.497458,
        "nauc_recall_at_10_max": 0.498547,
        "nauc_recall_at_10_std": -0.173885,
        "nauc_recall_at_10_diff1": 0.455886,
        "nauc_recall_at_20_max": 0.504695,
        "nauc_recall_at_20_std": -0.130327,
        "nauc_recall_at_20_diff1": 0.441107,
        "nauc_recall_at_100_max": 0.631076,
        "nauc_recall_at_100_std": 0.17858,
        "nauc_recall_at_100_diff1": 0.373815,
        "nauc_recall_at_1000_max": NaN,
        "nauc_recall_at_1000_std": NaN,
        "nauc_recall_at_1000_diff1": NaN,
        "nauc_precision_at_1_max": 0.400619,
        "nauc_precision_at_1_std": -0.199334,
        "nauc_precision_at_1_diff1": 0.59591,
        "nauc_precision_at_3_max": 0.440159,
        "nauc_precision_at_3_std": -0.219456,
        "nauc_precision_at_3_diff1": 0.505916,
        "nauc_precision_at_5_max": 0.438982,
        "nauc_precision_at_5_std": -0.219375,
        "nauc_precision_at_5_diff1": 0.497458,
        "nauc_precision_at_10_max": 0.498547,
        "nauc_precision_at_10_std": -0.173885,
        "nauc_precision_at_10_diff1": 0.455886,
        "nauc_precision_at_20_max": 0.504695,
        "nauc_precision_at_20_std": -0.130327,
        "nauc_precision_at_20_diff1": 0.441107,
        "nauc_precision_at_100_max": 0.631076,
        "nauc_precision_at_100_std": 0.17858,
        "nauc_precision_at_100_diff1": 0.373815,
        "nauc_precision_at_1000_max": NaN,
        "nauc_precision_at_1000_std": NaN,
        "nauc_precision_at_1000_diff1": NaN,
        "nauc_mrr_at_1_max": 0.400619,
        "nauc_mrr_at_1_std": -0.199334,
        "nauc_mrr_at_1_diff1": 0.59591,
        "nauc_mrr_at_3_max": 0.417249,
        "nauc_mrr_at_3_std": -0.205535,
        "nauc_mrr_at_3_diff1": 0.555626,
        "nauc_mrr_at_5_max": 0.415844,
        "nauc_mrr_at_5_std": -0.205665,
        "nauc_mrr_at_5_diff1": 0.554194,
        "nauc_mrr_at_10_max": 0.419981,
        "nauc_mrr_at_10_std": -0.20208,
        "nauc_mrr_at_10_diff1": 0.551677,
        "nauc_mrr_at_20_max": 0.419351,
        "nauc_mrr_at_20_std": -0.201274,
        "nauc_mrr_at_20_diff1": 0.552191,
        "nauc_mrr_at_100_max": 0.41963,
        "nauc_mrr_at_100_std": -0.200931,
        "nauc_mrr_at_100_diff1": 0.553015,
        "nauc_mrr_at_1000_max": 0.419481,
        "nauc_mrr_at_1000_std": -0.201207,
        "nauc_mrr_at_1000_diff1": 0.553171,
        "hit_rate_at_1": 0.41069,
        "hit_rate_at_3": 0.56883,
        "hit_rate_at_5": 0.64391,
        "hit_rate_at_10": 0.7463,
        "hit_rate_at_20": 0.8157,
        "hit_rate_at_100": 0.94425,
        "hit_rate_at_1000": 1.0,
        "main_score": 0.56797,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 1518.502744436264,
  "kg_co2_emissions": null,
  "date": 1776445562.486326
}

{
  "dataset_revision": "e5b93b6eae80b8c9e9c88a381baae84d29b34fd2",
  "task_name": "Kinetics400VA",
  "mteb_version": "2.12.19",
  "scores": {
    "test": [
      {
        "scores_per_experiment": [
          {
            "accuracy": 0.72816,
            "f1": 0.72272,
            "f1_weighted": 0.722717,
            "precision": 0.742297,
            "precision_weighted": 0.742329,
            "recall": 0.728181,
            "recall_weighted": 0.72816,
            "ap": null,
            "ap_weighted": null
          },
          {
            "accuracy": 0.724906,
            "f1": 0.719799,
            "f1_weighted": 0.719775,
            "precision": 0.736174,
            "precision_weighted": 0.736224,
            "recall": 0.724993,
            "recall_weighted": 0.724906,
            "ap": null,
            "ap_weighted": null
          },
          {
            "accuracy": 0.71965,
            "f1": 0.715303,
            "f1_weighted": 0.715288,
            "precision": 0.734947,
            "precision_weighted": 0.734983,
            "recall": 0.719715,
            "recall_weighted": 0.71965,
            "ap": null,
            "ap_weighted": null
          },
          {
            "accuracy": 0.729912,
            "f1": 0.726142,
            "f1_weighted": 0.726099,
            "precision": 0.745082,
            "precision_weighted": 0.745015,
            "recall": 0.729931,
            "recall_weighted": 0.729912,
            "ap": null,
            "ap_weighted": null
          },
          {
            "accuracy": 0.734168,
            "f1": 0.729161,
            "f1_weighted": 0.729174,
            "precision": 0.748121,
            "precision_weighted": 0.748267,
            "recall": 0.734299,
            "recall_weighted": 0.734168,
            "ap": null,
            "ap_weighted": null
          },
          {
            "accuracy": 0.722904,
            "f1": 0.7174,
            "f1_weighted": 0.717338,
            "precision": 0.736001,
            "precision_weighted": 0.735994,
            "recall": 0.723021,
            "recall_weighted": 0.722904,
            "ap": null,
            "ap_weighted": null
          },
          {
            "accuracy": 0.732916,
            "f1": 0.728341,
            "f1_weighted": 0.728391,
            "precision": 0.745523,
            "precision_weighted": 0.745646,
            "recall": 0.732938,
            "recall_weighted": 0.732916,
            "ap": null,
            "ap_weighted": null
          },
          {
            "accuracy": 0.729662,
            "f1": 0.722289,
            "f1_weighted": 0.722286,
            "precision": 0.736129,
            "precision_weighted": 0.736254,
            "recall": 0.729806,
            "recall_weighted": 0.729662,
            "ap": null,
            "ap_weighted": null
          },
          {
            "accuracy": 0.729662,
            "f1": 0.724061,
            "f1_weighted": 0.72401,
            "precision": 0.741733,
            "precision_weighted": 0.741795,
            "recall": 0.729833,
            "recall_weighted": 0.729662,
            "ap": null,
            "ap_weighted": null
          },
          {
            "accuracy": 0.729662,
            "f1": 0.726217,
            "f1_weighted": 0.726231,
            "precision": 0.743824,
            "precision_weighted": 0.74389,
            "recall": 0.729715,
            "recall_weighted": 0.729662,
            "ap": null,
            "ap_weighted": null
          }
        ],
        "accuracy": 0.72816,
        "f1": 0.723143,
        "f1_weighted": 0.723131,
        "precision": 0.740983,
        "precision_weighted": 0.74104,
        "recall": 0.728243,
        "recall_weighted": 0.72816,
        "ap": NaN,
        "ap_weighted": NaN,
        "main_score": 0.72816,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 76991.36311769485,
  "kg_co2_emissions": null,
  "date": 1776509406.896884
}

Samoed · 2026-04-18T13:08:00Z

Since this model was merged with sentence transformers in https://huggingface.co/LCO-Embedding/LCO-Embedding-Omni-3B/discussions/1 we probably can use Multimodal wrapper

mteb/mteb/models/sentence_transformer_wrapper.py

Line 237 in b14dcc5

    
           class SentenceTransformerMultimodalEncoderWrapper(SentenceTransformerEncoderWrapper):

Samoed · 2026-04-18T13:11:54Z

Have checked! Indeed the video part needed some hacks, as qwen's process_mm_info from qwen_omni_utils takes either a video path or a list of image frames. I'd suggest the following implementation to be compatible with current video part on mteb's end.

During omni embed nemotron integration we investigated processors and decided to totally remove process_mm_info and use processors directly #4388 (comment)

take 2 frames every second (align with default setting of Qwen-VL style or Qwen-Omni style models) and get image frames. As most omni embedding models are based on Qwen2.5-Omni now (maybe soon 3 or 3.6), this block can be re-used in a lot of other MVEB models, so that we can use Qwen's original process_mm_info without having to modify that part

This is done by HF processors directly https://github.com/huggingface/transformers/blob/a29df2d916e3b820aecd19d3b5a877abc523ba3c/src/transformers/models/qwen2_5_omni/processing_qwen2_5_omni.py#L162-L168 probably we can remove fps processing in collator cc @AdnanElAssadi56

AdnanElAssadi56 · 2026-04-18T20:19:00Z

Have checked! Indeed the video part needed some hacks, as qwen's process_mm_info from qwen_omni_utils takes either a video path or a list of image frames. I'd suggest the following implementation to be compatible with current video part on mteb's end.

During omni embed nemotron integration we investigated processors and decided to totally remove process_mm_info and use processors directly #4388 (comment)

take 2 frames every second (align with default setting of Qwen-VL style or Qwen-Omni style models) and get image frames. As most omni embedding models are based on Qwen2.5-Omni now (maybe soon 3 or 3.6), this block can be re-used in a lot of other MVEB models, so that we can use Qwen's original process_mm_info without having to modify that part

This is done by HF processors directly https://github.com/huggingface/transformers/blob/a29df2d916e3b820aecd19d3b5a877abc523ba3c/src/transformers/models/qwen2_5_omni/processing_qwen2_5_omni.py#L162-L168 probably we can remove fps processing in collator cc @AdnanElAssadi56

We can keep fps processing in collator in case we encounter another type of processor, but yes, i agree, we should just pass to processors directly. For ones like this that handle video_decoders, we can pass video decoders directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

isaac-chung · 2026-04-20T09:33:21Z

Suggestion: Since I've really only run image results on LCO embeddings, any objections to limit this PR to the image change only, and I can use a sep PR for video?

add image and video to LCO embed model

95ce329

isaac-chung mentioned this pull request Apr 16, 2026

Add LCO-Embedding-Omni-3B on MIEB-lite embeddings-benchmark/results#487

Draft

7 tasks

isaac-chung marked this pull request as ready for review April 16, 2026 12:07

KennethEnevoldsen requested a review from gowitheflow-1998 April 16, 2026 13:15

KennethEnevoldsen reviewed Apr 16, 2026

View reviewed changes

Samoed reviewed Apr 16, 2026

View reviewed changes

Merge branch 'main' into lco-video

66f2da5

Merge branch 'main' into lco-video

aa4af29

isaac-chung force-pushed the lco-video branch from 2d3566f to aa4af29 Compare April 18, 2026 21:06

Fix extras group comparison to be underscore/hyphen insensitive

45f1419

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Samoed linked an issue Apr 20, 2026 that may be closed by this pull request

Support image in LCO #4064

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model: add image and video to LCO embed model#4384

model: add image and video to LCO embed model#4384
isaac-chung wants to merge 4 commits intomainfrom
lco-video

isaac-chung commented Apr 15, 2026 •

edited

Loading

Uh oh!

KennethEnevoldsen left a comment

Uh oh!

Samoed Apr 16, 2026 •

edited

Loading

Uh oh!

gowitheflow-1998 commented Apr 16, 2026

Uh oh!

Samoed commented Apr 16, 2026

Uh oh!

isaac-chung commented Apr 16, 2026

Uh oh!

gowitheflow-1998 commented Apr 18, 2026

Uh oh!

Samoed commented Apr 18, 2026

Uh oh!

Samoed commented Apr 18, 2026 •

edited

Loading

Uh oh!

AdnanElAssadi56 commented Apr 18, 2026

Uh oh!

isaac-chung commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

isaac-chung commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

Uh oh!

Samoed Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gowitheflow-1998 commented Apr 16, 2026

Uh oh!

Samoed commented Apr 16, 2026

Uh oh!

isaac-chung commented Apr 16, 2026

Uh oh!

gowitheflow-1998 commented Apr 18, 2026

Uh oh!

Samoed commented Apr 18, 2026

Uh oh!

Samoed commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AdnanElAssadi56 commented Apr 18, 2026

Uh oh!

isaac-chung commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

isaac-chung commented Apr 15, 2026 •

edited

Loading

Samoed Apr 16, 2026 •

edited

Loading

Samoed commented Apr 18, 2026 •

edited

Loading