Skip to content

model: add image and video to LCO embed model#4384

Open
isaac-chung wants to merge 4 commits intomainfrom
lco-video
Open

model: add image and video to LCO embed model#4384
isaac-chung wants to merge 4 commits intomainfrom
lco-video

Conversation

@isaac-chung
Copy link
Copy Markdown
Collaborator

@isaac-chung isaac-chung commented Apr 15, 2026

Extend LCO-embedding models to handle image and video, following the batch call examples in the HF model pages.

I was able to verify the omni-3B model with the MIEB-lite clustering (2 tasks) average scores. Will also run this on the MIEB-lite benchmark and a video task. Results: embeddings-benchmark/results#487

  • I have filled out the ModelMeta object to the extent possible
  • I have ensured that my model can be loaded using
    • mteb.get_model(model_name, revision) and
    • mteb.get_model_meta(model_name, revision)
  • I have tested the implementation works on a representative set of tasks.
  • The model is public, i.e., is available either as an API or the weights are publicly available to download
srun --partition=shared uv run --no-sync \ 
mteb run -m LCO-Embedding/LCO-Embedding-Omni-3B \ 
-b "MIEB(lite)" --output-folder /home/niklas/isaac/results

Copy link
Copy Markdown
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable, but would be good to have @gowitheflow-1998 give it a review as well.

We should probably also add him as the contact in the model meta.

)

audio_inputs, _, _ = process_mm_info(
audio_inputs, image_inputs, video_inputs = process_mm_info(
Copy link
Copy Markdown
Member

@Samoed Samoed Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using video with process_mm_info requires some hacks and can cause problems (@AdnanElAssadi56 has it for omni embed #4388 (comment)). I looked through it and it seems it doesn't do anything helpful in our case, and we can remove its usage #4388 (comment)

@gowitheflow-1998
Copy link
Copy Markdown
Contributor

thanks! @isaac-chung @Samoed @KennethEnevoldsen Will test it out asap. Any branch I can test video tasks on?

@Samoed
Copy link
Copy Markdown
Member

Samoed commented Apr 16, 2026

@gowitheflow-1998 Video tasks integrated directly to main. We have integrated few of them already. You can run Kinetics400VA or MSRVTTV2T

@isaac-chung
Copy link
Copy Markdown
Collaborator Author

@gowitheflow-1998 I just updated this branch. There should be Kinetics400V that uses only video.

@gowitheflow-1998
Copy link
Copy Markdown
Contributor

Have checked! Indeed the video part needed some hacks, as qwen's process_mm_info from qwen_omni_utils takes either a video path or a list of image frames. I'd suggest the following implementation to be compatible with current video part on mteb's end.

take 2 frames every second (align with default setting of Qwen-VL style or Qwen-Omni style models) and get image frames. As most omni embedding models are based on Qwen2.5-Omni now (maybe soon 3 or 3.6), this block can be re-used in a lot of other MVEB models, so that we can use Qwen's original process_mm_info without having to modify that part

            # Video
            video_row = video_list[i] if i < len(video_list) else None
            if video_row is not None:
                total_frames = len(video_row)
                
                if total_frames > 0:
                    target_fps = 2.0 
                    if hasattr(video_row, "metadata") and video_row.metadata.average_fps and video_row.metadata.average_fps > 0:
                        duration_sec = total_frames / video_row.metadata.average_fps
                    elif hasattr(video_row, "metadata") and video_row.metadata.duration_seconds:
                        duration_sec = video_row.metadata.duration_seconds
                    else:
                        duration_sec = total_frames / 30.0  # Fallback assumption if metadata is missing

                    num_samples = max(1, int(duration_sec * target_fps))
                                        
                    indices = torch.linspace(0, total_frames - 1, num_samples).long()
                    
                    pil_frames = []
                    for idx in indices:
                        frame_index = int(idx.item()) 
                        frame_tensor = video_row[frame_index].data
                        pil_frames.append(to_pil(frame_tensor))
                    
                    content.append({
                        "type": "video", 
                        "video": pil_frames,
                        # "max_pixels": 224 * 224 # Optional
                    })

LCO-Embedding-Omni-3B results using this implementation:

{
  "dataset_revision": "4661603cee25c1fd370e5478a2953203cf37155b",
  "task_name": "MSRVTTV2T",
  "mteb_version": "2.12.19",
  "scores": {
    "test": [
      {
        "ndcg_at_1": 0.41069,
        "ndcg_at_3": 0.50421,
        "ndcg_at_5": 0.53485,
        "ndcg_at_10": 0.56797,
        "ndcg_at_20": 0.58558,
        "ndcg_at_100": 0.60971,
        "ndcg_at_1000": 0.61702,
        "map_at_1": 0.41069,
        "map_at_3": 0.4818,
        "map_at_5": 0.49863,
        "map_at_10": 0.5123,
        "map_at_20": 0.51717,
        "map_at_100": 0.52061,
        "map_at_1000": 0.52092,
        "recall_at_1": 0.41069,
        "recall_at_3": 0.56883,
        "recall_at_5": 0.64391,
        "recall_at_10": 0.7463,
        "recall_at_20": 0.8157,
        "recall_at_100": 0.94425,
        "recall_at_1000": 1.0,
        "accuracy": 0.41069,
        "precision_at_1": 0.41069,
        "precision_at_3": 0.18961,
        "precision_at_5": 0.12878,
        "precision_at_10": 0.07463,
        "precision_at_20": 0.04078,
        "precision_at_100": 0.00944,
        "precision_at_1000": 0.001,
        "mrr_at_1": 0.410694,
        "mrr_at_3": 0.481797,
        "mrr_at_5": 0.498635,
        "mrr_at_10": 0.512303,
        "mrr_at_20": 0.517168,
        "mrr_at_100": 0.520608,
        "mrr_at_1000": 0.520915,
        "nauc_ndcg_at_1_max": 0.400619,
        "nauc_ndcg_at_1_std": -0.199334,
        "nauc_ndcg_at_1_diff1": 0.59591,
        "nauc_ndcg_at_3_max": 0.422876,
        "nauc_ndcg_at_3_std": -0.208856,
        "nauc_ndcg_at_3_diff1": 0.54331,
        "nauc_ndcg_at_5_max": 0.421099,
        "nauc_ndcg_at_5_std": -0.208785,
        "nauc_ndcg_at_5_diff1": 0.541023,
        "nauc_ndcg_at_10_max": 0.433317,
        "nauc_ndcg_at_10_std": -0.198222,
        "nauc_ndcg_at_10_diff1": 0.53386,
        "nauc_ndcg_at_20_max": 0.431293,
        "nauc_ndcg_at_20_std": -0.193463,
        "nauc_ndcg_at_20_diff1": 0.535155,
        "nauc_ndcg_at_100_max": 0.431278,
        "nauc_ndcg_at_100_std": -0.187593,
        "nauc_ndcg_at_100_diff1": 0.540255,
        "nauc_ndcg_at_1000_max": 0.426029,
        "nauc_ndcg_at_1000_std": -0.197315,
        "nauc_ndcg_at_1000_diff1": 0.544727,
        "nauc_map_at_1_max": 0.400619,
        "nauc_map_at_1_std": -0.199334,
        "nauc_map_at_1_diff1": 0.59591,
        "nauc_map_at_3_max": 0.417249,
        "nauc_map_at_3_std": -0.205535,
        "nauc_map_at_3_diff1": 0.555626,
        "nauc_map_at_5_max": 0.415844,
        "nauc_map_at_5_std": -0.205665,
        "nauc_map_at_5_diff1": 0.554194,
        "nauc_map_at_10_max": 0.419981,
        "nauc_map_at_10_std": -0.20208,
        "nauc_map_at_10_diff1": 0.551677,
        "nauc_map_at_20_max": 0.419351,
        "nauc_map_at_20_std": -0.201274,
        "nauc_map_at_20_diff1": 0.552191,
        "nauc_map_at_100_max": 0.41963,
        "nauc_map_at_100_std": -0.200931,
        "nauc_map_at_100_diff1": 0.553015,
        "nauc_map_at_1000_max": 0.419481,
        "nauc_map_at_1000_std": -0.201207,
        "nauc_map_at_1000_diff1": 0.553171,
        "nauc_recall_at_1_max": 0.400619,
        "nauc_recall_at_1_std": -0.199334,
        "nauc_recall_at_1_diff1": 0.59591,
        "nauc_recall_at_3_max": 0.440159,
        "nauc_recall_at_3_std": -0.219456,
        "nauc_recall_at_3_diff1": 0.505916,
        "nauc_recall_at_5_max": 0.438982,
        "nauc_recall_at_5_std": -0.219375,
        "nauc_recall_at_5_diff1": 0.497458,
        "nauc_recall_at_10_max": 0.498547,
        "nauc_recall_at_10_std": -0.173885,
        "nauc_recall_at_10_diff1": 0.455886,
        "nauc_recall_at_20_max": 0.504695,
        "nauc_recall_at_20_std": -0.130327,
        "nauc_recall_at_20_diff1": 0.441107,
        "nauc_recall_at_100_max": 0.631076,
        "nauc_recall_at_100_std": 0.17858,
        "nauc_recall_at_100_diff1": 0.373815,
        "nauc_recall_at_1000_max": NaN,
        "nauc_recall_at_1000_std": NaN,
        "nauc_recall_at_1000_diff1": NaN,
        "nauc_precision_at_1_max": 0.400619,
        "nauc_precision_at_1_std": -0.199334,
        "nauc_precision_at_1_diff1": 0.59591,
        "nauc_precision_at_3_max": 0.440159,
        "nauc_precision_at_3_std": -0.219456,
        "nauc_precision_at_3_diff1": 0.505916,
        "nauc_precision_at_5_max": 0.438982,
        "nauc_precision_at_5_std": -0.219375,
        "nauc_precision_at_5_diff1": 0.497458,
        "nauc_precision_at_10_max": 0.498547,
        "nauc_precision_at_10_std": -0.173885,
        "nauc_precision_at_10_diff1": 0.455886,
        "nauc_precision_at_20_max": 0.504695,
        "nauc_precision_at_20_std": -0.130327,
        "nauc_precision_at_20_diff1": 0.441107,
        "nauc_precision_at_100_max": 0.631076,
        "nauc_precision_at_100_std": 0.17858,
        "nauc_precision_at_100_diff1": 0.373815,
        "nauc_precision_at_1000_max": NaN,
        "nauc_precision_at_1000_std": NaN,
        "nauc_precision_at_1000_diff1": NaN,
        "nauc_mrr_at_1_max": 0.400619,
        "nauc_mrr_at_1_std": -0.199334,
        "nauc_mrr_at_1_diff1": 0.59591,
        "nauc_mrr_at_3_max": 0.417249,
        "nauc_mrr_at_3_std": -0.205535,
        "nauc_mrr_at_3_diff1": 0.555626,
        "nauc_mrr_at_5_max": 0.415844,
        "nauc_mrr_at_5_std": -0.205665,
        "nauc_mrr_at_5_diff1": 0.554194,
        "nauc_mrr_at_10_max": 0.419981,
        "nauc_mrr_at_10_std": -0.20208,
        "nauc_mrr_at_10_diff1": 0.551677,
        "nauc_mrr_at_20_max": 0.419351,
        "nauc_mrr_at_20_std": -0.201274,
        "nauc_mrr_at_20_diff1": 0.552191,
        "nauc_mrr_at_100_max": 0.41963,
        "nauc_mrr_at_100_std": -0.200931,
        "nauc_mrr_at_100_diff1": 0.553015,
        "nauc_mrr_at_1000_max": 0.419481,
        "nauc_mrr_at_1000_std": -0.201207,
        "nauc_mrr_at_1000_diff1": 0.553171,
        "hit_rate_at_1": 0.41069,
        "hit_rate_at_3": 0.56883,
        "hit_rate_at_5": 0.64391,
        "hit_rate_at_10": 0.7463,
        "hit_rate_at_20": 0.8157,
        "hit_rate_at_100": 0.94425,
        "hit_rate_at_1000": 1.0,
        "main_score": 0.56797,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 1518.502744436264,
  "kg_co2_emissions": null,
  "date": 1776445562.486326
}
{
  "dataset_revision": "e5b93b6eae80b8c9e9c88a381baae84d29b34fd2",
  "task_name": "Kinetics400VA",
  "mteb_version": "2.12.19",
  "scores": {
    "test": [
      {
        "scores_per_experiment": [
          {
            "accuracy": 0.72816,
            "f1": 0.72272,
            "f1_weighted": 0.722717,
            "precision": 0.742297,
            "precision_weighted": 0.742329,
            "recall": 0.728181,
            "recall_weighted": 0.72816,
            "ap": null,
            "ap_weighted": null
          },
          {
            "accuracy": 0.724906,
            "f1": 0.719799,
            "f1_weighted": 0.719775,
            "precision": 0.736174,
            "precision_weighted": 0.736224,
            "recall": 0.724993,
            "recall_weighted": 0.724906,
            "ap": null,
            "ap_weighted": null
          },
          {
            "accuracy": 0.71965,
            "f1": 0.715303,
            "f1_weighted": 0.715288,
            "precision": 0.734947,
            "precision_weighted": 0.734983,
            "recall": 0.719715,
            "recall_weighted": 0.71965,
            "ap": null,
            "ap_weighted": null
          },
          {
            "accuracy": 0.729912,
            "f1": 0.726142,
            "f1_weighted": 0.726099,
            "precision": 0.745082,
            "precision_weighted": 0.745015,
            "recall": 0.729931,
            "recall_weighted": 0.729912,
            "ap": null,
            "ap_weighted": null
          },
          {
            "accuracy": 0.734168,
            "f1": 0.729161,
            "f1_weighted": 0.729174,
            "precision": 0.748121,
            "precision_weighted": 0.748267,
            "recall": 0.734299,
            "recall_weighted": 0.734168,
            "ap": null,
            "ap_weighted": null
          },
          {
            "accuracy": 0.722904,
            "f1": 0.7174,
            "f1_weighted": 0.717338,
            "precision": 0.736001,
            "precision_weighted": 0.735994,
            "recall": 0.723021,
            "recall_weighted": 0.722904,
            "ap": null,
            "ap_weighted": null
          },
          {
            "accuracy": 0.732916,
            "f1": 0.728341,
            "f1_weighted": 0.728391,
            "precision": 0.745523,
            "precision_weighted": 0.745646,
            "recall": 0.732938,
            "recall_weighted": 0.732916,
            "ap": null,
            "ap_weighted": null
          },
          {
            "accuracy": 0.729662,
            "f1": 0.722289,
            "f1_weighted": 0.722286,
            "precision": 0.736129,
            "precision_weighted": 0.736254,
            "recall": 0.729806,
            "recall_weighted": 0.729662,
            "ap": null,
            "ap_weighted": null
          },
          {
            "accuracy": 0.729662,
            "f1": 0.724061,
            "f1_weighted": 0.72401,
            "precision": 0.741733,
            "precision_weighted": 0.741795,
            "recall": 0.729833,
            "recall_weighted": 0.729662,
            "ap": null,
            "ap_weighted": null
          },
          {
            "accuracy": 0.729662,
            "f1": 0.726217,
            "f1_weighted": 0.726231,
            "precision": 0.743824,
            "precision_weighted": 0.74389,
            "recall": 0.729715,
            "recall_weighted": 0.729662,
            "ap": null,
            "ap_weighted": null
          }
        ],
        "accuracy": 0.72816,
        "f1": 0.723143,
        "f1_weighted": 0.723131,
        "precision": 0.740983,
        "precision_weighted": 0.74104,
        "recall": 0.728243,
        "recall_weighted": 0.72816,
        "ap": NaN,
        "ap_weighted": NaN,
        "main_score": 0.72816,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ]
      }
    ]
  },
  "evaluation_time": 76991.36311769485,
  "kg_co2_emissions": null,
  "date": 1776509406.896884
}

@Samoed
Copy link
Copy Markdown
Member

Samoed commented Apr 18, 2026

Since this model was merged with sentence transformers in https://huggingface.co/LCO-Embedding/LCO-Embedding-Omni-3B/discussions/1 we probably can use Multimodal wrapper

class SentenceTransformerMultimodalEncoderWrapper(SentenceTransformerEncoderWrapper):

@Samoed
Copy link
Copy Markdown
Member

Samoed commented Apr 18, 2026

Have checked! Indeed the video part needed some hacks, as qwen's process_mm_info from qwen_omni_utils takes either a video path or a list of image frames. I'd suggest the following implementation to be compatible with current video part on mteb's end.

During omni embed nemotron integration we investigated processors and decided to totally remove process_mm_info and use processors directly #4388 (comment)

take 2 frames every second (align with default setting of Qwen-VL style or Qwen-Omni style models) and get image frames. As most omni embedding models are based on Qwen2.5-Omni now (maybe soon 3 or 3.6), this block can be re-used in a lot of other MVEB models, so that we can use Qwen's original process_mm_info without having to modify that part

This is done by HF processors directly https://github.com/huggingface/transformers/blob/a29df2d916e3b820aecd19d3b5a877abc523ba3c/src/transformers/models/qwen2_5_omni/processing_qwen2_5_omni.py#L162-L168 probably we can remove fps processing in collator cc @AdnanElAssadi56

@AdnanElAssadi56
Copy link
Copy Markdown
Contributor

Have checked! Indeed the video part needed some hacks, as qwen's process_mm_info from qwen_omni_utils takes either a video path or a list of image frames. I'd suggest the following implementation to be compatible with current video part on mteb's end.

During omni embed nemotron integration we investigated processors and decided to totally remove process_mm_info and use processors directly #4388 (comment)

take 2 frames every second (align with default setting of Qwen-VL style or Qwen-Omni style models) and get image frames. As most omni embedding models are based on Qwen2.5-Omni now (maybe soon 3 or 3.6), this block can be re-used in a lot of other MVEB models, so that we can use Qwen's original process_mm_info without having to modify that part

This is done by HF processors directly https://github.com/huggingface/transformers/blob/a29df2d916e3b820aecd19d3b5a877abc523ba3c/src/transformers/models/qwen2_5_omni/processing_qwen2_5_omni.py#L162-L168 probably we can remove fps processing in collator cc @AdnanElAssadi56

We can keep fps processing in collator in case we encounter another type of processor, but yes, i agree, we should just pass to processors directly. For ones like this that handle video_decoders, we can pass video decoders directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Samoed Samoed linked an issue Apr 20, 2026 that may be closed by this pull request
@isaac-chung
Copy link
Copy Markdown
Collaborator Author

Suggestion: Since I've really only run image results on LCO embeddings, any objections to limit this PR to the image change only, and I can use a sep PR for video?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support image in LCO

5 participants