model: add image and video to LCO embed model#4384
Conversation
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
looks reasonable, but would be good to have @gowitheflow-1998 give it a review as well.
We should probably also add him as the contact in the model meta.
| ) | ||
|
|
||
| audio_inputs, _, _ = process_mm_info( | ||
| audio_inputs, image_inputs, video_inputs = process_mm_info( |
There was a problem hiding this comment.
Using video with process_mm_info requires some hacks and can cause problems (@AdnanElAssadi56 has it for omni embed #4388 (comment)). I looked through it and it seems it doesn't do anything helpful in our case, and we can remove its usage #4388 (comment)
|
thanks! @isaac-chung @Samoed @KennethEnevoldsen Will test it out asap. Any branch I can test video tasks on? |
|
@gowitheflow-1998 Video tasks integrated directly to main. We have integrated few of them already. You can run |
|
@gowitheflow-1998 I just updated this branch. There should be |
|
Have checked! Indeed the video part needed some hacks, as qwen's take 2 frames every second (align with default setting of Qwen-VL style or Qwen-Omni style models) and get image frames. As most omni embedding models are based on Qwen2.5-Omni now (maybe soon 3 or 3.6), this block can be re-used in a lot of other # Video
video_row = video_list[i] if i < len(video_list) else None
if video_row is not None:
total_frames = len(video_row)
if total_frames > 0:
target_fps = 2.0
if hasattr(video_row, "metadata") and video_row.metadata.average_fps and video_row.metadata.average_fps > 0:
duration_sec = total_frames / video_row.metadata.average_fps
elif hasattr(video_row, "metadata") and video_row.metadata.duration_seconds:
duration_sec = video_row.metadata.duration_seconds
else:
duration_sec = total_frames / 30.0 # Fallback assumption if metadata is missing
num_samples = max(1, int(duration_sec * target_fps))
indices = torch.linspace(0, total_frames - 1, num_samples).long()
pil_frames = []
for idx in indices:
frame_index = int(idx.item())
frame_tensor = video_row[frame_index].data
pil_frames.append(to_pil(frame_tensor))
content.append({
"type": "video",
"video": pil_frames,
# "max_pixels": 224 * 224 # Optional
})
|
|
Since this model was merged with sentence transformers in https://huggingface.co/LCO-Embedding/LCO-Embedding-Omni-3B/discussions/1 we probably can use Multimodal wrapper |
During omni embed nemotron integration we investigated processors and decided to totally remove
This is done by HF processors directly https://github.com/huggingface/transformers/blob/a29df2d916e3b820aecd19d3b5a877abc523ba3c/src/transformers/models/qwen2_5_omni/processing_qwen2_5_omni.py#L162-L168 probably we can remove fps processing in collator cc @AdnanElAssadi56 |
We can keep fps processing in collator in case we encounter another type of processor, but yes, i agree, we should just pass to processors directly. For ones like this that handle video_decoders, we can pass video decoders directly. |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Suggestion: Since I've really only run image results on LCO embeddings, any objections to limit this PR to the image change only, and I can use a sep PR for video? |
Extend LCO-embedding models to handle image and video, following the batch call examples in the HF model pages.
I was able to verify the omni-3B model with the MIEB-lite clustering (2 tasks) average scores. Will also run this on the MIEB-lite benchmark and a video task. Results: embeddings-benchmark/results#487
mteb.get_model(model_name, revision)andmteb.get_model_meta(model_name, revision)