Skip to content

[MAEB] Wav2Clip Text Encoder#3781

Closed
AdnanElAssadi56 wants to merge 2 commits intoembeddings-benchmark:maebfrom
AdnanElAssadi56:maeb-model-wav2clip_fix
Closed

[MAEB] Wav2Clip Text Encoder#3781
AdnanElAssadi56 wants to merge 2 commits intoembeddings-benchmark:maebfrom
AdnanElAssadi56:maeb-model-wav2clip_fix

Conversation

@AdnanElAssadi56
Copy link
Copy Markdown
Contributor

Related to #3545

@Samoed I've double-checked the paper (arXiv:2110.11499v2), and it seem to confirm that we can use the standard CLIP text encoder.

  1. The CLIP model is not tuned. The paper explicitly states in Section 2: "Throughout distillation, the original CLIP model weights are kept frozen".
  2. The authors note that "we get image and text modality for free".
  3. In their own evaluation (Section 3.2), they describe the process as extracting "CLIP text and Wav2CLIP audio embeddings".

Since the text encoder is identical to the standard CLIP encoder, I think we can safely get the text embeddings from the original CLIP model, and they will be mathematically aligned with the audio embeddings from wav2clip.

# text side (CLIP)
self.clip = CLIPModel.from_pretrained(model_name, revision=revision).to(device)
# text side (CLIP) - we use the standard OpenAI CLIP model as mentioned in paper
clip_model_name = "openai/clip-vit-base-patch32"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we process this without loading other model?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean this way:

self.clip = CLIPModel.from_pretrained(model_name, revision=revision).to(device)

Then, no, because there is no huggingface revision for the model, and it was giving a 404 error during evaluation.

Copy link
Copy Markdown
Member

@Samoed Samoed Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread mteb/models/model_implementations/wav2clip_model.py
@Samoed Samoed added the audio Audio extension label Dec 23, 2025
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jan 6, 2026

This pull request has been automatically marked as stale due to inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

audio Audio extension stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants