Question about the Dataset construction

Hi FastComposer team,
Kudos on this insightful and amazing work and thanks for sharing the code with the community!

In the paper, in the Dataset Construction part (Section 5.1), it is mentioned that 

> Finally, we use a greedy matching algorithm to match noun
phrases with image segments. We do this by considering the product of the image-text similarity
score by the OpenCLIP model  (CLIP-ViT-H-14-laion2B-s32B-b79K) and the label-text
similarity score by the Sentence-Transformer model (stsb-mpnet-base-v2).

Could you please clarify this further? If I understand correctly, the OpenCLIP features of image segments are matched with the sentence transformer features of the noun phrases. Is that correct?
If so, how is the Image _segment_ given as input to the OpenCLIP model - Is the part of the image outside the segment masked (with 0 pixels or black colour)?
It would be great if you could share the code for this process too.

Thanks a lot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about the Dataset construction #34

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about the Dataset construction #34

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions