Skip to content

Question about the Dataset construction #34

@Parul-Gupta

Description

@Parul-Gupta

Hi FastComposer team,
Kudos on this insightful and amazing work and thanks for sharing the code with the community!

In the paper, in the Dataset Construction part (Section 5.1), it is mentioned that

Finally, we use a greedy matching algorithm to match noun
phrases with image segments. We do this by considering the product of the image-text similarity
score by the OpenCLIP model (CLIP-ViT-H-14-laion2B-s32B-b79K) and the label-text
similarity score by the Sentence-Transformer model (stsb-mpnet-base-v2).

Could you please clarify this further? If I understand correctly, the OpenCLIP features of image segments are matched with the sentence transformer features of the noun phrases. Is that correct?
If so, how is the Image segment given as input to the OpenCLIP model - Is the part of the image outside the segment masked (with 0 pixels or black colour)?
It would be great if you could share the code for this process too.

Thanks a lot!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions