Skip to content

The Flash agent performs significant worse with GPT-4o-mini #93

@KaminariOS

Description

@KaminariOS

I am not sure what model(GPT-4-turbo?) was used for evaluation in the paper.

However, to reduce API cost, I use GPT-4o-mini for the FLASH agent, and it performs badly(11% on detection) while ReAct's performance(72% on Detection) is close to the result in the paper.

For the simple problem k8s_target_port-misconfig-detection-1 it consumed 50000 tokens and reached max_steps and did not call submit at the end.

I guess it is because of the additional context from the hindsight?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions