The Flash agent performs significant worse with GPT-4o-mini

I am not sure what model(GPT-4-turbo?) was used for evaluation in the paper.

However, to reduce API cost, I use GPT-4o-mini for the FLASH agent, and it performs badly(11% on detection) while ReAct's performance(72% on Detection) is close to the result in the paper.

For the simple problem `k8s_target_port-misconfig-detection-1` it consumed 50000 tokens and reached max_steps and did not call submit at the end.

I guess it is because of the additional context from the hindsight?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The Flash agent performs significant worse with GPT-4o-mini #93

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The Flash agent performs significant worse with GPT-4o-mini #93

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions