I am not sure what model(GPT-4-turbo?) was used for evaluation in the paper.
However, to reduce API cost, I use GPT-4o-mini for the FLASH agent, and it performs badly(11% on detection) while ReAct's performance(72% on Detection) is close to the result in the paper.
For the simple problem k8s_target_port-misconfig-detection-1 it consumed 50000 tokens and reached max_steps and did not call submit at the end.
I guess it is because of the additional context from the hindsight?