* Create benchmark dataset with several scenarios across documented difficulty levels Expected outcome: Meet human inspection success metric mentioned in #928