Hi,
I found that some answer with higher overall_socre possessing a lower helpfulness_score in evol_instruct.jsonl dataset which the principle is 100% helpfulness.
for example, the scores of 9th sample in evol_instruct.jsonl dataset is as following:
| models |
helpfulness |
honesty |
instruction following |
truthfulness |
overall score |
| gpt-3.5-turbo |
4 |
5 |
4 |
5 |
7 |
| llama-2-70b-chat |
4 |
4 |
5 |
5 |
7.5 |
| mpt-30b-chat |
3 |
4 |
3 |
5 |
6.5 |
| vicuna-33b |
5 |
4 |
4 |
5 |
6.5 |
The answer of vicuna-33b has the highest helpfulness but lowest overall score.
My question is should I pickup the answer with the highest overall score or the highest helpfulness score as a preference anwer, or should I use the mean of the four principles.
Any suggestions will be appriciated, thx.
Hi,
I found that some answer with higher overall_socre possessing a lower helpfulness_score in
evol_instruct.jsonldataset which the principle is 100% helpfulness.for example, the scores of 9th sample in
evol_instruct.jsonldataset is as following:The answer of vicuna-33b has the highest helpfulness but lowest overall score.
My question is should I pickup the answer with the highest overall score or the highest helpfulness score as a preference anwer, or should I use the mean of the four principles.
Any suggestions will be appriciated, thx.