The overall score is not matching with the principles

Hi, 
 I found that some answer with higher overall_socre possessing a lower helpfulness_score in `evol_instruct.jsonl` dataset which the principle is 100% helpfulness.

for example, the scores of 9th sample in  `evol_instruct.jsonl` dataset is as following:

| models           | helpfulness | honesty | instruction following | truthfulness | overall score |
| ---------------- | ---- | ------- | ----------- | ----- | ------------- |
| gpt-3.5-turbo    | 4    | 5       | 4           | 5     | 7             |
| llama-2-70b-chat | 4    | 4       | 5           | 5     | 7.5           |
| mpt-30b-chat     | 3    | 4       | 3           | 5     | 6.5           |
| vicuna-33b       | 5    | 4       | 4           | 5     | 6.5           |

The answer of vicuna-33b has the highest helpfulness but lowest overall score. 

My question is should I pickup the answer with the highest overall score or the highest helpfulness score as a preference anwer, or should I use the  mean of the four principles.

Any suggestions will be appriciated, thx.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The overall score is not matching with the principles #11

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

models	helpfulness	honesty	instruction following	truthfulness	overall score
gpt-3.5-turbo	4	5	4	5	7
llama-2-70b-chat	4	4	5	5	7.5
mpt-30b-chat	3	4	3	5	6.5
vicuna-33b	5	4	4	5	6.5

The overall score is not matching with the principles #11

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions