请教：如何训练react类型多轮人机对话

RT，请教各位大佬，如何训练react类型多轮人机对话

我想训练一个能够连续对话的Agent，SFT的方案是构造下面这种多轮数据
{
            "messages": [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "What is the capital of France?"},
                {"role": "assistant", "content": "The capital of France is Paris."},
                {"role": "user", "content": "And what about Germany?"},
                {"role": "assistant", "content": "The capital of Germany is Berlin."},
            ]
}

想请教RL该如何做？使用另一个LLM模拟user的情况下，如何做rollout？如何计算奖励

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

请教：如何训练react类型多轮人机对话 #280

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

请教：如何训练react类型多轮人机对话 #280

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions