Skip to content
This repository was archived by the owner on Aug 17, 2021. It is now read-only.
This repository was archived by the owner on Aug 17, 2021. It is now read-only.

Progress for OSPP project 210370190 application #1

@peterchen96

Description

@peterchen96

Hi, Mr. Tian Jun @findmyway. Sorry for the late response to your latest email. Here I want to report my recent progress and problems.

recent progress

Recently, I implement Neural Fictitious Self-play (NFSP) based on ReinforcementLearning.jl and the main structure as the following:

  • rl_agent:
Agent(
    policy = QBasedPolicy(
        learner = DQNLearner,
        explorer = EpsilonGreedyExplorer,
        ),
    trajectory = CircularArraySARTTrajectory,
)
  • sl_agent:
Agent(
    policy = QBasedPolicy(
        learner = AverageLearner,
        explorer = GreedyExplorer,
        ),
    trajectory = CircularArraySARTTrajectory,
)

where AverageLearner is an AbstractLearner which I imitated the structure from DQNLearner and rewrite its network updating process.

The code can run on GPU successfully, although its computation utility is inefficient. And the result is shown in README.md.

recent problems

  • use nash_conv for evaluation:

    In the paper, they use exploitability as evaluation metric and I found the function nash_conv.jl in ./ReinforcementLearningZoo/src/algorithms/cfr/ which is the double of the exploitability in the kuhn_poker game.

    However, I meet the following error when using nash_conv:

    julia> RLZoo.nash_conv(my_policy, env)
    ERROR: AssertionError: The copy method doesn't seem to be implemented for the environment: KuhnPokerEnv(Symbol[], Symbol[])

    Also, how to pass my nfsp into nash_conv is also a big problem. Or if I rewrite nash_conv function as the plus of the best_response_value(a::NFSPAgent, env) to each player, how can I get best response policy?

  • low utitily rate of cpu and gpu:

    I guess that frequent data transfermation between cpu and gpu is the main reason. For example, when call action = policy(env), it will run the following code:

    env |>
    state |>
    x -> Flux.unsqueeze(x, ndims(x) + 1) |>
    x -> send_to_device(device(learner), x) |>
    learner.approximator |>
    send_to_host |> 
    vec

    which will generate a transformation between CPU and GPU in each step.

    Maybe I can copy the networks to CPU when no need to update parameters?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions