Progress for OSPP project `210370190` application

Hi, Mr. Tian Jun @findmyway. Sorry for the late response to your latest email. Here I want to report my recent progress and problems.

### recent progress

Recently, I implement `Neural Fictitious Self-play` (NFSP) based on `ReinforcementLearning.jl` and the main structure as the following:

* rl_agent:
```julia
Agent(
    policy = QBasedPolicy(
        learner = DQNLearner,
        explorer = EpsilonGreedyExplorer,
        ),
    trajectory = CircularArraySARTTrajectory,
)
```

* sl_agent:
```julia
Agent(
    policy = QBasedPolicy(
        learner = AverageLearner,
        explorer = GreedyExplorer,
        ),
    trajectory = CircularArraySARTTrajectory,
)
```

where `AverageLearner` is an `AbstractLearner` which I imitated the structure from `DQNLearner` and rewrite its network updating process.

The code can run on GPU successfully, although its computation utility is inefficient. And the result is shown in `README.md`.

### recent problems

* use `nash_conv` for evaluation:

    In the [paper](https://arxiv.org/abs/2103.00187), they use `exploitability` as evaluation metric and I found the function `nash_conv.jl` in `./ReinforcementLearningZoo/src/algorithms/cfr/` which is the double of the `exploitability` in the kuhn_poker game.

    However, I meet the following error when using `nash_conv`:

    ```julia
    julia> RLZoo.nash_conv(my_policy, env)
    ERROR: AssertionError: The copy method doesn't seem to be implemented for the environment: KuhnPokerEnv(Symbol[], Symbol[])
    ```

    Also, how to pass my `nfsp` into nash_conv is also a big problem. Or if I rewrite `nash_conv` function as the plus of the `best_response_value(a::NFSPAgent, env)` to each player, how can I get `best response policy`?

* low utitily rate of cpu and gpu:

    I guess that frequent data transfermation between cpu and gpu is the main reason. For example, when call `action = policy(env)`, it will run the following code:
    ```julia
    env |>
    state |>
    x -> Flux.unsqueeze(x, ndims(x) + 1) |>
    x -> send_to_device(device(learner), x) |>
    learner.approximator |>
    send_to_host |> 
    vec
    ```
    which will generate a transformation between CPU and GPU in each step. 

    Maybe I can copy the networks to CPU when no need to update parameters?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Progress for OSPP project `210370190` application #1

recent progress

recent problems

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Progress for OSPP project 210370190 application #1

Description

recent progress

recent problems

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Progress for OSPP project `210370190` application #1