-
Notifications
You must be signed in to change notification settings - Fork 0
Progress for OSPP project 210370190 application #1
Description
Hi, Mr. Tian Jun @findmyway. Sorry for the late response to your latest email. Here I want to report my recent progress and problems.
recent progress
Recently, I implement Neural Fictitious Self-play (NFSP) based on ReinforcementLearning.jl and the main structure as the following:
- rl_agent:
Agent(
policy = QBasedPolicy(
learner = DQNLearner,
explorer = EpsilonGreedyExplorer,
),
trajectory = CircularArraySARTTrajectory,
)- sl_agent:
Agent(
policy = QBasedPolicy(
learner = AverageLearner,
explorer = GreedyExplorer,
),
trajectory = CircularArraySARTTrajectory,
)where AverageLearner is an AbstractLearner which I imitated the structure from DQNLearner and rewrite its network updating process.
The code can run on GPU successfully, although its computation utility is inefficient. And the result is shown in README.md.
recent problems
-
use
nash_convfor evaluation:In the paper, they use
exploitabilityas evaluation metric and I found the functionnash_conv.jlin./ReinforcementLearningZoo/src/algorithms/cfr/which is the double of theexploitabilityin the kuhn_poker game.However, I meet the following error when using
nash_conv:julia> RLZoo.nash_conv(my_policy, env) ERROR: AssertionError: The copy method doesn't seem to be implemented for the environment: KuhnPokerEnv(Symbol[], Symbol[])
Also, how to pass my
nfspinto nash_conv is also a big problem. Or if I rewritenash_convfunction as the plus of thebest_response_value(a::NFSPAgent, env)to each player, how can I getbest response policy? -
low utitily rate of cpu and gpu:
I guess that frequent data transfermation between cpu and gpu is the main reason. For example, when call
action = policy(env), it will run the following code:env |> state |> x -> Flux.unsqueeze(x, ndims(x) + 1) |> x -> send_to_device(device(learner), x) |> learner.approximator |> send_to_host |> vec
which will generate a transformation between CPU and GPU in each step.
Maybe I can copy the networks to CPU when no need to update parameters?