|
110 | 110 | [解析]:为了获得最优的状态值函数$V$,这里取了两层最优,分别是采用最优策略$\pi^{*}$和选取使得状态动作值函数$Q$最大的状态$\max_{a\in A}$。 |
111 | 111 |
|
112 | 112 | ## 16.16 |
113 | | - |
114 | 113 | $$ |
115 | | -V^{\pi}(x)\leq V^{\pi{}'}(x) |
| 114 | +V^{\pi}(x)\leqslant V^{\pi{}'}(x) |
116 | 115 | $$ |
117 | | - |
118 | 116 | [推导]: |
119 | 117 | $$ |
120 | 118 | \begin{aligned} |
121 | | -V^{\pi}(x)&\leq Q^{\pi}(x,\pi{}'(x))\\ |
122 | | -&=\sum_{x{}'\in X}P_{x\rightarrow x{}'}^{\pi{}'(x)}(R_{x\rightarrow x{}'}^{\pi{}'(x)}+\gamma V^{\pi}(x{}'))\\ |
123 | | -&\leq \sum_{x{}'\in X}P_{x\rightarrow x{}'}^{\pi{}'(x)}(R_{x\rightarrow x{}'}^{\pi{}'(x)}+\gamma Q^{\pi}(x{}',\pi{}'(x{}')))\\ |
124 | | -&=\sum_{x{}'\in X}P_{x\rightarrow x{}'}^{\pi{}'(x)}(R_{x\rightarrow x{}'}^{\pi{}'(x)}+\gamma \sum_{x{}'\in X}P_{x{}'\rightarrow x{}'}^{\pi{}'(x{}')}(R_{x{}'\rightarrow x{}'}^{\pi{}'(x{}')}+\gamma V^{\pi}(x{}')))\\ |
125 | | -&=\sum_{x{}'\in X}P_{x\rightarrow x{}'}^{\pi{}'(x)}(R_{x\rightarrow x{}'}^{\pi{}'(x)}+\gamma V^{\pi{}'}(x{}'))\\ |
126 | | -&=V^{\pi{}'}(x) |
| 119 | +V^{\pi}(x) & \leqslant Q^{\pi}\left(x, \pi^{\prime}(x)\right) \\ |
| 120 | +&=\sum_{x^{\prime} \in X} P_{x \rightarrow x^{\prime}}^{\pi^{\prime}(x)}\left(R_{x \rightarrow x^{\prime}}^{\pi^{\prime}(x)}+\gamma V^{\pi}\left(x^{\prime}\right)\right) \\ |
| 121 | +& \leqslant \sum_{x^{\prime} \in X} P_{x \rightarrow x^{\prime}}^{\pi^{\prime}(x)}\left(R_{x \rightarrow x^{\prime}}^{\pi^{\prime}(x)}+\gamma Q^{\pi}\left(x^{\prime}, \pi^{\prime}\left(x^{\prime}\right)\right)\right) \\ |
| 122 | +&= \sum_{x^{\prime} \in X} P_{x \rightarrow x^{\prime}}^{\pi^{\prime}(x)}\left(R_{x \rightarrow x^{\prime}}^{\pi^{\prime}(x)}+ |
| 123 | +\sum_{x'^{\prime} \in X} P_{x' \rightarrow x^{''}}^{\pi^{\prime}(x')}\left(\gamma R_{x' \rightarrow x^{\prime \prime}}^{\pi^{\prime}(x')}+ |
| 124 | +\gamma^2 V^{\pi}\left(x^{\prime \prime}\right)\right)\right)\\ |
| 125 | +& \leqslant \sum_{x^{\prime} \in X} P_{x \rightarrow x^{\prime}}^{\pi^{\prime}(x)}\left(R_{x \rightarrow x^{\prime}}^{\pi^{\prime}(x)}+ \sum_{x'^{\prime} \in X} P_{x' \rightarrow x^{''}}^{\pi^{\prime}(x')} \left( \gamma R_{x' \rightarrow x^{\prime \prime}}^{\pi^{\prime}(x')} + |
| 126 | +\gamma^2 Q^{\pi}\left(x^{\prime \prime}, \pi^{\prime }\left(x^{\prime \prime}\right)\right)\right)\right) \\ |
| 127 | +&\leqslant \cdots \\ |
| 128 | +&\leqslant \sum_{x^{\prime} \in X} P_{x \rightarrow x^{\prime}}^{\pi^{\prime}(x)}\left(R_{x \rightarrow x^{\prime}}^{\pi^{\prime}(x)}+\sum_{x'^{\prime} \in X} P_{x' \rightarrow x^{''}}^{\pi^{\prime}(x')}\left(\gamma R_{x' \rightarrow x^{\prime \prime}}^{\pi^{\prime}(x')}+\sum_{x'^{\prime} \in X} P_{x'' \rightarrow x^{'''}}^{\pi^{\prime}(x'')} \left(\gamma^2 R_{x'' \rightarrow x^{\prime \prime \prime}}^{\pi^{\prime}(x'')}+\cdots \right)\right)\right) \\ |
| 129 | +&= V^{\pi'}(x) |
127 | 130 | \end{aligned} |
128 | 131 | $$ |
129 | 132 | 其中,使用了动作改变条件 |
130 | 133 | $$ |
131 | | -Q^{\pi}(x,\pi{}'(x))\geq V^{\pi}(x) |
| 134 | +Q^{\pi}(x,\pi{}'(x))\geqslant V^{\pi}(x) |
132 | 135 | $$ |
133 | 136 | 以及状态-动作值函数 |
134 | 137 | $$ |
135 | 138 | Q^{\pi}(x{}',\pi{}'(x{}'))=\sum_{x{}'\in X}P_{x{}'\rightarrow x{}'}^{\pi{}'(x{}')}(R_{x{}'\rightarrow x{}'}^{\pi{}'(x{}')}+\gamma V^{\pi}(x{}')) |
136 | 139 | $$ |
137 | 140 | 于是,当前状态的最优值函数为 |
138 | | - |
139 | 141 | $$ |
140 | | -V^{\ast}(x)=V^{\pi{}'}(x)\geq V^{\pi}(x) |
| 142 | +V^{\ast}(x)=V^{\pi{}'}(x)\geqslant V^{\pi}(x) |
141 | 143 | $$ |
142 | 144 |
|
143 | | - |
144 | | - |
145 | 145 | ## 16.31 |
146 | 146 |
|
147 | 147 | $$ |
|
0 commit comments