0x00 Preface
本文着眼于深度学习中的收敛性分析,作为第三部分,主要介绍了深度学习中随机梯度下降的收敛性证明1。相较于上一篇 (Part2) 的梯度下降,随机梯度下降的收敛性证明更加复杂,因为随机梯度下降的梯度是随机的,因此需要引入一些随机变量和假设,一些关于随机梯度下降的定义已经在 Part1 中进行了说明。
copy 一下先前的定义在下面:
Gradient Unbiased Estimate Assumption
$$
\mathbb{E}[g(w;\xi)] = \nabla f(w)
$$
Gradient Bounded Variance Assumption
$$
\mathrm {Var}(\nabla f(w;\xi)) \leq \sigma^2
$$
和上面的计算可以得到
$$
\mathbb{E}_{\xi}[||g(w; \xi)||^2] \leq ||\nabla f(w)||^{2} + \frac{\sigma^2}{b}
$$
0x01 Smooth and Strongly Convex with mini-batch SGD
defination
假设目标函数 $f$ 是 $R^{d}$ 上的$\mu$-强凸函数,并且 $L$-光滑,同时满足梯度无偏估计和梯度方差有界假设,当步长 $\eta \leq \frac{1}{L}$ 且起始点为 $w_0$ 时,经过 $t$ 步 mini-batch SGD 迭代,$\mathbb{E}[f(w_{t})]$ 被 bounded 如下:$$ \mathbb{E}[f(w_{t})] - f(w^{*}) - \frac{\eta L \sigma^{2}}{2\mu b} \leq (1- \eta \mu)^{t}\left (\mathbb{E}[f(w_{0})] - f(w^{*}) - \frac{\eta L \sigma^{2}}{2\mu b} \right ) $$
Proof:
由光滑性:
$$
\begin{align}
f(w_{t+1}) - f(w_{t})&\leq \nabla f(w_{t})^{\top}(w_{t+1} - w_{t}) + \frac{L}{2}||w_{t+1} - w_{t}||^{2} \\
&\leq -\eta \nabla f(w_{t})^{\top}g(w_{t}; \xi_{t}) + \frac{L}{2}||-\eta g(w_{t}; \xi_{t})||^{2} \\
&\leq -\eta \nabla f(w_{t})^{\top}g(w_{t}; \xi_{t}) + \frac{L\eta^{2}}{2} ||g(w_{t}; \xi_{t})||^{2} \\
\end{align}
$$
对左右求期望得:
$$
\begin{align}
\mathbb{E}[f(w_{t+1}) - f(w_{t})] &\leq -\eta \mathbb{E}[\nabla f(w_{t})^{\top}g(w_{t}; \xi_{t})] + \frac{L\eta^{2}}{2} \mathbb{E}[||g(w_{t}; \xi_{t})||^{2}] \\
\mathbb{E}[f(w_{t+1})] - f(w_{t}) &\leq -\eta ||\nabla f(w_{t})||^{2} + \frac{L\eta^{2}}{2}||\nabla f(w_{t})||^{2} + \frac{L\eta^{2} \sigma^{2}}{2b} \\
&\leq (\eta-\frac{L\eta^{2}}{2}) (-||\nabla f(w_{t})||^{2}) + \frac{L\eta^{2} \sigma^{2}}{2b}
\end{align}
$$
这里和 GD 就比较像了,取 $\eta<\frac{1}{L}$,在利用强凸性则有:
$$
\begin{align}
\mathbb{E}[f(w_{t+1})] - f(w_{t}) &\leq \frac{\eta}{2}(-||\nabla f(w_{t})||^{2}) + \frac{L\eta^{2} \sigma^{2}}{2b} \\
&\leq \frac{\eta}{2}(-2\mu(f(w_{t})-f^{*})) + \frac{L\eta^{2} \sigma^{2}}{2b} \\
&\leq -\eta\mu(\mathbb{E}[f(w_{t})]-f^{*}) + \frac{L\eta^{2} \sigma^{2}}{2b} \\
\end{align}
$$
左右都 $+f^{}-f^{}$,得到
$$
\mathbb{E}[f(w_{t+1})] - f^{*} \leq (1-\eta\mu)(\mathbb{E}[f(w_{t})]-f^{*}) + \frac{L\eta^{2} \sigma^{2}}{2b}
$$
接下来就需要配方了,我们还是假设一个未知数 $x$,使得
$$
\mathbb{E}[f(w_{t+1})] - f^{*} + x \leq (1-\eta\mu)(\mathbb{E}[f(w_{t})]-f^{*} + x)
$$
也就是
$$
(1-\eta\mu)x - x = \frac{L\eta^{2} \sigma^{2}}{2b} \\
x = - \frac{L\eta \sigma^{2}}{2\mu b}
$$
即
$$
\mathbb{E}[f(w_{t+1})] - f^{*} - \frac{L\eta \sigma^{2}}{2\mu b} \leq (1-\eta\mu)(\mathbb{E}[f(w_{t})]-f^{*} - \frac{L\eta \sigma^{2}}{2\mu b})
$$
得到
$$
\mathbb{E}[f(w_{t})] - f(w^{*}) - \frac{\eta L \sigma^{2}}{2\mu b} \leq (1- \eta \mu)^{t}\left (\mathbb{E}[f(w_{0})] - f(w^{*}) - \frac{\eta L \sigma^{2}}{2\mu b} \right )
$$
证毕
0x02 Non-convex with mini-batch SGD
defination
假设目标函数 $f$ 是 $R^{d}$ 上的 $L$-光滑函数,并且满足梯度无偏估计和梯度方差有界假设,当步长 $\eta \leq \frac{1}{L}$ 且起始点为 $w_0$ 时,经过 $t$ 步 mini-batch SGD 迭代,$f$ 梯度平方的平均期望被 bounded 如下:$$ \mathbb{E}[\frac{1}{t}\sum_{i=1}^{t} ||\nabla f(w_{i})||^{2}] \leq \frac{L \sigma^{2}}{b} + \frac{2(f(w_{0})-f(w_{\inf}))}{t\eta} $$
Proof:
直接从上面的 $L$-光滑性得到的结论,和 $\eta < \frac{1}{L}$ 开始
$$
\mathbb{E}[f(w_{t+1})] - f(w_{t}) \leq -\eta ||\nabla f(w_{t})||^{2} + \frac{L\eta^{2} \sigma^{2}}{2b}
$$
对两边都求和,在除以 $t$,得到
$$
\begin{align}
\frac{1}{t}\sum_{i=1}^{t} \mathbb{E}[f(w_{i+1})] - f(w_{i}) &\leq -\frac{\eta }{2t}\sum_{i=1}^{t} ||\nabla f(w_{i})||^{2} + \frac{L\eta^{2} \sigma^{2}}{2b} \\
\frac{1}{t} \sum_{i=1}^{t} ||\nabla f(w_{i})||^{2} &\leq -\frac{2\mathbb{E}[f(w_{t+1})-f(w_{0})]}{\eta t} + \frac{L\eta \sigma^{2}}{2b} \\
&\leq \frac{2(f(w_{0})-f(w_{\inf}))}{\eta t} + \frac{L\eta \sigma^{2}}{2b}
\end{align}
$$
证毕
Reference
《Optimization Algorithms for Distributed Machine Learning》,Gauri Joshi ↩︎