0x00 Preface

本文着眼于深度学习中的收敛性分析,作为第三部分,主要介绍了深度学习中随机梯度下降的收敛性证明1。相较于上一篇 (Part2) 的梯度下降,随机梯度下降的收敛性证明更加复杂,因为随机梯度下降的梯度是随机的,因此需要引入一些随机变量和假设,一些关于随机梯度下降的定义已经在 Part1 中进行了说明。

copy 一下先前的定义在下面:

Gradient Unbiased Estimate Assumption

$$ \mathbb{E}[g(w;\xi)] = \nabla f(w) $$

Gradient Bounded Variance Assumption

$$ \mathrm {Var}(\nabla f(w;\xi)) \leq \sigma^2 $$

和上面的计算可以得到

$$ \mathbb{E}_{\xi}[||g(w; \xi)||^2] \leq ||\nabla f(w)||^{2} + \frac{\sigma^2}{b} $$

0x01 Smooth and Strongly Convex with mini-batch SGD

defination

假设目标函数 $f$ 是 $R^{d}$ 上的$\mu$-强凸函数,并且 $L$-光滑,同时满足梯度无偏估计和梯度方差有界假设,当步长 $\eta \leq \frac{1}{L}$ 且起始点为 $w_0$ 时,经过 $t$ 步 mini-batch SGD 迭代,$\mathbb{E}[f(w_{t})]$ 被 bounded 如下:

$$ \mathbb{E}[f(w_{t})] - f(w^{*}) - \frac{\eta L \sigma^{2}}{2\mu b} \leq (1- \eta \mu)^{t}\left (\mathbb{E}[f(w_{0})] - f(w^{*}) - \frac{\eta L \sigma^{2}}{2\mu b} \right ) $$

Proof:

由光滑性:

$$ \begin{align} f(w_{t+1}) - f(w_{t})&\leq \nabla f(w_{t})^{\top}(w_{t+1} - w_{t}) + \frac{L}{2}||w_{t+1} - w_{t}||^{2} \\ &\leq -\eta \nabla f(w_{t})^{\top}g(w_{t}; \xi_{t}) + \frac{L}{2}||-\eta g(w_{t}; \xi_{t})||^{2} \\ &\leq -\eta \nabla f(w_{t})^{\top}g(w_{t}; \xi_{t}) + \frac{L\eta^{2}}{2} ||g(w_{t}; \xi_{t})||^{2} \\ \end{align} $$

对左右求期望得:

$$ \begin{align} \mathbb{E}[f(w_{t+1}) - f(w_{t})] &\leq -\eta \mathbb{E}[\nabla f(w_{t})^{\top}g(w_{t}; \xi_{t})] + \frac{L\eta^{2}}{2} \mathbb{E}[||g(w_{t}; \xi_{t})||^{2}] \\ \mathbb{E}[f(w_{t+1})] - f(w_{t}) &\leq -\eta ||\nabla f(w_{t})||^{2} + \frac{L\eta^{2}}{2}||\nabla f(w_{t})||^{2} + \frac{L\eta^{2} \sigma^{2}}{2b} \\ &\leq (\eta-\frac{L\eta^{2}}{2}) (-||\nabla f(w_{t})||^{2}) + \frac{L\eta^{2} \sigma^{2}}{2b} \end{align} $$

这里和 GD 就比较像了,取 $\eta<\frac{1}{L}$,在利用强凸性则有:

$$ \begin{align} \mathbb{E}[f(w_{t+1})] - f(w_{t}) &\leq \frac{\eta}{2}(-||\nabla f(w_{t})||^{2}) + \frac{L\eta^{2} \sigma^{2}}{2b} \\ &\leq \frac{\eta}{2}(-2\mu(f(w_{t})-f^{*})) + \frac{L\eta^{2} \sigma^{2}}{2b} \\ &\leq -\eta\mu(\mathbb{E}[f(w_{t})]-f^{*}) + \frac{L\eta^{2} \sigma^{2}}{2b} \\ \end{align} $$

左右都 $+f^{}-f^{}$,得到

$$ \mathbb{E}[f(w_{t+1})] - f^{*} \leq (1-\eta\mu)(\mathbb{E}[f(w_{t})]-f^{*}) + \frac{L\eta^{2} \sigma^{2}}{2b} $$

接下来就需要配方了,我们还是假设一个未知数 $x$,使得

$$ \mathbb{E}[f(w_{t+1})] - f^{*} + x \leq (1-\eta\mu)(\mathbb{E}[f(w_{t})]-f^{*} + x) $$

也就是

$$ (1-\eta\mu)x - x = \frac{L\eta^{2} \sigma^{2}}{2b} \\ x = - \frac{L\eta \sigma^{2}}{2\mu b} $$

$$ \mathbb{E}[f(w_{t+1})] - f^{*} - \frac{L\eta \sigma^{2}}{2\mu b} \leq (1-\eta\mu)(\mathbb{E}[f(w_{t})]-f^{*} - \frac{L\eta \sigma^{2}}{2\mu b}) $$

得到

$$ \mathbb{E}[f(w_{t})] - f(w^{*}) - \frac{\eta L \sigma^{2}}{2\mu b} \leq (1- \eta \mu)^{t}\left (\mathbb{E}[f(w_{0})] - f(w^{*}) - \frac{\eta L \sigma^{2}}{2\mu b} \right ) $$

证毕

0x02 Non-convex with mini-batch SGD

defination

假设目标函数 $f$ 是 $R^{d}$ 上的 $L$-光滑函数,并且满足梯度无偏估计和梯度方差有界假设,当步长 $\eta \leq \frac{1}{L}$ 且起始点为 $w_0$ 时,经过 $t$ 步 mini-batch SGD 迭代,$f$ 梯度平方的平均期望被 bounded 如下:

$$ \mathbb{E}[\frac{1}{t}\sum_{i=1}^{t} ||\nabla f(w_{i})||^{2}] \leq \frac{L \sigma^{2}}{b} + \frac{2(f(w_{0})-f(w_{\inf}))}{t\eta} $$

Proof:

直接从上面的 $L$-光滑性得到的结论,和 $\eta < \frac{1}{L}$ 开始

$$ \mathbb{E}[f(w_{t+1})] - f(w_{t}) \leq -\eta ||\nabla f(w_{t})||^{2} + \frac{L\eta^{2} \sigma^{2}}{2b} $$

对两边都求和,在除以 $t$,得到

$$ \begin{align} \frac{1}{t}\sum_{i=1}^{t} \mathbb{E}[f(w_{i+1})] - f(w_{i}) &\leq -\frac{\eta }{2t}\sum_{i=1}^{t} ||\nabla f(w_{i})||^{2} + \frac{L\eta^{2} \sigma^{2}}{2b} \\ \frac{1}{t} \sum_{i=1}^{t} ||\nabla f(w_{i})||^{2} &\leq -\frac{2\mathbb{E}[f(w_{t+1})-f(w_{0})]}{\eta t} + \frac{L\eta \sigma^{2}}{2b} \\ &\leq \frac{2(f(w_{0})-f(w_{\inf}))}{\eta t} + \frac{L\eta \sigma^{2}}{2b} \end{align} $$

证毕

Reference


  1. 《Optimization Algorithms for Distributed Machine Learning》,Gauri Joshi ↩︎