When I reviewed the implementation of Adam optimizer in tensorflow yesterday, I noticed that it’s code is different from the formulas that I saw in Adam’s paper. In tensorflow’s formulas for Adam are:

But the algorithm in the paper is:

Then quickly I found these words in the document of tf.AdamOptimizer:

Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the “epsilon” referred to here is “epsilon hat” in the paper.

And this time I did find the ‘Algo 2’ in the paper:

But how does ‘Algo 1’ tranform to ‘Algo 2’? Let me try to deduce them from ‘Algo 1’:

$latex \theta_t \gets \theta_{t-1} – \frac{\alpha \cdot \hat{m_t}}{(\sqrt{\hat{v_t}} + \epsilon)} &s=4 $

$latex \implies \theta_t \gets \theta_{t-1} – \alpha \cdot \frac{m_t}{1 – \beta_1^t} \cdot \frac{1}{(\sqrt{\hat{v_t}} + \epsilon)} \quad \text{

(put } \hat{m_t} \text{ in) } &s=4 $

$latex \implies \theta_t \gets \theta_{t-1} – \alpha \cdot \frac{m_t}{1 – \beta_1^t} \cdot \frac{\sqrt{1-\beta_2^t}}{\sqrt{v_t}} \quad \text{

(put } \hat{v_t} \text{ in and ignore } \epsilon \text{) } &s=4 $

$latex \implies \theta_t \gets \theta_{t-1} – \alpha_t \cdot \frac{m_t}{\sqrt{v_t} + \hat{\epsilon}} \quad \text{add new } \hat{\epsilon} \text { to avoid zero-divide} &s=4 $