1

According to the authors of this paper, to improve the performance, they decided to

drop backward pass and using a first-order approximation

I found a blog which discussed how to derive the math but got stuck along the way (please refer to the embedded image below):

  1. Why disappeared in the next line.
  2. How come (which is an Identity matrix)

FOMAML

Update: I also found another math solution for this. To me it looks less intuitive but there's no confusion with the disappearance of as in the first solution. first order MAML

nbro
  • 39,006
  • 12
  • 98
  • 176
Long
  • 145
  • 1
  • 8
  • Regarding your 1st question, what is $\theta$? In all other cases, $\theta$ has a subscript. – nbro Mar 09 '20 at 15:22
  • θ was in the Algorithm 1, but then in the blog it uses both θ and $\theta_{0}$. Maybe they're both "initial model parameter" and the notation was not consistent. I'm working on understanding it too .. – Long Mar 10 '20 at 09:40
  • @Long what is the rationality behind ignoring the 2nd-order derivative in FOMAML and regarding it as an identity matrix? – S.EB Dec 02 '22 at 03:50
  • @S.EB it is common in many research that people ignore high-order derivatives to avoid computation overhead and simpler implementation. See [this](https://lilianweng.github.io/posts/2018-11-30-meta-learning/) `The meta-optimization step above relies on second derivatives. To make the computation less expensive, a modified version of MAML omits second derivatives, resulting in a simplified and cheaper implementation, known as First-Order MAML (FOMAML)` – Long Dec 07 '22 at 04:11
  • @Long Thank you very much for your explanation. I have two questions: 1) why meta-learning is using two steps of gradients, also called bi-level optimization? 2) why it is said gradients of gradients? what is the exact meaning of it? – S.EB Dec 12 '22 at 04:32
  • @S.EB I think these questions should be in a [new post](https://ai.stackexchange.com/questions/ask), the comment section will not be beneficial to those who have similar questions as you do. I will try to answer them and I think there are many people with better knowledge (than I have) would do so too. – Long Dec 19 '22 at 17:10

1 Answers1

2

$\nabla_{\theta_{i-1}} \theta_{i-1} = \mathbf{I}$ in a similar way that $\frac{d f}{dx} = 1$ for $f(x) = x$. Strictly speaking, $\mathbf{I}$ should be a vector of $1s$ with the same dimensionality as $\theta_{i-1}$, but they are probably abusing notation here and putting such a vector as the diagonal elements of a matrix. Alternatively (actually, the most likely reason!), they are computing the partial derivative of $\theta_{i-1}^j$ with respect to $\theta_{i-1}^k$, for all $k$, for all $j$, which will make up an identity matrix.

Regarding your first question, $\nabla_{\theta} \theta_{0}$ probably becomes 1, but I am not familiar enough with the math of this paper to tell you why. Maybe it's because $\nabla_{\theta} \theta_{0}$ actually means $\nabla_{\theta_0} \theta_{0}$. I would need to dive into it.

nbro
  • 39,006
  • 12
  • 98
  • 176
  • 1
    we don't know what to differentiate it with I guess. I think it should not be included in the chain rule. It is probably a random initialization. –  Mar 09 '20 at 15:57
  • @DuttaA Well, I guess that what you say is consistent with my explanation. If you look at that part of the formula, you see terms where the gradient with respect to the previous parameters is taken of the current parameters. However, initially, we only have $\theta_0$ (i.e. no previous parameters), so the derivative of $\theta_0$ with respect to itself is $1$. – nbro Mar 09 '20 at 16:00
  • I updated my question with a second solution. Please have a look and give some comment about it. It seems like in this case we don't need to worry about the $\theta_{0}$ anymore. – Long Mar 10 '20 at 09:42