SU-CS361 APR112024

Hyper-gradient Descent Adapt the execution of gradient descent to Hyper-gradient Descent! Recall the Descent Direction Iteration update rule:
For LR \alpha, what if we write:

\begin{equation} \pdv{f\left(x^{(k+1)}\right)}{\alpha} = \pdv{f(x^{(k+1)}}{x^{(k+1)}} \pdv{x^{(k+1)}}{\alpha} \end{equation}

The left side is just f’(x^{(k+1)}) = \nabla_{x}f(x^{(k+1)}). Recall that the right side is \pdv{\alpha} \left(x^{(k)} - \alpha \nabla_{x} f(x^{(k)})\right). This evaluates to simply -\nabla_{x}f(x^{(k)}).
Therefore:

\begin{align} \pdv{f\left(x^{(k+1)}\right)}{\alpha} &= \pdv{f(x^{(k+1)}}{x^{(k+1)}} \pdv{x^{(k+1)}}{\alpha} \\ &= \nabla_{x}f(x^{(k+1)}) \cdot (-\nabla_{x}f(x^{(k)})) \end{align}

And so, to update our step size/learning rate, we can:

\begin{align} \alpha^{(k+1)} &= \alpha^{(k)} - \mu \left(\pdv{f\left(x^{(k+1)}\right)}{\alpha}\right) \\ &= \alpha^{(k)} - \mu \left[\nabla_{x}f(x^{(k+1)}) \cdot (-\nabla_{x}f(x^{(k)}))\right] \\ &= \alpha^{(k)} + \mu \left[\nabla_{x}f(x^{(k+1)}) \cdot (\nabla_{x}f(x^{(k)}))\right] \end{align}

Therefore, we update our weights based on steps of \alpha, and update our learning rate too with respect to minimizing f. You will note that optimal step sizes results in gradients being orthogonal; we see that this reflects this—no updates to \alpha happen if the gradients from (k+1) is orthogonal to k.
Second-Order Methods Newton’s Method See Newton’s Method
Secant Method You can estimate the Hessian from the gradient to apply Newton’s Method; this requires doing a thing:

\begin{equation} f’’(x_{k}) \approx \frac{f’(x_{k}) - f’(x_{k-1})}{x_{k} - x_{k-1}} \end{equation}

Now, we can write:

\begin{equation} x_{t+1} = x_{t} - \frac{x_{t} - x_{t-1}}{f’(x_{t}) -f’(x_{t-1})} f’(x_{t}) \end{equation}

How do we do this for Hessian? Use one of—
Davidson-Fletcher-Powell (DFP) Broyden-Fletcher-Goldfarb-SHanno (BFGS) Limited Memory BFGS due to approximate nature, this may take more steps to converge.
Direct Methods Cyclic Coordinate Search cycle through the coordinates and do line search. For a function that has n design variables in the design point. In each iteration, we freeze all dimensions except one, do line search, then move on to the next coordinate frame, and repeat.
Broken This may fail when there is a trough between two axes
Accelerated Coordinate Search After taking n cycles through all the directions once, also take a step in the average of the directions of the previous n steps. This kinda fixes the case where there’s a trough in between.
Powell’s Method This is Accelerated Coordinate Search, but we forget the oldest search direction. Consider Cyclic Coordinate Search, which searches in steps of the basis vectors e^{1}, e^{2}, …, e_{n}. Accelerated Coordinate Search searches in e^{1} … e^{n}, d, where d is the averages of your previous steps.
Powell’s Method go once forward by dropping the first member of the list
e^{1} … e^{n}, d^{(1)} e^{2}, …, e^{n}, d^{(1)}, d^{(2)} e^{3}, …, e^{n}, d^{(1)}, d^{(2)}, d^{(3)} Failure This may rapidly result in linearly dependent variables due to so much averaging. We therefore perform a linear search to obtain a new d or just reset this list back to independent bases every so often.
Hooke-Jeeves Search See Hooke-Jeeves Policy Search, but you don’t check the Rollout Policy; you just go and evaluate the function.
big picture: check a set of local perturbations, move your center point to the lowest one, and perturb again. If you converge, shrink your perturbation size.
Generalized Pattern Search Hooke-Jeeves Search uses 2n searches, one in each direction; for \mathbb{R}^{2}, for instance, this requires 4 lookups because 2 basis, and one sample in positive and one sample in negative direction.
So, to save one lookup, we create a positive spanning set instead (i.e. spanning set for the space for which you are constrained to only use positive coefficients, meaning we need one more vector). This creates a “triangle” which requires n+1 vectors, but we only check one point per spanning set member.
Opportunistic Search If you get a minimum point, just go there.
Dynamic Ordering If you found a minimum point once, check that point again first after going there.
Nelder-Mead Simplex Method Take a simplex to search. Sort its verticies by lowest to highest in terms of objective function value, with lowest point being called x_{l}, the x_{h} the highest function value, x_{s} being the second highest function value. \bar{x} being the mean point between all points except x_{h}.
You have four possible actions
reflection: take your worst point, and reflex across the mean line formed across the remaining points expansion: take your best point, and move them contraction: move the worst point closer to the mean of the other two shrink: move all points closer together