AdaBoost — Jemoka Knowledge Base

initialize weights \alpha_{i} = \frac{1}{N} for t \in 1 … T learn this-round classifier f_{t}\left(x\right) on data weights \alpha_{i} recompute classifier coefficient \hat{w}_{t} recompute weights \alpha_{i} normalize weights \alpha_{i} = \frac{\alpha_{i}}{\sum_{j=1}^{N} \alpha_{j}} final model predicts via \hat{y} = \text{sign}\left(\sum_{t=1}^{T} \hat{w}_{t}f_{t}\left(x\right)\right) for classifier with T rounds After some iterations. This algorithm only works on binary classification and eventually gets exponential loss / 0/1 loss. gradient boosting will eventually allow a more general form. convergence theorem At every t, if we can find a weak learner with weighted error < 0.5 for all rounds, then the AdaBoost converges. The training error of the classifier after training via AdaBoost (left term) is bounded by:

\begin{equation} \frac{1}{N} \sum_{i=1}^{N} 1 \left\{F\left(x_{i}\right) \neq y_{i}\right\} \leq \frac{1}{N} \sum_{i=1}^{N} \exp \left(-y_{i}\text{score}\left(x_{i}\right)\right) = \prod_{t=1}^{T} \sum_{i=1}^{N} \alpha_{i,t} \exp \left(-\hat{w}_{t} y_{i} f_{t}\left(x_{i}\right)\right) \end{equation}

where score is the resembling expression: \text{score}\left(x\right) = \sum_{t}^{} \hat{w}_{t}f_{t}\left(x\right) ; F\left(x\right) = \text{sign}\left(score\left(x\right)\right) Sketch: Notice that y_{i}score\left(x_{i}\right) is a proxy for whether or not F\left(x_{i}\right) \neq y_{i} (the expression is only negative during disagreement) And so e^{-y_{i}\text{score}\left(x_{i}\right)} is an exponential loss for the binary classification taskthat goes exponentially high when the scores are really not agreeing in sight with y_{i}. This gives a upper bound on the zero-one loss on the left, allowing us to: in order for us to have a tight error bound, we should therefore minimize the rightmost term as a function of \hat{w}_{t} and a_{i,t}. Turns out, if you do that, we get that the optimal w_{i} = \frac{1}{2} \ln \left(\frac{1-\varepsilon_{t}}{\varepsilon_{t}}\right). classifier weighting If f_{t}\left(x\right) is good, then set \hat{w}_{t} to be large; otherwise, set \hat{w}_{t} to be small. We can write:

\begin{equation} \hat{w}_{t} = \frac{1}{2} \ln \left( \frac{1- \text{weighted\_error}\left(f_{t}\right)}{\text{weighted\_error}\left(f_{t}\right)} \right) \end{equation}

This expression has the following properties: a great classifier is upweighted, a random classifier has weight that tends to zero, and a terrible classifier is automatically inverted. Notice that also if we have 0 weighted error, \hat{w}_{t} \to \infty, which is nice. Classifier Quality Weight good positive high random zero bad (inversely good) negative high sample weighting \begin{equation} \alpha_{i} = \begin{cases} \alpha_{i} e^{-\hat{w}{t}}, \text{if correct}\ \alpha{i} e^{\hat{w}_{t}}, \text{if mistake} \end{cases} \end{equation} The intuition: Sample Correct Classifier Quality Weight yes good decrease yes random stays the same yes bad increase no good increase no random stays the same no bad decrease after every iteration, we normalize weights to 1 αi =. weighting? gradient descent: should weight contribution i by \alpha_{i} in gradient. decision trees: weight the classification error on point i by \alpha_{i}