Machine Intelligence

Bias variance is one instantiation of Occam’s razor.

Point 1: Bias Variance Tradeoff and the variance of a probability distribution Variance in the bias-variance tradeoff refers to the concept that when you’re searching over models, some models have more flexibility. When they fit a dataset, models with more flexibility tend to overfit, because they find a separating hyperplane that is overly accommodating to particular datapoint. There are many ways to tend to overfit, and variance is an abstraction over all of them.

There’s valuing fit over smoothness.
There’s valuing a single datapoint in a region with sparse data over the impact from other datapoint farther away that you can interpolate from or extrapolate from. (Looking too much at particular datapoints)
Arbitrarily overweighting one representation of the features over valuable others, incomplete search over the set of feature representations

Related: Decomposition over Regularization https://docs.google.com/document/d/1tCoaZEzERE3XP_4SzJWJhQ17bnY7vfUGYPGnCCEO54I/edit?usp=sharing

Levels of Abstraction, Abstracting Over an Incomplete Subset https://docs.google.com/document/d/18FvL9mlKTDlxQVXju1v8vOV63U73-6f9WVGh9-d8ScE/edit?usp=sharing

Treating Variance in the bias-variance tradeoff as a concept, there are many ways we could instantiate it.

The standard way, watching your model overfit. This approximation of variance is the difference between the training error and the validation error. (Bias will affect both your training and validation error equally)
Bootstrap sampling variants of the dataset
1. Split between in and out of bag examples
2. Train on the training sample, test on the testing sample
3. Variance is the ordinary (distribution) variance of your predictions on a given datapoint (assuming regression). You can compute the average variance across datapoints for your model’s variance.

The number of hypotheses that can be learned by a model (say, the number of features in a dataset for a decision stump, and all of their interactions for a singly branched tree) Across different representations of a hypothesis space (parameters, freedom over those parameters, number of parameters, rules, freedom of rules) these are different approximations of the variance. But a wide hypothesis space tends to cause high variance, it’s not variance itself.

Say that your model’s predictions of a datapoint are Cauchy distributed. Would you say that since its variance is undefined, it’s not subject to the bias-variance tradeoff?

Variance of a distribution

Variance is complex hypothesis classes leading to overfitting

Just because a concept is formalizable doesn’t mean that the concept is its formalization. There’s something like map-territory here. But it’s higher-map lower-map. We need a clean way to distinguish between concepts and their formalization. Would you say that ‘attention’ in deep learning is attention? Of course not. Attention is so much bigger than that.

Source: Original Google Doc