Interesting Facts in Machine Learning

Linear Regression

You can get better generalization with a stochastic solver
Fastest solution is often through QR factorization, rather than computing inverse or pseudo inverse. Unlike almost every other algorithm in this respect.
The reason scaling can still be important is for the optimizer - even though you technically have a convex model and will get the same solution
Linear generalization w/ quality feature engineering is stronger than almost every other form of generalization for unstructured data (trees + networks overfit)
Best in terms of not overfitting the data - optimal algorithm in low signal-noise ratio environments
Only major supervised algorithm with closed form solution
Every relationship between your feature and the label should be as close to linear as possible
You can use boxcox transform to automatically get close to linear
Convex
Lasso is not invariant to rescaling
L1 penalty leads to laplace distributed coefficients, L2 penalty leads to gaussian distributed coefficients - Bayesian perspective

Logistic Regression

Two major ways to do multinomial eval:
1. Softmax Loss
2. One vs. All with binary (logistic) function
Naming -
1. “Logistic” regression due to Sigmoid (logistic) function
2. “Softmax” regression due to softmax function
No closed form solution, despite convexity
Many, many optimizers:
1. Newton / Newton-CG
2. BFGS
  1. L-BFGS
3. IRLS
4. Trust Region Conjugate Gradient
5. Gradient Descent
  1. GD + Line Search
6. Stochastic Average Gradient
Difficult Bayesian Solutions (No convenient conjugate prior)
Discriminative (Learns P(Y|X), rather than first the joint P(Y, X) and then conditioning on X (the generative approach))
Without regularization, the weights will become arbitrary large, damaging generalization. Penalties are more important than in the regression setting.
You can get better generalization with a stochastic solver [https://arxiv.org/pdf/1708.05070.pdf]
The reason scaling can still be important is for the optimizer - even though you technically have a convex model and will get the same solution
Linear generalization is stronger than almost every other form of generalization for unstructured data (trees + networks overfit)
Every relationship between your feature and the label should be as close to linear as possible
You can use boxcox transform to automatically get close to linear

Decision Trees

Captures structure that look like discontinuities or thresholds in a feature
1. This is close to quantized structure!!
  1. This is a great example of model blending. If leaving time and countdown time for each stoplight on the way are your input feature and you use both a decision tree and an MLP, you capture the quantized structure (you hit a stoplights, leading to large differences in arrival time) and the continuous structure (relationship between leaving time and arrival time)
Captures discrete structure in continuous and discrete features
Fails to capture continuous structure
Extremely poor generalization out of domain - best case, takes the most extreme example found in training data
Works over missing data
1. One approach is to stop when you hit a missing data point and give the classification to the larges distribution of children remaining
Learns hierarchy of feature interactions, top down
1. Question - decision trees are learned top-down. How can we do supervised learning bottom up? Hierarchical clustering w/ supervision?
Recursively chooses the split that leads to the greatest variance gain (for regression) or information gain through entropy or gini impurity (for classification).
Insensitive to monotone transformations of features (only cares how the distribution of labels varies across split points)
Greedy algorithm
Can be seen as a hierarchical mixture of experts (train expert models on subsets of the data)

Neural Networks

Learns compositional (bottom-up) hierarchical structure
Model complexity overcomes the curse of dimensionality
1. Combinatorial in depth and in width
Requires high signal-to-noise ratio
‘Just’ adaptive basis function regression
Optimizer improved by exponentially weighted average of the gradient, learning rate
Covariate Shift
Close-to-linear model leads to failure to generalize, ex. adversarial examples
Dimensionality of the representation increases with depth of a convnet.
Softmax leads to extreme solutions
Non-convex optimization surface is dominated by saddle points.
Convnets are:
Parameter Sharing leads to translation equivariance
Locality (Sparse Connectivity)
Composition
Not equivariant to scale or rotation.
Many machine learning libraries implement cross-correlation but call it convolution

Optimization

Stochastic gradient descent optimizes for the validation / test error directly (when each datapoint is only touched once), while batch gradients optimize for the training set error (and so overfit).
1. https://arxiv.org/abs/1509.01240
2. http://papers.nips.cc/paper/6015-learning-with-incremental-iterative-regularization
Improved by exponentially weighted average of the gradient, learning rate