Interesting Facts in Machine Learning
Category: Machine Intelligence
<!-- gdoc-inlined -->
Linear Regression
- You can get better generalization with a stochastic solver
- Fastest solution is often through QR factorization, rather than computing inverse or pseudo inverse. Unlike almost every other algorithm in this respect.
- The reason scaling can still be important is for the optimizer - even though you technically have a convex model and will get the same solution
- Linear generalization w/ quality feature engineering is stronger than almost every other form of generalization for unstructured data (trees + networks overfit)
- Best in terms of not overfitting the data - optimal algorithm in low signal-noise ratio environments
- Only major supervised algorithm with closed form solution
- Every relationship between your feature and the label should be as close to linear as possible
- You can use boxcox transform to automatically get close to linear
- Convex
- Lasso is not invariant to rescaling
- L1 penalty leads to laplace distributed coefficients, L2 penalty leads to gaussian distributed coefficients - Bayesian perspective
Logistic Regression
- Two major ways to do multinomial eval:
- Softmax Loss
- One vs. All with binary (logistic) function
- Naming -
- “Logistic” regression due to Sigmoid (logistic) function
- “Softmax” regression due to softmax function
- No closed form solution, despite convexity
- Many, many optimizers:
- Newton / Newton-CG
- BFGS
- L-BFGS
- IRLS
- Trust Region Conjugate Gradient
- Gradient Descent
- GD + Line Search
- Stochastic Average Gradient
- Difficult Bayesian Solutions (No convenient conjugate prior)
- Discriminative (Learns P(Y|X), rather than first the joint P(Y, X) and then conditioning on X (the generative approach))
- Without regularization, the weights will become arbitrary large, damaging generalization. Penalties are more important than in the regression setting.
- You can get better generalization with a stochastic solver [https://arxiv.org/pdf/1708.05070.pdf]
- The reason scaling can still be important is for the optimizer - even though you technically have a convex model and will get the same solution
- Linear generalization is stronger than almost every other form of generalization for unstructured data (trees + networks overfit)
- Every relationship between your feature and the label should be as close to linear as possible
- You can use boxcox transform to automatically get close to linear
Decision Trees
- Captures structure that look like discontinuities or thresholds in a feature
- This is close to quantized structure!!
- This is a great example of model blending. If leaving time and countdown time for each stoplight on the way are your input feature and you use both a decision tree and an MLP, you capture the quantized structure (you hit a stoplights, leading to large differences in arrival time) and the continuous structure (relationship between leaving time and arrival time)
- This is close to quantized structure!!
- Captures discrete structure in continuous and discrete features
- Fails to capture continuous structure
- Extremely poor generalization out of domain - best case, takes the most extreme example found in training data
- Works over missing data
- One approach is to stop when you hit a missing data point and give the classification to the larges distribution of children remaining
- Learns hierarchy of feature interactions, top down
- Question - decision trees are learned top-down. How can we do supervised learning bottom up? Hierarchical clustering w/ supervision?
- Recursively chooses the split that leads to the greatest variance gain (for regression) or information gain through entropy or gini impurity (for classification).
- Insensitive to monotone transformations of features (only cares how the distribution of labels varies across split points)
- Greedy algorithm
- Can be seen as a hierarchical mixture of experts (train expert models on subsets of the data)
Neural Networks
- Learns compositional (bottom-up) hierarchical structure
- Model complexity overcomes the curse of dimensionality
- Combinatorial in depth and in width
- Requires high signal-to-noise ratio
- ‘Just’ adaptive basis function regression
- Optimizer improved by exponentially weighted average of the gradient, learning rate
- Covariate Shift
- Close-to-linear model leads to failure to generalize, ex. adversarial examples
- Dimensionality of the representation increases with depth of a convnet.
- Softmax leads to extreme solutions
- Non-convex optimization surface is dominated by saddle points.
- Convnets are:
- Parameter Sharing leads to translation equivariance
- Locality (Sparse Connectivity)
- Composition
- Not equivariant to scale or rotation.
- Many machine learning libraries implement cross-correlation but call it convolution
Optimization
- Stochastic gradient descent optimizes for the validation / test error directly (when each datapoint is only touched once), while batch gradients optimize for the training set error (and so overfit).
- https://arxiv.org/abs/1509.01240
- http://papers.nips.cc/paper/6015-learning-with-incremental-iterative-regularization
- Improved by exponentially weighted average of the gradient, learning rate
Source: Original Google Doc