Murphy
Category: Technical
<!-- gdoc-inlined -->
Models to Learn
- Beta-binomial Model
- Dirichlet-multinomial Model
- Quadratic Discriminant Analysis
- Hierarchical Bayes
- Bayesian Linear Regression
- Bayesian Logistic Regression
- Bayesian Information Criterion
- Generalized Linear Models
- Independent Component Analysis
- Automatic Relevance Determination
- Sparse Coding
- CART
- MARS
- Boosting
- State Space Models
- Kalman Filter
- Spectral Clustering
- Structure Learning
- Latent Dirichlet Allocation
- Restricted Boltzmann Machines
- Deep Belief Networks
Optimizers to Learn
- Inferring the parameters of a multivariate normal
- Expectation Maximization
- The mean field method (Ising model)
- Variational Bayes
- Loopy belief propagation
- Expectation Propagation
- Rejection Sampling
- Importance Sampling
- Particle Filtering
Other
- Bayesian Decision Theory
- Parametric vs. Non-parametric models
- Empirical Risk Minimization
- Pathologies of Frequentist Statistics
- Distributions and their Value
- Degenerate pdf
- Empirical Distribution
- Laplace distribution
- Student’s t distribution
- Gamma distribution
- Beta distribution
- Pareto distribution
- Kernels and their Value
- RBF
- Mercer
- Linear
- Matern
- Kernel Trick
Probability Theory
Fundamentals in Probability Theory
- Discrete Random Variable following a Probability Mass Function
- Probability of a union of two events
- Joint Probability
- Joint Distribution
- Product Rule
- Sum Rule
- Conditional Probability
- Bayes Rule
- Generative Classifier
- Discriminative Classifier
- Independence
- Conditional Independence
- Continuous Random Variable following a Probability Density Function
- Cumulative Distribution Function
- Quantiles
- Median
- Quartiles
- Tail Area Probabilities
- Mean / Expected Value
- Variance / Standard Deviation
Distributions (Would be great to have examples for all, could even add this content to structure of information)
- Continuous Distributions
- Gaussian
- Mean / Precision (inverse variance) parameterized
- Student t distribution
- Mean / Variance / Degrees of Freedom parameterized
- Cauchy / Lorentz distribution when v = 1
- Like a Gaussian, but less sensitive to outliers.
- Approaches a gaussian with v ~> 5
- Laplace
- Parameterized by ‘location’ and ‘scale’
- Pointed distribution with linear tails (both concave and convex)
- Gamma
- Parameterized by ‘shape’ (a) and ‘rate’ (b)
- Exponential Distribution is gamma where shape = 1
- Erlang distribution is gamma where shape is an integer.
- Chi-squared is gamma where shape = v/2 and rate = ½
- Sum of squared gaussians
- Beta
- Parameterized by a and b, extremely flexible.
- Pareto
- Parameterized by x, m, and k. x must be greater than some constant m, but not too much greater (where too much is controlled by k).
- Models heavy tailed distributions
- Joint Continuous Distributions
- Covariance
- Correlation
- Multivariate Gaussian
- Multivariate Student t
- Dirichlet Distribution
- Gaussian
- Discrete Distributions
- Binomial
- Parameterized by hitting some fraction of a number of identical events
- Bernoulli (Binomial where n = 1)
- Multinomial
- Parameterized by hitting some fraction of a number of events for a number of different events.
- Multinoulli
- Poisson
- Lambda parameterized, where lambda both determines the center and spread of the distribution
- Used as a model for rare events like traffic accidents, radioactive decay
- Binomial
Things whose value I don’t understand
- The empirical distribution (empirical measure following the Dirac measure)
- Degenerate PDF (Dirac delta function)
Topics that feel tacked on:
- Central limit theorem
- Linear and general transformations
- Monte Carlo Approximation
- Information Theory
- Entropy
- KL Divergence
- Mutual Information
Generative Models for Discrete Data
The whole framework:
Probability that the label is a particular class is proportional to the probability of the datapoint given that class times the base probability of that class. P(y = c | x, t) ~= P(x | y = c, t) * P(y = c | t)
The ‘key’ is specifying the form for the probability of the ‘class conditional density’ (the probability of the datapoint given that the class is what it is, along with the parameters). The ‘class conditional density’ specifies what kind of data we expect to see in each class.
- Bayesian Concept Learning
- Posterior Predictive Distribution - output probability over hypotheses
- Discovering the data generating distribution from the data as induction
- LIkelihood
- Prior
- Posterior (Likelihood * Prior), normalized
- MAP Estimate - The maximum prediction out of the posterior, where the distribution has peaked.
- Beta-Binomial Model
- Likelihood
- Prior
- Conjugate Prior
- Bernoulli’s conjugate prior is the beta distribution.
- Prior parameters as hyper parameters
- Conjugate Prior
- Posterior
- Posterior mean and mode
- Posterior Variance
- Posterior predictive distribution
- Overfitting, black swan paradox
- Predicting the outcome of multiple future trials
- The Dirchlet-Multinomial Model
- LIkelihood
- Prior
- Posterior
- Posterior predictive
- Naive Bayes
- Assumed Independence (conditional independence)
- Model Fitting
- MLE for Naive Bayes ClassifierBayesian Naive Bayes
- Prediction / Inference
- Log-sum-exp trick
- Feature selection with mutual informaiton
Thoughts
- Priors aren’t subjective, they’re based on experienced data. Often they’re the result of structural transfer - take simplicity as an inductive bias that you fit into your prior, as something that has worked many times in the past. The only reason priors look subjective is because we don’t do metalearning or lifelong learning.
Probability Theory
Introduction
There’s a frequentist and bayesian approach to interpreting probability theory. Frequentist interpretation is in terms of multiple identical events. Bayesian interpretation is a measure of the uncertainty around a belief. There are a number of situations where the interpretation that involves identical situations does not make sense that we care about modeling, and so the Bayesian perspective will be used in this book. The mathematics will turn out to be similar.
Discrete Random Variables
The probability of an event is given by p(A), which is between 0 and 1 and is larger as the event becomes more likely. We can create a probability distribution over events p(X = x) which is a probability mass function in the case that the events are discrete.
A state space captures the possible events (states) and a pmf assigns those states each a probability. An indicator function returns 1 or 0 when a condition is true or false.
Unions and Joint Probability
The probability of a union is P(A U B) = P(A) + P(B) - P(A intersect B). (Inclusion exclusion)
Joint probabilities follow the product rule P(A intersect B) = P(A | B) * P(B). The marginal probability is the probability of a for every given value of B, or SUM over b in B { P(A | B = b) * p(B = b) }, where that sum gives the probability of A | B each individual component give P(A | B = b) for all b in B. This comes by the sum rule or the rule of total probability (summing over every element in the subset).
If you want to figure out the probability of the intersection of a large set, you can use the product rule repeatedly to get the chain rule of probability, where each new event’s probability is conditions on the events also included ‘before it’ in the set.
Conditional Probability and Bayes Rule
The probability of an event conditional on another event is the probability of their intersection divided by the probability of the event being conditioned on. P(A|B) = P(A intersect B) / P(B). Adding the product rule to this gives us Bayes rule, where P(A|B) = P(A) * P(A ^ B) / P(B), where P(B) = SUM over a in A {P(B | A = a) * P(A = a)
Unconditional and Conditional Independence
Independence means that the probability of the intersection of the events is equal to the product of those events’ probability. That means that when you condition one on the other, you get an equivalent probability mass as when you don’t condition.
We can introduce conditional independence, where two events may not be independent of one another directly, but when you bring in the influence of another event the dependence goes away. So conditioning on that event in both leads to equal probabilities.
Continuous Random Variables
We can define probabilities over a range by looking at the probability that a random variable is less than a given value in a continuous range. This probability will increase monotonically as we increase the value we’re comparing to in the range - this function is referred to as a cumulative distribution function. We differentiate this function to get a probability density function, which we can integrate over (within bounds) to get the probability of a random variable falling within the region in question.
Quantiles
We can invert the cumulative distribution function which will let us map proportions of the space (between 0 and 1) to values of the distribution. This inverse cumulative distribution function can give us the median, first and third quartile values, and split the distribution into ranges. We can also use it to compute the probability in the tails of the distribution.
Mean and Variance
The mean / expected value in a discrete context is just the sum of the probability of each value multiplied with the value itself. In a continuous context, it’s the integral over the value multiplied by the probability over every point in the distribution.
The variance is a function of the squared deviation from the mean, averaged over all datapoints in a discrete context. In a continuous context, the variance involves integrating over the deviation from the mean, which is akin to taking the difference between the expected value of every value in the distribution square and the value of the mean, squared. Square first and average over those values, subtract the average of the values (mean) squared.
The standard deviation is the square root of the variance, and is usefully in the same units as the original data.
Discrete Distributions Binomial and Bernoulli
Binomial has some number of events that either succeed or fail (binary). It’s modeled by pk * (1-p)n-k. When n = 1, we have the bernoulli. This is related to the binomial coefficient (n choose k) where we get to choose k items from n.
Multinomial and Multinoulli
The multinomial distribution models a sequence of draws from a distribution that has multiple possible outcomes. The case where there’s a single draw is referred to as the multinoulli.
The PDF involves n choose not just k, but x1 through xk. This ‘multinomial coefficient’ is multiplied by the product of the probability of each category showing up.
Poisson Distribution
This distribution is characterized by some coefficient to the x over x factorial - the denominator will dominate, and with large x the probability will get close to 0. Used to model rare events, or events that drop off sharply in probability.
Empirical Distribution
This distribution is a function of the dirac measure, binary where the value is 1 if an event is in a given set and 0 if it is not in the set. It can be thought of as a histogram, whose high is determined by the occurrence of data points, and whose value is 0 if here are no points in the data set with a given value.
Continuous Distributions
Normal Distribution
The normal distribution’s pdf is given by the exponential of the difference between the data and the mean squared divided by the variance of the data.
The mean and variance are the major parameters fit by the data. It’s possible for the pdf to have probability greater than one if the variance is extremely small.
We like the Normal distribution because the central limit theorem tells us that the distribution of samples over random distributions will be normal, and so it’s useful for modeling noise and other second order data. It also makes minimal assumptions about the data while still having a mean and standard deviation, and so when made as an assumption it imposes the least amount of arbitrary structure on our data. (How does this square with the no free lunch theorem?
Degenerate pdf
The Dirac delta distribution comes out of taking a gaussian and having its variance approach zero - a spike at a particular point.
The gaussian is not robust to outliers - any data points more than 2-3 sigma outside its range will distort the distribution strongly in their direction.
A fix for his comes from the Student t distribution (also, in the case where the degrees of freedom = 1, is called the Cauchy or Lorenz distribution). In the Student t distribution, degrees of freedom leads to a pdf that is fairly robust to outliers. It will still cleanly model the main portion of the data, while v (degrees of freedom) < 5. With higher degrees of freedom, it loses its outlier robustness property and starts to approximate a Gaussian.
Laplace Distribution
Double sided exponential distribution. Robust to outliers, much like the Student t distribution is robust to outliers. Similar structure where the pdf has a normalization term that involves dividing by the variance / standard deviation and the exponential has the data minus the mean in the numerator and the variance in the denominator.
Gamma Distribution
It’s super annoying that T isn’t explained here. I’m also getting a ton of abstract knowledge without any real data.
The gamma distribution is defined by the gamma function, which is an exponential times a variable that we’re integrating over. In the exponential is the argument to the gamma, the ‘shape’. There’s also a rate parameter to the gamma distribution.
The distribution is quite flexible, and other distributions (exponential, erlang distribution, chi squared distribution) are just special cases of the gamma.
The Beta Distribution
Similar structure to the Gamma distribution, where there’s a beta function in the denominator. There’s no exponential in the function! It looks somewhat similar to a binomial distribution, but the exponents are a function of a or of b. The two arguments, a and b, lead to huge flexibility in the shape of the distribution. It can have spikes near 0 and 1 (it’s only defined over 0 and 1) if a and b are both less than one, and will be unimodal in different ways if a and b are greater than one.
Pareto Distribution
The Pareto distribution can represent long tails, or heavy tails. Power law distribution. Good for modeling things like the usage of words in the english language, or income distribution in the US.
Not the same! Two parameters, k and m, where x must be greater than some constant m, but not too much greater - k controls what it ‘too much greater’. As k goes to infinity, we get a dirac delta function. (infinity if x = 0 and 0 if x != 0)
Joint Probability Distributions
It’s important to model how variables interact, which is specified by joint probability distributions over the variables. If everything is discrete we can represent the joint distribution with a multidimensional array over the data. But the number of parameters we need to model joint distributions can be very high, and so we will often make simplifying assumptions like limiting the distribution to a particular functional form.
Covariance and Correlation
Covariance is a measure over two variables, where we multiply their variance from their means for each value and then take the expectation over that product of variances over their ranges.
We can construct a covariance matrix over a dataset by looking at the covariance between each vector in the dataset. Much like a correlation matrix, we get a symmetric matrix.
The correlation is the normalized covariance, where we take the covariance and we divide it by the variance of each (this is, imo, a pretty terrible definition for intuition). The reality is that correlation measures the linear relationship between the variables. We can construct a correlation matrix by measuring the correlation between each set of features in a dataset.
Mutual information is a more canonical way of measuring whether there is a relationship between variables.
If variables are independent, their covariance will be 0 and they will be uncorrelated from one another. But if they’re uncorrelated, it doesn’t mean that they’re independent - there are many nonlinear relationships that could exist between them.
Multivariate Gaussian
This is an extension of the gaussian where the mean is replaced with a vector of means that is the length D of the number of gaussians, and the variance is replaced with the covariance matrix. Often we’ll use the inverse covariance (precision matrix or concentration matrix) instead.
Multivariate Student t
Introduces sigma, a scale matrix (not quite the covariance) as well as the gamma distribution in numerator and denominator. Still has degrees of freedom and fatter tails than a gaussian, which become closer to gaussian as the degrees of freedom approaches infinity.
Dirchlet Distribution
The Dirchlet distribution is the multivariate generalization of the beta distribution, defined over a farily unique support called the probability simplex. Collectively it sums up to 1. The pdf still included the beta function in the denominator, function of alpha (unclear how alpha relates to a and b, alpha is a vector with many values) with a product over the data to the alphak power minus 1, where K is the number of dimensions.
The generalization of a and b to alpha looks like it puts a-b into alpha0 and has independent values for the other dimensions. If all alpha are less than 1, we get spikes at all corners of the probability simplex. Unimodal behavior when all greater than 1.
Transformation of Random Variables
Linear Transformations
If we have y, a function of a random variable x, we can derive its first two moments (mean, covariance). The expected value of y will be the expected value of a parameterized version of x, atx + b. The expected value of x times a plus b will need to be equal to the expected value of y, and this is sufficient for the mean.
The covariance is entirely a function of a and x (b will cancel out, influencing the expected value in the same way it influences each individual value). A times sigma times At is that solution.
General Transformations
We can sum over a discrete probability mass function to get an estimate of y, which we want to predict. But we can’t sum over a continuous function. And so instead, we take the cumulative distribution function and differentiate it to get the corresponding probability density function. Then we take derivatives to relate x to y in a way that is proportional, where a change to x leads to a proportional change in y. This is referred to as Change of variables.
Multivariate Change of Variables
We extend to the multivariate case by considering a function that takes a vector and returns another vector, and looking to its Jacobian instead of its gradient. The Jacobian is a matrix of partial derivatives, with the partial derivative of y1 with respect to x1 in the upper left, and partial derivative of yi with respect to xi in general.
We can use the jacobian to define a probability density function over the transformed variables using the jacobian, and the determinant of the jacobian.
Central Limit Theorem
The central limit theorem is about a sum over random variables with probability density functions (that are not necessarily Gaussian). The claim is that they approach a Gaussian distribution (at least in the sqrt) with an increasing number of random variables.
Fascinating statement about metamodeling here, perhaps. That a model of models can have fairly strong assumptions built into it.
Monte Carlo Approximation
A monte carlo approximation lets us compute a distribution by sampling from the distribution. Then we evaluate over the sample, say to compute the expected value or variance, by doing the operation on the sample. In the case of expected value summing the values of the sample and dividing by the sample size, in the case of variance taking the expected value of deviations from the mean, squared.
Murphy’s always using indicator functions to embed conditional logic into equations. I always feel like it’s unfair somehow, and I’m not sure how to operate over the conditional. Same frustration for the cross entropy gradient. Often represented as an indicator function.
Information Theory
Entropy
Entropy is a measure of how dispersed a distribution is. It’s concerned with the amount of information that’s avaliable, which leads to a deterministic distribution (where it’s the same value every time) having zero entropy and a uniform distribution (where there are many values, each with equal probability) having maximal entropy.
KL Divergence
The Kublick - Lucker Divergence is a measure of the difference (and with alteration, the distance) between two distributions. It looks at the probability of a given value in distribution p, and multiplies that by the log probability of the ratio of that value in p vs. that value in q for the second distribution.
We can take the log of that ratio and turn it into subtraction, in which case we get an entropy of the original distribution p and a cross entropy between distributions p and q. That cross entropy is the value of the the original distribution for a given x times the log probability of that value in the other distribution.
KL Divergence can be thought of as modeling the difference in the number of bits required in encode some data from a source distribution with a different distribution as the code book. The entropy is the expected number of bits needed if we use the true model.
Mutual Information
Mutual information represents the difference (KL divergence) between the joint distribution (p(X, Y)) and the marginal distribution (p(X) * p(Y)).
Evaluated at a particular point, we get the pointwise mutual information. Mutual information is the expected value over the pointwise mutual information.
There’s information about how much you learn from a given bayesian update in pointwise mutual information.
Oh, mutual information makes perfect sense - it’s the ratio between the joint distribution and the product of marginals (which would be the same if the variables were independent).
Mutual Information for Continuous Variables
Computing the mutual information of a continuous variable generally requires discretizing or quintizing the random variables into bins,
The approach to generalizing mutual information is to bin the continuous variable at different grades (different bin sizes and locations), compute the mutual information with the discrete formula, and then return the maximal information coefficient found throughout.
MIC and mutual information generally let us discover linear and non-linear relations between variables. Every time I think to look at the correlation, I need to stop myself and look to mutual information and the maximal information coefficient.
I feel like due to using binning this coefficient will have dimensionality problems...
Generative models for discrete data
It’s nice, in a way, that bayesian modeling is about finding a suitable distributional assumption for the class conditional density. There are structural assumptions right from the get-go. It’s surprising that there aren’t meta-learning algorithms for selecting the appropriate class-conditional density.
Aw man, they frame concept learning as binary classification! All concepts are either including a thing or not! They rederived the notion of truth tartre and I came to!
These different notions of similarity are priors that humans bring to data, bring to an understanding of concepts. ‘Kinds of similarity’ may be the definition of structure that I’m looking for, to be honest!!
Hypothesis choice is also super interesting. And in a way, it’s the same problem as inductive biases - inductive bias is a choose of hypotheses to try to fit. Same as a foundational problem in science.
If data is sampled uniformly from a concept in your hypothesis space, the probability will strongly favor concepts that have all of the data consistent with a specific hypothesis. This is a formalization of occam’s razor, where the smallest hypothesis that’s consistent with the data is most likely. And that’s a potential contrarian truth!!! That it’s the ‘smallest hypothesis’, not the ‘simplest hypothesis’! Is ‘all even’ simpler than ‘all powers of two’? It is! But it’s less likely. This could be a deep breakthrough in thinking about the trade off between simplicity and complexity. But I may be generalizing from one example, I need to think more.
The prior as a mechanism for bringing prior information to bear on a situation is fairly powerful. It’s a form of transfer, and it allows for inference with small amounts of data.
The bayesian ‘maximum a-posteriori’ MAP estimate converges to maximum likelihood when data gets large. And so you could think of maximum likelihood learning as bayesian learning without a prior - it simply only uses the likelihood. Should verify this.
That said, the bayesian approach does give you a nice distribution over possible hypotheses.
Maximum likelihood starts with overfitting, and transitions to more even predictions as it sees more data. The bayesian approach starts broad and tightens as it sees more data.
Bayes model averaging says that your predictions should take the weight that you put on each hypothesis and average their probabilities.
Abstractive summarization:
There are two simple generative models for making predictions of a discrete outcome. The beta-binomial model for binary outputs and the dirichlet-multinomial model for many outputs. Both involve conjugate priors.
Lots of thoughts on how to think about a prior / hypothesis set, and how they can be overwhelmed by data through the likelihood to create updates in the posterior.
Lots of arguments for the use of bayesian algorithms -
- Avoiding the overfitting of Maximum Likelihood Estimation through a prior
- ‘Avoiding’ of zero probabilities on black swan (unseen) events
- (In practice, your prior will be zero over hypotheses you didn’t think of)
- ‘Avoiding’ of zero probabilities on black swan (unseen) events
- Continuous updates for online learning
Naive bayes as an easily implemented model, framing the inputs as gaussian our binomial and the output as binary (in the simple case).
Mutual information for feature selection, using the log-exp-sum trick to deal with underflow in predicting the likelihood of high dimensional vectors, lagrange multipliers for constraining outputs in the dirichlet multinomial model to sum to 1, a more complex model than the binary independence model for dealing with interactions between features using the Dirichlet Compound Multinomial density for the class conditional density.
Gaussian Models
Geometric interpretation of the multivariate gaussian: The contours of equal probability lie along an ellipse. The eigenvectors determine the orientation of the ellipse, and the eigenvalues determine how elongated it is.
Mahalanobis distance corresponds to euclidean distance in a transformed coordinate system, shifted by mu and rotated by U.
Maximum likelihood for a multivariate gaussian is just the empirical means and covariance from the data.
Ah! The gaussian distribution has maximum entropy! (Subject to having a specified mean and covariance), that feels pretty fundamental.
You get gaussian discriminant analysis when a multivariate gaussian defineds the class conditional density of a generative classifier. Gaussian discriminant analysis collapses to naive bayes when the covariance matrix is diagonal.
In effect it’s a nearest centroids classifier, where the centroids are the means and the distance metric is the Mahalanobis distance.
You get Linear Discriminant Analysis when the covariance matrices are tied, shared across classes.
The Boltzmann distribution in physics has the same form as the softmax, and is where terminology like the temperature (which trades off most probable states (low temperature) against visiting states uniformly (high temperature). This also shows up in generating diversity in language modeling.
- Empirical Distribution
- The empirical distribution just makes the probability of a particular set be the fraction of datapoints that are in that set. Histograms are a visualization of the empirical distribution if you see each datapoint as having a bin (for the discrete empirical distribution).
- Probit Regression
- Inferring the parameters of a multivariate normal
- There are two major parameters, mu and sigma. Mu represent the set of means of the MVN. sigma represents the covariance matrix. This is typically done in three parts, firs through computing the mean, then computing the covariance, then computing the joint distribution.
- You can use a gaussian likelihood, a conjugate prior which in this case is a guassian, and derive a gaussian posterior for mu.
- For the covariance, the corresponding conjugate prior is an inverse Wishart distribution.
- There are two major parameters, mu and sigma. Mu represent the set of means of the MVN. sigma represents the covariance matrix. This is typically done in three parts, firs through computing the mean, then computing the covariance, then computing the joint distribution.
- Expectation Maximization
- EM is an iterative algorithm which alternates between inferring the missing data values in the parameters (E step) and optimizing the parameters given the ‘filled in’ data (M step).
- The mean field method (Ising model)
- Variational Bayes
- Loopy belief propagation
- Expectation Propagation
- Rejection Sampling
- Importance Sampling
- Particle Filtering
Source: Original Google Doc