Generative Learning Algorithm

Gaussian Discriminant Analysis High level idea: 1) fit parameters to the positive and negative examples separately as a multi-variant Gaussian density 2) try to see if a new samples’ probablitity to 1) is greater or 2) is greater.
requirements fit a parameter to additional information multivariant gaussian See multi-variant Gaussian density. If it helps, here you go:

\begin{equation} p\left(z\right) = \frac{1}{\left(2\pi\right)^{\frac{|z|}{2}}|\Sigma|^{\frac{1}{2}}} \exp \left(-\frac{1}{2} \left(x-\mu\right)^{T} \Sigma^{-1} \left(x - \mu\right)\right) \end{equation}

making predictions Suppose p\left(y=1\right) = \phi, p\left(y=0\right) = 1-\phi.
Now, this means that:

\begin{equation} p\left(y\right) = \phi^{y} \left(1-\phi\right)^{1-y} \end{equation}

Now, we can write using multi-variant Gaussian density that:

\begin{equation} p\left(x\mid y = 0\right) = \frac{1}{\left(2\pi\right)^{\frac{|z|}{2}}|\Sigma|^{\frac{1}{2}}} \exp \left(-\frac{1}{2} \left(x-\mu_{0}\right)^{T} \Sigma^{-1} \left(x - \mu_{0}\right)\right) \end{equation}

\begin{equation} p\left(x\mid y = 1\right) = \frac{1}{\left(2\pi\right)^{\frac{|z|}{2}}|\Sigma|^{\frac{1}{2}}} \exp \left(-\frac{1}{2} \left(x-\mu_{1}\right)^{T} \Sigma^{-1} \left(x - \mu_{1}\right)\right) \end{equation}

Finally, to predict p\left(\cdot | x\right), we use Naive Bayes using the p\left(y\right) and p\left(x|y\right) above. In particular we can have:

\begin{align} y &= \arg\max_{y} p\left(y|x\right) \\ &= \arg\max_{y} \frac{p\left(x|y\right)p\left(y\right)}{p\left(x\right)} \\ &= \arg\max_{y} p\left(x|y\right)p\left(y\right) \end{align}

applying Bayes rule to make prediction makes the sigmoid function Fun fact: applying the Bayes rule on making predictions will just hand you the sigmoid function; i.e. the predicted p\left(y=1|x\right) is just going to be a sigmoid. That is, assuming:

\begin{equation} \begin{cases} x|y=0 \sim \mathcal{N}\left(\mu_{0}, \Sigma\right) \\ x|y=1 \sim \mathcal{N}\left(\mu_{1}, \Sigma\right) \\ y \sim \text{Ber}\left(\phi\right) \end{cases} \end{equation}

results in the assumption that

\begin{equation} p\left(y=1|x\right) \end{equation}

is logistic. Meaning, GDA makes stronger assumptions than logistic regression.
double fun fact:

\begin{equation} \begin{cases} x|y=0 \sim \text{ExFam}\left(\eta_{0}\right) \\ x|y=1 \sim \text{ExFam}\left(\eta_{1}\right) \\ y \sim \text{Ber}\left(\phi\right) \end{cases} \end{equation}

will also be logistic
So why are we doing any of this?!
If you have a large dataset with lots of noise, logistic regression will do better. If you know your dataset is a Gaussian, this will fit faster (i.e. so good for data-constrained regimes).
fitting Our goal is to solve for \mu_{0}, \mu_{1} to make predictions above that maximizes the joint likelihood of our system.

\begin{equation} \mathcal{L}\left(\phi, \mu_{0}, \mu_{1}, \Sigma\right) = \prod_{i=1}^{n} p\left(x^{(i)}, y^{(i)}; \phi, \mu_{0}, \mu_{1}, \Sigma\right) \end{equation}

in particular:

\begin{equation} \mathcal{L}\left(\phi, \mu_{0}, \mu_{1}, \Sigma\right) = \prod_{i=1}^{n} p\left(x^{(i)}|y^{(i)}\right) p\left(y^{(i)}\right) \end{equation}

And finally, apply the log trick, you can find parameters \mu_{0} and \mu_{1} such that:

\begin{equation} \max_{\phi, \mu_{0}, \mu_{1}, \Sigma} \sum_{i=1}^{n} \log \left[p\left(x^{(i)}|y^{(i)}\right) p\left(y^{(i)}\right)\right] \end{equation}

If you do the derivative thing and go a bunch of brrr and solving, we obtain:

\begin{equation} \phi = \frac{\sum_{i=1}^{n}y^{(i)}}{n} = \frac{\sum_{i=1}^{n} 1\left\{y^{(i)}=1\right\}}{n} \end{equation}

and the means are just the mean of all samples in each class

\begin{equation} \mu_{0} = \frac{\sum_{i=1}^{n} 1 \left\{y^{(i)}=0\right\} x^{(i)}}{\sum_{i=1}^{n}1 \left\{y^{(i)}=0\right\}} \end{equation}

\begin{equation} \mu_{0} = \frac{\sum_{i=1}^{n} 1 \left\{y^{(i)}=1\right\} x^{(i)}}{\sum_{i=1}^{n}1 \left\{y^{(i)}=1\right\}} \end{equation}

and the covariance is a function of these means:

\begin{equation} \Sigma = \frac{1}{n} \sum_{i=1}^{n} \left(x^{(i)}-\mu_{y^{(i)}}\right) \left(x^{(i)}- \mu_{y(i)}\right)^{T} \end{equation}

“why do we have a single covariance for all classes”? For #reasons, a single covariance matrix will result in a linear decision boundary. Otherwise, a custom covariance for each class would result in a non-linear boundary.