stochastic gradient descent See stochastic gradient descent Word2Vec see word2vec Or, we can even use a simpler approach, window-based co-occurance GloVe goal: we want to capture linear meaning components in a word vector space correct insight: the ratio of co-occurrence probabilities are linear meaning components Therefore, GloVe vectors comes from a log-bilinear:
such that:
Evaluating a NLP System Intrinsic evaluate on the specific target task the system is trained on evaluate speed evaluate understandability Extrinsic real task + attempt to replace older system with new system maybe expensive to compute Word Sense Ambiguity Each word may have multiple different meanings; each of those separate word sense should live in a different place. However, words with polysemy have related senses, so we usually average:
where a_{j} is the frequency for the j th sense of the word, and v_1 … v_{3} are separate word senses. sparse coding if each sense is relatively common, at high enough dimensions, sparse coding allows you to recover the component sense vectors from the average word vectors because of the general sparsity of the vector space. Word-Vector NER Create a window +- n words next to each word to classify; feed the entire sequence’s embeddings concatenated into a neural classifier, and use a target to say whether the center word is a entity/person/no label, etc. These simplest classifiers usually use softmax, which, without other activations, gives a linear decision boundary. With a neural classifier, we can add enough nonlinearity in the middle to make the entire representation, but the final output layer will still contain a linear classifier.