1990 static word embeddings 2003 neural language models 2008 multi-task learning 2015 attention 2017 transformer 2018 trainable contextual word embeddings + large scale pretraining 2019 prompt engineering Motivating Attention Given a sequence of embeddings: x_1, x_2, …, x_{n} For each x_{i}, the goal of attention is to produce a new embedding of each x_{i} named a_{i} based its dot product similarity with all other words that are before it. Let’s define:

\begin{equation} score(x_{i}, x_{j}) = x_{i} \cdot x_{j} \end{equation}

Which means that we can write:

\begin{equation} a_{i} = \sum_{j \leq i}^{} \alpha_{i,j} x_{j} \end{equation}

where:

\begin{equation} \alpha_{i,j} = softmax \left(score(x_{i}, x_{j}) \right) \end{equation}

The resulting a_{i} is the output of our attention. Attention From the above, we call the input embeddings x_{j} the values, and we will create a separate embeddings called key with which we will measure the similarity. We call the word we want the target new embeddings for the query (i.e. x_{i} from above).

[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?