Skip to main content

Transformer


Transformer

Transformer Architecture
https://arxiv.org/pdf/1706.03762.pdf
https://charon.me/posts/pytorch/pytorch_seq2seq_6/

Self-Attention

  • Input: X={x1,x2,,xn}Rn×dX = \left\{ x_1, x_2, \cdots, x_n \right\} \in \mathbb{R}^{n \times d}
  • Query: qi=WqxiRdkq_i=W_q x_i \in \mathbb{R}^{d_k}
  • Key: ki=WkxiRdkk_i=W_k x_i \in \mathbb{R}^{d_k}
  • Value: vi=WvxiRdvv_i=W_v x_i \in \mathbb{R}^{d_v}
  • Attention: aij=softmax(qiTkjdk)=exp(qiTkjdk)j=1nexp(qiTkjdk)a_{ij} = \text{softmax}(\frac{q_i^T k_j}{\sqrt{d_k}}) = \frac{exp(\frac{q_i^T k_j}{\sqrt{d_k}})}{\sum_{j=1}^n exp(\frac{q_i^T k_j}{\sqrt{d_k}})}
  • Output: yi=j=1naijvjy_i = \sum_{j=1}^n a_{ij} v_j
  • Positional Encoding: pi={sin(i1000021/d),cos(i1000021/d),sin(i1000022/d),cos(i1000022/d),}Rdp_i = \left\{ sin\left(\frac{i}{10000^{2*1/d}}\right), cos\left(\frac{i}{10000^{2*1/d}}\right), sin\left(\frac{i}{10000^{2*2/d}}\right), cos\left(\frac{i}{10000^{2*2/d}}\right), \cdots \right\} \in \mathbb{R}^{d}

...