Multi-Head Attention

In the Transformer architecture, multi-head attention is the implementation of multiple attention mechanisms in parallel. It is achieved by splitting the representation of each element in the sequence in equal-sized "heads", running the attention mechanism on each head independently, then aggregating the results. Multi-head attention allows for more nuanced context embeddings, since each element can split attention between various other elements.