aspe

ML - Transformer 본문

카테고리 없음

ML - Transformer

aspe 2021. 12. 14. 06:02

Problems of Attention-Based Models

Because of forget gate, there is still long dependency broblem.

Attention of source sentence and target sentence itself are ignored

RNN is hard to parallelize.

Transformer

It has self attentions of encoder, decoder

Multi-Head Attention

Applying Softmax function to all Query, Key, Value at one time, distinct features are less prominent.

Apply Softmax function with small range several times.

Layer Normalization & Residual Connection

Normalize Layers across batches. -> Avoids Interval Covariate Shift(AIC)

Residual Connection -> Esay to know how input changed, Ensemble effect

Position Encoding

The multi-head attention network cannot naturally make use of the position of the words in the input sequence.

This problem makes same result with different sequence of sentence.

So Transformer has Positional Encoding

All the photos are from Professor Hak-soo Kim's lecture of Konkuk University's Department of Computer Engineering.
Comments