M.S. AAI Capstone Chronicles 2024
that they do not distinguish between different orderings of a sequence, e.g., “The box is on the chair” and “The chair is on the box” would be treated as the same sequence and have the same output. A notion of order can be added to the sequence by creating a positional embedding that encodes the position of each token in a sequence. There are two primary kinds of positional embeddings: fixed and learned (Zakka, 2023). Fixed positional embeddings encode position using a fixed function, typically a sinusoidal function. Learned positional embeddings use a trainable embedding layer to learn positional encodings of a specific size. Fixed positional embeddings are not updated during training, making them useful for handling sequences that may be much longer than what was seen during training. For our use case this is not an issue since the model is generating captions with a maximum length that is smaller than the training sequences, so we chose to use a learned positional embedding layer. This positional embedding is added to the token embedding to augment the semantic information captured by the token embedding with sequential information. Each block of the decoder consists of a self-attention layer, a cross-attention layer, and a feed forward network. Although not pictured for simplicity in the diagram in Figure 7, there are also residual connections after the self-attention and cross-attention layers as well as normalization layers after each of the three main components. For each block, the sequential embedding is the input to the self-attention layer. In a self-attention layer, the input is both the query and the key (and value), which has the practical effect of allowing tokens in the sequence to “attend” to other tokens in the same sequence and update the sequential embedding with richer contextual information about relationships and dependencies between different parts of the
220
Made with FlippingBook - professional solution for displaying marketing and sales documents online