Transformers - Natural Language Processing
- Originally introduced by Google for translating languages
- Currently trained to take in Input and predict the next Word as output
- Predicts the form of Probability distribution > Adds new random word from predictions > Repeats process of predicting the next word
- MLP (Multi-layer Perceptron) / Feed Forward Layer
- All vectors go through the same operation in Parallel, Storing Facts
- It is like asking a long list of questions and updating them based on their answers
- Steps
- Multiplying the vector by a big Matrix (Values are based on model parameters learned from data), Up projection
- Adds a Bias vector
- ReLU is used to convert all negative values into 0, called neurons, active when positive
- Again multiplied by Down projection matrix
- Repeating the Attention & MLP steps multiple times
- GPT 3 => Total Weights > 170 B, Organized into < 28 K matrices
- Embedding => 12288 x 50257 = 617 M weights
- Input broken into little chunks called Tokens
- Token can be piece of Words, Punctuations, Space, Special Characters
- Embedding => Turns tokens into Vectors
- Vector encode information about the context of that token and its position in the context
- Direction if the vector can have a semantic meaning
- Words with similar meaning are in the vector which have closer coordinates to each other in the space
- Initially each vector has no relation with its surroundings
- Embedding Matrix => 1st matrix, begins with predefined random values
- Columns representing predefined vocabulary of 50257 tokens
- Rows representing 12288 dimensions
- Context Size => No. of vectors the network can process at a time
- GPT3 has a context size of 2048
- Adjust the Embeddings to add contextual meaning with that Token by absorbing meanings from its surrounding and changing its direction in the space
- Self-Attention head => GPT3 uses 96 attention head inside each block & 96 layers
- Attention pattern
- Dot product of each Key-Query pair, gives score of how every token is relevant to updating the meaning of every other token
- Its size is equal to the square of the Context size
- Larger positive value means more related
- We also predict the next Token of each sub-query
- So that the later Token does not influence the earlier Tokens
- Masking => Turn all the values below the diagonal to 0 by turning these values into -∞ before applying Softmax
- Use Softmax function to normalize each value along the column, to turn them into Weights
- Key => 12288 x 128 = 1.57 M weights => 151 M / block => 14 B / layer
- Key Vector => dimensional
- A Key matrix multiplied by each Embedding to generates its Key
- It is like answer of the Query => Aligns with the Query in the Space => Key embedding attend to the query embedding
- Values of the matrix depends on the type of context we are trying to identify
- Query => 12288 x 128 = 1.57 M weights => 151 M / block => 14 B / layer
- Query Vector => 128 dimensional
- A Query matrix multiplied by each Embedding to generates its Query
- Value & Output => 128 x 12288 + 12288 x 128 = 3 M weights => 302 M / block => 28 B / layer
- Multiply Value matrix by the embedding of each word in the row, giving us a Value vector respectively
- Value vectors are multiplied by their respective weights for that word in the column
- These rescaled values in the column are added together to get the change that needs to be added to the original Embedding
- Constraining the overall value map to a Low-rank transformation
- Up-projection => 49,152 x 12288 x 96 = 58 B
- Down-projection => 12288 x 49,152 x 96 = 58 B
- Unembedding => 50257 x 12288 = 617 M weights
- Meaning of the passage is in the last vector of the sequence, used to generate probability distribution of all possible next tokens
- Unembedding matrix => Maps the last vector to list of 50K tokens
- Rows representing predefined vocabulary of 50257 tokens
- Columns representing 12288 dimensions
- Softmax function => Used to normalize it into a Probability distribution
- Takes input called Logits, and gives output called Probability
- All values will be between 0 & 1 and adds up to 1
- Take exponential of each number and then divide them by the sum of all
- Temperature => Added as denominator while taking exponent
- When T is larger, Gives more weight to lower values, making distribution more uniform
- A next text is added to the passage and all the steps are repeated