Transformers - Natural Language Processing

  • Originally introduced by Google for translating languages
    • Currently trained to take in Input and predict the next Word as output
  • Predicts the form of Probability distribution > Adds new random word from predictions > Repeats process of predicting the next word
    • MLP (Multi-layer Perceptron) / Feed Forward Layer
      • All vectors go through the same operation in Parallel, Storing Facts
      • It is like asking a long list of questions and updating them based on their answers
      • Steps
        • Multiplying the vector by a big Matrix (Values are based on model parameters learned from data), Up projection
        • Adds a Bias vector
        • ReLU is used to convert all negative values into 0, called neurons, active when positive
        • Again multiplied by Down projection matrix
    • Repeating the Attention & MLP steps multiple times
  • GPT 3 => Total Weights > 170 B, Organized into < 28 K matrices
    • Embedding => 12288 x 50257 = 617 M weights
      • Input broken into little chunks called Tokens
        • Token can be piece of Words, Punctuations, Space, Special Characters
      • Embedding => Turns tokens into Vectors
        • Vector encode information about the context of that token and its position in the context
          • Direction if the vector can have a semantic meaning
          • Words with similar meaning are in the vector which have closer coordinates to each other in the space
          • Initially each vector has no relation with its surroundings
        • Embedding Matrix => 1st matrix, begins with predefined random values
          • Columns representing predefined vocabulary of 50257 tokens
          • Rows representing 12288 dimensions
      • Context Size => No. of vectors the network can process at a time
        • GPT3 has a context size of 2048
      • Adjust the Embeddings to add contextual meaning with that Token by absorbing meanings from its surrounding and changing its direction in the space
      • Self-Attention head => GPT3 uses 96 attention head inside each block & 96 layers
        • Attention pattern
          • Dot product of each Key-Query pair, gives score of how every token is relevant to updating the meaning of every other token
          • Its size is equal to the square of the Context size
          • Larger positive value means more related
          • We also predict the next Token of each sub-query
            • So that the later Token does not influence the earlier Tokens
            • Masking => Turn all the values below the diagonal to 0 by turning these values into -∞ before applying Softmax
          • Use Softmax function to normalize each value along the column, to turn them into Weights
    • Key => 12288 x 128 = 1.57 M weights => 151 M / block => 14 B / layer
      • Key Vector => dimensional
        • A Key matrix multiplied by each Embedding to generates its Key
        • It is like answer of the Query => Aligns with the Query in the Space => Key embedding attend to the query embedding
        • Values of the matrix depends on the type of context we are trying to identify
    • Query => 12288 x 128 = 1.57 M weights => 151 M / block => 14 B / layer
      • Query Vector => 128 dimensional
        • A Query matrix multiplied by each Embedding to generates its Query
    • Value & Output => 128 x 12288 + 12288 x 128 = 3 M weights => 302 M / block => 28 B / layer
      • Multiply Value matrix by the embedding of each word in the row, giving us a Value vector respectively
      • Value vectors are multiplied by their respective weights for that word in the column
      • These rescaled values in the column are added together to get the change that needs to be added to the original Embedding
      • Constraining the overall value map to a Low-rank transformation
    • Up-projection => 49,152 x 12288 x 96 = 58 B
      • Multi-layer Perceptron
    • Down-projection => 12288 x 49,152 x 96 = 58 B
    • Unembedding => 50257 x 12288 = 617 M weights
      • Meaning of the passage is in the last vector of the sequence, used to generate probability distribution of all possible next tokens
        • Unembedding matrix => Maps the last vector to list of 50K tokens
          • Rows representing predefined vocabulary of 50257 tokens
          • Columns representing 12288 dimensions
        • Softmax function => Used to normalize it into a Probability distribution
          • Takes input called Logits, and gives output called Probability
          • All values will be between 0 & 1 and adds up to 1
          • Take exponential of each number and then divide them by the sum of all
          • Temperature => Added as denominator while taking exponent
            • When T is larger, Gives more weight to lower values, making distribution more uniform
      • A next text is added to the passage and all the steps are repeated
Share: