What is an N-gram Language Model

Over the past few months, I’ve been refreshing myself on some of the concepts of Machine Learning and AI, and I’ve decided to write a few posts on those, which primarily serve as notes to myself (which is why I call my blog “Notebook”), and hopefully this will help someone else as well. This one is on the N-gram Language Model.

Some background on Language Models

How is ChatGPT producing human-like responses? There are a few parts to it, with language models being one. A language model (LM) is a probabilistic model that assigns probabilities to sequences of words or tokens in a given language. The goal is to capture the structure and patterns of the language to predict the likelihood of a particular sequence of words.

Let’s assume we have a vocabulary (V)(V) that contains a sequence of words or tokens denoted as w1,w2,... wnw_{1}, w_{2}, …\ w_{n}, where n is the length of the sequence. The LM assigns probabilities (p) to every possible sequence or order of words belonging to a vocabulary (V). The probability of a sequence can be expressed as follows: p(w1,w2,... wx)p(w_{1}, w_{2}, …\ w_{x})

An Example

Assume we have words or a token set as V={chase,the,tiger,the,deer}V = \left\{chase,the,tiger,the,deer\right\}, and the LM might have the following probabilities (p) assigned after training.

  • p{chase,the,tiger,the,deer}=0.0001p\left\{chase, the, tiger, the, deer\right\} = 0.0001
  • p{the,chase,tiger,the,deer}=0.003p\left\{the, chase, tiger, the, deer\right\} = 0.003
  • p{chase,the,deer,the,tiger}=0.0022p\left\{chase, the, deer, the, tiger\right\} = 0.0022

N-gram Language Model

N-gram models are a type of probabilistic LM used in natural language processing and computational linguistics. These models are based on the idea that the probability of a word depends on the previous n1 words in the sequence. The term “n-gram” refers to a consecutive sequence of n items.

For example, consider the following sentence: I love programming languages.

  • Unigram (1-gram): “I,” “love,” “programming,” “languages”
  • Bigram (2-gram): “I love,” “love programming,” “programming languages”
  • Trigram (3-gram): “I love programming,” “love programming languages”
  • 4-gram: “I love programming languages”

N-gram models are simple and computationally efficient, making them suitable for various natural language processing tasks. However, they have a few limitations, including the inability to capture long-range dependencies in language and the exponential increase in the number of possible word sequences (n-grams) as the value of ‘n’ increases (sparsity problem).

The basic algorithm is as follows:

  1. Tokenization: Split the input text into individual words or tokens.
  2. N-gram generation: Create n-grams by forming sequences of n consecutive words from the tokenized text.
  3. Frequency counting: Count the occurrences of each n-gram in the training data.
  4. Probability estimation: Calculate the conditional probability of each word given its previous n1 words using the frequency counts.
  5. Smoothing (optional): Apply smoothing techniques to handle unseen n-grams and avoid zero probabilities.
  6. Text generation: Start with an initial seed of n1 words, predict the next word based on probabilities, and iteratively generate the next words to form a sequence.
  7. Repeat generation: Continue generating words until the desired length or a stopping condition is reached.

Sample Code in Python for a Basic N-gram Model

import random

class BasicNGramLanguageModel:
    def __init__(self, n):
        self.n = n
        self.ngrams = {}
        self.start_tokens = ['<start>'] * (n - 1)

    def train(self, data_set):
        for sentence in data_set:
            tokens = ['<start>'] + sentence.split() + ['<end>']
            for i in range(len(tokens) - self.n + 1):
                ngram = tuple(tokens[i:i + self.n])
                if ngram in self.ngrams:
                    self.ngrams[ngram] += 1
                else:
                    self.ngrams[ngram] = 1

    def generate_text(self, seed_text, length=10):
        seed_tokens = seed_text.split()
        padded_seed_text = self.start_tokens[-(self.n - 1 - len(seed_tokens)):] + seed_tokens
        generated_text = list(padded_seed_text)
        current_ngram = tuple(generated_text[-self.n + 1:])

        for _ in range(length):
            next_words = [ngram[-1] for ngram in self.ngrams.keys() if ngram[:-1] == current_ngram]
            if next_words:
                next_word = random.choice(next_words)
                generated_text.append(next_word)
                current_ngram = tuple(generated_text[-self.n + 1:])
            else:
                break

        return ' '.join(generated_text[len(self.start_tokens):])

# Sample Training Data
training_data = [
    "This is a simple example.",
    "The example demonstrates an n-gram language model.",
    "N-grams are used in natural language processing.",
    "I love programming languages",
    "Python is a great programming languages",
    "Learning new programming languages and software development concepts is awesome"
]

ngram_size = 3 # Change n-gram order here

# Example usage with seed text
model = BasicNGramLanguageModel(ngram_size)
model.train(training_data)

test_seed_text = "Learning"  # You can change the seed text here
test_generated_text = model.generate_text(test_seed_text, length=1)
print("Seed text:", test_seed_text)
print("Generated text:", test_generated_text)

Large Language Models (LLM)

Unlike simple n-gram models that count fixed-length word sequences, LLMs embed tokens into high-dimensional vectors, allowing them to capture nuanced meanings and analogies even for words or phrases they have never seen together. Modern LLMs, such as transformer-based models like GPT-3, utilise deep learning techniques to capture intricate patterns and dependencies in the data.

Let’s discuss the Transformer Architecture in a later post.

Leave a comment