Over the past few months, I’ve been refreshing myself on some of the concepts of Machine Learning and AI, and I’ve decided to write a few posts on those, which primarily serve as notes to myself (which is why I call my blog “Notebook”), and hopefully this will help someone else as well. This one is on the N-gram Language Model.
Some background on Language Models
How is ChatGPT producing human-like responses? There are a few parts to it, with language models being one. A language model (LM) is a probabilistic model that assigns probabilities to sequences of words or tokens in a given language. The goal is to capture the structure and patterns of the language to predict the likelihood of a particular sequence of words.
Let’s assume we have a vocabulary that contains a sequence of words or tokens denoted as , where is the length of the sequence. The LM assigns probabilities to every possible sequence or order of words belonging to a vocabulary The probability of a sequence can be expressed as follows:
An Example
Assume we have words or a token set as , and the LM might have the following probabilities assigned after training.
N-gram Language Model
N-gram models are a type of probabilistic LM used in natural language processing and computational linguistics. These models are based on the idea that the probability of a word depends on the previous words in the sequence. The term “n-gram” refers to a consecutive sequence of items.
For example, consider the following sentence: I love programming languages.
- Unigram (1-gram): “I,” “love,” “programming,” “languages”
- Bigram (2-gram): “I love,” “love programming,” “programming languages”
- Trigram (3-gram): “I love programming,” “love programming languages”
- 4-gram: “I love programming languages”
N-gram models are simple and computationally efficient, making them suitable for various natural language processing tasks. However, they have a few limitations, including the inability to capture long-range dependencies in language and the exponential increase in the number of possible word sequences (n-grams) as the value of ‘n’ increases (sparsity problem).
The basic algorithm is as follows:
- Tokenization: Split the input text into individual words or tokens.
- N-gram generation: Create n-grams by forming sequences of consecutive words from the tokenized text.
- Frequency counting: Count the occurrences of each n-gram in the training data.
- Probability estimation: Calculate the conditional probability of each word given its previous words using the frequency counts.
- Smoothing (optional): Apply smoothing techniques to handle unseen n-grams and avoid zero probabilities.
- Text generation: Start with an initial seed of words, predict the next word based on probabilities, and iteratively generate the next words to form a sequence.
- Repeat generation: Continue generating words until the desired length or a stopping condition is reached.
Sample Code in Python for a Basic N-gram Model
import random
class BasicNGramLanguageModel:
def __init__(self, n):
self.n = n
self.ngrams = {}
self.start_tokens = ['<start>'] * (n - 1)
def train(self, data_set):
for sentence in data_set:
tokens = ['<start>'] + sentence.split() + ['<end>']
for i in range(len(tokens) - self.n + 1):
ngram = tuple(tokens[i:i + self.n])
if ngram in self.ngrams:
self.ngrams[ngram] += 1
else:
self.ngrams[ngram] = 1
def generate_text(self, seed_text, length=10):
seed_tokens = seed_text.split()
padded_seed_text = self.start_tokens[-(self.n - 1 - len(seed_tokens)):] + seed_tokens
generated_text = list(padded_seed_text)
current_ngram = tuple(generated_text[-self.n + 1:])
for _ in range(length):
next_words = [ngram[-1] for ngram in self.ngrams.keys() if ngram[:-1] == current_ngram]
if next_words:
next_word = random.choice(next_words)
generated_text.append(next_word)
current_ngram = tuple(generated_text[-self.n + 1:])
else:
break
return ' '.join(generated_text[len(self.start_tokens):])
# Sample Training Data
training_data = [
"This is a simple example.",
"The example demonstrates an n-gram language model.",
"N-grams are used in natural language processing.",
"I love programming languages",
"Python is a great programming languages",
"Learning new programming languages and software development concepts is awesome"
]
ngram_size = 3 # Change n-gram order here
# Example usage with seed text
model = BasicNGramLanguageModel(ngram_size)
model.train(training_data)
test_seed_text = "Learning" # You can change the seed text here
test_generated_text = model.generate_text(test_seed_text, length=1)
print("Seed text:", test_seed_text)
print("Generated text:", test_generated_text)
Large Language Models (LLM)
Unlike simple n-gram models that count fixed-length word sequences, LLMs embed tokens into high-dimensional vectors, allowing them to capture nuanced meanings and analogies even for words or phrases they have never seen together. Modern LLMs, such as transformer-based models like GPT-3, utilise deep learning techniques to capture intricate patterns and dependencies in the data.
Let’s discuss the Transformer Architecture in a later post.