Let's build an Artificial Peter Drury using AI

In this blog post, we will embark on a fascinating journey into the world of Artificial Intelligence (AI) and Natural Language Processing (NLP) to create an "Artificial Peter Drury Commentator". If you're a fan of football commentary and the iconic style of Peter Drury, this project will be especially intriguing.

Step 1: Setting Up the Environment

To get started, make sure you have the necessary tools and libraries installed. We'll be using PyTorch and Python for this project. Here's how you can set up your environment:

import torch
import torch.nn as nn
from torch.nn import functional as F

Step 2: Loading the Dataset

Our model needs training data to learn from. For this project, we'll use a dataset of Peter Drury's commentaries. You can upload the dataset as 'input.txt' and load it as follows:

# load the data, Drury dataset
with open('drury.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)

Hyper Parameters

batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu' # use GPU if available - for faster training!
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0


torch.manual_seed(1337)

Step 3: Data Preprocessing - Tokenization
Tokenize the text - convert each character to a unique integer ID - here we are using character-level tokenization There are several methods for tokenization:

character-level tokenization
word-level tokenization
sub-word level tokenization
You can also use a pre-trained tokenizer such as SentencePiece, TikToken etc

# Tokenize the text - convert each character to a unique integer ID - here we are using character-level tokenization
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

Split data into train & eval sets
We are splitting the data into training and test sets. The training set will be used to train the model and the test set will be used to evaluate the model. We will use 90% of the data for training and 10% for testing.

The reason for splitting the dataset is because we do not want a perfect memorization of the dataset. We want the model to generalize well to unseen data. If we do not split the dataset, the model will memorize the training data and will not perform well on the test data.

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # 90/10 train/test split
train_data = data[:n]
val_data = data[n:]

Data Loading

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad() # use this decorator to make Pytorch more memory efficient
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

Attention Mechanism
We need our model to know the context of our data, to add an attention mechanism, we will use the Transformer architecture. Read more about Attention is all you need

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

Step 4: Building the Model

Now, let's build the Bigram Language Model using PyTorch. This model will learn to generate text in a similar style to Peter Drury's commentary.

class BigramLanguageModel(nn.Module):
     def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

Optimization
We shall use the Adam optimizer from PyTorch


model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

Step 5: Training the Model

To train the model, we'll define hyperparameters, create a data loading function, and set up the training loop.

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

Step 6: Generating Drury-like Commentaries

Once the model is trained, you can use it to generate commentaries in Peter Drury's style. Provide a context, and the model will generate text.

# Generating commentaries
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated_commentary = decode(m.generate(context, max_new_tokens=2000)[0].tolist())
print(generated_commentary)

Conclusion

In this blog post, we've explored the process of building an "Artificial Peter Drury Commentator" using PyTorch. We've covered setting up the environment, loading and preprocessing the dataset, building the model, training it, and generating commentaries. The final result is a model capable of emulating Peter Drury's iconic commentary style.

I have trained the model using just 20KB of data and the model perfomance is promising, let's see what happens when the training data increases. We can also improve the model by using another tokenization technique because character-level tokenization is a bit noisy.

Feel free to experiment with different datasets and fine-tune the model to create your own AI commentator for various sports or events. The possibilities with AI and NLP are endless, and this project is just a glimpse of what you can achieve.

You can find all the code and resources in Github

Stay tuned for more exciting AI projects and technological advancements in the future!