#36 MyGPT Building and Deploying a Large Language Model (LLM) with Python š ļøš»
Now that weāve gone through the theory and some optimizations in the previous parts of this blog series, itās time to put everything together and build a complete Large Language Model (LLM) from scratch. If you havenāt already, I encourage you to check out the earlier posts in this series to understand the foundational concepts and techniques:
-
Part 1: Introduction to Transformers and LLMs: This post introduces the basics of Transformers, language models, and how they form the backbone of modern NLP tasks.
-
Part 2: Building and Training a Transformer Model: Here, we covered how to set up a Transformer model, train it using PyTorch, and understand the training pipeline.
-
Part 3: Optimizations and Real-World Applications: In this part, we explored advanced optimization techniques, regularization, and how to apply Transformer models to real-world tasks.
In this final part, weāll focus primarily on codeāapplying everything weāve covered so far in a hands-on, practical implementation. Youāll be able to build a working LLM, train it, generate text, and fine-tune it based on the principles discussed earlier. Letās dive right into it! Weāll build, train, save, and use our LLM, ensuring itās fully operational. The model will be based on a Transformer architecture, trained on the Tiny Shakespeare dataset for simplicity, but you can extend it to larger datasets once you have the basic framework.
1. Setting Up the Transformer Model
Letās first set up the Transformer model. Weāll build a class that uses the components we discussed earlier (embedding layer, positional encoding, self-attention, and feedforward layers).
import torch
import torch.nn as nn
class TransformerModel(nn.Module):
def __init__(self, vocab_size, embed_size, num_heads, num_layers, dropout=0.2):
super(TransformerModel, self).__init__()
# Embedding for input tokens
self.embedding = nn.Embedding(vocab_size, embed_size)
# Positional Encoding to track token positions
self.positional_encoding = PositionalEncoding(embed_size, dropout)
# Transformer layers: Stacking multiple decoder layers
self.layers = nn.TransformerDecoderLayer(embed_size, num_heads, dim_feedforward=2048, dropout=dropout)
self.transformer_decoder = nn.TransformerDecoder(self.layers, num_layers)
# Linear layer to project output to the vocab size
self.fc_out = nn.Linear(embed_size, vocab_size)
def forward(self, src, tgt):
# Embed the source and target sequences and apply positional encoding
embed_src = self.embedding(src) + self.positional_encoding(src)
embed_tgt = self.embedding(tgt) + self.positional_encoding(tgt)
# Pass through transformer decoder
output = self.transformer_decoder(embed_tgt, embed_src)
# Project to vocabulary size (logits)
return self.fc_out(output)
class PositionalEncoding(nn.Module):
def __init__(self, embed_size, dropout, max_len=5000):
super(PositionalEncoding, self).__init__()
self.dropout = nn.Dropout(p=dropout)
# Create a matrix of position encodings
pe = torch.zeros(max_len, embed_size)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, embed_size, 2).float() * (-torch.log(torch.tensor(10000.0)) / embed_size))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0).transpose(0, 1)
self.register_buffer('pe', pe)
def forward(self, x):
# Add positional encoding to the embeddings
x = x + self.pe[:x.size(0), :]
return self.dropout(x)
This sets up the Transformer model with a basic architecture. Now, letās move on to the next step: loading and tokenizing the data.
2. Data Preprocessing and Tokenization
Weāll work with the Tiny Shakespeare dataset, which is a compact text file containing Shakespeareās works. We need to tokenize the data, convert it to numerical representations, and split it into input-output sequences for training. You can download the Tiny Shakespeare dataset as a plain text file from this GitHub repository.
import torch
from torch.utils.data import Dataset, DataLoader
# Load Tiny Shakespeare dataset
def load_data(file_path):
with open(file_path, 'r') as f:
text = f.read()
return text
# Tokenizer: Convert characters to indices and vice versa
def create_tokenizer(text):
chars = sorted(list(set(text)))
vocab_size = len(chars)
char_to_int = {ch: i for i, ch in enumerate(chars)}
int_to_char = {i: ch for i, ch in enumerate(chars)}
return char_to_int, int_to_char, vocab_size
# Custom dataset to create input/target pairs
class ShakespeareDataset(Dataset):
def __init__(self, text, char_to_int, seq_length=100):
self.text = text
self.char_to_int = char_to_int
self.seq_length = seq_length
def __len__(self):
return len(self.text) - self.seq_length
def __getitem__(self, idx):
input_seq = [self.char_to_int[ch] for ch in self.text[idx:idx + self.seq_length]]
target_seq = [self.char_to_int[ch] for ch in self.text[idx + 1:idx + self.seq_length + 1]]
return torch.tensor(input_seq, dtype=torch.long), torch.tensor(target_seq, dtype=torch.long)
# Load the data and create tokenizer
file_path = 'tiny_shakespeare.txt'
text_data = load_data(file_path)
char_to_int, int_to_char, vocab_size = create_tokenizer(text_data)
# Create Dataset and DataLoader
seq_length = 100
dataset = ShakespeareDataset(text_data, char_to_int, seq_length)
dataloader = DataLoader(dataset, batch_size=64, shuffle=True)
Hereās whatās happening:
- We load the Tiny Shakespeare dataset from a file and tokenize the characters into numerical indices.
- We create a custom ShakespeareDataset class that splits the text into sequences of 100 characters and shifts them to create input and target pairs.
- We use DataLoader to batch the data, making it easier to feed the model during training.
3. Training the Transformer Model
Now, weāll define the training loop, where the model learns to predict the next character in the sequence. Weāll use cross-entropy loss to measure how well the modelās predictions match the actual target sequences.
import torch.optim as optim
# Initialize model, optimizer, and loss function
embed_size = 512
num_heads = 8
num_layers = 6
dropout = 0.2
model = TransformerModel(vocab_size, embed_size, num_heads, num_layers, dropout)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
# Training loop
num_epochs = 50
for epoch in range(num_epochs):
model.train()
epoch_loss = 0
for input_seq, target_seq in dataloader:
optimizer.zero_grad()
# Forward pass
output = model(input_seq, input_seq)
loss = criterion(output.view(-1, vocab_size), target_seq.view(-1))
# Backward pass and optimization
loss.backward()
optimizer.step()
epoch_loss += loss.item()
print(f"Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss/len(dataloader):.4f}")
Explanation:
- Model Initialization: We initialize our TransformerModel with
embed_size
,num_heads
, andnum_layers
. The optimizer is Adam, and the loss function is CrossEntropyLoss. - Training Loop: For each epoch, we:
- Perform a forward pass through the model.
- Calculate the loss between the predicted and target sequences.
- Perform a backward pass to update the weights.
- Print the average loss for the epoch.
4. Generating Text from the Model
Once the model is trained, we can generate text using the trained model by feeding it a starting sequence and having it predict the next characters.
def generate_text(model, start_sequence, char_to_int, int_to_char, length=100):
model.eval()
input_seq = torch.tensor([char_to_int[ch] for ch in start_sequence], dtype=torch.long).unsqueeze(0)
generated_text = start_sequence
for _ in range(length):
with torch.no_grad():
output = model(input_seq, input_seq)
next_char_idx = torch.argmax(output[0, -1]).item()
next_char = int_to_char[next_char_idx]
generated_text += next_char
input_seq = torch.cat([input_seq, torch.tensor([[next_char_idx]], dtype=torch.long)], dim=1)
return generated_text
# Generate text
start_sequence = "O Romeo"
generated_text = generate_text(model, start_sequence, char_to_int, int_to_char, length=500)
print(generated_text)
Explanation:
- Text Generation: We provide a starting sequence (e.g.,
"O Romeo"
) and generate characters one by one using the trained model. - Greedy Decoding: We use
torch.argmax
to pick the character with the highest probability after each step and append it to the generated sequence. - Loop: The process is repeated until the desired text length is reached.
###
5. Saving and Loading the Model
To avoid retraining the model every time, weāll save the modelās weights and load them for future use.
# Save the trained model
torch.save(model.state_dict(), 'transformer_model.pth')
# Load the saved model
model.load_state_dict(torch.load('transformer_model.pth'))
model.eval()
Explanation:
- Saving: We use
torch.save()
to save the modelās state dictionary (weights) to a file. - Loading: When needed, we load the model using
torch.load()
and callmodel.eval()
to set the model to evaluation mode (which disables dropout).
Putting It All Together
Hereās a summary of what weāve done:
- Defined the Transformer model with embeddings, positional encodings, self-attention, and feedforward layers.
- Tokenized the data and prepared it for training using DataLoader.
- Trained the model using cross-entropy loss and Adam optimizer.
- Generated text by feeding a starting sequence and having the model predict the next characters.
- Saved and loaded the trained model for future use.
With this complete pipeline, you now have a fully functioning LLM. While we used the Tiny Shakespeare dataset for simplicity, this framework can be applied to much larger datasets with minimal adjustments.
Conclusion
Congratulations! š Youāve now successfully built a Large Language Model using the Transformer architecture from scratch. You learned how to train the model, generate text, and save/load it for future use. This framework can be extended for real-world applications like chatbots, text generation systems, and more sophisticated NLP tasks.
Next steps? You can scale this model to larger datasets, experiment with different architectures, or fine-tune a pretrained model like GPT for even more powerful performance. š