A minimal, educational implementation of a GPT-style decoder-only transformer language model built from scratch using PyTorch.
- Decoder-only Transformer Architecture (similar to GPT-2/3)
- Multi-Head Causal Self-Attention with KV-Cache
- Pre-LayerNorm architecture for stable training
- BPE Tokenization using tiktoken (GPT-2 tokenizer)
- Simple CLI for training and inference
- Clean, modular code for learning and experimentation
Input Tokens → Token Embedding + Positional Embedding → Dropout
↓
┌───────────────────────────────┐
│ Transformer Decoder Block │ × N layers
│ ├── LayerNorm │
│ ├── Multi-Head Attention │
│ ├── Residual Connection │
│ ├── LayerNorm │
│ ├── Feed-Forward Network │
│ └── Residual Connection │
└───────────────────────────────┘
↓
Final LayerNorm → LM Head → Output Logits
# Clone or navigate to the project
cd TinyLLM
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On macOS/Linux:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtPrepare your training data as a .txt file, then run:
python trainllm.py your_data.txtTraining Options:
python trainllm.py data.txt \
--model-size small \
--epochs 5 \
--batch-size 8 \
--learning-rate 3e-4 \
--max-seq-len 256 \
--output-dir ./checkpointspython tinyllm.py startChat Options:
python tinyllm.py start \
--model ./checkpoints/final \
--temperature 0.8 \
--max-tokens 150python tinyllm.py generate "Once upon a time" --max-tokens 100TinyLLM/
├── src/
│ ├── model/
│ │ ├── config.py # Model configuration
│ │ ├── embeddings.py # Token & position embeddings
│ │ ├── attention.py # Multi-head causal attention
│ │ ├── feedforward.py # Feed-forward network
│ │ ├── transformer_block.py # Decoder block
│ │ └── tinyllm.py # Main model class
│ ├── tokenizer/
│ │ └── tokenizer.py # BPE tokenizer (tiktoken)
│ ├── data/
│ │ └── dataset.py # Dataset classes
│ ├── training/
│ │ └── trainer.py # Training loop
│ └── inference/
│ └── generate.py # Text generation
├── trainllm.py # Training CLI
├── tinyllm.py # Chat CLI
├── requirements.txt
└── README.md
| Size | Parameters | Layers | Heads | Hidden | FFN |
|---|---|---|---|---|---|
| Tiny | ~1M | 4 | 4 | 128 | 512 |
| Small | ~10M | 6 | 8 | 256 | 1024 |
| Medium | ~45M | 8 | 8 | 512 | 2048 |
python trainllm.py <data_file> [OPTIONS]
Arguments:
data_file Path to training text file
Options:
-o, --output-dir Output directory (default: ./checkpoints)
-m, --model-size Model size: tiny, small, medium (default: small)
-e, --epochs Number of epochs (default: 3)
-b, --batch-size Batch size (default: 8)
-lr, --learning-rate Learning rate (default: 3e-4)
--max-seq-len Maximum sequence length (default: 256)
--device Device: auto, cuda, mps, cpu (default: auto)python tinyllm.py start [OPTIONS]
Options:
-m, --model Path to model checkpoint
-t, --temperature Sampling temperature (default: 0.8)
--top-k Top-K sampling (default: 50)
--top-p Top-P sampling (default: 0.9)
--max-tokens Max tokens per response (default: 150)
Chat Commands:
/quit, /exit Exit chat
/clear Clear conversation history
/temp <value> Set temperature
/help Show help# Create a sample text file
echo "Hello, I am TinyLLM. I am a small language model.
I can generate text based on patterns I learned during training.
Ask me anything and I will try my best to respond!" > sample.txt
# Train the model
python trainllm.py sample.txt --epochs 10 --model-size tiny$ python tinyllm.py start
🧑 You: Hello, who are you?
🤖 TinyLLM: I am TinyLLM, a small language model trained to have conversations.
🧑 You: What can you do?
🤖 TinyLLM: I can generate text, answer questions, and have conversations!
🧑 You: /quit
Goodbye! 👋
The model can only attend to previous tokens, not future ones. This is achieved using a lower-triangular mask.
Layer normalization is applied before each sub-layer (attention, FFN) rather than after, leading to more stable training.
The token embedding matrix is shared with the output projection (LM head), reducing parameters.
During generation, key-value pairs from previous tokens are cached to avoid redundant computation.
- Start small: Use
--model-size tinyfor quick experiments - More data is better: LLMs need lots of text to learn patterns
- Adjust learning rate: Lower for larger models (1e-4 to 5e-4)
- Use gradient accumulation:
--gradient-accumulation 4to simulate larger batches - Monitor loss: Training loss should decrease steadily
MIT License - Feel free to use, modify, and learn from this code!
Built for learning and experimentation. Happy coding! 🚀