A hands-on journey into LLMs, Tokenization, RAG pipelines, PDF vector processing, and API development.
Welcome to my Generative AI learning repository! This project documents my step-by-step progress in understanding the core foundations of GenAI, including tokenization, embeddings, vector stores, retrieval-augmented generation (RAG), and backend integration using FastAPI.
Includes Python scripts where I learned:
- How tokenizers work
- Creating custom vocab files
- Training BPE tokenizers
- Understanding merges, vocab length, and token distribution Files:
vocab_train.pyvocab_path.pyvocab_length.pymy_bpe-merges.txtmy_bpe-vocab.jsonwords.txt
Hands-on implementation of:
- PDF text extraction
- Converting documents into embeddings
- Vector indexing and similarity search
- Building a simple RAG pipeline for answering queries Folder: Rag Model
I built API endpoints to:
- Generate responses
- Interact with the RAG pipeline
- Test HTTP GET & POST requests
- Understand API structure Folder: API
A beginner-level FastAPI project to understand:
- CRUD operations
- Path & query parameters
- Async routes Folder: fastapi-todo-main
- 🧠 Generative AI Foundations
- 🔤 Tokenization (LLM Tokenizer Concepts)
- 📚 Embeddings & Vector Databases
- 🔍 RAG Workflow (Retriever + Generator)
- ⚙️ FastAPI for AI Backend Integration
- 📝 Working with PDF Text & Chunking
- 🐍 Python for AI & NLP Workflows
This repo contains my personal learning journey as I explore the core building blocks behind modern LLM systems and GenAI applications. I am documenting everything from basics to advanced workflows.
✔️ Tokenizer Training ✔️ PDF Vector Processing ✔️ Basic RAG 🔜 Integrating a full Q&A chatbot 🔜 Adding frontend for GenAI demo 🔜 Deploying FastAPI backend
If you have ideas to improve the code or learning roadmap, feel free to open an issue or PR!
If you like this repo, consider giving it a star ⭐ — it helps me stay motivated to learn more every day!