AI Training với Hugging Face

Dự án này cung cấp các môi trường và configurations để training AI models với Hugging Face Transformers trên nhiều platforms khác nhau.

📋 Tổng Quan

Repository này bao gồm:

📚 Documentation chi tiết về các môi trường training
🐳 Docker configs cho local development và cloud deployment
☸️ Kubernetes manifests cho distributed training
☁️ Cloud provider configs (AWS, GCP, Azure)
📝 Example training scripts

🚀 Quick Start

Local Development với Docker

# Clone repository
git clone <your-repo-url>
cd ai-training

# Setup environment
cd docker
cp .env.example .env
# Edit .env với your tokens

# Start training environment
docker-compose up -d training

# Access Jupyter Notebook
# Check logs for token: docker-compose logs training
# Open: http://localhost:8888

# Run training
docker-compose run --rm training python /workspace/scripts/train.py

Kubernetes Deployment

# Setup namespace và resources
cd k8s
kubectl apply -f 01-namespace.yaml
kubectl apply -f 03-persistent-volumes.yaml
kubectl apply -f 04-secrets.yaml  # Edit first!

# Run single GPU training
kubectl apply -f 02-gpu-job.yaml

# Monitor
kubectl logs -f job/huggingface-training-job -n ml-training

Cloud Training

AWS SageMaker

cd cloud/aws
python sagemaker-training.py

GCP Vertex AI

cd cloud/gcp
python vertex-ai-training.py

Azure ML

cd cloud/azure
python azure-ml-training.py

📁 Project Structure

ai-training/
├── docs/                           # Documentation
│   └── training-environments.md    # Comprehensive environment guide
│
├── docker/                         # Docker configurations
│   ├── Dockerfile.training         # Training image
│   ├── Dockerfile.inference        # Inference image
│   ├── docker-compose.yml          # Multi-service setup
│   ├── .env.example                # Environment variables template
│   └── README.md                   # Docker setup guide
│
├── k8s/                            # Kubernetes manifests
│   ├── 01-namespace.yaml           # Namespace
│   ├── 02-gpu-job.yaml             # Single GPU job
│   ├── 03-persistent-volumes.yaml  # Storage
│   ├── 04-secrets.yaml             # Secrets
│   ├── 05-distributed-training-pytorch.yaml  # Distributed training
│   └── README.md                   # K8s setup guide
│
├── cloud/                          # Cloud provider configs
│   ├── aws/
│   │   ├── sagemaker-training.py
│   │   ├── sagemaker-distributed.py
│   │   └── terraform/              # AWS infrastructure
│   ├── gcp/
│   │   └── vertex-ai-training.py
│   ├── azure/
│   │   └── azure-ml-training.py
│   └── README.md                   # Cloud comparison guide
│
├── scripts/                        # Training scripts (to be added)
│   ├── train.py
│   ├── train_distributed.py
│   └── evaluate.py
│
└── README.md                       # This file

🎯 Môi Trường Training

1. Local Development

Docker Desktop: Development, testing
Conda/venv: Quick experiments

2. Kubernetes

Local K8s: Minikube, Kind, K3s
Managed K8s: GKE, EKS, AKS
Tools: Kubeflow, KServe, Argo Workflows

3. Cloud Platforms

AWS SageMaker: Native Hugging Face integration
GCP Vertex AI: TPU support
Azure ML: MLflow integration
Lambda Labs: Cost-effective GPU training

4. Specialized Platforms

Google Colab Pro: Learning, prototyping
Paperspace Gradient: Individual researchers
Hugging Face Spaces: Quick demos

Xem docs/training-environments.md để biết chi tiết và so sánh.

🛠️ Setup Requirements

Local (Docker)

# Docker & Docker Compose
docker --version  # >= 20.10
docker-compose --version  # >= 2.0

# NVIDIA Docker (cho GPU)
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Kubernetes

# kubectl
kubectl version

# NVIDIA GPU Operator
helm install gpu-operator nvidia/gpu-operator -n gpu-operator-resources

# Kubeflow Training Operator
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"

Cloud SDKs

# AWS
pip install awscli boto3 sagemaker

# GCP
pip install google-cloud-aiplatform

# Azure
pip install azure-ai-ml azure-identity

📊 Cost Comparison

Platform	1x V100 ($/hr)	8x V100 ($/hr)	Spot Discount
AWS	$3.06	$24.48	70%
GCP	$2.48	$19.84	60-91%
Azure	$3.06	$24.48	60-80%
Lambda Labs	$0.50	$4.40	N/A (already cheap)

Recommendation: Use spot/preemptible instances để save 60-90% costs.

🎓 Use Cases

Startup/Small Team

Dev: Docker + Local GPUs
Training: Lambda Labs hoặc Paperspace
Production: Managed K8s + Kubeflow

Enterprise

Dev: Docker + Dev clusters
Training: AWS SageMaker hoặc Self-managed K8s
Production: Kubernetes + MLOps platform

Research/Academic

Dev: Conda environments
Training: Google Colab Pro hoặc University clusters
Sharing: Hugging Face Spaces

Individual Developer

Dev: Docker hoặc Conda
Training: Colab Pro hoặc Lambda Labs
Deployment: Hugging Face Spaces

📖 Documentation

Training Environments Guide - Comprehensive comparison
Docker Setup - Local development guide
Kubernetes Setup - K8s deployment guide
Cloud Providers - AWS, GCP, Azure guides

🔑 Environment Variables

Create .env file trong docker/ directory:

# Hugging Face
HF_TOKEN=hf_your_token_here

# Weights & Biases
WANDB_API_KEY=your_wandb_key_here

# Training config
MODEL_NAME=bert-base-uncased
DATASET_NAME=imdb
BATCH_SIZE=16
EPOCHS=3
LEARNING_RATE=2e-5

🚀 Example Training Workflows

1. Fine-tune BERT on IMDB (Local)

cd docker
docker-compose run --rm training python /workspace/scripts/train.py \
  --model_name_or_path bert-base-uncased \
  --dataset_name imdb \
  --num_train_epochs 3 \
  --per_device_train_batch_size 16 \
  --output_dir /output/bert-imdb

2. Distributed GPT-2 Training (K8s)

cd k8s
kubectl apply -f 05-distributed-training-pytorch.yaml
kubectl logs -n ml-training -l pytorch-job-name=huggingface-distributed-training -f

3. Cloud Training với Spot Instances (AWS)

# cloud/aws/sagemaker-training.py
huggingface_estimator = HuggingFace(
    instance_type='ml.p3.2xlarge',
    use_spot_instances=True,
    max_wait=90000,
)

🎯 Best Practices

Always use version control cho code, configs, và experiments
Implement checkpointing cho long-running jobs
Monitor GPU utilization - target 80%+
Use mixed precision (FP16/BF16) để faster training
Cache models và datasets để avoid re-downloads
Use experiment tracking (W&B, MLflow, TensorBoard)
Start with smallest viable instance và scale up
Use spot/preemptible instances khi có thể

🔧 Troubleshooting

Docker GPU not detected

# Verify NVIDIA Docker
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

# Restart Docker
sudo systemctl restart docker

K8s Pod pending

# Check events
kubectl get events -n ml-training --sort-by='.lastTimestamp'

# Describe pod
kubectl describe pod <pod-name> -n ml-training

Out of Memory

Reduce batch size
Enable gradient accumulation
Use FP16 mixed precision
Increase shared memory

🤝 Contributing

Contributions welcome! Please:

Fork the repository
Create feature branch
Make changes
Submit pull request

📄 License

See LICENSE file for details.

🔗 Resources

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: /docs directory

Happy Training! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
cloud		cloud
docker		docker
docs		docs
k8s		k8s
scripts		scripts
LICENSE		LICENSE
README.md		README.md

License

anhkhoa289/ai-training

Folders and files

Latest commit

History

Repository files navigation