Skip to content

anhkhoa289/ai-training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Training với Hugging Face

Dự án này cung cấp các môi trường và configurations để training AI models với Hugging Face Transformers trên nhiều platforms khác nhau.

📋 Tổng Quan

Repository này bao gồm:

  • 📚 Documentation chi tiết về các môi trường training
  • 🐳 Docker configs cho local development và cloud deployment
  • ☸️ Kubernetes manifests cho distributed training
  • ☁️ Cloud provider configs (AWS, GCP, Azure)
  • 📝 Example training scripts

🚀 Quick Start

Local Development với Docker

# Clone repository
git clone <your-repo-url>
cd ai-training

# Setup environment
cd docker
cp .env.example .env
# Edit .env với your tokens

# Start training environment
docker-compose up -d training

# Access Jupyter Notebook
# Check logs for token: docker-compose logs training
# Open: http://localhost:8888

# Run training
docker-compose run --rm training python /workspace/scripts/train.py

Kubernetes Deployment

# Setup namespace và resources
cd k8s
kubectl apply -f 01-namespace.yaml
kubectl apply -f 03-persistent-volumes.yaml
kubectl apply -f 04-secrets.yaml  # Edit first!

# Run single GPU training
kubectl apply -f 02-gpu-job.yaml

# Monitor
kubectl logs -f job/huggingface-training-job -n ml-training

Cloud Training

AWS SageMaker

cd cloud/aws
python sagemaker-training.py

GCP Vertex AI

cd cloud/gcp
python vertex-ai-training.py

Azure ML

cd cloud/azure
python azure-ml-training.py

📁 Project Structure

ai-training/
├── docs/                           # Documentation
│   └── training-environments.md    # Comprehensive environment guide
│
├── docker/                         # Docker configurations
│   ├── Dockerfile.training         # Training image
│   ├── Dockerfile.inference        # Inference image
│   ├── docker-compose.yml          # Multi-service setup
│   ├── .env.example                # Environment variables template
│   └── README.md                   # Docker setup guide
│
├── k8s/                            # Kubernetes manifests
│   ├── 01-namespace.yaml           # Namespace
│   ├── 02-gpu-job.yaml             # Single GPU job
│   ├── 03-persistent-volumes.yaml  # Storage
│   ├── 04-secrets.yaml             # Secrets
│   ├── 05-distributed-training-pytorch.yaml  # Distributed training
│   └── README.md                   # K8s setup guide
│
├── cloud/                          # Cloud provider configs
│   ├── aws/
│   │   ├── sagemaker-training.py
│   │   ├── sagemaker-distributed.py
│   │   └── terraform/              # AWS infrastructure
│   ├── gcp/
│   │   └── vertex-ai-training.py
│   ├── azure/
│   │   └── azure-ml-training.py
│   └── README.md                   # Cloud comparison guide
│
├── scripts/                        # Training scripts (to be added)
│   ├── train.py
│   ├── train_distributed.py
│   └── evaluate.py
│
└── README.md                       # This file

🎯 Môi Trường Training

1. Local Development

  • Docker Desktop: Development, testing
  • Conda/venv: Quick experiments

2. Kubernetes

  • Local K8s: Minikube, Kind, K3s
  • Managed K8s: GKE, EKS, AKS
  • Tools: Kubeflow, KServe, Argo Workflows

3. Cloud Platforms

  • AWS SageMaker: Native Hugging Face integration
  • GCP Vertex AI: TPU support
  • Azure ML: MLflow integration
  • Lambda Labs: Cost-effective GPU training

4. Specialized Platforms

  • Google Colab Pro: Learning, prototyping
  • Paperspace Gradient: Individual researchers
  • Hugging Face Spaces: Quick demos

Xem docs/training-environments.md để biết chi tiết và so sánh.

🛠️ Setup Requirements

Local (Docker)

# Docker & Docker Compose
docker --version  # >= 20.10
docker-compose --version  # >= 2.0

# NVIDIA Docker (cho GPU)
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Kubernetes

# kubectl
kubectl version

# NVIDIA GPU Operator
helm install gpu-operator nvidia/gpu-operator -n gpu-operator-resources

# Kubeflow Training Operator
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"

Cloud SDKs

# AWS
pip install awscli boto3 sagemaker

# GCP
pip install google-cloud-aiplatform

# Azure
pip install azure-ai-ml azure-identity

📊 Cost Comparison

Platform 1x V100 ($/hr) 8x V100 ($/hr) Spot Discount
AWS $3.06 $24.48 70%
GCP $2.48 $19.84 60-91%
Azure $3.06 $24.48 60-80%
Lambda Labs $0.50 $4.40 N/A (already cheap)

Recommendation: Use spot/preemptible instances để save 60-90% costs.

🎓 Use Cases

Startup/Small Team

Dev: Docker + Local GPUs
Training: Lambda Labs hoặc Paperspace
Production: Managed K8s + Kubeflow

Enterprise

Dev: Docker + Dev clusters
Training: AWS SageMaker hoặc Self-managed K8s
Production: Kubernetes + MLOps platform

Research/Academic

Dev: Conda environments
Training: Google Colab Pro hoặc University clusters
Sharing: Hugging Face Spaces

Individual Developer

Dev: Docker hoặc Conda
Training: Colab Pro hoặc Lambda Labs
Deployment: Hugging Face Spaces

📖 Documentation

🔑 Environment Variables

Create .env file trong docker/ directory:

# Hugging Face
HF_TOKEN=hf_your_token_here

# Weights & Biases
WANDB_API_KEY=your_wandb_key_here

# Training config
MODEL_NAME=bert-base-uncased
DATASET_NAME=imdb
BATCH_SIZE=16
EPOCHS=3
LEARNING_RATE=2e-5

🚀 Example Training Workflows

1. Fine-tune BERT on IMDB (Local)

cd docker
docker-compose run --rm training python /workspace/scripts/train.py \
  --model_name_or_path bert-base-uncased \
  --dataset_name imdb \
  --num_train_epochs 3 \
  --per_device_train_batch_size 16 \
  --output_dir /output/bert-imdb

2. Distributed GPT-2 Training (K8s)

cd k8s
kubectl apply -f 05-distributed-training-pytorch.yaml
kubectl logs -n ml-training -l pytorch-job-name=huggingface-distributed-training -f

3. Cloud Training với Spot Instances (AWS)

# cloud/aws/sagemaker-training.py
huggingface_estimator = HuggingFace(
    instance_type='ml.p3.2xlarge',
    use_spot_instances=True,
    max_wait=90000,
)

🎯 Best Practices

  1. Always use version control cho code, configs, và experiments
  2. Implement checkpointing cho long-running jobs
  3. Monitor GPU utilization - target 80%+
  4. Use mixed precision (FP16/BF16) để faster training
  5. Cache models và datasets để avoid re-downloads
  6. Use experiment tracking (W&B, MLflow, TensorBoard)
  7. Start with smallest viable instance và scale up
  8. Use spot/preemptible instances khi có thể

🔧 Troubleshooting

Docker GPU not detected

# Verify NVIDIA Docker
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

# Restart Docker
sudo systemctl restart docker

K8s Pod pending

# Check events
kubectl get events -n ml-training --sort-by='.lastTimestamp'

# Describe pod
kubectl describe pod <pod-name> -n ml-training

Out of Memory

  • Reduce batch size
  • Enable gradient accumulation
  • Use FP16 mixed precision
  • Increase shared memory

🤝 Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create feature branch
  3. Make changes
  4. Submit pull request

📄 License

See LICENSE file for details.

🔗 Resources

📞 Support

  • Issues: GitHub Issues
  • Discussions: GitHub Discussions
  • Documentation: /docs directory

Happy Training! 🚀

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •