Dự án này cung cấp các môi trường và configurations để training AI models với Hugging Face Transformers trên nhiều platforms khác nhau.
Repository này bao gồm:
- 📚 Documentation chi tiết về các môi trường training
- 🐳 Docker configs cho local development và cloud deployment
- ☸️ Kubernetes manifests cho distributed training
- ☁️ Cloud provider configs (AWS, GCP, Azure)
- 📝 Example training scripts
# Clone repository
git clone <your-repo-url>
cd ai-training
# Setup environment
cd docker
cp .env.example .env
# Edit .env với your tokens
# Start training environment
docker-compose up -d training
# Access Jupyter Notebook
# Check logs for token: docker-compose logs training
# Open: http://localhost:8888
# Run training
docker-compose run --rm training python /workspace/scripts/train.py# Setup namespace và resources
cd k8s
kubectl apply -f 01-namespace.yaml
kubectl apply -f 03-persistent-volumes.yaml
kubectl apply -f 04-secrets.yaml # Edit first!
# Run single GPU training
kubectl apply -f 02-gpu-job.yaml
# Monitor
kubectl logs -f job/huggingface-training-job -n ml-trainingcd cloud/aws
python sagemaker-training.pycd cloud/gcp
python vertex-ai-training.pycd cloud/azure
python azure-ml-training.pyai-training/
├── docs/ # Documentation
│ └── training-environments.md # Comprehensive environment guide
│
├── docker/ # Docker configurations
│ ├── Dockerfile.training # Training image
│ ├── Dockerfile.inference # Inference image
│ ├── docker-compose.yml # Multi-service setup
│ ├── .env.example # Environment variables template
│ └── README.md # Docker setup guide
│
├── k8s/ # Kubernetes manifests
│ ├── 01-namespace.yaml # Namespace
│ ├── 02-gpu-job.yaml # Single GPU job
│ ├── 03-persistent-volumes.yaml # Storage
│ ├── 04-secrets.yaml # Secrets
│ ├── 05-distributed-training-pytorch.yaml # Distributed training
│ └── README.md # K8s setup guide
│
├── cloud/ # Cloud provider configs
│ ├── aws/
│ │ ├── sagemaker-training.py
│ │ ├── sagemaker-distributed.py
│ │ └── terraform/ # AWS infrastructure
│ ├── gcp/
│ │ └── vertex-ai-training.py
│ ├── azure/
│ │ └── azure-ml-training.py
│ └── README.md # Cloud comparison guide
│
├── scripts/ # Training scripts (to be added)
│ ├── train.py
│ ├── train_distributed.py
│ └── evaluate.py
│
└── README.md # This file
- Docker Desktop: Development, testing
- Conda/venv: Quick experiments
- Local K8s: Minikube, Kind, K3s
- Managed K8s: GKE, EKS, AKS
- Tools: Kubeflow, KServe, Argo Workflows
- AWS SageMaker: Native Hugging Face integration
- GCP Vertex AI: TPU support
- Azure ML: MLflow integration
- Lambda Labs: Cost-effective GPU training
- Google Colab Pro: Learning, prototyping
- Paperspace Gradient: Individual researchers
- Hugging Face Spaces: Quick demos
Xem docs/training-environments.md để biết chi tiết và so sánh.
# Docker & Docker Compose
docker --version # >= 20.10
docker-compose --version # >= 2.0
# NVIDIA Docker (cho GPU)
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi# kubectl
kubectl version
# NVIDIA GPU Operator
helm install gpu-operator nvidia/gpu-operator -n gpu-operator-resources
# Kubeflow Training Operator
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"# AWS
pip install awscli boto3 sagemaker
# GCP
pip install google-cloud-aiplatform
# Azure
pip install azure-ai-ml azure-identity| Platform | 1x V100 ($/hr) | 8x V100 ($/hr) | Spot Discount |
|---|---|---|---|
| AWS | $3.06 | $24.48 | 70% |
| GCP | $2.48 | $19.84 | 60-91% |
| Azure | $3.06 | $24.48 | 60-80% |
| Lambda Labs | $0.50 | $4.40 | N/A (already cheap) |
Recommendation: Use spot/preemptible instances để save 60-90% costs.
Dev: Docker + Local GPUs
Training: Lambda Labs hoặc Paperspace
Production: Managed K8s + Kubeflow
Dev: Docker + Dev clusters
Training: AWS SageMaker hoặc Self-managed K8s
Production: Kubernetes + MLOps platform
Dev: Conda environments
Training: Google Colab Pro hoặc University clusters
Sharing: Hugging Face Spaces
Dev: Docker hoặc Conda
Training: Colab Pro hoặc Lambda Labs
Deployment: Hugging Face Spaces
- Training Environments Guide - Comprehensive comparison
- Docker Setup - Local development guide
- Kubernetes Setup - K8s deployment guide
- Cloud Providers - AWS, GCP, Azure guides
Create .env file trong docker/ directory:
# Hugging Face
HF_TOKEN=hf_your_token_here
# Weights & Biases
WANDB_API_KEY=your_wandb_key_here
# Training config
MODEL_NAME=bert-base-uncased
DATASET_NAME=imdb
BATCH_SIZE=16
EPOCHS=3
LEARNING_RATE=2e-5cd docker
docker-compose run --rm training python /workspace/scripts/train.py \
--model_name_or_path bert-base-uncased \
--dataset_name imdb \
--num_train_epochs 3 \
--per_device_train_batch_size 16 \
--output_dir /output/bert-imdbcd k8s
kubectl apply -f 05-distributed-training-pytorch.yaml
kubectl logs -n ml-training -l pytorch-job-name=huggingface-distributed-training -f# cloud/aws/sagemaker-training.py
huggingface_estimator = HuggingFace(
instance_type='ml.p3.2xlarge',
use_spot_instances=True,
max_wait=90000,
)- Always use version control cho code, configs, và experiments
- Implement checkpointing cho long-running jobs
- Monitor GPU utilization - target 80%+
- Use mixed precision (FP16/BF16) để faster training
- Cache models và datasets để avoid re-downloads
- Use experiment tracking (W&B, MLflow, TensorBoard)
- Start with smallest viable instance và scale up
- Use spot/preemptible instances khi có thể
# Verify NVIDIA Docker
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
# Restart Docker
sudo systemctl restart docker# Check events
kubectl get events -n ml-training --sort-by='.lastTimestamp'
# Describe pod
kubectl describe pod <pod-name> -n ml-training- Reduce batch size
- Enable gradient accumulation
- Use FP16 mixed precision
- Increase shared memory
Contributions welcome! Please:
- Fork the repository
- Create feature branch
- Make changes
- Submit pull request
See LICENSE file for details.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation:
/docsdirectory
Happy Training! 🚀