Skip to content

kkkamur07/slm-distill

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pretrain Distilling Mono-lingual models

SLM Distill Banner

Model Size Compression Language License

We are currently trying to do distillation of multilingual models to mono lingual ones starting with hindi langauge. This distillation happens during pre-training so the teacher now sort of acts as a guide during pre-training and thus we can have higher compression ratios as well.

The current progress is that we have pretrained distilled a Hindi model from XLMRobertaBase with comopression ratio of around 8x and further improving it to 20x so that we can have a mono lingual model which performs as good as XLMRoberta for low resource and moderate resource languages.

🔗 Model: kkkamur07/hindi-xlm-roberta-33M (with detailed model card)

We are currently getting perplexity of around 18 on hindi and XLM Roberta base gets around 5 and we will close this gap soon and setup more robust evaluations pipeline in the future with suprising things as the model gives us perplexity in english to be around 50 while the XLM Robert Base has around 2 same, so essentially the model is forgetting things in english.

We have currently trained it on 100M tokens of hindi

Future improvements :

  • Robust Evaluations pipeline
  • Redue the vocabulary of tokenizer from 250004 to only hindi words
  • Add FlashAttention, ROPE so that you can have more fine grained control of the student's architecture
  • Adaptive temperature and alpha scaling with some regularization in the loss so that model can generalize well.
  • Varied & Diverse data and increase data
  • Cache the tokens because it takes a lot of time
  • Teacher model should be studied in detail
  • More control over the hyperparameters
  • CUDA Kernel to accelerate training
  • Future research directions on data bias inheritance from teacher model and how to mitigate this ?
  • Add robust logging to MLFlow and others

🎯 End Goal

Our end goal is to develop high-quality, efficient models for all 22 Indic languages as specified by the Constitution of India. Throughout this process, we aim to identify what works best for low-resource language modeling.

Philosophy

Start simple, work our way up.

We believe in:

  1. Iterative improvement: Start with basic distillation, then add complexity
  2. Empirical validation: Every improvement must be measured and validated
  3. Open research: Share findings, models, and code with the community
  4. Practical deployment: Focus on models that can run on edge devices

🔬 Research in Progress 🔬
Building efficient monolingual models for Indic languages, one language at a time.

🤗 Model💻 GitHub📧 Contact

In addition to this Read_me summary, we have also created a short motivation, summary and outline for the project in this google document.

About

PreTrain Distilling Monolingual models from multi lingual ones

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •