GitHub - kkkamur07/slm-distill: PreTrain Distilling Monolingual models from multi lingual ones

Pretrain Distilling Mono-lingual models

We are currently trying to do distillation of multilingual models to mono lingual ones starting with hindi langauge. This distillation happens during pre-training so the teacher now sort of acts as a guide during pre-training and thus we can have higher compression ratios as well.

The current progress is that we have pretrained distilled a Hindi model from XLMRobertaBase with comopression ratio of around 8x and further improving it to 20x so that we can have a mono lingual model which performs as good as XLMRoberta for low resource and moderate resource languages.

🔗 Model: kkkamur07/hindi-xlm-roberta-33M (with detailed model card)

We are currently getting perplexity of around 18 on hindi and XLM Roberta base gets around 5 and we will close this gap soon and setup more robust evaluations pipeline in the future with suprising things as the model gives us perplexity in english to be around 50 while the XLM Robert Base has around 2 same, so essentially the model is forgetting things in english.

We have currently trained it on 100M tokens of hindi

Future improvements :

🎯 End Goal

Our end goal is to develop high-quality, efficient models for all 22 Indic languages as specified by the Constitution of India. Throughout this process, we aim to identify what works best for low-resource language modeling.

Philosophy

Start simple, work our way up.

We believe in:

Iterative improvement: Start with basic distillation, then add complexity
Empirical validation: Every improvement must be measured and validated
Open research: Share findings, models, and code with the community
Practical deployment: Focus on models that can run on edge devices

🔬 Research in Progress 🔬
Building efficient monolingual models for Indic languages, one language at a time.

🤗 Model • 💻 GitHub • 📧 Contact

In addition to this Read_me summary, we have also created a short motivation, summary and outline for the project in this google document.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
configs		configs
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pretrain Distilling Mono-lingual models

Future improvements :

🎯 End Goal

Philosophy

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

kkkamur07/slm-distill

Folders and files

Latest commit

History

Repository files navigation

Pretrain Distilling Mono-lingual models

Future improvements :

🎯 End Goal

Philosophy

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages