Streamlining the Compression of Indic Language Models with NyunZero
Background
In the rapidly evolving landscape of AI, LLMs play a pivotal role in understanding and generating human-like text. OpenHathi, based on the impressive LLaMA-7B architecture, stands out as a powerful Indic language model. Leveraging its capabilities can significantly enhance natural language processing tasks in various applications. In this article, we will explore the seamless compression of OpenHathi with AWQ quantization and TensorRT-LLM engine conversion made possible through NyunZero.
We will be using the Samvaad dataset to sample the calibration samples for quantization; using an appropriate calibration dataset is very important, especially for non-English and multi-lingual models which again is seamless via NyunZero’s data-aware quantization support.
Using NyunZero
NyunZero directly works inside the user’s infrastructure across most cloud providers including GCP, AWS and Azure. Additionally, users can also connect their local machines with a simple SSH connection. This ensures that the user's data remains private. The following steps outline the complete process -
- Connecting the Infrastructure - Nyun Docs - Connect Infrastructure
-
Importing Model and Dataset -
We will be using the open-sourced model sarvamai/OpenHathi-7B-Hi-v0.1-Base as the base model directly from HuggingFace and nyunai/samvaad-hi-v1-chat-format as the calibration dataset. This dataset was formatted as specified in this prayog (experiment) notebook, wherein we use the sarvamai/samvaad-hi-v1 dataset (100k high-quality conversations in English, Hindi, and Hinglish curated exclusively with an Indic context) and format it into tulu and yahma-like chat formats.
-
Start a compression job using Nyun Kompress - the following video outlines the complete process of how one can import a model and a dataset on the platform and start a compression job.
Output Structure
By default, the LLM Engine TensorRT
method for LLMs performs an efficient quantization over the model and then compiles the model engines for speedups. The output from the platform has the following folder structure -
Folder structure inside the workspace
tllm_checkpoint*
- contains the intermediary checkpoints for the input model.*.engine
- the corresponding engine file for the modelrun.py
- a script with an example inference on the generated engines
You can avoid the quantization using the following parameters for the job -
# parameters to avoid quantization
to_quantize: False
Example screenshot (Notice the Method Hyperparameters field)
Evaluation
We report the wikitext perplexity and downstream task performance of the pre-trained and compressed model below. With a few easy steps, NyunZero is able to speed up the model by 6.25x with minimal drop in perplexity.
Metrics | Baseline Model | Compressed (Quant + Engine) |
---|---|---|
Latency (Tokens / s) | 30.90 | 194.86 |
Weight Memory (in GB) | 14.5 | 4.15 |
Perplexity (wikitext) | 6.742 | 6.97 |
The compressed weights are openly available at - nyunai/OpenHathi-7B-Hi-v0.1-Base-AWQ-samvaad-hi-v1-chat-format
We additionally also report downstream task performance on indic tasks and general English tasks.
Baseline Model | Compressed | Baseline Model | Compressed | |
---|---|---|---|---|
0-Shot | 0-Shot | 5-Shot | 5-Shot | |
IndicSentiment | 59.01 | 56.21 | 96.69 | 91.88 |
IndicCopa | 55.67 | 53.89 | 59.46 | 53.22 |
IndicXNLI | 33.33 | 33.55 | 42.53 | 37.98 |
IndicXParaphrase | 59.34 | 53.94 | 50.00 | 48.10 |
BoolQ | 54.25 | 60.36 | 63.48 | 61.98 |
ARC Easy | 57.02 | 52.10 | 61.91 | 57.99 |
ARC Easy Hindi (Translated) | 35.56 | 28.99 | 39.98 | 35.98 |
Arc Challenge | 40.27 | 37.37 | 45.90 | 41.46 |
Arc Challenge Hindi (Translated) | 29.94 | 26.10 | 32.08 | 28.58 |
Winogrande | 49.48 | 49.09 | - | - |