Streamlining the Compression of Indic Language Models with NyunZero

Background

In the rapidly evolving landscape of AI, LLMs play a pivotal role in understanding and generating human-like text. OpenHathi, based on the impressive LLaMA-7B architecture, stands out as a powerful Indic language model. Leveraging its capabilities can significantly enhance natural language processing tasks in various applications. In this article, we will explore the seamless compression of OpenHathi with AWQ quantization and TensorRT-LLM engine conversion made possible through NyunZero.

We will be using the Samvaad dataset to sample the calibration samples for quantization; using an appropriate calibration dataset is very important, especially for non-English and multi-lingual models which again is seamless via NyunZero’s data-aware quantization support.

Using NyunZero

NyunZero directly works inside the user’s infrastructure across most cloud providers including GCP, AWS and Azure. Additionally, users can also connect their local machines with a simple SSH connection. This ensures that the user's data remains private. The following steps outline the complete process -

Connecting the Infrastructure - Nyun Docs - Connect Infrastructure
Importing Model and Dataset -

We will be using the open-sourced model sarvamai/OpenHathi-7B-Hi-v0.1-Base as the base model directly from HuggingFace and nyunai/samvaad-hi-v1-chat-format as the calibration dataset. This dataset was formatted as specified in this prayog (experiment) notebook, wherein we use the sarvamai/samvaad-hi-v1 dataset (100k high-quality conversations in English, Hindi, and Hinglish curated exclusively with an Indic context) and format it into tulu and yahma-like chat formats.
Start a compression job using Nyun Kompress - the following video outlines the complete process of how one can import a model and a dataset on the platform and start a compression job.

2024-03-04-Nyunai_Demo.mp4

Output Structure

By default, the LLM Engine TensorRT method for LLMs performs an efficient quantization over the model and then compiles the model engines for speedups. The output from the platform has the following folder structure -

*Folder structure inside the workspace*

Folder structure inside the workspace

tllm_checkpoint* - contains the intermediary checkpoints for the input model.
*.engine - the corresponding engine file for the model
run.py - a script with an example inference on the generated engines

You can avoid the quantization using the following parameters for the job -

# parameters to avoid quantization
to_quantize: False

*Example screenshot (Notice the Method Hyperparameters field)*

Example screenshot (Notice the Method Hyperparameters field)

Evaluation

We report the wikitext perplexity and downstream task performance of the pre-trained and compressed model below. With a few easy steps, NyunZero is able to speed up the model by 6.25x with minimal drop in perplexity.

Metrics	Baseline Model	Compressed (Quant + Engine)
Latency (Tokens / s)	30.90	194.86
Weight Memory (in GB)	14.5	4.15
Perplexity (wikitext)	6.742	6.97

The compressed weights are openly available at - nyunai/OpenHathi-7B-Hi-v0.1-Base-AWQ-samvaad-hi-v1-chat-format

We additionally also report downstream task performance on indic tasks and general English tasks.

	Baseline Model	Compressed	Baseline Model	Compressed
	0-Shot	0-Shot	5-Shot	5-Shot
IndicSentiment	59.01	56.21	96.69	91.88
IndicCopa	55.67	53.89	59.46	53.22
IndicXNLI	33.33	33.55	42.53	37.98
IndicXParaphrase	59.34	53.94	50.00	48.10
BoolQ	54.25	60.36	63.48	61.98
ARC Easy	57.02	52.10	61.91	57.99
ARC Easy Hindi (Translated)	35.56	28.99	39.98	35.98
Arc Challenge	40.27	37.37	45.90	41.46
Arc Challenge Hindi (Translated)	29.94	26.10	32.08	28.58
Winogrande	49.48	49.09	-	-