Nyuntam Text Generation
Table of Contents
Overview
Nyuntam Text Generation is a comprehensive suite of tools and algorithms designed to optimize and accelerate the inference of large language models (LLMs) for text generation tasks. The suite encompasses various techniques such as structured pruning, quantization, and engine optimization, all aimed at enhancing the efficiency of LLMs. These tools are compatible with popular models like GPT-2, GPT-3, and BERT and can be deployed across a wide range of platforms, including CPUs, GPUs, and edge devices.
LLM Structured Pruning
Fluctuation-based Adaptive Structured Pruning (FLAP)
FLAP is an innovative framework that enhances the efficiency of Large Language Models (LLMs) by reducing storage requirements and improving inference speed. This framework outputs model.safetensors
, which can be directly loaded using load_and_replace_weights
.
Parameters
Parameter | Values | Description | Default Value | ||
---|---|---|---|---|---|
pruning_ratio | Float | Pruning ratio | 0.5 | ||
metrics | "WIFV" | Importance metric: "WIFV" (Weighted Importance Feature Value) |
"WIFV" | ||
structure | "AL-AM" | Pruning structure: "AL-AM" (Adaptive across both Layers and Modules) |
"AL-MM" | ||
remove_heads | Int | Number of heads to remove | -1 | ||
nsamples | Int | Number of samples for evaluation | 2048 | ||
start_pruning_layer_idx | Int | Decoder Layer index to start pruning from. | 22 |
LLM Quantization
W4A16 Activation aware Weight-Quantization (AWQ)
LLM Quantization involves a 4-bit weight-only quantization method specifically designed for Language Model (LM) applications. This method uses GEMM (General Matrix Multiply) as the default operation. The process generates *.safetensor
and config.json
files that can be directly loaded by transformers' AutoModelForCausalLM.from_pretrained()
or AutoAWQ's AutoAWQForCausalLM.from_quantized()
for quantized models.
Parameters
Parameter | Values | Description | Default Value |
---|---|---|---|
zero_point | bool | Whether to use zero point. | True |
q_group_size | Int | Quantization group size | 128 |
w_bit | Int | Weight bitwidth (only 4-bit is supported) | 4 |
version | "GEMM", "GEMV" | Version of AutoAWQ. One of GEMM or GEMV. | "GEMM" |
W4A8KV4 Quattuor-octo-Quattuor (QoQ)
QoQ employs a 4-bit weight, 8-bit activation, and 4-bit KV cache configuration. The algorithm includes a progressive quantization strategy to reduce dequantization overhead and a SmoothAttention mechanism to mitigate accuracy loss from 4-bit KV quantization.
Parameters
Parameter | Values | Description | Default Value | ||
---|---|---|---|---|---|
save_model | True | Whether to save the model | True | ||
keep_scales | True | Whether to keep scales during quantization | True | ||
loads_with_qserve | False | Whether the model loads with QServe | False | ||
quant_type | "gchn", "g128", "awq", "gptq", "sq_dynamic", "sq_static" |
Quantization type - "gchn": QoQ Algorithm (Channelwise) "g128": QoQ Algorithm (Groupwise) "awq": AWQ Algorithm "gptq": GPTQ Algorithm "sq_dynamic": SmoothQuant Algorithm (Dynamic) "sq_static": SmoothQuant Algorithm (Static) |
"gchn" | ||
eval.tasks | arc_challenge:25 | Evaluation tasks | arc_challenge:25 | ||
eval.max_seq_length | 4096 | Maximum sequence length for evaluation | 4096 | ||
eval.evaluator | "lm_eval" | Evaluator used for evaluation | "lm_eval" |
Other nested config parameters can be updated as here. Find all the default configs for quant_type
here.
W2A16 Additive Quantization of Language Models (AQLM)
AQLM introduces learned additive quantization, tailored to each transformer block, and jointly optimizes codebook parameters across blocks. It stands out for being Pareto-optimal in accuracy vs. model size for models compressed to less than 3 bits per parameter. AQLM also offers practical, fast implementations for GPU and CPU, making it suitable for deploying LLMs on end-user devices.
Parameters
Parameter | Values | Description | Default Value |
---|---|---|---|
save_intermediate_results | bool | Whether to save intermediate results | true |
dtype | string | Data type for quantization | "float16" |
Calibration Config | |||
attn_implementation | null or string | Attention implementation | null |
beam_size | int | Beam size for calibration | 1 |
codebook_value_nbits | int | Number of bits for codebook values | 16 |
codebook_value_num_groups | int | Number of groups for codebook values | 1 |
dtype | string | Data type for calibration | "float16" |
finetune_adam_beta1 | float | Adam beta1 for finetuning | 0.9 |
finetune_adam_beta2 | float | Adam beta2 for finetuning | 0.999 |
finetune_batch_size | int | Batch size for finetuning | 16 |
finetune_early_stop | int | Early stopping criterion for finetuning | 3 |
finetune_keep_best | bool | Whether to keep the best model during finetuning | true |
finetune_lr | float | Learning rate for finetuning | 0.0001 |
finetune_max_epochs | int | Maximum number of epochs for finetuning | 25 |
in_group_size | int | Input group size | 8 |
init_max_iter | int | Maximum iterations for initialization | 100 |
init_max_points_per_centroid | null or int | Maximum points per centroid for initialization | null |
local_batch_size | int | Local batch size | 1 |
lr | float | Learning rate | 0.0001 |
max_epochs | int | Maximum number of epochs | 100 |
mix_compression | bool | Whether to use mixed compression | false |
model_seqlen | int | Model sequence length | 4096 |
nbits_per_codebook | int | Number of bits per codebook | 16 |
new_eval | bool | Whether to use new evaluation | false |
no_quant | bool | Whether to disable quantization | false |
nsamples | int | Number of samples | 2048 |
num_codebooks | int | Number of codebooks | 1 |
offload_activations | bool | Whether to offload activations | true |
on_save | null or function | Function to call on save | null |
out_group_size | int | Output group size | 1 |
print_frequency | int | Frequency of printing | 10 |
relative_mse_tolerance | float | Relative MSE tolerance | 0.01 |
resume | bool | Whether to resume training | false |
scale_nbits | int | Number of bits for scaling | 0 |
seed | int | Random seed | 0 |
skip_out_loss | bool | Whether to skip output loss | false |
steps_per_epoch | int | Steps per epoch | 100 |
true_sequential | bool | Whether to use true sequential processing | false |
trust_remote_code | bool | Whether to trust remote code | true |
use_checkpointing | bool | Whether to use checkpointing | false |
use_faiss | bool | Whether to use Faiss | false |
use_fast_tokenizer | bool | Whether to use fast tokenizer | false |
val_size | int | Validation size | 256 |
wandb | bool | Whether to use Weights & Biases | false |
Finetune Config | |||
adam_beta1 | float | Adam beta1 for finetuning | 0.9 |
adam_beta2 | float | Adam beta2 for finetuning | 0.95 |
amp_dtype | string | AMP data type | float32 |
amsgrad | bool | Whether to use AMSGrad | false |
attn_implementation | null or string | Attention implementation for finetuning | null |
base_model | string | Base model name | base_model |
batch_size | int | Batch size | 1 |
beam_size | int | Beam size | 1 |
block_type | string | Block type | LlamaDecoderLayer |
code_adam_16bit | bool | Whether to use 16-bit Adam for codes | false |
code_beta1 | float | Beta1 for code optimization | 0.0 |
code_beta2 | float | Beta2 for code optimization | 0.95 |
code_dtype | string | Data type for codes | uint16 |
code_lr | float | Learning rate for codes | 0.001 |
code_selection_temperature | float | Temperature for code selection | 0 |
code_trust_ratio | float | Trust ratio for codes | 0.01 |
debias | bool | Whether to debias | true |
delta_decay | float | Delta decay | 0 |
download_num_workers | null or int | Number of workers for downloading | null |
eval_datasets | list | Evaluation datasets | ["wikitext2", "c4"] |
eval_every_steps | int | Evaluate every n steps | 1 |
force_code_update | bool | Whether to force code update | false |
gradient_checkpointing | bool | Whether to use gradient checkpointing | true |
keep_best_model | bool | Whether to keep the best model | false |
lamb | bool | Whether to use LAMB optimizer | true |
limit_parallel_inits | int | Limit on parallel initializations | 1 |
load_dtype | string | Data type for loading | float32 |
lr | float | Learning rate | 0.0001 |
master_dtype | string | Master data type | float32 |
max_code_change_per_step | float | Maximum code change per step | 0.01 |
max_epochs | int | Maximum number of epochs | 10 |
microbatch_size | int | Microbatch size | 1 |
minimize_sync | bool | Whether to minimize synchronization | false |
model_seqlen | int | Model sequence length | 4096 |
monkeypatch_old_pickle | bool | Whether to monkeypatch old pickle | false |
num_workers | int | Number of workers | 8 |
overwrite_cache | bool | Whether to overwrite cache | false |
preprocessing_chunk_length | null or int | Preprocessing chunk length | null |
preprocessing_keep_in_memory | bool | Whether to keep preprocessing in memory | false |
preprocessing_num_workers | int | Number of preprocessing workers | 24 |
print_every_steps | int | Print every n steps | 1 |
save_every_steps | int | Save every n steps | 1 |
seed | int | Random seed | 1337 |
straight_through_buffer_dtype | string | Straight-through buffer data type | float32 |
trust_remote_code | bool | Whether to trust remote code | true |
update_codebooks_and_scales | bool | Whether to update codebooks and scales | true |
update_codes | bool | Whether to update codes | true |
update_non_quantized_parameters | bool | Whether to update non-quantized parameters | true |
use_fast_tokenizer | bool | Whether to use fast tokenizer | false |
use_fsdp_amp | bool | Whether to use FSDP AMP | false |
verbose_optimizer | bool | Whether to use verbose optimizer | true |
wandb | bool | Whether to use Weights & Biases | false |
wrap_separately | list | Layers to wrap separately | [] |
Conversion Config | |||
attn_implementation | null or string | Attention implementation for conversion | null |
code_dtype | string | Data type for codes | int32 |
load_dtype | string | Data type for loading | auto |
trust_remote_code | bool | Whether to trust remote code for conversion | true |
LLM Engine
TensorRT
The LLM Engine TensorRT optimizes LLMs for inference by building TensorRT engines equipped with state-of-the-art optimizations to ensure efficient inference performance. The output is a .engine
model file that can be directly loaded by NVIDIA Triton Inference Server.
Parameters
Parameter | Values | Description | Default Value |
---|---|---|---|
to_quantize | bool | Whether to quantize the model before building the engine. | True |
quant_method | "fp8", "int4_awq", "smoothquant", "int8" | Quantization format | "int4_awq" |
smoothquant | float | (if quant_method = "smoothquant") The smooth quant's α value, which controls quantization difficulty migration between activations and weights. | 0.5 |
calib_size | Int | Calibration size | 32 |
dtype | "float16" | The data type of the model | "float16" |
Model/Quantization Support Grid
Model | fp8 | int4_awq | smoothquant | int8 |
---|---|---|---|---|
LLaMA | ✓ | ✓ | ✓ | ✓ |
LLaMA-2 | ✓ | ✓ | ✓ | ✓ |
Vicuna | ✓ | ✓ | ✓ | ✓ |
Mixtral | ✓ | ✓ | - | ✓ |
Mistral-7B | ✓ | ✓ | - | ✓ |
Gemma | ✓ | ✓ | - | ✓ |
ExLlama
LLM Engine ExLlama introduces a new quantization format known as EXL2, providing flexibility in weight storage. This implementation generates engine files and a script for fast inference on the given model. The output includes .safetensor
, config.json
model files, and a run.sh
script for test inference using ExllamaV2.
Parameters
Parameter | Values | Description | Default Value |
---|---|---|---|
bits | Float >= 2 , <= 8 | Target bits per weight | 4.125 |
shard_size | Int | Maximum shard size in MB while saving the model | 8192 |
rope_scale | Float | RoPE scaling factor (related to RoPE (NTK) parameters) | 1 |
rope_alpha | Float | RoPE alpha value (related to RoPE (NTK) parameters) | 1 |
head_bits | Int | Target bits per weight (for the head layer) | 6 |
MLCLLM
The LLM Engine MLCLLM offers compiler accelerations and runtime optimizations for native deployment across various platforms, including edge devices. The output consists of params-*.bin
files and compiled files that can be directly used by MLC Chat, along with a run.py
script for sample usage.
Parameters
Parameter | Values | Description | Default Value |
---|---|---|---|
quantize | bool | Indicates whether quantization is applied to the model | True |
quant_method | "q4f16_0", "q4f16_autoawq" | Method used for quantization | "q4f16_autoawq" |
conv_template | "llama-2" | Conversation templates | None |
llvm_triple | null | LLVM triple | None |