Algorithms
Overview
Kompress currently supports the following tasks and respective algorithms -
- Image Classification -
- Object Detection -
- LLM -
Algorithms
Vision Compressions
CPU Post Training Quantization - Torch
Native CPU quantization. 8 bit quantization by default. Outputs .pt
model file which can be directly loaded by torch.load
.
Parameter | Values | Description | Default Value |
---|---|---|---|
insize | Int | Input Shape For Vision Tasks (Currently only A X A Shapes supported) | 32 |
BATCH_SIZE | Int | Batch Size | 1 |
TRAINING | bool | Enables Finetuning before PTQ | True |
VALIDATE | bool | Enables Validation during Optional Finetuning (When TRAINING=True) | True |
VALIDATION_INTERVAL | Int | Defines Epoch Intervals for Validation during Finetuning (When TRAINING, VALIDATE = True) | 1 |
CRITERION | "CrossEntropyLoss", "MSE Loss", others |
Defines Loss functions for finetuning/validation (When TRAINING = True) | "CrossEntropyLoss" |
LEARNING RATE | Float | Defines Learning Rate for Finetuning (When TRAINING=True) | 0.001 |
FINETUNE_EPOCHS | Int | Defines the number of Epochs for Finetuning (When TRAINING=True) | 1 |
OPTIMIZER | "Adam", "SGD", others |
Defines Optimizer for Finetuning. (When TRAINING = TRUE) | Adam |
PRETRAINED | bool | Indicates whether to load ImageNet Weights in case custom model is not provided. | False |
choice | "static","weight" or "fusion" | Indicates the Kind of PTQ to be performed. | "static" |
CPU Post Training Quantization - OpenVino
Neural networks inference optimization in OpenVINORuntime with minimal accuracy drop. Outputs .xml
and .bin
model files which can be directly loaded by openvino.core.read_model
.
Parameter | Values | Description | Default Value |
---|---|---|---|
insize | Int | Input Shape For Vision Tasks (Currently only A X A Shapes supported) | 32 |
BATCH_SIZE | Int | Batch Size | 1 |
TRAINING | bool | Enables Finetuning before PTQ | True |
VALIDATE | bool | Enables Validation during Optional Finetuning (When TRAINING = True) | True |
VALIDATION_INTERVAL | Int | Defines Epoch Intervals for Validation during Finetuning (When TRAINING, VALIDATE = True) | 1 |
CRITERION | "CrossEntropyLoss", "MSE Loss", others |
Defines Loss functions for finetuning/validation (When TRAINING = True) | CrossEntropyLoss |
LEARNING RATE | Float | Defines Learning Rate for Finetuning (When TRAINING = True) | 0.001 |
FINETUNE_EPOCHS | Int | Defines the number of Epochs for Finetuning (When TRAINING = True) | 1 |
OPTIMIZER | "Adam", "SGD", others |
Defines Optimizer for Finetuning. (When TRAINING = TRUE) | Adam |
PRETRAINED | bool | Indicates whether to load ImageNet Weights in case custom model is not provided. | False |
TRANSFORMER | bool | Indicates whether uploaded model consists a transformer based architecture (Only For Classification) | True |
CPU Post Training Quantization - ONNX
ONNX 8-bit CPU Post Training Quantization for Pytorch models. Outputs .onnx
model files which can be directly loaded by onnx.load
.
Parameter | Values | Description | Default Value |
---|---|---|---|
insize | Int | Input Shape For Vision Tasks (Currently only A X A Shapes supported) | 32 |
BATCH_SIZE | Int | Batch Size for dataloader | 1 |
TRAINING | bool | Enables Finetuning before PTQ | True |
VALIDATE | bool | Enables Validation during Optional Finetuning (When TRAINING = True) |
True |
VALIDATION_INTERVAL | Int | Defines Epoch Intervals for Validation during Finetuning. (When TRAINING, VALIDATE = True) |
1 |
CRITERION | “CrossEntropyLoss”, “MSE Loss”, |
Defines Loss functions for finetuning/validation (When TRAINING = True) |
CrossEntropyLoss |
LEARNING RATE | Float | Defines Learning Rate for Finetuning (When TRAINING = True) | 0.001 |
FINETUNE_EPOCHS | Int | Defines the number of Epochs for Finetuning (When TRAINING = True) |
1 |
OPTIMIZER | “Adam”, “SGD”, |
Defines Optimizer for Finetuning. (When TRAINING = True) | Adam |
PRETRAINED | bool | Indicates whether to load ImageNet Weights in case custom model is not provided. | False |
quant_format | QuantFormat.QDQ, QuantFormat.QOperator | Indicates the ONNX quantization representation format | QuantFormat.QDQ |
per_channel | bool | Indicates usage of "Per Channel" quantization that improves accuracy of models with large weight range | False |
activation_type | QuantType.QInt8, QuantType.QUInt8, QuantType.QFLOAT8E4M3FN, QuantType.QInt16, QuantType.QUInt16 |
Indicates the expected data type of activations post quantization | QuantType.QInt8 |
weight_type | QuantType.QInt8, QuantType.QUInt8, QuantType.QFLOAT8E4M3FN, QuantType.QInt16, QuantType.QUInt16 |
Indicates the expected data type of weights post quantization | QuantType.QInt8 |
CPU Quantization Aware Training - Torch
Native CPU quantization. 8 bit quantization by default. Outputs .pt
model file which can be directly loaded by torch.load
.
Parameter | Values | Description | Default Value |
---|---|---|---|
insize | Int | Input Shape For Vision Tasks (Currently only A X A Shapes supported) | 32 |
BATCH_SIZE | Int | Batch Size | 1 |
TRAINING | bool | Enables Finetuning before PTQ | True |
VALIDATE | bool | Enables Validation during Optional Finetuning (When TRAINING=True) | True |
VALIDATION_INTERVAL | Int | Defines Epoch Intervals for Validation during Finetuning (When TRAINING, VALIDATE = True) | 1 |
CRITERION | "CrossEntropyLoss", "MSE Loss", others |
Defines Loss functions for finetuning/validation (When TRAINING = True) | "CrossEntropyLoss" |
LEARNING RATE | Float | Defines Learning Rate for Finetuning (When TRAINING=True) | 0.001 |
FINETUNE_EPOCHS | Int | Defines the number of Epochs for Finetuning (When TRAINING=True) | 1 |
OPTIMIZER | "Adam", "SGD", others |
Defines Optimizer for Finetuning. (When TRAINING = TRUE) | Adam |
PRETRAINED | bool | Indicates whether to load ImageNet Weights in case custom model is not provided. | False |
CPU Quantization Aware Training - OpenVino
Neural networks inference optimization in OpenVINORuntime with minimal accuracy drop. Outputs .xml
and .bin
model files which can be directly loaded by openvino.core.read_model
.
Parameter | Values | Description | Default Value |
---|---|---|---|
insize | Int | Input Shape For Vision Tasks (Currently only A X A Shapes supported) | 32 |
BATCH_SIZE | Int | Batch Size | 1 |
TRAINING | bool | Enables Finetuning before PTQ | True |
VALIDATE | bool | Enables Validation during Optional Finetuning (When TRAINING = True) | True |
VALIDATION_INTERVAL | Int | Defines Epoch Intervals for Validation during Finetuning (When TRAINING, VALIDATE = True) | 1 |
CRITERION | "CrossEntropyLoss", "MSE Loss", others |
Defines Loss functions for finetuning/validation (When TRAINING = True) | CrossEntropyLoss |
LEARNING RATE | Float | Defines Learning Rate for Finetuning (When TRAINING = True) | 0.001 |
FINETUNE_EPOCHS | Int | Defines the number of Epochs for Finetuning (When TRAINING = True) | 1 |
OPTIMIZER | "Adam", "SGD", others |
Defines Optimizer for Finetuning. (When TRAINING = TRUE) | Adam |
PRETRAINED | bool | Indicates whether to load ImageNet Weights in case custom model is not provided. | False |
TRANSFORMER | bool | Indicates whether uploaded model consists a transformer based architecture | True |
GPU Post Training Quantization - TensorRT
8-bit Quantization executable in GPU via TensorRT Runtime. Outputs .engine
model file which can be directly loaded by tensorrt.Runtime
.
Parameter | Values | Description | Default Value |
---|---|---|---|
insize | Int | Input Shape For Vision Tasks (Currently only A X A Shapes supported) | 32 |
BATCH_SIZE | Int | Batch Size for dataloader | 1 |
TRAINING | bool | Enables Finetuning before PTQ | True |
VALIDATE | bool | Enables Validation during Optional Finetuning (When TRAINING = True) |
True |
VALIDATION_INTERVAL | Int | Defines Epoch Intervals for Validation during Finetuning. (When TRAINING, VALIDATE = True) |
1 |
CRITERION | “CrossEntropyLoss”, “MSE Loss”, |
Defines Loss functions for finetuning/validation (When TRAINING = True) |
CrossEntropyLoss |
LEARNING RATE | Float | Defines Learning Rate for Finetuning (When TRAINING = True) | 0.001 |
FINETUNE_EPOCHS | Int | Defines the number of Epochs for Finetuning (When TRAINING = True) |
1 |
OPTIMIZER | “Adam”, “SGD”, |
Defines Optimizer for Finetuning. (When TRAINING = TRUE) | Adam |
PRETRAINED | bool | Indicates whether to load ImageNet Weights in case custom model is not provided. | False |
Knowledge Distillation
Simple Distillation Training Strategy that adds an additional loss between Teacher and Student Predictions. Outputs .pt
model file which can be directly loaded by using torch.load
.
Parameter | Values | Description | Default Value |
---|---|---|---|
insize | Int | A single integer representing the input image size for teacher network and student network | 32 |
BATCH_SIZE | Int | Batch Size for dataloader | 1 |
TRAINING | bool | Whether to finetune teacher model before distillation. | True |
VALIDATE | bool | Enables Validation during Optional Finetuning (When TRAINING = True) |
True |
VALIDATION_INTERVAL | Int | Defines Epoch Intervals for Validation during Finetuning. (When TRAINING, VALIDATE = True) |
1 |
CRITERION | “CrossEntropyLoss”, “MSE Loss”, |
Defines Loss functions for finetuning/validation (When TRAINING = True) |
CrossEntropyLoss |
LEARNING RATE | Float | Defines Learning Rate for Finetuning (When TRAINING = True) | 0.001 |
FINETUNE_EPOCHS | Int | Defines the number of Epochs for Finetuning (When TRAINING = True) |
1 |
OPTIMIZER | “Adam”, “SGD”, |
Defines Optimizer for Finetuning. (When TRAINING = TRUE) | Adam |
TEACHER_MODEL | String | Model Name of the provided Teacher Model. (Required both when intrinsicly provided and when custom teacher is uploaded) | vgg16 |
CUSTOM_TEACHER_PATH | String | Relative Path for Teacher checkpoint from User Data folder | None |
METHOD | "pkd","cwd","pkd_yolo" | Distillation Algorithm to use to distill models. Needed for MMDetection (pkd,cwd) and MMYolo(pkd_yolo) distillation. Not needed for classification. | pkd |
EPOCHS | Int | Indicates Number of Training Epochs for Distillation | 20 |
LR | Float | Indicates Learning Rate for distillation process. | 0.01 |
LAMBDA | Float | Adjusts the balance between cross entropy andKLDiv (Classification Only) | 0.5 |
TEMPERATURE | Int | Indicates Temperature for softmax (Classification Only) | 20 |
SEED | Int | Sets the seed for random number generation (Classification Only) | 43 |
WEIGHT_DECAY | Float | Sets the amount of Weight Decay during Distillation (Classification Only) | 0.0005 |
Structured Pruning (Image Classification)
Pruning existing Parameters to increase Efficiency. MM Detection and MM Segmentation models are currently supported through MM Razor Pruning Algorithms. Outputs .pt
model file which can be directly loaded by torch.load
.
Parameter | Values | Description | Default Value |
---|---|---|---|
insize | Int | Input Shape For Vision Tasks (Currently only A X A Shapes supported) | 32 |
BATCH_SIZE | Int | Batch Size for dataloader | 1 |
TRAINING | bool | Whether to finetune model after pruning. | True |
VALIDATE | bool | Enables Validation during Optional Finetuning (When TRAINING = True) |
True |
VALIDATION_INTERVAL | Int | Defines Epoch Intervals for Validation during Finetuning. (When TRAINING, VALIDATE = True) |
1 |
CRITERION | “CrossEntropyLoss”, “MSE Loss”, |
Defines Loss functions for finetuning/validation (When TRAINING = True) |
CrossEntropyLoss |
LEARNING RATE | Float | Defines Learning Rate for Finetuning (When TRAINING = True) | 0.001 |
FINETUNE_EPOCHS | Int | Defines the number of Epochs for Finetuning (When TRAINING = True) |
1 |
OPTIMIZER | “Adam”, “SGD”, |
Defines Optimizer for Finetuning. (When TRAINING = TRUE) | Adam |
PRETRAINED | bool | Indicates whether to load ImageNet Weights in case custom model is not provided. | False |
The below parameters are specifically for the pruning of classification models. |Parameter|Values|Description|Default Value| | :- | :- | :- | :- | |PRUNER_NAME|
MetaPruner,
GroupNormPruner,
BNSPruner,
|Pruning Algorithm to be utilized for pruning classification models.
|GroupNormPruner| |GROUP_IMPORTANCE|GroupNormImportance,
GroupTaylorImportance
|Logic for identify importance of parameters to prune.
|GroupNormImportance
| |TARGET_PRUNE_RATE|Int|Parameter Reduction Rate, defines how much parameters are reduced|Integer Value| |BOTTLENECK|bool|When Pruning Transformer based Architectures, whether to prune only intermediate layers (bottleneck) or perform uniform pruning.
|False
| |PRUNE_NUM_HEADS|bool|Whether to Prune number of Heads (For Transformer based Architectures)|True|Structured Pruning (Object Detection)
The below parameters are specifically for the pruning of object detection models.
Parameter | Values | Description | Default Value |
---|---|---|---|
INTERVAL | Int | Epoch Interval between every pruning operation | 10 |
NORM_TYPE | "act", "flops" | Type of pruning operation. "act" focuses on reducing parameters with minimal changes to activations. "flops" focuses on improving number of flops. | "act" |
LR_RATIO | Float | Ratio to decrease lr rate. | 0.1 |
TARGET_FLOP_RATIO | Float | The target flop ratio to prune your model. (also used for "act"). | 0.5 |
EPOCHS | Int | Number of epochs to perform training (possibly a multiple of Interval). | 20 |
LLM
LLM Structured Pruning
LLM Structured Pruning is a novel structured pruning framework for Large Language Models (LLMs) that improves efficiency by reducing storage and enhancing inference speed. Outputs model.safetensors
, directly loadable by transformers.from_pretrained()
.
Parameter | Values | Description | Default Value |
---|---|---|---|
pruning_ratio | Float | Pruning ratio | 0.2 |
metrics | "IFV", "WIFV", "WIFN" |
Importance metric: "WIFN" (Weighted Importance Feature Norm), "IFV" (Importance Feature Value), "WIFV" (Weighted Importance Feature Value) |
"WIFV" |
structure | "UL-UM", "UL-MM", "AL-MM", "AL-AM" |
Pruning structure: "UL-UM" (Uniform across Layers, Uniform across Modules), "UL-MM" (Uniform across Layers, Manual ratio for Modules), "AL-MM" (Adaptive across Layers, Manual for Modules), "AL-AM" (Adaptive across both Layers and Modules) |
"AL-MM" |
remove_heads | Int | Number of heads to remove | 8 |
nsamples | Int | Number of samples for evaluation | 2048 |
LLM Quantization
A 4-bit weight-only quantization method designed for Language Model (LM) applications. Utilizes GEMM (General Matrix Multiply) as the default operation. Generates *.safetensor
& config.json
files that can be directly loaded by transformers' AutoModelForCausalLM.from_pretrained()
or AutoAWQ's AutoAWQForCausalLM.from_quantized()
for quantized models.
Parameter | Values | Description | Default Value |
---|---|---|---|
zero_point | bool | Whether to use zero point. | True |
q_group_size | Int | Quantization group size | 128 |
w_bit | Int | Weight bitwidth (only 4 bit is supported) | 4 |
version | "GEMM", "GEMV" | Version of AutoAWQ. One of GEMM or GEMV. | "GEMM" |
LLM Engine TensorRT
Optimizes LLMs for inference and builds TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently. Outputs .engine
model files which can be directly loaded by NVIDIA Triton Inference Server.
Parameter | Values | Description | Default Value |
---|---|---|---|
to_quantize | bool | To first quantize the model and then build engine. |
True |
quant_method | "fp8", "int4_awq", "smoothquant", "int8" | Quantization format | "int4_awq" |
smoothquant | float | (if quant_method = "smoothquant") smooth quant's α value (to control quantization difficulty migration between activations and weights) |
0.5 |
calib_size | Int | Calibration size | 32 |
dtype | "float16" | dtype of the model | "float16" |
- Model/Quantization Support Grid:
Model | fp8 | int4_awq | smoothquant | int8 |
---|---|---|---|---|
LLaMA | ✓ | ✓ | ✓ | ✓ |
LLaMA-2 | ✓ | ✓ | ✓ | ✓ |
Vicuna | ✓ | ✓ | ✓ | ✓ |
Mixtral | ✓ | ✓ | - | ✓ |
Mistral-7B | ✓ | ✓ | - | ✓ |
Gemma | ✓ | ✓ | - | ✓ |
LLM Engine ExLlama
A new quantization format introducing EXL2, which brings a lot of flexibility to how weights are stored. This implementation generates the engine files and a script required to produce fast inferences on the provided model. Outputs .safetonsor
, config.json
model files along with run.sh
that loads and runs a test inference with ExllamaV2.
Parameter | Values | Description | Default Value |
---|---|---|---|
bits | Float >= 2 , <= 8 | Target bits per weight | 4.125 |
shard_size | Int | Max shard size in MB while saving model | 8192 |
rope_scale | Float | RoPE scaling factor (related to RoPE (NTK) parameters for calibration) | 1 |
rope_alpha | Float | RoPE alpha value (related to RoPE (NTK) parameters for calibration) | 1 |
head_bits | Int | Target bits per weight (for head layer) | 6 |
LLM Engine MLCLLM
Compiler accelerations and runtime optimizations for native deployment across platforms and edge devices. Outputs params-*.bin
files and compiled files directly usable by MLC Chat. Also produces a run.py
for sample usage.
Parameter | Values | Description | Default Value |
---|---|---|---|
quantize | bool | Indicates whether quantization is applied to the model | True |
quant_method | "q4f16_0", "q4f16_autoawq" | Method used for quantization | "q4f16_autoawq" |
conv_template | "llama-2" | Conversation templates | None |
llvm_triple | null | LLVM triple | None |
Low-rank Decomposition
Coming soon