Algorithms

Overview

Kompress currently supports the following tasks and respective algorithms -

Image Classification -
Object Detection -

LLM -

Algorithms

Vision Compressions

CPU Post Training Quantization - Torch

Native CPU quantization. 8 bit quantization by default. Outputs .pt model file which can be directly loaded by torch.load.

Parameter	Values	Description	Default Value
insize	Int	Input Shape For Vision Tasks (Currently only A X A Shapes supported)	32
BATCH_SIZE	Int	Batch Size	1
TRAINING	bool	Enables Finetuning before PTQ	True
VALIDATE	bool	Enables Validation during Optional Finetuning (When TRAINING=True)	True
VALIDATION_INTERVAL	Int	Defines Epoch Intervals for Validation during Finetuning (When TRAINING, VALIDATE = True)	1
CRITERION	"CrossEntropyLoss", "MSE Loss", others	Defines Loss functions for finetuning/validation (When TRAINING = True)	"CrossEntropyLoss"
LEARNING RATE	Float	Defines Learning Rate for Finetuning (When TRAINING=True)	0.001
FINETUNE_EPOCHS	Int	Defines the number of Epochs for Finetuning (When TRAINING=True)	1
OPTIMIZER	"Adam", "SGD", others	Defines Optimizer for Finetuning. (When TRAINING = TRUE)	Adam
PRETRAINED	bool	Indicates whether to load ImageNet Weights in case custom model is not provided.	False
choice	"static","weight" or "fusion"	Indicates the Kind of PTQ to be performed.	"static"

CPU Post Training Quantization - OpenVino

Neural networks inference optimization in OpenVINORuntime with minimal accuracy drop. Outputs .xml and .bin model files which can be directly loaded by openvino.core.read_model.

Parameter	Values	Description	Default Value
insize	Int	Input Shape For Vision Tasks (Currently only A X A Shapes supported)	32
BATCH_SIZE	Int	Batch Size	1
TRAINING	bool	Enables Finetuning before PTQ	True
VALIDATE	bool	Enables Validation during Optional Finetuning (When TRAINING = True)	True
VALIDATION_INTERVAL	Int	Defines Epoch Intervals for Validation during Finetuning (When TRAINING, VALIDATE = True)	1
CRITERION	"CrossEntropyLoss", "MSE Loss", others	Defines Loss functions for finetuning/validation (When TRAINING = True)	CrossEntropyLoss
LEARNING RATE	Float	Defines Learning Rate for Finetuning (When TRAINING = True)	0.001
FINETUNE_EPOCHS	Int	Defines the number of Epochs for Finetuning (When TRAINING = True)	1
OPTIMIZER	"Adam", "SGD", others	Defines Optimizer for Finetuning. (When TRAINING = TRUE)	Adam
PRETRAINED	bool	Indicates whether to load ImageNet Weights in case custom model is not provided.	False
TRANSFORMER	bool	Indicates whether uploaded model consists a transformer based architecture (Only For Classification)	True

CPU Post Training Quantization - ONNX

ONNX 8-bit CPU Post Training Quantization for Pytorch models. Outputs .onnx model files which can be directly loaded by onnx.load.

Parameter	Values	Description	Default Value
insize	Int	Input Shape For Vision Tasks (Currently only A X A Shapes supported)	32
BATCH_SIZE	Int	Batch Size for dataloader	1
TRAINING	bool	Enables Finetuning before PTQ	True
VALIDATE	bool	Enables Validation during Optional Finetuning (When TRAINING = True)	True
VALIDATION_INTERVAL	Int	Defines Epoch Intervals for Validation during Finetuning. (When TRAINING, VALIDATE = True)	1
CRITERION	“CrossEntropyLoss”, “MSE Loss”, others	Defines Loss functions for finetuning/validation (When TRAINING = True)	CrossEntropyLoss
LEARNING RATE	Float	Defines Learning Rate for Finetuning (When TRAINING = True)	0.001
FINETUNE_EPOCHS	Int	Defines the number of Epochs for Finetuning (When TRAINING = True)	1
OPTIMIZER	“Adam”, “SGD”, others	Defines Optimizer for Finetuning. (When TRAINING = True)	Adam
PRETRAINED	bool	Indicates whether to load ImageNet Weights in case custom model is not provided.	False
quant_format	QuantFormat.QDQ, QuantFormat.QOperator	Indicates the ONNX quantization representation format	QuantFormat.QDQ
per_channel	bool	Indicates usage of "Per Channel" quantization that improves accuracy of models with large weight range	False
activation_type	QuantType.QInt8, QuantType.QUInt8, QuantType.QFLOAT8E4M3FN, QuantType.QInt16, QuantType.QUInt16	Indicates the expected data type of activations post quantization	QuantType.QInt8
weight_type	QuantType.QInt8, QuantType.QUInt8, QuantType.QFLOAT8E4M3FN, QuantType.QInt16, QuantType.QUInt16	Indicates the expected data type of weights post quantization	QuantType.QInt8

CPU Quantization Aware Training - Torch

Native CPU quantization. 8 bit quantization by default. Outputs .pt model file which can be directly loaded by torch.load.

Parameter	Values	Description	Default Value
insize	Int	Input Shape For Vision Tasks (Currently only A X A Shapes supported)	32
BATCH_SIZE	Int	Batch Size	1
TRAINING	bool	Enables Finetuning before PTQ	True
VALIDATE	bool	Enables Validation during Optional Finetuning (When TRAINING=True)	True
VALIDATION_INTERVAL	Int	Defines Epoch Intervals for Validation during Finetuning (When TRAINING, VALIDATE = True)	1
CRITERION	"CrossEntropyLoss", "MSE Loss", others	Defines Loss functions for finetuning/validation (When TRAINING = True)	"CrossEntropyLoss"
LEARNING RATE	Float	Defines Learning Rate for Finetuning (When TRAINING=True)	0.001
FINETUNE_EPOCHS	Int	Defines the number of Epochs for Finetuning (When TRAINING=True)	1
OPTIMIZER	"Adam", "SGD", others	Defines Optimizer for Finetuning. (When TRAINING = TRUE)	Adam
PRETRAINED	bool	Indicates whether to load ImageNet Weights in case custom model is not provided.	False

CPU Quantization Aware Training - OpenVino

Neural networks inference optimization in OpenVINORuntime with minimal accuracy drop. Outputs .xml and .bin model files which can be directly loaded by openvino.core.read_model.

Parameter	Values	Description	Default Value
insize	Int	Input Shape For Vision Tasks (Currently only A X A Shapes supported)	32
BATCH_SIZE	Int	Batch Size	1
TRAINING	bool	Enables Finetuning before PTQ	True
VALIDATE	bool	Enables Validation during Optional Finetuning (When TRAINING = True)	True
VALIDATION_INTERVAL	Int	Defines Epoch Intervals for Validation during Finetuning (When TRAINING, VALIDATE = True)	1
CRITERION	"CrossEntropyLoss", "MSE Loss", others	Defines Loss functions for finetuning/validation (When TRAINING = True)	CrossEntropyLoss
LEARNING RATE	Float	Defines Learning Rate for Finetuning (When TRAINING = True)	0.001
FINETUNE_EPOCHS	Int	Defines the number of Epochs for Finetuning (When TRAINING = True)	1
OPTIMIZER	"Adam", "SGD", others	Defines Optimizer for Finetuning. (When TRAINING = TRUE)	Adam
PRETRAINED	bool	Indicates whether to load ImageNet Weights in case custom model is not provided.	False
TRANSFORMER	bool	Indicates whether uploaded model consists a transformer based architecture	True

GPU Post Training Quantization - TensorRT

8-bit Quantization executable in GPU via TensorRT Runtime. Outputs .engine model file which can be directly loaded by tensorrt.Runtime.

Parameter	Values	Description	Default Value
insize	Int	Input Shape For Vision Tasks (Currently only A X A Shapes supported)	32
BATCH_SIZE	Int	Batch Size for dataloader	1
TRAINING	bool	Enables Finetuning before PTQ	True
VALIDATE	bool	Enables Validation during Optional Finetuning (When TRAINING = True)	True
VALIDATION_INTERVAL	Int	Defines Epoch Intervals for Validation during Finetuning. (When TRAINING, VALIDATE = True)	1
CRITERION	“CrossEntropyLoss”, “MSE Loss”, others	Defines Loss functions for finetuning/validation (When TRAINING = True)	CrossEntropyLoss
LEARNING RATE	Float	Defines Learning Rate for Finetuning (When TRAINING = True)	0.001
FINETUNE_EPOCHS	Int	Defines the number of Epochs for Finetuning (When TRAINING = True)	1
OPTIMIZER	“Adam”, “SGD”, others	Defines Optimizer for Finetuning. (When TRAINING = TRUE)	Adam
PRETRAINED	bool	Indicates whether to load ImageNet Weights in case custom model is not provided.	False

Knowledge Distillation

Simple Distillation Training Strategy that adds an additional loss between Teacher and Student Predictions. Outputs .pt model file which can be directly loaded by using torch.load.

Parameter	Values	Description	Default Value
insize	Int	A single integer representing the input image size for teacher network and student network	32
BATCH_SIZE	Int	Batch Size for dataloader	1
TRAINING	bool	Whether to finetune teacher model before distillation.	True
VALIDATE	bool	Enables Validation during Optional Finetuning (When TRAINING = True)	True
VALIDATION_INTERVAL	Int	Defines Epoch Intervals for Validation during Finetuning. (When TRAINING, VALIDATE = True)	1
CRITERION	“CrossEntropyLoss”, “MSE Loss”, others	Defines Loss functions for finetuning/validation (When TRAINING = True)	CrossEntropyLoss
LEARNING RATE	Float	Defines Learning Rate for Finetuning (When TRAINING = True)	0.001
FINETUNE_EPOCHS	Int	Defines the number of Epochs for Finetuning (When TRAINING = True)	1
OPTIMIZER	“Adam”, “SGD”, others	Defines Optimizer for Finetuning. (When TRAINING = TRUE)	Adam
TEACHER_MODEL	String	Model Name of the provided Teacher Model. (Required both when intrinsicly provided and when custom teacher is uploaded)	vgg16
CUSTOM_TEACHER_PATH	String	Relative Path for Teacher checkpoint from User Data folder	None
METHOD	"pkd","cwd","pkd_yolo"	Distillation Algorithm to use to distill models. Needed for MMDetection (pkd,cwd) and MMYolo(pkd_yolo) distillation. Not needed for classification.	pkd
EPOCHS	Int	Indicates Number of Training Epochs for Distillation	20
LR	Float	Indicates Learning Rate for distillation process.	0.01
LAMBDA	Float	Adjusts the balance between cross entropy andKLDiv (Classification Only)	0.5
TEMPERATURE	Int	Indicates Temperature for softmax (Classification Only)	20
SEED	Int	Sets the seed for random number generation (Classification Only)	43
WEIGHT_DECAY	Float	Sets the amount of Weight Decay during Distillation (Classification Only)	0.0005

Structured Pruning (Image Classification)

Pruning existing Parameters to increase Efficiency. MM Detection and MM Segmentation models are currently supported through MM Razor Pruning Algorithms. Outputs .pt model file which can be directly loaded by torch.load.

Parameter	Values	Description	Default Value
insize	Int	Input Shape For Vision Tasks (Currently only A X A Shapes supported)	32
BATCH_SIZE	Int	Batch Size for dataloader	1
TRAINING	bool	Whether to finetune model after pruning.	True
VALIDATE	bool	Enables Validation during Optional Finetuning (When TRAINING = True)	True
VALIDATION_INTERVAL	Int	Defines Epoch Intervals for Validation during Finetuning. (When TRAINING, VALIDATE = True)	1
CRITERION	“CrossEntropyLoss”, “MSE Loss”, others	Defines Loss functions for finetuning/validation (When TRAINING = True)	CrossEntropyLoss
LEARNING RATE	Float	Defines Learning Rate for Finetuning (When TRAINING = True)	0.001
FINETUNE_EPOCHS	Int	Defines the number of Epochs for Finetuning (When TRAINING = True)	1
OPTIMIZER	“Adam”, “SGD”, others	Defines Optimizer for Finetuning. (When TRAINING = TRUE)	Adam
PRETRAINED	bool	Indicates whether to load ImageNet Weights in case custom model is not provided.	False

MetaPruner,

GroupNormPruner,

BNSPruner,

|

Pruning Algorithm to be utilized for pruning classification models.

|GroupNormPruner| |GROUP_IMPORTANCE|

GroupNormImportance,

GroupTaylorImportance

|

Logic for identify importance of parameters to prune.

|

GroupNormImportance

When Pruning Transformer based Architectures, whether to prune only intermediate layers (bottleneck) or perform uniform pruning.

|

False

Structured Pruning (Object Detection)

The below parameters are specifically for the pruning of object detection models.

Parameter	Values	Description	Default Value
INTERVAL	Int	Epoch Interval between every pruning operation	10
NORM_TYPE	"act", "flops"	Type of pruning operation. "act" focuses on reducing parameters with minimal changes to activations. "flops" focuses on improving number of flops.	"act"
LR_RATIO	Float	Ratio to decrease lr rate.	0.1
TARGET_FLOP_RATIO	Float	The target flop ratio to prune your model. (also used for "act").	0.5
EPOCHS	Int	Number of epochs to perform training (possibly a multiple of Interval).	20

LLM

LLM Structured Pruning

LLM Structured Pruning is a novel structured pruning framework for Large Language Models (LLMs) that improves efficiency by reducing storage and enhancing inference speed. Outputs model.safetensors, directly loadable by transformers.from_pretrained().

Parameter	Values	Description	Default Value
pruning_ratio	Float	Pruning ratio	0.2
metrics	"IFV", "WIFV", "WIFN"	Importance metric: "WIFN" (Weighted Importance Feature Norm), "IFV" (Importance Feature Value), "WIFV" (Weighted Importance Feature Value)	"WIFV"
structure	"UL-UM", "UL-MM", "AL-MM", "AL-AM"	Pruning structure: "UL-UM" (Uniform across Layers, Uniform across Modules), "UL-MM" (Uniform across Layers, Manual ratio for Modules), "AL-MM" (Adaptive across Layers, Manual for Modules), "AL-AM" (Adaptive across both Layers and Modules)	"AL-MM"
remove_heads	Int	Number of heads to remove	8
nsamples	Int	Number of samples for evaluation	2048

LLM Quantization

A 4-bit weight-only quantization method designed for Language Model (LM) applications. Utilizes GEMM (General Matrix Multiply) as the default operation. Generates *.safetensor & config.json files that can be directly loaded by transformers' AutoModelForCausalLM.from_pretrained() or AutoAWQ's AutoAWQForCausalLM.from_quantized() for quantized models.

Parameter	Values	Description	Default Value
zero_point	bool	Whether to use zero point.	True
q_group_size	Int	Quantization group size	128
w_bit	Int	Weight bitwidth (only 4 bit is supported)	4
version	"GEMM", "GEMV"	Version of AutoAWQ. One of GEMM or GEMV.	"GEMM"

LLM Engine TensorRT

Optimizes LLMs for inference and builds TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently. Outputs .engine model files which can be directly loaded by NVIDIA Triton Inference Server.

Parameter	Values	Description	Default Value
to_quantize	bool	To first quantize the model and then build engine.	True
quant_method	"fp8", "int4_awq", "smoothquant", "int8"	Quantization format	"int4_awq"
smoothquant	float	(if quant_method = "smoothquant") smooth quant's α value (to control quantization difficulty migration between activations and weights)	0.5
calib_size	Int	Calibration size	32
dtype	"float16"	dtype of the model	"float16"

Model/Quantization Support Grid:

Model	fp8	int4_awq	smoothquant	int8
LLaMA	✓	✓	✓	✓
LLaMA-2	✓	✓	✓	✓
Vicuna	✓	✓	✓	✓
Mixtral	✓	✓	-	✓
Mistral-7B	✓	✓	-	✓
Gemma	✓	✓	-	✓

LLM Engine ExLlama

A new quantization format introducing EXL2, which brings a lot of flexibility to how weights are stored. This implementation generates the engine files and a script required to produce fast inferences on the provided model. Outputs .safetonsor, config.json model files along with run.sh that loads and runs a test inference with ExllamaV2.

Parameter	Values	Description	Default Value
bits	Float >= 2 , <= 8	Target bits per weight	4.125
shard_size	Int	Max shard size in MB while saving model	8192
rope_scale	Float	RoPE scaling factor (related to RoPE (NTK) parameters for calibration)	1
rope_alpha	Float	RoPE alpha value (related to RoPE (NTK) parameters for calibration)	1
head_bits	Int	Target bits per weight (for head layer)	6

LLM Engine MLCLLM

Compiler accelerations and runtime optimizations for native deployment across platforms and edge devices. Outputs params-*.bin files and compiled files directly usable by MLC Chat. Also produces a run.py for sample usage.

Parameter	Values	Description	Default Value
quantize	bool	Indicates whether quantization is applied to the model	True
quant_method	"q4f16_0", "q4f16_autoawq"	Method used for quantization	"q4f16_autoawq"
conv_template	"llama-2"	Conversation templates	None
llvm_triple	null	LLVM triple	None

Low-rank Decomposition

Coming soon