Compression of LSTM models for Retail Edge deployments

admin 4 hours ago

0 0 7 minutes read

Compression of LSTM models for Retail Edge deployments

There can be practical issues when it comes to deploying AI models in retail environments. Retail environments can include store-level systems, edge devices, and budget-conscious setups, especially for small to mid-sized retail companies. One such major use case is inventory management demand forecasting or shelf optimization. It requires the model used to be small, fast, and accurate.

That’s exactly what we’re going to work on here. In this article, I will walk you through three compression techniques step by step. We will start by building a basic LSTM. Then we will measure its size and accuracy, and apply the compression method one at a time to see how it changes the model. Finally, we’ll wrap everything up with a side-by-side comparison.

So, without further ado, let’s dive in.

The Problem: Marketing AI at the Edge

With everything now moving to the edge, Retail is also looking to store-level mobile apps, devices, and IOT sensors, which can use models and make predictions locally rather than constantly calling cloud APIs.

A predictive model running on a store device or mobile application, such as a shelf sensor or scanner, can face constraints such as limited memory, limited battery, and requires low network latency.

Even with cloud deployment, if the model size is small, it can reduce the cost. Especially if you run thousands of guesses every day on a huge product catalog. The 4KB size model costs much less than the 64KB size model

Not only the cost, the speed of thinking also affects the real-time decisions. Rapid model forecasting can benefit inventory optimization and stock restocking alerts.

Measurement Setup

For research, I used Kaggle Item Demand forecast data set at the store level. The data is spread over 5 years of daily sales at 10 stores and 50 items. This public data set has weekly seasonal sales patterns, trends, and noise.

For this, I used sample data of 5 stores, 10 items, and created 50 different time series. The combination of each store item generates its own sequence, which will result in a total of 72,000 training sample data. The model will forecast the next day’s sales data based on the sales history of the past 14 days, which is a common data set for demand forecasting.

The test was performed 3 times and the results were estimated to be reliable.

A parameter	Details
Data set	Kaggle Store Demand Forecasting Dataset
A sample	5 shops × 10 items = 50 times series
Training Samples	~72,000 samples in total
Sequence length	Data for the last 14 days
Work	One step to sell every day
Metric	Mean Absolute Percentage Error (MAPE)
It works with each model	3 times, average

Step 1: Building a Baseline LSTM

Before pressing anything, we need a point of reference. Our baseline is a standard LSTM with 64 hidden units trained on the dataset described above.

Basic Code:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
def build_lstm(units, seq_length):
    """Build LSTM with specified hidden units."""
    model = Sequential([
        LSTM(units, activation='tanh', input_shape=(seq_length, 1)),
        Dropout(0.2),
        Dense(1)
    ])
    model.compile(optimizer="adam", loss="mse")
    return model
# Baseline: 64 hidden units
baseline_model = build_lstm(64, seq_length=14)

Basic Performance:

The way	Model	Size (KB)	MAPE (%)	MAPE Std (%)
The foundation	LSM-64	66.25	15.92	±0.10

This is our reference point. The LSTM-64 model is 66.25KB in size with a MAPE of 15.92%. All the compression process below will be measured by these numbers.

Step 2: Compression Technique 1 – Structural Measurement

In this way, we reduce the volume of models with fewer hidden units. Instead of a 64-unit LSTM, we train a 32/16-unit model from scratch and see how it performs. This is the easiest of the three.

Code:

# Using the same build_lstm function from baseline
# Compare: 64 units (66KB) vs 32 units vs 16 units
model_32 = build_lstm(32, seq_length=14)
model_16 = build_lstm(16, seq_length=14)

Results:

The way	Model	Size (KB)	MAPE (%)	MAPE Std (%)
The foundation	LSM-64	66.25	15.92	±0.10
Buildings	LSM-32	17.13	16.22	±0.09
Buildings	LSM-16	4.57	16.74	±0.46

Analysis: The LSTM-16 model is 14.5x smaller than the 64 bit model (4.57KB vs 66.25KB), while the MAPE is only increased by 0.82%. In most commercial applications, this difference is minute, and the LSTM 32 model provides an average compression of 3.9x, with a loss of accuracy of 0.3%.

Step 3: Compression Technique 2 – Size pruning

Pruning is removing irrelevant weights from training the model. The main idea is that the contributions of many neural networks are small and can be ignored or set to zero. After pruning, the model is fine-tuned to restore accuracy.

Code:

import numpy as np
from tensorflow.keras.optimizers import Adam
def apply_magnitude_pruning(model, target_sparsity=0.5):
    """Apply per-layer magnitude pruning, skip biases"""
    masks = []
    for layer in model.layers:
        weights = layer.get_weights()
        layer_masks = []
        new_weights = []
        for w in weights:
            if w.ndim == 1:  # Bias - don't prune
                layer_masks.append(None)
                new_weights.append(w)
            else:  # Kernel - prune per-layer
                threshold = np.percentile(np.abs(w), target_sparsity * 100)
                mask = (np.abs(w) >= threshold).astype(np.float32)
                layer_masks.append(mask)
                new_weights.append(w * mask)
        masks.append(layer_masks)
        layer.set_weights(new_weights)
    return masks
# After pruning, fine-tune with lower learning rate
model.compile(optimizer=Adam(learning_rate=0.0001), loss="mse")
model.fit(X_train, y_train, epochs=50, callbacks=[maintain_sparsity])

Results:

The way	Model	Size (KB)	MAPE (%)	MAPE Std (%)
The foundation	LSM-64	66.25	15.92	±0.10
Pruning	Pruned 30%	11.99	16.04	±0.09
Pruning	Pruned 50%	8.56	16.20	±0.08
Pruning	Pruned 70%	5.14	16.84	±0.16

Analysis: With Magnitude Pruning at 50% sparsity, the model size dropped to 8.56KB with only 0.28% accuracy loss compared to the baseline. Even at 70% pruning, MAPE was less than 17%.

Key findings for performing pruning on LSTMs were to use thresholds at every layer instead of a global threshold, skip bias weights (using only kernel weights), and use a lower learning rate during fine-tuning. Apart from this, the performance of LSTM can degrade significantly due to the dependence of the recurrent weights.

Step 4: Compression Technique 3 — INT8 Quantization

Quantization deals with the conversion of 32-bit floating point weights to 8-bit integers after training which will reduce the size of the model by 4 times without losing much accuracy.

Code:

def simulate_int8_quantization(model):
    """Simulate INT8 quantization on model weights."""
    for layer in model.layers:
        weights = layer.get_weights()
        quantized = []
        for w in weights:
            w_min, w_max = w.min(), w.max()
            if w_max - w_min > 1e-10:
                # Quantize to INT8 range [0, 255]
                scale = (w_max - w_min) / 255.0
                zero_point = np.round(-w_min / scale)
                w_int8 = np.round(w / scale + zero_point).clip(0, 255)
                # Dequantize
                w_quant = (w_int8 - zero_point) * scale
            else:
                w_quant = w
            quantized.append(w_quant.astype(np.float32))
        layer.set_weights(quantized)

For production use, it is recommended to use the built-in calibration of TensorFlow Lite:

import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

Results:

The way	Model	Size (KB)	MAPE (%)	MAPE Std (%)
The foundation	LSM-64	66.25	15.92	±0.10
Quantization	INT8	4.28	16.21	±0.22

Analysis: INT8 approximation reduced the model size to 4.28KB from 66.25KB(15.5x compression) with a 0.29% increase in accuracy. This is a much less accurate model compared to the untrimmed LSTM 32 model. Especially in use, INT8 inference is supported, and it is the best among the 3 techniques.

Putting It All Together: A Side-by-Side Comparison

Here’s how each technique compares to the LSTM-64 baseline:

The plan	Pressure measurement	Accuracy Impact
LSM-32	3.9x	+0.30% MAPE
LSM-16	14.5x	+0.82% MAPE
Pruned 30%	5.5x	+0.12% MAPE
Pruned 50%	7.7x	+0.28% MAPE
Pruned 70%	12.9x	+0.92% MAPE
INT8 Quantization	15.5x	+0.29% MAPE

Full benchmark results for all strategies:

The way	Model	Size (KB)	MAPE (%)	MAPE Std (%)
The foundation	LSM-64	66.25	15.92	±0.10
Buildings	LSM-32	17.13	16.22	±0.09
Buildings	LSM-16	4.57	16.74	±0.46
Pruning	Pruned 30%	11.99	16.04	±0.09
Pruning	Pruned 50%	8.56	16.20	±0.08
Pruning	Pruned 70%	5.14	16.84	±0.16
Quantization	INT8	4.28	16.21	±0.22

Each of the above methods comes with its own trade-offs. Property estimation can reduce model size, but requires retraining of the model. Pruning will maintain structure but filter connections. Quantification can be fast but requires corresponding implementation times.

Choosing the Right Strategy

Choose a property size if:

You start from scratch and you can train
Simplicity is more important than excessive pressure

Choose pruning if:

You already have a trained model and you want to compress the model
You need granular level control over the trade off of size accuracy

Go to Quantization if:

You need high pressure with minimal loss of accuracy
Your target deployment platform is INT8 optimized (For example, mobile, edge devices)
You want a quick solution without retraining from scratch.

Choose hybrid strategies where:

Greater pressure is required (edge deployment, IoT)
You can invest time in multiplying in the pressure pipe

Points to Remember in Marketing Practices

Model compression is only one part of the puzzle. There are other factors to consider in marketing systems, as given below.

A Bigger model is always better than a smaller old model. Build retraining into your pipeline as sales patterns change with seasons, trends, promotions, etc.
Benchmarks from a local machine cannot be compared to a local production device. In particular, quantized models can behave differently on different platforms.
Monitoring is an important factor in production, as pressure can cause subtle deterioration of accuracy. All necessary warnings and paging must be in place.
Always consider the overall system costs as a 4KB model that requires no special runtime can be more expensive than using a standard 17KB model, which runs everywhere.

The conclusion

To conclude, all three compression techniques can deliver significant size reduction while maintaining reasonable accuracy.

Property valuation it is the easiest of the 3. LSTM-16 delivers 14.5x compression with less than 1% precision loss.

Pruning provides more control. With proper performance (threshold for each layer, skipping, fine-tuning of low learning rate), 70% pruning achieves 12.9x compression.

The highest value of INT8 achieves the best tradeoff of 15.5x compression with only a 0.29% increase in accuracy.

Choosing the best strategy will depend on your limitations and constraints. If a simple solution is needed, then start by measuring the properties. If needed, high compression rate with minimal loss of precision, go with calibration. Choose pruning especially if you need tighter control over the tradeoff of compression accuracy.

For edge deployments that enable in-store devices, tablets, shelf sensors, or scanners, the size of the model (4KB vs 66KB) can determine whether your AI runs locally on the device or requires a persistent cloud connection.

Ravi Teja Pagidoju is a Senior Engineer with 9+ years of experience
building AI/ML systems to improve sales and supply chain. He has an MS in Computer Science and has published research on combined LLM-optimization methods in IEEE and Springer publications.