How to Build a Matryoshka-Optimized Sentence Embedding Model for Ultra-Fast Retrieval with 64-Dimension Truncation

0 0 4 minutes read

How to Build a Matryoshka-Optimized Sentence Embedding Model for Ultra-Fast Retrieval with 64-Dimension Truncation

In this tutorial, we fine-tune the embedding model of Sentence-Transformers using Matryoshka Representation Learning so that the initial vector measurements carry the most useful semantic signal. We train with MatryoshkaLoss on triplet data and verify the important promise of MRL by measuring the retrieval quality after truncating the embedding to 64, 128, and 256 dimensions. Finally, we save the tuned model and show how to load it with a small truncate_dim setting for a fast and memory-efficient vector search. Check it out FULL CODES here.

!pip -q install -U sentence-transformers datasets accelerate


import math
import random
import numpy as np
import torch


from datasets import load_dataset
from torch.utils.data import DataLoader


from sentence_transformers import SentenceTransformer, InputExample
from sentence_transformers import losses
from sentence_transformers.util import cos_sim




def set_seed(seed=42):
   random.seed(seed)
   np.random.seed(seed)
   torch.manual_seed(seed)
   torch.cuda.manual_seed_all(seed)


set_seed(42)

We install the required libraries and import all the required modules for training and testing. Set a deterministic seed, so our sampling and training behavior remains the same for every run. We also ensure that PyTorch and CUDA RNGs are aligned when the GPU is available. Check it out FULL CODES here.

@torch.no_grad()
def retrieval_metrics_mrr_recall_at_k(
   model,
   queries,
   corpus,
   qrels,
   dims_list=(64, 128, 256, None),
   k=10,
   batch_size=64,
):
   device = "cuda" if torch.cuda.is_available() else "cpu"
   model.to(device)


   qids = list(queries.keys())
   docids = list(corpus.keys())


   q_texts = [queries[qid] for qid in qids]
   d_texts = [corpus[did] for did in docids]


   q_emb = model.encode(q_texts, batch_size=batch_size, convert_to_tensor=True, normalize_embeddings=True)
   d_emb = model.encode(d_texts, batch_size=batch_size, convert_to_tensor=True, normalize_embeddings=True)


   results = {}


   for dim in dims_list:
       if dim is None:
           qe = q_emb
           de = d_emb
           dim_name = "full"
       else:
           qe = q_emb[:, :dim]
           de = d_emb[:, :dim]
           dim_name = str(dim)
           qe = torch.nn.functional.normalize(qe, p=2, dim=1)
           de = torch.nn.functional.normalize(de, p=2, dim=1)


       sims = cos_sim(qe, de)


       mrr_total = 0.0
       recall_total = 0.0


       for i, qid in enumerate(qids):
           rel = qrels.get(qid, set())
           if not rel:
               continue


           topk = torch.topk(sims[i], k=min(k, sims.shape[1]), largest=True).indices.tolist()
           topk_docids = [docids[j] for j in topk]


           recall_total += 1.0 if any(d in rel for d in topk_docids) else 0.0


           rr = 0.0
           for rank, d in enumerate(topk_docids, start=1):
               if d in rel:
                   rr = 1.0 / rank
                   break
           mrr_total += rr


       denom = max(1, len(qids))
       results[dim_name] = {f"MRR@{k}": mrr_total / denom, f"Recall@{k}": recall_total / denom}


   return results




def pretty_print(results, title):
   print("n" + "=" * 80)
   print(title)
   print("=" * 80)
   for dim, metrics in results.items():
       print(f"dim={dim:>4} | " + " | ".join([f"{k}={v:.4f}" for k, v in metrics.items()]))

We use a simple recall checker that combines queries and documents, calculates cosine similarity, and reports MRR@10 and Recall@10. We re-normalize the embeddings after truncation so that the small initials are always comparable in cosine space. We also added an integrated printer to make before/after comparisons easier to read. Check it out FULL CODES here.

DATASET_ID = "sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1"
SUBSET = "triplet-hard"
SPLIT = "train"


TRAIN_SAMPLES = 4000
EVAL_QUERIES = 300


stream = load_dataset(DATASET_ID, SUBSET, split=SPLIT, streaming=True)


train_examples = []
eval_queries = {}
eval_corpus = {}
eval_qrels = {}


doc_id_counter = 0
qid_counter = 0


for row in stream:
   q = (row.get("query") or "").strip()
   pos = (row.get("positive") or "").strip()
   neg = (row.get("negative") or "").strip()


   if not q or not pos or not neg:
       continue


   train_examples.append(InputExample(texts=[q, pos, neg]))


   if len(eval_queries) < EVAL_QUERIES:
       qid = f"q{qid_counter}"
       qid_counter += 1


       pos_id = f"d{doc_id_counter}"; doc_id_counter += 1
       neg_id = f"d{doc_id_counter}"; doc_id_counter += 1


       eval_queries[qid] = q
       eval_corpus[pos_id] = pos
       eval_corpus[neg_id] = neg
       eval_qrels[qid] = {pos_id}


   if len(train_examples) >= TRAIN_SAMPLES and len(eval_queries) >= EVAL_QUERIES:
       break


print(len(train_examples), len(eval_queries), len(eval_corpus))

We stream the mined MS MARCO data set and build both a training set (queries, good, bad) and a small IR benchmark set. For each query we map each query to a relevant document and include a counter document to make retrieval meaningful. We stop early to keep the run Colab-friendly while it’s still big enough to show the effects of the cuts.

MODEL_ID = "BAAI/bge-base-en-v1.5"


device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer(MODEL_ID, device=device)
full_dim = model.get_sentence_embedding_dimension()


baseline = retrieval_metrics_mrr_recall_at_k(
   model,
   queries=eval_queries,
   corpus=eval_corpus,
   qrels=eval_qrels,
   dims_list=(64, 128, 256, None),
   k=10,
)
pretty_print(baseline, "BEFORE")

We load the solid foundation embedding model and record its full embedding dimensions. We do a basic test on all 64/128/256/full sizes to see how the reduction behaves before any training. We print the results so that we can later compare whether MRL improves the quality of early dimensions.

batch_size = 16
epochs = 1
warmup_steps = 100


train_loader = DataLoader(train_examples, batch_size=batch_size, shuffle=True, drop_last=True)


base_loss = losses.MultipleNegativesRankingLoss(model=model)


mrl_dims = [full_dim, 512, 256, 128, 64] if full_dim >= 768 else [full_dim, 256, 128, 64]
mrl_loss = losses.MatryoshkaLoss(
   model=model,
   loss=base_loss,
   matryoshka_dims=mrl_dims
)


model.fit(
   train_objectives=[(train_loader, mrl_loss)],
   epochs=epochs,
   warmup_steps=warmup_steps,
   show_progress_bar=True,
)


after = retrieval_metrics_mrr_recall_at_k(
   model,
   queries=eval_queries,
   corpus=eval_corpus,
   qrels=eval_qrels,
   dims_list=(64, 128, 256, None),
   k=10,
)
pretty_print(after, "AFTER")


out_dir = "mrl-msmarco-demo"
model.save(out_dir)


m64 = SentenceTransformer(out_dir, truncate_dim=64)
emb = m64.encode(
   ["what is the liberal arts?", "liberal arts covers humanities and sciences"],
   normalize_embeddings=True
)
print(emb.shape)

We create MultipleNegativesRankingLoss and wrap it with MatryoshkaLoss using a drop-down list of the size of the target prefix. We fine-tune the model in triplets, then re-run the same reduction benchmark to measure the storage improvement. Also, we save the model and reload it with truncate_dim=64 to ensure the practical use of convergent retrieval.

In conclusion, we have successfully trained an embedded Matryoshka-optimized model that maintains robust retrieval performance even if we reduce the vectors to a small starting size, such as 64. We validated the result by comparing baseline and post-training recovery metrics across multiple cutoff sizes and full embeddings. With the saved model and truncate_dim loading pattern, we now have a clean workflow for building small, fast vector indexes while retaining the option to re-embed with dimensional embedding.

Check it out FULL CODES here. Also, feel free to follow us Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.