Technology & AI

Microsoft AI Releases Harrier-OSS-v1: A New Family of Multilingual Embedding Models Beats SOTA in Multilingual MTEB v2

Microsoft announced the release of the Harrier-OSS-v1a family of three multilingual text embedding models designed to provide high-quality semantic representations in a wide range of languages. Emissions include three different scales: a 270M model parameter, a 0.6B model, and a 27B model.

Harrier-OSS-v1 models have achieved modern results (SOTA) in the MTEB Multilingual (Large Text Embedding Benchmark) v2. For AI professionals, this release marks an important milestone in open source discovery technology, offering a scalable range of models that use modern LLM architecture to embed tasks.

Architecture and Foundation

The Harrier-OSS-v1 family is a departure from bidirectional encoder architectures (such as BERT) that have dominated the embedding landscape for years. Instead, these models are used decoder properties onlysimilar to those found in modern Large Language Models (LLMs).

The use of only decoder bases represents a change in the way context is processed. In the causal model (decoder only), each token can only pay attention to the tokens that come before it. To find a single vector representing all inputs, Harrier uses assembling the last token. This means that the hidden state of the last token in the sequence is used as an integrated representation of the text, which is then underlying L2 familiarity to ensure that the vector has constant magnitude.

Technical Details

Harrier-OSS-v1 models are characterized by their various embedding sizes and their consistent support for remote content input. The following table provides a breakdown of the technical specifications:

I 32,768 (32k) token context window of all three sizes is a key feature of Retrieval-Augmented Generation (RAG) systems. Most traditional embedding models are limited to 512 or 1,024 tokens. The extended window allows AI devs to embed very large documents or code files without the need for aggressive processing, which often leads to a loss of semantic coherence.

Implementation: Instruction-Based Embedding

One of the most important working details for AI devs is that Harrier-OSS-v1 is an organization corrected discipline family reunion. To achieve limited performance, the model requires task-specific instructions to be provided at query time.

The implementation follows a certain logic:

  • Side of the question: All questions should be preceded by a one-sentence task statement that describes the purpose (eg, to retrieve the same text mathematically or to find a translation).
  • Document side: Documents must be coded outside instructions.

An example query format would look like this:

"Instruct: Retrieve semantically similar textnQuery: [User input text]"

This command-based approach allows the model to adjust its vector space dynamically based on the activity, improving retrieval accuracy across different domains such as web search or bitext mining.

Training and De-skilling

The development of the Harrier-OSS-v1 family involved a multi-stage training process. Although the 27B model offers the highest number of parameters and dimensions (5,376), the Microsoft team used special techniques to improve the performance of the smaller variants.

I 270M again 0.6B The models are also trained using the distillation of information from large embedding models. Knowledge refinement is a technique in which a ‘learner’ model is trained to replicate the output distribution or feature representations of the best performing ‘teacher’ model. This technique allows Harrier submodels to achieve higher embedding quality than would be expected from their parameter calculations, making them more efficient in applications where memory or latency is a factor.

Working on MTEB Multilingual v2

I Multilingual MTEB v2 comprehensive benchmark which tests models across a wide range of functions, including:

  • Classification: To see a section of text.
  • Integration: Collecting similar documents.
  • Categorization: Determine whether two sentences are conditional.
  • Retrieval: Finding the most relevant document for a particular question.

By achieving SOTA results in this benchmark release, the Harrier family demonstrates a high level of expertise in multi-language retrieval. This is especially important for global applications where the system may need to process queries and documents in different languages ​​in the same vector environment.

Key Takeaways

  1. Additional Multilingual SOTA: The family includes three models (270M, 0.6B, and 27B) who obtained State of the Art results in Multilingual MTEB v2 benchmark as their release date.
  2. Decoder-Only Foundation: Moving away from BERT-style encoders, these models use only decoder architectures assembling the last token again L2 familiarity.
  3. 32k Extended Content: All models support a Context window for 32,768 tokenswhich allows the representation of long-form documents or code bases without the semantic loss associated with dynamic chunking.
  4. Command-Dependent Retrieval: The best performance requires instructions on the side of the question (a one-sentence function description prepared for input), while documents must be coded without instructions.
  5. Quality by Distillation: A little 270M (640-dim) again 0.6B (1,024-dim) models are trained using distillation knowledge from large embedding models to improve their quality of semantic representation in relation to their parameter estimation.

Check it out Model Weights here. Also, feel free to follow us Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button