Technology & AI

Microsoft AI Introduces MAI-Transcribe-1.5: 2.4% WER on unprocessed analysis, Best-in-Class FLEURS Accuracy, and Up to 5th Fastest Long-Range Audio Transcription

Last week Microsoft AI announced MAI-Transcribe-1.5. It is the second iteration of the company’s internal speech and text family. The model targets accuracy across 43 languages, accents, and sound environments. The Microsoft team is putting it to work in production writing.

What is MAI-Transcribe-1.5

MAI-Transcribe-1.5 is an automatic speech recognition (ASR) model. It takes audio as input and returns text. Microsoft built it in-house, not on a third-party basis. The model handles 43 languages ​​in one system. Optimized for various pronunciations, dialects, and real-world acoustic environments.

Microsoft includes it in Copilot, Teams, GitHub, and Dynamics 365 Contact Center. It is also available in Foundry, Microsoft’s modeling platform.

The Case for Accuracy

Accuracy here is measured by Word-Error-Rate (WER). A lower WER means fewer errors per word typed. Microsoft reports the best WER for all 43 languages ​​in FLEURS. FLEURS is a common transcription measure for many languages.

On the Artificial Analysis leaderboard, the model posts a WER of 2.4%. That puts it third in the open competition benchmark. So the picture is divided. The Microsoft team claimed first place in FLEURS and third in Artificial Analysis.

Language expansion is another matter of accuracy. Coverage increased from 25 to 43 languages. 18 new languages ​​are added without compromising accuracy. Ten of them are South Asian, including Bengali, Tamil, and Telugu. Eight are European, such as Ukrainian, Greek and Catalan.

Speed

MAI-Transcribe-1.5 leads the accuracy-times-speed on the Functional Analysis leaderboard. It works up to 5x faster than models with comparable accuracy. The effect is great for long audio files. The model can record an hour of audio in less than 15 seconds.

Microsoft cites speedups of up to 5x over Gemini 3.1, Scribe v2, and GPT-4o-Transcribe in long audio. Against the previous MAI-Transcribe-1, the Azure card lists up to 5.7x faster long form. For batch pipelines that process large archives, that latency gap closes quickly.

Keyword (Business) Bias: An Aspect Worth Understanding

General writers often fail at domain-specific vocabulary. This includes people, product names, medical terms, and internal acronyms. Those words tend to matter a lot to business users.

MAI-Transcribe-1.5 adds keyword bias, also called entity bias. You provide a list of domain-specific keywords. Azure Card supports up to 200 keywords. The model biases its predictions in that range. Obviously, it doesn’t enforce a match. It uses the shared context to determine when the bias should be active. Microsoft reports a 30% WER reduction in FLEURS when biasing is used.

A short example shows the result. Without bias, the words translate to “Sean,” “Oif,” and “Societal.” With the given list of names, the model also finds “Shaun,” “Aoife,” and “Xochitl.” This includes meetings, healthcare, and call centers with niche vocabulary.

Use Cases

The Azure model card lists concrete production scenarios. Each maps to a general engineering work:

  • Video caption of media and content platforms.
  • Access tools that depends on accurate captions.
  • Conference transcription in Teams-style interaction tools.
  • Call analysis through contact centers and support figures.
  • Content creation workflow which require a quick draft transcript.
  • Ambassadors of the word which converts speech to text before consultation.

Automatic language detection helps when the input language is unknown. The model detects spoken language without manual setup.

MAI-Transcribe-1.5 vs MAI-Transcribe-1

The table below compares the two generations using only the facts mentioned.

AttributeMAI-Transcribe-1MAI-Transcribe-1.5
Combined languages2543
Keyword/business biasNot listedUp to 200 keywords
Long form indexing speedThe foundationUp to 5.7x faster
WER Performance AnalysisNot specified2.4% (ranked #3)
FLEURS position (by Microsoft)state of the artBest in class across 43 languages
Automatic language detectionNot specifiedYes
Life cycleEarly releaseGenerally Available (GA)
Input / OutputAudio / TextAudio / Text

Powers and Limitations

Power:

  • Installation of 43 languages ​​from one model, from 25.
  • Keyword/entity bias produces up to 30% WER reduction in FLEURS.
  • A transcript of less than 15 seconds of an hour of audio.
  • Available now with Azure AI Foundry.
  • It’s robust to loud, real-world noise, according to Microsoft.

Limitations:

  • There is no dial yet, so speaker labels are not available.
  • There is no native streaming API, so real-time usage is limited.
  • Several claims of accuracy, speed, and cost are for first-timers.
  • It is ranked third in Artificial Analysis, behind two competitors.

Sources


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button