Salesforce AI Research Releases VoiceAgentRAG: A Dual-Agent Memory Router That Cuts Voice RAG Retrieval Latency by 316x

admin 10 hours ago

0 0 3 minutes read

Salesforce AI Research Releases VoiceAgentRAG: A Dual-Agent Memory Router That Cuts Voice RAG Retrieval Latency by 316x

In the world of voice AI, the difference between a helpful assistant and a negative interaction is measured in milliseconds. While text-based Retrieval-Augmented Generation (RAG) systems can afford a few seconds of ‘thinking’ time, voice agents must respond within 200.ms Budget to keep the flow of natural conversation. Typical vector database generation queries typically add 50-300ms of network latency, effectively eats up the entire budget before LLM can even begin to generate a response.

The Salesforce AI research team released VoiceAgentRAGan open source dual-agent architecture designed to bypass this retrieval feature by removing the retrieved document from response generation.

Dual-Agent Architecture: Fast Talker vs. Slow Thinker

VoiceAgentRAG works as a memory router that schedules two agents simultaneously via an asynchronous event bus:

Fast Speaker (Former Agent): This agent handles the critical delay method. For every user query, it first checks the location, which is in memory Semantic warehouse. If the required core is present, the lookup takes about 0.35ms. If you miss a cache, it falls back to the remote vector database and immediately stores the results of future turns.
Slow Thinker (Rear Agent): Running as a background task, this agent continuously monitors chat streams. It uses a sliding window of the last six interviews prediction 3–5 possible topics to follow. It then prefetches the appropriate document fragments from the remote vector store to the local cache before the user addresses their next query.

To increase the accuracy of the search, the Slow Thinker is ordered to generate document style definitions there are questions^{. This ensures that the resulting embedding closely matches the actual prose found in the knowledge base^.}

Core Technology: Semantic Caching

The efficiency of the system depends on the special semantic cache implemented in the internal memory FAISS IndexFlat IP (internal product)^{^{^{^.}}}

Reference for embedding a document: Unlike a passive cache that indexes by query definition, VoiceAgentRAG indexes the entries themselves embedding documents. This allows the cache to perform appropriate semantic searches over its content, ensuring consistency even if the user’s phrase differs from the system’s predictions.
Threshold Management: Because query-to-document cosine similarity is systematically lower than query-to-query similarity, the system uses a default threshold $= 0.40$ measuring precision and recall.
Maintenance: The cache finds almost duplicates using the 0.95 cosine similarity limit and using a Recently Used (LRU) eviction policy with 300-second Time to Live (TTL).
Important Retrieval: By missing the Fast Talker cache, a PriorityRetrieval event triggers the Slow Thinker to perform a quick retrieval with extended top-k (2x default) to quickly fill the cache in the new title area.

Benchmarks and performance

The research team tested the system using it Quadrant Cloud as a remote vector database for all 200 questions and 10 interview situations.

Metric	Working
Overall Cache Hit Rate	75% (79% in warm conversion)
Speedup recovery	316x $(110ms rightarrow 0.35ms)$
Total Retrieval Time Saved	16.5 seconds over 200 laps

Architecture is most effective in contexts that are thematically consistent or that have a continuous topic. For example, ‘Feature Comparison’ (S8) achieved a 95% hit rate. In contrast, performance is immersed in highly volatile situations; low performance status was ‘Existing customer development’ (S9) of a 45% hit ratewhile ‘Mixed rapid-fire’ (S10) retained 55%.

Integration and Support

The VoiceAgentRAG repository is designed for extensive interoperability across the AI stack:

LLM providers: Support OpenAI, Anthropic, Gemini/Vertex AIagain Ollama. The default test model for the paper was GPT-4o-mini.
Embedding: Applied research OpenAI-3-minor text embedding (max. 1536), but the latter provides support for both OpenAI again Ollama embedding.
STT/TTS: Support Whisper (local or OpenAI) speech-to-text and Edge TTS or OpenAI text-to-speech.
Vector Shops: Built-in support for FAISS again Qdrant.

Key Takeaways

Dual-Agent Architecture: The system solves RAG’s latency bottleneck by using a front-end ‘Fast Talker’ for sub-millisecond cache checks and a back-end ‘Slow Thinker’ for predictive prefetching.
Significant Speedup: Achieves 316x retrieval acceleration $(110ms rightarrow 0.35ms)$ for cache hits, which is important to stay within the 200ms voice response budget.
High Cache Effectively: Across the various scenarios, the system maintains an overall cache hit rate of 75%, reaching as high as 95% for topically relevant queries such as feature comparison.
Saving Document Index: To ensure accuracy regardless of the user’s phrase, the semantic cache identifies entries by document embedding rather than predicted query embedding.
Advance Download: The background agent uses a sliding window of the last 6 conversations to predict the possible next topics and fill the cache during the pause between environmental exchanges.

Check it out Paper again Repo here. Also, feel free to follow us Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.