Technology & AI

Building a Content Pruning Pipeline for Long-Term Agents

In this article, you’ll learn how to use a context pruning pipeline for long-running AI agents, enabling them to efficiently manage conversational memory with semantic similarity.

Topics we will cover include:

  • Why unlimited conversation history is a problem for agents built on top of large language models, and what a context pruning strategy looks like.
  • How to use sentence modifier embedding models to calculate semantic similarity between current modifier and archived conversation.
  • How it combines a context window cut from the most recent turn, the top-K relative past turns statistically related, and the current information.

Building a Content Pruning Pipeline for Long-Term Agents

Introduction

Modern AI agents are built on large-scale linguistic models (LLMs) that are designed to operate continuously. As a result, their chat history continues to grow forever. Passing such history as a window into the context of LLM is an ideal solution to the prohibitive token costs, latency issues, and the depravity of the thinking end.

Building a context pruning pipeline can address this problem by managing memory for the latest conversation. This article outlines the basic principles of implementing a core pruning pipeline for long-running agents.

We use a completely accessible and free local solution based on open source embedding models rather than paid APIs, but you can replace them with paid APIs if you want a more efficient solution.

A Proposed Memory Strategy

Classical memory strategies for agents rely on a sliding window that forgets old information as it is left behind, including potentially sensitive information. Going beyond that approach, it’s possible to build a curated, smart pipeline that gives the LLM exactly what it needs as a core.

In short, context can be determined up to the following aspects:

  • I current informationcontaining a user request or query.
  • I the latest turni.e. the exchange of the previous input response, which is the key to maintaining the continuity of the conversation.
  • I top matches are statistically compatiblecalculated based on the similarity score. These are past curves closely related to current information, returned by vector embedding.

Everything in the conversation history outside the scope of these three elements is discarded in the active context, saving computing and memory.

Simulation-Based Implementation

Our example implementation mimics the use of the technique mentioned above, creating a context pruning window step by step. Sentence transformer models are used to simulate a long-running pipeline around the history of humorous conversation.

We start by making necessary imports:

Next, we load and run the previously trained embedding model – effectively all-MiniLM-L6-v2 from the sentence_transformers the library. This model is trained to convert raw text into embedding vectors that capture semantic features. We also create a simple, simulated agent history that contains user-agent interactions (in real cases, this will be downloaded from the database):

The basic concept of the core pruning pipeline is as follows. It is integrated into a prune_context() a function that receives current information, a full interaction history, and a number of past curves that are statistically related to find, k:

The code above is pretty self-explanatory. It divides logic into the basic case – when the history of the conversation is still very short, when the whole history is passed as context – and the general case, where the real semantic pruning pipeline takes place in several steps: embedding the previous curves, calculating the cosine similarity and the current fast embedding, sorting from the highest to the least similarity, and selecting the top-K. The current information, the most recent turn, and the previous top-K semantically similar turns are finally merged into the pruned core.

The following example shows how to get the context of new information when the user returns to aspects related to the efficiency of the shipping line:

The resulting content window generated by our pruning strategy is shown below:

Note that we used the default value ke.g top_k=2. The last opportunity, which is always included in our defined pipeline, contains a pair of messages:

So why does one additional user agent connection appear before this opportunity, rather than two? The reason is that the top-k strategy does not work at the full turn level (ie, two messages), but at the individual message level. In this case, the two messages returned based on similarity happen to form two parts of the same interaction, but it is equally possible that the two most important messages are both user messages, agent messages, or non-consecutive parts of the conversation history.

Wrapping up

This article demonstrated how to use a context pruning pipeline – based on the conversational history of the simulated agent – that relies on semantic similarity to select relevant parts of the conversation as context for the current information. This is an important technique for long-running agents, which helps reduce memory usage and computational costs while improving overall efficiency.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button