Building a Content Pruning Pipeline for Long-Term Agents

In this article, you’ll learn how to use a context pruning pipeline for long-running AI agents, enabling them to efficiently manage conversational memory with semantic similarity.
Topics we will cover include:
- Why unlimited conversation history is a problem for agents built on top of large language models, and what a context pruning strategy looks like.
- How to use sentence modifier embedding models to calculate semantic similarity between current modifier and archived conversation.
- How it combines a context window cut from the most recent turn, the top-K relative past turns statistically related, and the current information.
Building a Content Pruning Pipeline for Long-Term Agents
Introduction
Modern AI agents are built on large-scale linguistic models (LLMs) that are designed to operate continuously. As a result, their chat history continues to grow forever. Passing such history as a window into the context of LLM is an ideal solution to the prohibitive token costs, latency issues, and the depravity of the thinking end.
Building a context pruning pipeline can address this problem by managing memory for the latest conversation. This article outlines the basic principles of implementing a core pruning pipeline for long-running agents.
We use a completely accessible and free local solution based on open source embedding models rather than paid APIs, but you can replace them with paid APIs if you want a more efficient solution.
A Proposed Memory Strategy
Classical memory strategies for agents rely on a sliding window that forgets old information as it is left behind, including potentially sensitive information. Going beyond that approach, it’s possible to build a curated, smart pipeline that gives the LLM exactly what it needs as a core.
In short, context can be determined up to the following aspects:
- I current informationcontaining a user request or query.
- I the latest turni.e. the exchange of the previous input response, which is the key to maintaining the continuity of the conversation.
- I top matches are statistically compatiblecalculated based on the similarity score. These are past curves closely related to current information, returned by vector embedding.
Everything in the conversation history outside the scope of these three elements is discarded in the active context, saving computing and memory.
Simulation-Based Implementation
Our example implementation mimics the use of the technique mentioned above, creating a context pruning window step by step. Sentence transformer models are used to simulate a long-running pipeline around the history of humorous conversation.
We start by making necessary imports:
import numpy as np from sentence_transformers import SentenceTransformer from scipy.spatial.distance import cosine
enter numpy like np from sentence_converters enter SentenceTransformer from scipy.place.distance enter cosine |
Next, we load and run the previously trained embedding model – effectively all-MiniLM-L6-v2 from the sentence_transformers the library. This model is trained to convert raw text into embedding vectors that capture semantic features. We also create a simple, simulated agent history that contains user-agent interactions (in real cases, this will be downloaded from the database):
# Run a simple lightweight open source embedding model = SentenceTransformer(‘all-MiniLM-L6-v2’) # 1. Simulated Agent History (Usually downloaded from the database) chat_history = [
{“role”: “user”, “content”: “My name is Alice and I work in logistics.”},
{“role”: “agent”, “content”: “Nice to meet you, Alice. How can I help with logistics?”},
{“role”: “user”, “content”: “What’s the weather like today?”},
{“role”: “agent”, “content”: “It’s sunny and 75 degrees.”},
{“role”: “user”, “content”: “I need help calculating route efficiency for my fleet.”},
{“role”: “agent”, “content”: “Route efficiency involves analyzing distance, traffic, and load weight.”},
{“role”: “user”, “content”: “Thanks, that makes sense.”},
{“role”: “agent”, “content”: “You’re welcome! Let me know if you need anything else.”}
]
# Implement a lightweight open source embedding model model = SentenceTransformer(‘all-MiniLM-L6-v2’) # 1. Simulated Agent History (Usually downloaded from a website) chat_history = [ {“role”: “user”, “content”: “My name is Alice and I work in logistics.”}, {“role”: “agent”, “content”: “Nice to meet you, Alice. How can I help with logistics?”}, {“role”: “user”, “content”: “What’s the weather like today?”}, {“role”: “agent”, “content”: “It’s sunny and 75 degrees.”}, {“role”: “user”, “content”: “I need help calculating route efficiency for my fleet.”}, {“role”: “agent”, “content”: “Route efficiency involves analyzing distance, traffic, and load weight.”}, {“role”: “user”, “content”: “Thanks, that makes sense.”}, {“role”: “agent”, “content”: “You’re welcome! Let me know if you need anything else.”} ] |
The basic concept of the core pruning pipeline is as follows. It is integrated into a prune_context() a function that receives current information, a full interaction history, and a number of past curves that are statistically related to find, k:
def prune_context(current_prompt, history, top_k=2): # If the conversation history is too short, we just return it if len(history) <= 2: return history + [{"role": "user", "content": current_prompt}]# Returns the most recent turn (last user/agent pair) recent_turn = history[-2:] # All history will be eligible for semantic pruning archived_turns = history[:-2]# 2. Embed the current prompt_emb = model.encode(current_prompt) # 3. Embed the archived turn and match the computer points_turn = []to open archived_turns: turn_emb = model.encode(turn["content"]) # We want similarity, so we subtract the cosine distance from similarity 1 = 1 - cosine(prompt_emb, turn_emb) score_turns.append((similarity, turn)) # 4. Sort by maximum similarity and cut Top-K turns score_turns.sort(key=lambda x: x[0]reverse=True) top_semantic_turns = [turn for score, turn in scored_turns[:top_k]+ [{"role": "user", "content": current_prompt}]return trimmed_context
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | def prune_context(current_information, history, up_k=2): # If the chat history is too short, we just return it if this(history) <= 2: come back history + [{“role”: “user”, “content”: current_prompt}] # Returns the latest opportunity (user/agent pair) latest_turn = history[–2:]
# The remaining history will be eligible for semantic pruning archived_turn = history[:–2]
# 2. Embeds current information prompt_emb = model.enter the code(current_information)
# 3. Embedding stored curves and computer parallelism curve_points = [] for repent in the middle archived_turn: open_emb = model.enter the code(repent[“content”]) # We want uniformity, so we subtract the cosine distance from 1 similarity = 1 – cosine(prompt_emb, open_emb) curve_points.add((similarity, repent))
# 4. Sorting by high similarity and cutting Top-K curves curve_points.filter(the key=lambda x: x[0], step back=The truth) top_semantic_turn = [turn for score, turn in scored_turns[:top_k]]
# Sorting semantic turns chronologically (optional but recommended for LLMs) top_semantic_turn.filter(the key=lambda x: archived_turn.index(x)) # 5. Assemble the final pruned core trimmed_context = top_semantic_turn + latest_turn + [{“role”: “user”, “content”: current_prompt}]
come back trimmed_context |
The code above is pretty self-explanatory. It divides logic into the basic case – when the history of the conversation is still very short, when the whole history is passed as context – and the general case, where the real semantic pruning pipeline takes place in several steps: embedding the previous curves, calculating the cosine similarity and the current fast embedding, sorting from the highest to the least similarity, and selecting the top-K. The current information, the most recent turn, and the previous top-K semantically similar turns are finally merged into the pruned core.
The following example shows how to get the context of new information when the user returns to aspects related to the efficiency of the shipping line:
# Simulation Execution current_request = “Can we go back to the inventory?” optimized_context = prune_context(current_request, chat_history) # Print result print(“— PRUNED SCENE FACE —“) in msg in optimized_context: print(f”{ msg[‘role’].top()}: {msg[‘content’]}”)
# Simulation Execution current_request = “Can we get back to the fleet statistics?” optimized_context = prune_context(current_request, chat_history) # Output the result print(“— FAX OF THE GIVEN COMMUNICATION —“) for the message in the middle optimized_context: print(f“{msg[‘role’].top()}: {msg[‘content’]}”) |
The resulting content window generated by our pruning strategy is shown below:
— PROVIDED COMMUNICATION FACE — USER: I need help calculating the efficiency of my fleet. AGENCY: Route efficiency includes analysis of distance, traffic, and load weight. USER: Thanks, that makes sense. AGENCY: Welcome! Let me know if you need anything else. USER: Can we go back to the fleet stats?
—– It has been circumcised CONTEXT It’s perfect —– USER: I the need Help to count route efficiency for mine ships. AGENT: The route efficiency involves analysis distance, traffic, again burden weight. USER: Thank you, that it does the idea. AGENT: You‘re you are welcome! Allow me know if you the need anything else the rest. USER: He can we go away back to i ships calculations? |
Note that we used the default value ke.g top_k=2. The last opportunity, which is always included in our defined pipeline, contains a pair of messages:
USER: Thanks, that makes sense. AGENCY: Welcome! Let me know if you need anything else.
USER: Thank you, that it does the idea. AGENT: You‘re you are welcome! Allow me know if you the need anything else the rest. |
So why does one additional user agent connection appear before this opportunity, rather than two? The reason is that the top-k strategy does not work at the full turn level (ie, two messages), but at the individual message level. In this case, the two messages returned based on similarity happen to form two parts of the same interaction, but it is equally possible that the two most important messages are both user messages, agent messages, or non-consecutive parts of the conversation history.
Wrapping up
This article demonstrated how to use a context pruning pipeline – based on the conversational history of the simulated agent – that relies on semantic similarity to select relevant parts of the conversation as context for the current information. This is an important technique for long-running agents, which helps reduce memory usage and computational costs while improving overall efficiency.



