How to Build Effective Agent Thinking Methods by Dynamically Pruning Multiple Thinking Methods Without Losing Accuracy

0 0 5 minutes read

How to Build Effective Agent Thinking Methods by Dynamically Pruning Multiple Thinking Methods Without Losing Accuracy

In this tutorial, we use a chain logic pruning framework that generates multiple logics in parallel and dynamically reduces them using consensus signals and premature stops. We focus on improving the efficiency of reasoning by reducing the use of unnecessary tokens while preserving the validity of the response, showing that consensus and agreement weight graphs can serve as effective proxies of reasoning quality. We design the entire pipeline using an instruction-activated model and continuous sampling to simulate how an agent can decide when it has thought “enough is enough.” Check it out FULL CODES here.

!pip -q install -U transformers accelerate bitsandbytes networkx scikit-learn


import re, time, random, math
import numpy as np
import torch
import networkx as nx
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


SEED = 7
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)


MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"


tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
   MODEL_NAME,
   device_map="auto",
   torch_dtype=torch.float16,
   load_in_4bit=True
)
model.eval()


SYSTEM = "You are a careful problem solver. Keep reasoning brief and output a final numeric answer."
FINAL_RE = re.compile(r"Final:s*([-d]+(?:.d+)?)")

We set up the Colab environment and load all the necessary libraries to consider the active objects. We implement a lightweight instruction-optimized language model with scaling to ensure stable performance on limited GPU resources. We also describe the global configuration, random control, and key feedback pattern used throughout the course. Check it out FULL CODES here.

def make_prompt(q):
   return (
       f"{SYSTEM}nn"
       f"Problem: {q}n"
       f"Reasoning: (brief)n"
       f"Final: "
   )


def parse_final_number(text):
   m = FINAL_RE.search(text)
   if m:
       return m.group(1).strip()
   nums = re.findall(r"[-]?d+(?:.d+)?", text)
   return nums[-1] if nums else None


def is_correct(pred, gold):
   if pred is None:
       return 0
   try:
       return int(abs(float(pred) - float(gold)) < 1e-9)
   except:
       return int(str(pred).strip() == str(gold).strip())


def tok_len(text):
   return len(tokenizer.encode(text))

We describe the helper functions that were ordered by the design, extract the final numerical answers, and check the accuracy against the ground truth. We measure the way responses are processed so that different reasoning methods can be consistently compared. We also introduce token counting tools that allow us to measure over time the effectiveness of thinking. Check it out FULL CODES here.

@torch.no_grad()
def generate_paths(question, n, max_new_tokens=64, temperature=0.7, top_p=0.9):
   prompt = make_prompt(question)
   inputs = tokenizer(prompt, return_tensors="pt").to(model.device)


   gen_cfg = GenerationConfig(
       do_sample=True,
       temperature=temperature,
       top_p=top_p,
       max_new_tokens=max_new_tokens,
       pad_token_id=tokenizer.eos_token_id,
       eos_token_id=tokenizer.eos_token_id,
       num_return_sequences=n
   )


   out = model.generate(**inputs, generation_config=gen_cfg)
   prompt_tok = inputs["input_ids"].shape[1]


   paths = []
   for i in range(out.shape[0]):
       seq = out[i]
       gen_ids = seq[prompt_tok:]
       completion = tokenizer.decode(gen_ids, skip_special_tokens=True)
       paths.append({
           "prompt_tokens": int(prompt_tok),
           "gen_tokens": int(gen_ids.shape[0]),
           "completion": completion
       })
   return paths

We perform rapid multi-sampling generation that generates several modes of thought in a single model call. We only extract the generated continuity to separate the logical result of each method. We store token usage and completions in a structured format to support downstream pruning decisions. Check it out FULL CODES here.

def consensus_strength(completions, sim_threshold=0.22):
   if len(completions) <= 1:
       return [0.0] * len(completions)


   vec = TfidfVectorizer(ngram_range=(1,2), max_features=2500)
   X = vec.fit_transform(completions)
   S = cosine_similarity(X)


   G = nx.Graph()
   n = len(completions)
   G.add_nodes_from(range(n))


   for i in range(n):
       for j in range(i+1, n):
           w = float(S[i, j])
           if w >= sim_threshold:
               G.add_edge(i, j, weight=w)


   strength = [0.0] * n
   for u, v, d in G.edges(data=True):
       w = float(d.get("weight", 0.0))
       strength[u] += w
       strength[v] += w


   return strength

We develop a lightweight consensus method using a similarity graph over the generated logic patterns. We calculate the pairwise similarity scores and convert them into a power signal based on the graph of each method. It allows us to estimate the agreement between reasoning methods without expensive model calls. Check it out FULL CODES here.

def pick_final_answer(paths):
   answers = [parse_final_number(p["completion"]) for p in paths]
   strengths = consensus_strength([p["completion"] for p in paths])


   groups = {}
   for i, a in enumerate(answers):
       if a is None:
           continue
       groups.setdefault(a, {"idx": [], "strength": 0.0, "tokens": 0})
       groups[a]["idx"].append(i)
       groups[a]["strength"] += strengths[i]
       groups[a]["tokens"] += paths[i]["gen_tokens"]


   if not groups:
       return None, {"answers": answers, "strengths": strengths}


   ranked = sorted(
       groups.items(),
       key=lambda kv: (len(kv[1]["idx"]), kv[1]["strength"], -kv[1]["tokens"]),
       reverse=True
   )


   best_answer = ranked[0][0]
   best_indices = ranked[0][1]["idx"]
   best_i = sorted(best_indices, key=lambda i: (paths[i]["gen_tokens"], -strengths[i]))[0]


   return best_answer, {"answers": answers, "strengths": strengths, "best_i": best_i}


def pruned_agent_answer(
   question,
   batch_size=2,
   k_max=10,
   max_new_tokens=64,
   temperature=0.7,
   top_p=0.9,
   stop_min_samples=4,
   stop_ratio=0.67,
   stop_margin=2
):
   paths = []
   prompt_tokens_once = tok_len(make_prompt(question))
   total_gen_tokens = 0


   while len(paths) < k_max:
       n = min(batch_size, k_max - len(paths))
       new_paths = generate_paths(
           question,
           n=n,
           max_new_tokens=max_new_tokens,
           temperature=temperature,
           top_p=top_p
       )
       paths.extend(new_paths)
       total_gen_tokens += sum(p["gen_tokens"] for p in new_paths)


       if len(paths) >= stop_min_samples:
           answers = [parse_final_number(p["completion"]) for p in paths]
           counts = {}
           for a in answers:
               if a is None:
                   continue
               counts[a] = counts.get(a, 0) + 1
           if counts:
               sorted_counts = sorted(counts.items(), key=lambda kv: kv[1], reverse=True)
               top_a, top_c = sorted_counts[0]
               second_c = sorted_counts[1][1] if len(sorted_counts) > 1 else 0
               if top_c >= math.ceil(stop_ratio * len(paths)) and (top_c - second_c) >= stop_margin:
                   final, dbg = pick_final_answer(paths)
                   return {
                       "final": final,
                       "paths": paths,
                       "early_stopped_at": len(paths),
                       "tokens_total": int(prompt_tokens_once * len(paths) + total_gen_tokens),
                       "debug": dbg
                   }


   final, dbg = pick_final_answer(paths)
   return {
       "final": final,
       "paths": paths,
       "early_stopped_at": None,
       "tokens_total": int(prompt_tokens_once * len(paths) + total_gen_tokens),
       "debug": dbg
   }

We use a core pruning logic that integrates ways to think about the final answers and organize them using signals of consistency and efficiency. We introduce continuous sampling with an early stop to stop production when sufficient confidence emerges. Then we choose the final answer that balances the strength of the agreement and the minimum consumption of tokens. Check it out FULL CODES here.

def baseline_answer(question, k=10, max_new_tokens=64):
   paths = generate_paths(question, n=k, max_new_tokens=max_new_tokens)
   prompt_tokens_once = tok_len(make_prompt(question))
   total_gen_tokens = sum(p["gen_tokens"] for p in paths)


   answers = [parse_final_number(p["completion"]) for p in paths]
   counts = {}
   for a in answers:
       if a is None:
           continue
       counts[a] = counts.get(a, 0) + 1
   final = max(counts.items(), key=lambda kv: kv[1])[0] if counts else None


   return {
       "final": final,
       "paths": paths,
       "tokens_total": int(prompt_tokens_once * k + total_gen_tokens)
   }


DATA = [
   {"q": "If a store sells 3 notebooks for $12, how much does 1 notebook cost?", "a": "4"},
   {"q": "What is 17*6?", "a": "102"},
   {"q": "A rectangle has length 9 and width 4. What is its area?", "a": "36"},
   {"q": "If you buy 5 apples at $2 each, how much do you pay?", "a": "10"},
   {"q": "What is 144 divided by 12?", "a": "12"},
   {"q": "If x=8, what is 3x+5?", "a": "29"},
   {"q": "A jar has 30 candies. You eat 7. How many remain?", "a": "23"},
   {"q": "If a train travels 60 km in 1.5 hours, what is its average speed (km/h)?", "a": "40"},
   {"q": "Compute: (25 - 9) * 3", "a": "48"},
   {"q": "What is the next number in the pattern: 2, 4, 8, 16, ?", "a": "32"},
]


base_acc, base_tok = [], []
prun_acc, prun_tok = [], []


for item in DATA:
   b = baseline_answer(item["q"], k=8, max_new_tokens=56)
   base_acc.append(is_correct(b["final"], item["a"]))
   base_tok.append(b["tokens_total"])


   p = pruned_agent_answer(item["q"], max_new_tokens=56)
   prun_acc.append(is_correct(p["final"], item["a"]))
   prun_tok.append(p["tokens_total"])


print("Baseline accuracy:", float(np.mean(base_acc)))
print("Baseline avg tokens:", float(np.mean(base_tok)))
print("Pruned accuracy:", float(np.mean(prun_acc)))
print("Pruned avg tokens:", float(np.mean(prun_tok)))

We compare the cut agent method against the fixed basis for stability. We examine both accuracy and token usage methods to measure efficiency gains in pruning. We conclude by reporting aggregated metrics that show that adaptive pruning preserves fairness while reducing computational cost.

In conclusion, we have shown that agent pruning can significantly reduce the effective use of tokens without sacrificing accuracy by stopping reasoning when sufficient consensus emerges. We have shown that combining stability, similarity-based consensus graphs, and early positioning heuristics provides a practical and scalable approach to efficient reasoning in agent systems. This framework serves as the basis for more advanced agent behaviors, such as intermediate generation pruning, budget-aware reasoning, and adaptive control of depth of reasoning in real-world AI agents.

Check it out FULL CODES here. Also, feel free to follow us Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.