Technology & AI

LangSmith vs. Langfuse vs. Arize Compared

Your AI agent is good at testing. Then you send it, and something breaks. A tool called loops forever, like it never learns. The retrieval step returns garbage and costs more. You have absolutely no idea why.

That’s the problem with agent visibility. And if you build with LLMs, you need to solve it before production, not after. This post breaks down three of the most commonly used visualization tools: Lang Smith, Langfuse again Arise. We’ll stop each one, track down the same agent and compare what you’re really getting.

What is Agent Observability?

Traditional application monitoring tracks requests, errors, and delays, but that’s not enough for Agents.

An agent may call multiple tools in sequence, with each LLM step having its own information, token consumption, latency, and potential failure point. A single failed retrieval or tool call can lead to an incorrect final response.

Agent recognition captures a full performance graph: all steps, decision, LLM input and output, tool call, arguments, results, token usage, latency, and test score. Without this visibility, the behavior of the debugger is guesswork.

Setting Up the Test Agent

We will use LangChain agent which is easy to compare them. The agent receives a query from the user, retrieves the appropriate context, and responds using one or more tools to provide an answer.

First, you need to create a test agent and thus install all the required libraries.

Let’s look at the base agent in two ways (search_docs again get_order_status). This will serve as our basis for comparing the three visualization tools.

"""
Base agent used across all three observability demos.

Swap the OPENAI_API_KEY env var or call build_agent() from any demo file.
"""

import os

from dotenv import load_dotenv
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain.tools import tool
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI

load_dotenv()


@tool
def search_docs(query: str) -> str:
    """Search internal docs for relevant information."""
    # Simulated retrieval — swap with your actual vector store
    docs = {
        "refund": (
            "Refunds are processed within 5-7 business days. "
            "Items must be returned within 30 days."
        ),
        "shipping": (
            "Standard shipping takes 3-5 business days. "
            "Express is 1-2 days."
        ),
        "account": (
            "You can reset your password via the login page. "
            "Contact support for account issues."
        ),
    }

    for keyword, content in docs.items():
        if keyword in query.lower():
            return content

    return f"Found general docs related to: {query}"


@tool
def get_order_status(order_id: str) -> str:
    """Look up the status of an order by ID."""
    # Simulated order lookup
    statuses = {
        "ORD-001": "Shipped — expected delivery 2026-05-30",
        "ORD-002": "Processing — not yet shipped",
        "ORD-003": "Delivered on 2026-05-25",
    }

    return statuses.get(
        order_id,
        f"Order {order_id} not found in the system.",
    )


def build_agent() -> AgentExecutor:
    llm = ChatOpenAI(
        model="gpt-4o",
        temperature=0,
        api_key=os.environ["OPENAI_API_KEY"],
    )

    tools = [search_docs, get_order_status]

    prompt = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "You are a helpful customer support assistant. "
                "Use tools when needed.",
            ),
            ("user", "{input}"),
            MessagesPlaceholder(variable_name="agent_scratchpad"),
        ]
    )

    agent = create_openai_tools_agent(llm, tools, prompt)

    return AgentExecutor(
        agent=agent,
        tools=tools,
        verbose=False,
    )


TEST_QUESTIONS = [
    "What are the refund policies?",
    "What is the status of order ORD-002?",
    "How long does shipping take?",
]


if __name__ == "__main__":
    executor = build_agent()

    for question in TEST_QUESTIONS:
        print(f"nQ: {question}")

        result = executor.invoke({"input": question})

        print(f"A: {result['output']}")

This creates a candidate agent that can be used again with each tool. The first tool we will examine will be the one provided by LangSmith.

LangSmith: Native Langchain Tracing

The LangChain team created LangSmith. If you use LangChain, integration will be faster and easier.

"""
LangSmith observability demo.

Setup:

pip install langsmith

Set LANGCHAIN_API_KEY in your .env file.

How it works:

LangSmith hooks into LangChain's callback system via env vars, so no code
changes are needed beyond the two os.environ lines below.
"""

import os

from dotenv import load_dotenv

from agent_base import TEST_QUESTIONS, build_agent

load_dotenv()

# Enable LangSmith tracing. These two vars are all you need.
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "agent-observability-demo"

# LANGCHAIN_API_KEY must be set in your .env or environment.


def run_with_metadata(
    executor,
    question: str,
    user_id: str = "demo-user",
):
    """Run the agent and attach per-run metadata via config."""
    return executor.invoke(
        {"input": question},
        config={
            "metadata": {
                "user_id": user_id,
                "source": "langsmith_demo",
            },
            # Optional: tag runs for filtering in the dashboard.
            "tags": ["observability-blog", "demo"],
        },
    )


def main():
    print("=== LangSmith Demo ===")
    print("Traces will appear at: 
    print(f"Project: {os.environ['LANGCHAIN_PROJECT']}n")

    executor = build_agent()

    for question in TEST_QUESTIONS:
        print(f"Q: {question}")

        result = run_with_metadata(executor, question)

        print(f"A: {result['output']}n")

    print("Done. Open LangSmith to inspect the full trace tree for each run.")


if __name__ == "__main__":
    main()

LangSmith automatically connects to the LangChain callback system without the need for decorators or wrappers to see each run from your project dashboard.

What you will see on the dashboard:

LangSmith’s tracking view shows the full agent creation tree, from the initial call to tool usage, LLM responses, and final output. Each node includes inputs, outputs, and delays.

You can tag runs, add metadata, filter by result, save runs as datasets, and perform tests. This is useful when developing information or a retrieval idea.

The fast-paced gameplay is another strong feature. You can open any trace, edit the line information, and reuse it to correct LLM malfunctions.

LangSmith’s estimate comes from the scale. The free tier has caps, and integration takes more effort if you don’t use LangChain, although OpenTelemetry is supported.

Langfuse: Open Source and Framework-Agnostic

Langfuse is another open source method here. You can host it on your server, or use their cloud service. It also includes all frameworks like LangChain, LlamaIndex, raw OpenAI APIs, etc.

# Read this Doc-string for installing the dependencies and their setup 
"""
Langfuse observability demo.

Setup:

pip install langfuse

Set LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY in your .env file.

LANGFUSE_HOST defaults to  override for self-hosted.

Key differences from LangSmith:

- Callback handler is passed per-invoke for more explicit control.
- Native session grouping for multi-turn conversations.
- You can score any trace after the fact via the Langfuse client.
"""

import os

from dotenv import load_dotenv
from langfuse import Langfuse
from langfuse.callback import CallbackHandler

from agent_base import TEST_QUESTIONS, build_agent

load_dotenv()


def build_handler(
    session_id: str,
    user_id: str = "demo-user",
) -> CallbackHandler:
    return CallbackHandler(
        public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
        secret_key=os.environ["LANGFUSE_SECRET_KEY"],
        host=os.getenv("LANGFUSE_HOST", "
        session_id=session_id,
        user_id=user_id,
        metadata={"source": "langfuse_demo"},
        tags=["observability-blog", "demo"],
    )


def score_trace(
    trace_id: str,
    score: float,
    comment: str = "",
):
    """Add a correctness score to a trace after reviewing the output."""
    lf = Langfuse(
        public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
        secret_key=os.environ["LANGFUSE_SECRET_KEY"],
        host=os.getenv("LANGFUSE_HOST", "
    )

    lf.score(
        trace_id=trace_id,
        name="correctness",
        value=score,
        comment=comment,
    )

    lf.flush()

    print(f"Scored trace {trace_id}: {score}")


def run_single_session(
    executor,
    session_id: str,
):
    """Run all test questions in a single session so they're linked in the UI."""
    handler = build_handler(session_id=session_id)
    trace_ids = []

    for question in TEST_QUESTIONS:
        print(f"Q: {question}")

        result = executor.invoke(
            {"input": question},
            config={"callbacks": [handler]},
        )

        print(f"A: {result['output']}n")

        # handler.get_trace_id() returns the trace ID for the last run.
        trace_ids.append(handler.get_trace_id())

    # Flush ensures traces are sent before the process exits.
    # This is critical in batch jobs.
    handler.flush()

    return trace_ids


def main():
    print("=== Langfuse Demo ===")
    print(f"Dashboard: {os.getenv('LANGFUSE_HOST', '

    executor = build_agent()
    session_id = "demo-session-001"

    trace_ids = run_single_session(executor, session_id)

    # Example: programmatically score the first trace.
    if trace_ids and trace_ids[0]:
        print("nScoring first trace as an example:")
        score_trace(trace_ids[0], score=0.9, comment="Answer was accurate")

    print(f"nDone. Find all runs under session '{session_id}' in your Langfuse dashboard.")


if __name__ == "__main__":
    main()

You can always pass callback handlers, which is more specific than LangSmith, but offers more flexibility since you can provide user IDs, session IDs, and custom metadata on request.

Workflow Testing

Langfuse has a really good workflow evaluation; You can add points after the tracking is completed.

from langfuse import Langfuse

lf = Langfuse()

# Score a specific trace by ID.
lf.score(
    trace_id="trace-abc123",
    name="correctness",
    value=0.9,
    comment="Answer was accurate but slightly verbose",
)

This works in conjunction with human reviews of the responses your team receives, allowing you to get aggregated test metrics over time.

Users can schedule their sessions by linking them, so agents can easily follow conversations across multiple cycles. All traces of each user’s session are connected to the application, allowing you to follow the entire conversation in one place.

Arize: Production-Grade ML Visualization

Originally developed as a platform for monitoring traditional machine learning models, Arize is now able to monitor both models and language agents. The fact that it was originally created to help teams get models out of production at a tighter scale remains the same.

Using OpenInference

In addition to using the OpenInference standard as its measurement scheme, Arize also integrates OpenTelemetry for instrumentation. Configuring Arize is more complicated than it is for most providers.

# Read this Doc-string for installing the dependencies and their setup 
"""
Arize observability demo.

Setup:

pip install arize-otel openinference-instrumentation-langchain

Set ARIZE_SPACE_ID and ARIZE_API_KEY in your .env file.

Key differences from the others:

- Uses OpenTelemetry under the hood, so it integrates with existing OTel stacks.
- Instrumentation is global like LangSmith, not per-invoke like Langfuse.
- Best-in-class production monitoring: drift detection, cohort analysis, alerting.
- Phoenix, arize-phoenix, is the free local sibling for development use.
"""

import os

from arize.otel import register
from dotenv import load_dotenv
from openinference.instrumentation.langchain import LangChainInstrumentor

from agent_base import TEST_QUESTIONS, build_agent

load_dotenv()


def setup_arize_tracing():
    """Register Arize as the OTel tracer provider and instrument LangChain globally."""
    tracer_provider = register(
        space_id=os.environ["ARIZE_SPACE_ID"],
        api_key=os.environ["ARIZE_API_KEY"],
        project_name="agent-observability-demo",
    )

    LangChainInstrumentor().instrument(tracer_provider=tracer_provider)

    return tracer_provider


def run_with_attributes(
    executor,
    question: str,
    user_segment: str = "standard",
):
    """Run the agent and attach span attributes for cohort analysis in Arize."""
    from opentelemetry import trace

    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("agent_run") as span:
        span.set_attribute("user.segment", user_segment)
        span.set_attribute("query.text", question)
        span.set_attribute("demo.source", "arize_demo")

        result = executor.invoke({"input": question})

        span.set_attribute("response.text", result["output"])

        return result


def main():
    print("=== Arize Demo ===")
    print("Traces will appear at: 
    print("Project: agent-observability-demon")

    setup_arize_tracing()

    executor = build_agent()

    # Simulate two user segments to demonstrate cohort analysis in Arize.
    segments = ["premium", "standard", "standard"]

    for question, segment in zip(TEST_QUESTIONS, segments):
        print(f"Q: {question} [segment={segment}]")

        result = run_with_attributes(
            executor,
            question,
            user_segment=segment,
        )

        print(f"A: {result['output']}n")

    print("Done. In Arize, use the cohort filter to compare premium vs standard responses.")
    print("Set up monitors on the Arize dashboard to alert on response quality drift.")


if __name__ == "__main__":
    main()

The instrumentation is as universal as that of LangSmith, but becomes part of the overall measurement framework of OpenTelemetry. Therefore, Arize can use your organization’s existing visualization stack regardless of the actual framework you use (ie, Jaeger, Grafana, etc.).

Which Real Estate Agent Should You Choose?

To be completely open, there is no one right tool for all use cases; it all depends on where you are in the development cycle and what your team needs.

A featureLang SmithLangfuseArise
Set the complexSmall (2 env vars)Down (call handler)Most boilerplate
Frame supportLangChain-native; others by OTelAny frameAny frame with Otel
Self-controlIt has a limitFirst stage (Docker Compose)Phoenix only (local dev)
Follow the visionBeautiful view of the treeGood, it’s connected to the sessionGood, Otel-standard
Rating / scoringData set + playgroundPersonal score for the sessionThe evals are based on a rubric
Productivity monitoringThe basicsThe basicsDrift, warns, collections
Many times / timesThread levelNative session groupTracking level only
Open sourceOwnershipFully open sourcePhoenix is ​​OSS; the field does not exist
Free categoryLimited tracking/monthGenerous (self-host = unlimited)It has a limit
It’s very goodLangChain dev & iterationData ownership + any frameworkMonitoring the production rate
  • Use it Lang Smith if you are building with LangChain and want the fastest setup for debugging and replication.
  • Use it Langfuse if you need self-hosting, strong data ownership, multi-platform support, or session level tracking for chat agents.
  • Use it Arise if your agent is moving to production and you need monitoring, drift detection, collections, and alerts.

The conclusion

Agent visualization is one of those things you regret skipping after something goes wrong in production. Tracking the agent runs after the fact, without any instrumentation is like debugging a distributed system with print statements.

All three tools covered here are ready to produce. Each has a free method. And each one takes less than 30 minutes to integrate with the LangChain agent. There is no good reason to send an invisible agent.

Choose the right tool for your current category. Add points early, or informally. And if your agent starts doing something weird at 2am, you’ll be glad you did.

Riya Bansal

Data Science Trainee at Analytics Vidhya
I currently work as a Data Science Trainer at Analytics Vidhya, where I focus on building data-driven solutions and applying AI/ML techniques to solve real-world business problems. My work allows me to explore advanced analytics, machine learning, and AI applications that empower organizations to make smarter, evidence-based decisions.
With a strong foundation in computer science, software development, and data analysis, I am passionate about using AI to create impactful, innovative solutions that bridge the gap between technology and business.
📩 You can also contact me at [email protected]

Sign in to continue reading and enjoy content curated by experts.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button