这篇文章有中文版可用

How LLM Agents Became What They Look Like in 2026?

1) A Brief History of LLM Agents

Stage 1: Structured Output

An LLM is like a magic function: it receives an input and produces an output. That seems similar to what we do every day with code, but with one key difference—both the input and the output are in natural language.

However, when we need the LLM to behave deterministically, we must make it output a machine-readable format, i.e., structured output. With structured output, this magic function can be embedded into a normal programmatic function.

Stage 2: Tool Calling

Tool calling is essentially the first standardized wrapper around structured output, introduced by the OpenAI API (which became the de facto standard for LLM APIs).

It introduces the concept of a “tool” that takes JSON arguments. This is how the magic function integrates with normal functions. However, the API itself does not actually execute the tool—it only returns the arguments. In that sense, tool calling is still structured output, just with a more explicit interface.

Stage 3: MCP

So who actually runs the tool? MCP introduces the missing role: the tool runtime.

In its built-in client–server architecture, the MCP client is basically a wrapper around tool calling, except that it additionally calls an MCP server. The MCP server is the component that truly executes the tool.

Stage 4-1: Bash

In my opinion, MCP is an over-engineered approach that ultimately failed to become the universal solution.

Since the early days of gpt-3.5-turbo, LLMs have shown strong coding ability—not only in major languages like Python, but also in:

  • JSON (as discussed above: the foundation of structured output and tool usage)
  • Bash. Bash becomes a “meta tool” for agents: potentially the only tool an agent needs. With Bash, an agent can run curl, wget, gh (GitHub CLI), and more. This makes many MCP-style applications redundant.

Stage 4-2: Filesystem

As we dive deeper into agent development, several issues emerge:

  • Some tools may produce outputs too large for the context window, or generate artifacts like images that cannot be directly returned to the LLM (assuming the LLM is not multi-modal).
  • To process these artifacts, we need additional tools, such as:
    1. using a multi-modal LLM to summarize the image, or
    2. using zip to compress it, or
    3. uploading it somewhere on the internet.
  • The problem is that these options can all be valid at the same time. That would require 3 separate tools like:
    • generate_img_and_summarize
    • generate_img_and_zip
    • generate_img_and_send
  • If we add audio generation, we suddenly need 3 more tools! As combinations grow, the number of tools grows exponentially.

The root cause is that intermediate artifacts (images, audio, long texts) cannot be returned directly to the LLM in a single step. To decouple tool combinations, we need a place to store intermediate artifacts so that the agent can decide what to do next in subsequent turns by LLM.

That is the filesystem.

Stage 4-3: What’s the runtime for Bash and filesystem?

Yes, OS.

2) Stage NEXT

Bash looks promising as the one ultra meta tool to rule all tools, but many tasks cannot be covered by Bash alone. A good example is the browser—the ultimate GUI program. GUI was designed to bypass TTY because complex tasks can overwhelm a human’s context window.

Meanwhile, LLMs are evolving alongside agent design. With innovations such as:

  • separation of thinking and output,
  • RL to improve structured JSON output and coding,
  • larger context windows,

some engineering practices may become obsolete. For example:

  • ReACT became less necessary once structured output was standardized.
  • CoT prompts and CoT-style workflows may become less critical once thinking/output separation is integrated during model training.

3) Problems

Trade-off: Latency vs. Quality

When using coding agents, I always choose the SOTA model and can tolerate long waiting times for the best result. But in many cases, users expect responses within acceptable latency.

For a chatbot scenario, I believe TTFT should be at most 20–30 seconds. That is one reason non-agentic RAG still exists.

Reproducibility

Even if users can wait for higher quality, they still expect predictable wait times and consistent outputs. That is why workflows continue to matter.

Trade-off: Privacy/Cost vs. Quality

In many situations, users choose smaller open-weight models - or even edge-sized models - for privacy or cost reasons.

Multi-modal Agents

Beyond coding agents for programmers, many users want agents with additional capabilities such as browser-use, computer-use, or interacting with specific GUIs.

4) How About Agent-Skills?

Technically, agent-skills is simply a protocol for dynamic prompt injection (the agent decides which prompts it should load), assuming the agent is built on Stage 4 as described above. In practice, it is nothing more than a set of files.

However, compared with MCP, agent-skills seems more likely to become widely adopted because:

  • it is self-contained as a folder and easy to distribute;
  • it is simple and can be distributed without a runtime or dependencies - assuming the client agent already operates at Stage 4.
  • Since it is designed to run on an OS, it can pack anything that can work on an OS - prompt.md, tools.sh, helper.py, and even bin/executable, lib/library.so, Minecraft.zip (If your agent is smart enough to play the game), Tutorial: Beat Ender Dragon with HALF A Heart.mp4 (If your agent can), etc.

In short, agent-skills is a promising protocol. A Stage 4 agent system can act as a client to use skills from other providers, or a system can provide skills for other agents. The idea of dynamic prompt injection can support good system design, but that design is independent of the agent-skills protocol itself.

What if an OS distro can do pacman -S someprogram someprogram-agent-skill?

Reference: