A survey published on ArXiv proposes a structured framework for categorizing and evaluating how large language model agents organize their workflows, covering everything from fixed pre-deployment pipelines to systems that rewrite their own structure mid-task.
As LLM-based systems grow more complex — combining model calls, web search, code execution, memory, and verification steps — researchers and engineers have developed a wide variety of architectures for sequencing these components. Until now, no common vocabulary or evaluation standard has unified the field. This paper, from researchers posting to ArXiv's CS.AI section, attempts to fill that gap.
Static Templates Versus Graphs That Rewrite Themselves
The central organizing principle of the survey is when a workflow's structure gets decided. The authors define two broad camps: static methods, which lock in a reusable scaffold before the system ever runs, and dynamic methods, which select, generate, or revise the workflow for a specific task — either just before execution begins or while it is already running.
This distinction matters more than it might first appear. A static workflow is predictable and auditable; you can inspect it, test it, and deploy it with confidence about what will happen. A dynamic workflow can, in theory, adapt to the specific demands of an unusual task — but it introduces uncertainty about behavior, cost, and failure modes.
The difference between a workflow that is fixed at design time and one that rewrites itself at runtime is not just architectural — it changes what you can test, audit, and trust.
The paper organizes prior work along three dimensions: when structure is determined, which part of the workflow is being optimized, and what evaluation signals guide that optimization. Those signals include task performance metrics, verifier outputs, human preferences, and feedback derived from execution traces.
A Vocabulary Problem Holding Back Progress
One of the survey's core arguments is that the field suffers from inconsistent terminology. Researchers often use the same words — 'agent', 'pipeline', 'workflow' — to mean different things, making it difficult to compare methods or reproduce results.
To address this, the authors introduce the term agentic computation graphs (ACGs) as a unifying concept. Under this framework, a workflow is a graph in which nodes represent components (an LLM call, a retrieval step, a code executor) and edges represent dependencies and information flows between them.
They further distinguish between three layers of abstraction: reusable workflow templates (the design), run-specific realized graphs (what actually gets deployed for a given task), and execution traces (what actually happened at runtime). Keeping these three separate is, they argue, essential for understanding where optimization is actually occurring and what claims can fairly be made about a method's generality.
How Workflows Get Optimized — and Evaluated
The survey catalogues the range of signals researchers have used to guide workflow optimization. These include straightforward task metrics (did the system answer the question correctly?), signals from automated verifiers, human preference data, and patterns extracted from previous execution traces.
Each signal type comes with trade-offs. Task metrics are concrete but may not capture whether a workflow is efficient or robust. Verifier signals can be automated at scale but depend on the quality of the verifier. Trace-derived feedback is rich but requires storing and processing potentially large volumes of runtime data.
The authors also introduce what they call a structure-aware evaluation perspective — the idea that assessing a workflow should go beyond whether it produced the right answer. Graph-level properties (how complex is the workflow?), execution cost, robustness to input variation, and how much the structure changes across different inputs are all proposed as complementary metrics.
This is a meaningful methodological point. A workflow that achieves 95% accuracy but does so with unpredictable structural variation across runs may be harder to deploy reliably than one achieving 90% accuracy with a stable, auditable graph.
Why This Matters for Building Reliable AI Systems
The practical implications extend beyond academic taxonomy. As enterprises and developers build more sophisticated AI agents — systems that browse the web, write and run code, query databases, and call external APIs in sequence — the question of how to design, optimize, and evaluate those pipelines becomes a core engineering challenge.
Current practice is fragmented. Some teams hand-craft static pipelines that work well for known tasks but break on edge cases. Others use dynamic approaches that adapt but are difficult to debug or audit. The lack of shared standards means that a method shown to work well in one paper is often impossible to fairly compare against a method from another.
By proposing a common framework, this survey aims to make such comparisons possible — and to give future researchers a clearer set of questions to answer when they introduce a new method: Is your optimization targeting the template, the realized graph, or the execution behavior? What signal guides it? And how does your evaluation account for cost and structural consistency, not just task accuracy?
What This Means
For anyone building or researching LLM-based agent systems, this survey provides a practical vocabulary and evaluation lens that could make workflow designs easier to compare, reproduce, and trust at deployment scale.