Beyond Depth-First: Cardinal — Shopify’s Breadth-First GraphQL Execution Engine

Most GraphQL optimization work focuses on the same problems: reducing N+1 queries, tuning dataloader batching, caching expensive fields. When Shopify dug into traces for their largest queries, they found the bottleneck wasn’t any of those things. It was the execution engine itself. Specifically, the CPU-bound overhead of running field resolvers at scale, not the I/O or the application logic inside them.

That finding led to a question that most GraphQL teams haven’t had to ask: what if the execution model is wrong for this workload? Not broken. GraphQL’s depth-first execution is sound and runs reliably in production at thousands of companies. But potentially mismatched to the specific shape of high-cardinality list queries at this scale.

The result is GraphQL Cardinal, a new breadth-first execution engine that Shopify built from scratch and has been running in production for over a year. This article walks through what they found, how the engine works, how it compares to other approaches the community has taken, and what’s available to Ruby engineers today.

What the Traces Were Showing

Shopify’s GraphQL layer supports deeply nested, high-cardinality queries. Fetching 250 products with 250 variants each, for example. These are patterns most GraphQL APIs guard against. They support them deliberately.

When they dug into traces for these large queries, the bottleneck wasn’t where most engineers would look. The majority of request time wasn’t being spent on I/O, loading data from databases or external services. It was being spent on field resolution: the CPU-bound work of building the GraphQL response itself.

This matters because it points to a different category of problem. I/O bottlenecks are solved with better batching, caching, and query optimization. Execution engine bottlenecks require rethinking how the engine works.

Understanding Why

Conventional GraphQL engines, including graphql-ruby (which Shopify has used since 2015) and graphql-js (the official specification implementation), traverse the response tree depth-first. The engine picks the first product in a list, resolves all of its fields and nested variant subtrees, then moves to the next product and repeats. This is a sound, well-understood approach that works well for the vast majority of GraphQL queries.

The scaling characteristic they identified isn’t a flaw in depth-first design. It’s inherent to how it works at high cardinality. Three costs compound:

No amortization across subtrees. Each product’s subtree is processed independently. The time to process 100 products is the time of one multiplied by 100. Shopify’s profiles showed this as a distinct “column” pattern, where each column represents one product’s subtree lined up sequentially. At 250 products with 250 variants, that’s 62,500 field executions of effectively the same logic.

Field-level overhead multiplied across every object. Each field execution carries non-zero platform overhead: authorization, instrumentation hooks, tracing, engine bookkeeping. Shopify measured an empty field-level tracing hook adding roughly 10% overhead across 1,000 fields. Small individually, but in depth-first execution this overhead runs once per field per object. A query with 5 fields across 1,000 objects runs 5,000 field executions. At 1ms overhead per field, that’s 5 seconds of CPU time spent purely on engine machinery.

Dataloader promise overhead at scale. Dataloaders solve GraphQL’s N+1 problem by pooling I/O across fields. They’re the right tool for reducing database calls. But in depth-first execution, the promise machinery runs once per field per object, building thousands of individual promises each carrying allocation costs and GC pressure. Shopify measured a graphql-batch workflow with no I/O running roughly 2.5x slower than equivalent non-lazy fields, purely from promise overhead.

None of these are problems at low cardinality. They become the dominant cost at scale.

The Hypothesis

The observation led to a question: what if field resolvers ran once per field across all objects simultaneously, rather than once per field per object?

The interface change this requires is straightforward. Instead of a resolver receiving one object and returning one result, it receives a set of objects and returns a mapped set of results. The resolver logic is the same. The execution model isn’t.

The team laid out the napkin math before writing a line of engine code. The assumptions: all GraphQL fields carry some non-zero overhead cost, rounded to 1ms for simplicity. Scenario: 5 fields, 1,000 objects.

+----------------+--------------------+-----------------+---------------------+
|                | Field executions   | Overhead        | Dataloader promises |
+----------------+--------------------+-----------------+---------------------+
| Depth-first    | 5,000 (depth x     | ~5 seconds      | 5,000               |
|                | breadth)           |                 |                     |
+----------------+--------------------+-----------------+---------------------+
| Breadth-first  | 5 (depth only)     | ~5 milliseconds | 5                   |
+----------------+--------------------+-----------------+---------------------+

The largest dimension, breadth (the number of objects), is removed as a multiplying factor from platform overhead entirely. The I/O savings from batching remain. The promise overhead doesn’t scale with list size.

The idea of batching resolvers by field isn’t new. Libraries like DataLoader and graphql-resolve-batch introduced batching to reduce redundant work, primarily around I/O and resolver execution. These approaches operate within the traditional depth-first execution model. The engine still traverses the response tree per object, while batching is layered on top; collecting work across resolver invocations and executing it in groups.

Cardinal takes a fundamentally different approach. It eliminates depth-first traversal entirely and executes fields breadth-first by default. Batching is not an optimization, it is the execution model.

Building the Engine

Cardinal was built as a standalone wrapper around GraphQL Ruby’s existing static primitives: schemas, ASTs, and type definitions. The core algorithm is available as an open-source proof of concept at graphql-breadth-exec. Execution works in three phases.

Tree building. Cardinal constructs an execution tree from the request’s static AST. The tree has two primitives: scopes (typed closures containing fields) and fields (with return types and child scopes). The tree is built eagerly for statically-resolvable positions. Abstract types are built lazily once the parent resolves. One design constraint: the tree can only be navigated upward, never downward.

Planning (lookbehind). Before execution, Cardinal runs a bottom-up planning pass. Each field can examine its ancestors and register preloads or planning notes. They call this “lookbehind” as a deliberate contrast to lookahead, because lookahead can’t make informed decisions about unresolved abstract types below it. Lookbehind works from concrete leaves upward.

Execution. Execution runs top-down. To make this concrete: imagine a query for 3 products, each with 2 variants. In depth-first execution, the engine resolves product 1 and both its variants completely, then product 2 and its variants, then product 3, resulting in 6 variant field executions across separate subtrees. In Cardinal’s breadth-first execution, the engine resolves all 3 products in one pass, then flat-maps all 6 variants into a single scope, and runs each variant field resolver exactly once across all 6 objects simultaneously. Result hashes are passed by reference and shaped in-place as execution descends. There is no separate response-building step.

This flat-set passing scales directly with list size. At 250 products with 250 variants, instead of 62,500 separate field executions per field, Cardinal holds a single execution scope containing all 62,500 variant objects and runs each variant field resolver exactly once across the entire set.

One notable implementation choice: Cardinal’s engine is driven by a queue rather than recursion. This eliminates the deep stack traces large GraphQL queries produce and directly reduces memory footprint. The core execution loop started as a single line of code.

On errors. Breadth execution changes error handling. Depth-first tracks error paths through subtrees precisely. Breadth has no subtree concept, so Cardinal runs to completion, inlines rescued errors into the response tree, then adds a depth traversal step at the end to locate error positions. This is a coarser model. Their reasoning: fewer than 1% of their API requests result in non-validation errors, so optimizing for the success path is the right tradeoff.

How Cardinal Compares to Grafast

Cardinal isn’t the only breadth-first GraphQL engine worth knowing about. Grafast, part of the Graphile project, takes a different approach to the same problem. Understanding the difference clarifies what each one actually is.

Grafast is a planning engine. Before executing a request, it walks the document breadth-first and calls plan resolvers for each field, assembling their requirements into a directed acyclic graph of steps. That plan is then optimized (deduplicating redundant work, fusing related steps, hoisting operations) and cached. Subsequent requests using the same GraphQL document skip planning entirely and jump straight to the optimized execution phase.

Cardinal is an execution engine. It builds a fresh execution tree per request and runs it breadth-first in real time. No upfront planning, no plan caching, no DAG optimization. Just a flat, queue-driven execution loop that processes fields once across the full object set.

The practical difference is in the migration floor. Grafast requires rewriting resolvers as “plan resolvers”, functions that describe what data is needed rather than fetching it directly. Traditional resolvers can be emulated for compatibility, but that emulation forfeits most of Grafast’s benefits. Cardinal’s interpreter bridge lets legacy resolvers run as-is inside the breadth-first engine, with teams migrating incrementally at their own pace.

Cardinal explicitly acknowledges Grafast. Its lookbehind planning pass is directly inspired by Grafast’s design. But the two systems are optimizing for different things. Grafast gives you a globally-optimized execution plan at the cost of a higher migration investment. Cardinal gives you breadth-first execution with a lower adoption barrier today.

Neither is strictly better. They represent different bets about where the complexity budget is best spent.

What Production Showed

The hypothesis held up. For CPU-bound benchmarks with 5,000 fields of flat JSON data, Cardinal ran roughly 15x faster and used 90% less memory than graphql-ruby. These are ceiling numbers, representing the maximum advantage at high repetition with no I/O involved. The gains are not uniform: at one list item, depth-first has a slight edge since there’s no breadth to amortize. At two items breadth starts pulling ahead, and by the time lists reach hundreds of items the advantage is significant.

In production testing with large product and variant payloads, Cardinal saved over 4 seconds at P50 for their largest test queries. Profiles confirmed the theory: comparable time on I/O and data staging, but dramatically reduced time on field execution and neighboring garbage collection.

The Migration Problem

Building Cardinal was the straightforward half of this project. Migrating a production monolith built on depth-first resolvers to a breadth-first engine is a different category of problem.

The interface change is fundamental. Every field resolver in their codebase was written to receive one object and return one result, across tens of thousands of implementations. You can’t swap that interface atomically.

Cardinal expects resolvers that receive a full set of objects and return a mapped set of results. Every existing resolver in their stack did the opposite, receive one object, return one result. That’s the mismatch the interpreter was built to bridge. When Cardinal encounters a legacy resolver, instead of passing it all objects at once, the interpreter loops through them individually, calling the legacy resolver once per object. This doesn’t make legacy resolvers faster, they still execute one at a time. But it lets the entire existing stack run on Cardinal without changing a single resolver on day one.

The interpreter initially came with a memory tradeoff. The team addressed this, resulting in a 40% improvement in the interpreter’s memory efficiency. By rollout, the interpreter was slightly lighter and faster at list-heavy queries than the previous engine, without any resolver changes.

Migrating field-level tracers was a simpler win. In depth execution, tracers ran once per field per object. In breadth execution, they run once per field selection. This is dramatically cheaper, with field timings averaging a single breadth duration across resolved objects rather than capturing one per repetition.

The ongoing work is the resolver migration itself. The team built a shadow verifier to confirm migrated breadth fields match their legacy counterparts, a benchmark suite for studying migrated query performance, and a library of AI-assisted skills to accelerate translation work. Every regression found so far has been a translation error. No cases yet where breadth execution is fundamentally worse for a correctly implemented resolver.

What’s Available for Ruby Engineers Today

For teams running graphql-ruby, GraphQL::Execution::Next is available now. It's experimental and not yet recommended for production, but tryable in development today.

The new module introduces four resolver configurations that map directly to Cardinal’s model:

resolve_batch: the high-performance path. A class method receives the full set of parent objects and returns a mapped set of results. Use this for any field that does I/O.
resolve_each: runs a class method once per object. Use this for pure computation with no I/O. Under the hood it maps across objects.
resolve_static: runs once and returns the same value for all objects. Useful for computed aggregates or context-dependent fields.
dataload: shorthands for common ActiveRecord patterns, including association loading and record fetching by ID.

Legacy instance method resolvers are partially supported but will be deprecated. The migration guide is at graphql-ruby.org/execution/migration.

The gains for list-heavy queries mirror what they reported. 15x faster execution and 75% less memory are not unusual at high repetition. Flat queries with no lists will see little difference. The benefit scales with how much your responses repeat field patterns across objects.

An Open Question for the GraphQL Community

Shopify closes their post with a direct quote from the official GraphQL specification: conformance requirements can be fulfilled “in any way as long as the perceived result is equivalent.”

Breadth-first execution is spec-compliant. The spec never mandated depth-first traversal. The community converged on it. Depth-first is the practical choice for most GraphQL APIs today. The migration cost to breadth-first isn’t justified when list sizes are small and the gains are negligible. But the assumption that it’s the only execution model worth considering is harder to defend now.

The challenge for the broader community is practical. Most GraphQL tooling, documentation, and resolver patterns are built around the depth-first interface. A migration requires a resolver interface change that touches every field in a codebase. Their tooling (interpreter bridges, shadow verifiers, AI-assisted migration) represents significant investment in making that transition safe at scale.

Whether teams outside Shopify can make a similar transition without similar resources is an open question. What’s no longer open is whether the execution model is worth questioning.

References:

Beyond Depth-First: Cardinal — Shopify’s Breadth-First GraphQL Execution Engine was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

Beyond Depth-First: Cardinal — Shopify’s Breadth-First GraphQL Execution Engine