We were spending more on infrastructure than the year before, running dozens of instances of our web API. A reasonable response: consolidate to larger instances. Fewer servers, same capacity, lower cost.
We had no idea we’d just removed the safety net hiding an 8-year-old bug.
The Symptoms
The week after our infrastructure changes, things started breaking. Not constantly, but consistently:
- Intermittent slowness that users noticed immediately
- 500 errors that would mysteriously resolve on refresh
- Worker processes dying and auto-restarting across the cluster
Our monitoring caught it first. When New Relic alerts fire, incident calls spin up, and I end up on the call. The frustrating part? We couldn’t find a pattern. No correlation to recent deployments. No specific time of day. We tried proactive morning scaling—all that did was shift the problem window.
The randomness was maddening.
Getting Onto the Crime Scene
The first real clue came from IIS Manager. We noticed cryptic messages about worker processes being terminated unexpectedly. But by the time we’d see the alert, the instance was already unhealthy and being replaced.
We needed to catch an instance before it died.
After watching the cluster like hawks, we finally got onto a server while it was struggling but still alive. Our distinguished engineer ran Debug Diagnostic Tool to capture crash dumps.
The hypothesis: something was throwing a fatal exception that killed the entire worker process.
We were right—but finding what would take another week.
The Red Herrings
Debug Diagnostic revealed a graveyard of technical debt. Each finding was a real problem, but none was THE problem:
1. Entity Framework Queries From Hell
The ORM was generating queries that would make a DBA weep. Multi-second execution times, missing indexes, N+1 patterns everywhere.
2. Async/Await in ASPX Pages
We found async void event handlers and improperly awaited tasks in legacy Web Forms pages. This is a known anti-pattern that causes thread pool starvation and deadlocks.
3. The AsyncHelper Anti-Pattern
Someone had written a “helpful” utility class that let you call async code synchronously:
// DON'T DO THIS
public static class AsyncHelper
{
public static T RunSync<T>(Func<Task<T>> func)
{
return Task.Run(func).GetAwaiter().GetResult();
}
}
This causes thread contention and can deadlock under load. It was scattered throughout the codebase.
4. AutoMapper Instantiation Per Request
Instead of configuring AutoMapper once at startup, the code was creating new MapperConfiguration instances on every request. This is expensive and unnecessary. (Related GitHub issue discussing similar problems.)
We spent a week fixing these issues. Performance improved noticeably. But the crashes continued.
Something valuable happened though: we removed the noise.
Finding the Needle
With the noise cleared, one pattern emerged in the call stacks. Deep in the traces, I noticed references to an “Expression Builder” utility. The code had been in the system since at least 2016.
Here’s what BuildContainsExpression was doing:
private static Expression<Func<TElement, bool>> BuildContainsExpression<TElement, TValue>(
Expression<Func<TElement, TValue>> valueSelector,
IEnumerable<TValue> values)
{
if (valueSelector == null)
throw new ArgumentNullException(nameof(valueSelector));
if (values == null)
throw new ArgumentNullException(nameof(values));
var p = valueSelector.Parameters.Single();
if (!values.Any())
return e => false;
// Build a chain of OR expressions
var equals = values.Select(value =>
(Expression)Expression.Equal(
valueSelector.Body,
Expression.Constant(value, typeof(TValue))));
// HERE'S THE PROBLEM: Aggregate builds a deeply nested tree
var body = equals.Aggregate((accumulate, equal) =>
Expression.Or(accumulate, equal));
return Expression.Lambda<Func<TElement, bool>>(body, p);
}
The intent was reasonable: build a LINQ expression that filters by a list of IDs. The implementation was a time bomb.
The Problem
Aggregate with Expression.Or builds a recursively nested expression tree. For 10 IDs, you get a small tree. For 2,000 IDs, you get a structure 2,000 levels deep.
The CLR stack has limits. When the expression tree gets deep enough and Entity Framework tries to compile it, you hit a StackOverflowException. And unlike most exceptions, you can’t catch a stack overflow—it terminates the entire process.
The Generated SQL
The difference in output tells the story:
-- BuildContainsExpression with 3 IDs generates:
WHERE (CategoryId = @p0) OR (CategoryId = @p1) OR (CategoryId = @p2)
-- Standard .Contains() with 3 IDs generates:
WHERE CategoryId IN (@p0, @p1, @p2)
The OR chain doesn’t just risk stack overflow—it’s slower to compile, generates longer SQL strings, and can confuse the query optimizer.
Benchmarking the Difference
I wrote a LINQPad script to measure EF6 query compilation time:
void Main()
{
var testSizes = new[] { 10, 100, 500, 1000, 2000, 3610 };
var results = new List<dynamic>();
foreach (var size in testSizes)
{
var ids = Enumerable.Range(1, size).ToList();
// OLD APPROACH: BuildContainsExpression (OR chain)
var swOld = Stopwatch.StartNew();
var oldQuery = Records
.Where(BuildContainsExpression<Record, int>(
e => e.CategoryId, ids));
var oldSql = this.GetCommand(oldQuery).CommandText;
swOld.Stop();
// NEW APPROACH: .Contains() (IN clause)
var swNew = Stopwatch.StartNew();
var newQuery = Records
.Where(e => ids.Contains(e.CategoryId));
var newSql = this.GetCommand(newQuery).CommandText;
swNew.Stop();
results.Add(new
{
IdCount = size,
OldCompileMs = swOld.ElapsedMilliseconds,
NewCompileMs = swNew.ElapsedMilliseconds,
Speedup = $"{(double)swOld.ElapsedMilliseconds / Math.Max(1, swNew.ElapsedMilliseconds):F1}x",
OldSqlLength = oldSql.Length,
NewSqlLength = newSql.Length
});
}
results.Dump("EF Query Compilation Time Comparison");
}
The results were striking:
| ID Count | Old Compile (ms) | New Compile (ms) | Speedup | Old SQL Length | New SQL Length |
|---|---|---|---|---|---|
| 10 | 7 | 0 | 7x | 1,591 | 1,250 |
| 100 | 36 | 0 | 36x | 5,551 | 1,790 |
| 500 | 167 | 0 | 167x | 23,551 | 4,590 |
| 1,000 | 339 | 1 | 339x | 46,051 | 8,090 |
| 2,000 | 787 | 1 | 787x | 92,051 | 16,090 |
| 3,610 | 1,336 | 2 | 668x | 168,795 | 31,810 |
At 2,000 IDs, the old approach took 787ms just to compile the expression—before any SQL even executed. The new approach? 1 millisecond.
The Complexity Analysis
Looking at the benchmark data, you can see the problem mathematically:
BuildContainsExpression (OR chain):
- Time complexity: O(n) — Each ID adds one more
Expression.Orcall during theAggregateoperation - Space complexity: O(n) stack space — The
Aggregatebuilds a deeply nested binary tree. For 2,000 IDs, you get an expression tree 2,000 levels deep. When Entity Framework compiles this tree via recursive traversal, each level consumes stack space.
This is the killer. The CLR default stack size is 1MB for threads in IIS. A tree of depth n requiring n stack frames will eventually exceed that limit.
.Contains() (IN clause):
- Time complexity: O(1) — Creates a single
MethodCallExpressionregardless of collection size - Space complexity: O(1) stack space — Flat expression structure, no recursive depth
The benchmark confirms the linear relationship. Doubling the ID count roughly doubles the compilation time:
- 1,000 → 2,000 IDs: 339ms → 787ms (~2.3x)
- 2,000 → 3,610 IDs: 787ms → 1,336ms (~1.7x for 1.8x IDs)
The stack overflow isn’t about time—it’s about depth. Even if compilation were instant, a sufficiently deep expression tree would still overflow the stack during traversal.
The Aha Moment
While benchmarking, I pushed the ID count higher to see when things would break. Around 2,000 IDs, I started hitting stack overflow exceptions locally.
That exact code path existed in production. Customers with large integrations—linking thousands of records in “catch-all” relationships—could trigger this condition.
The bug had existed for 8 years. Why did it only become critical now?
Why Now?
Two factors collided:
1. Data Growth
Over the years, integrations had created increasingly large datasets. Customers could create relationships linking thousands of records to single users. We enforced no limits on these collections.
2. Infrastructure Change
Our consolidation to larger instances meant fewer servers with fewer worker processes. Previously, when a process died, the impact was distributed across many instances. Now, each crash took out a larger percentage of our capacity.
Horizontal scaling wasn’t solving the problem—it was hiding it.
The Fix
The fix was almost anticlimactic:
// Before: BuildContainsExpression (recursive OR chain)
.Where(BuildContainsExpression<UserRecord, int>(
e => e.CategoryId, ids))
// After: Standard LINQ (generates IN clause)
.Where(e => ids.Contains(e.CategoryId))
That’s it. Replace the “clever” helper with the vanilla LINQ approach that Entity Framework already handles correctly.
I deleted BuildContainsExpression entirely from the codebase. It had no reason to exist.
The Results
After deploying the fix and the performance improvements from our week of cleanup:
- 5x throughput improvement
- 62% reduction in peak infrastructure
- Worker process crashes: Zero
The system that had been limping along, randomly failing, was now handling 5x the load on potentially a third of the hardware.
Lessons Learned
”Clever” Code is a Time Bomb
Someone wrote BuildContainsExpression because they thought they were being smart. Maybe they didn’t trust EF’s .Contains() implementation. Maybe they copied it from a Stack Overflow answer circa 2012. Whatever the reason, it was more complex than necessary and didn’t scale.
The best code is often the most boring code.
Horizontal Scaling Masks Problems
For 8 years, this bug existed. It probably caused crashes the entire time. But with enough instances, the impact was absorbed. Auto-scaling healed the wounds before anyone noticed.
When we optimized for cost by consolidating instances, we removed the safety net that was hiding our technical debt.
Clean Up the Noise First
We spent a week fixing “unrelated” issues before finding the real problem. That wasn’t wasted time. By removing async anti-patterns, AutoMapper inefficiencies, and query problems, we made the actual culprit visible.
Sometimes you have to clean up the noise before you can hear the signal.
The Business Case for Tech Debt
This kind of incident has caused problems multiple times. Each time, it directly impacted how customers perceive the product. Technical debt isn’t just an engineering concern—it’s a business risk.
Many of the underlying issues exist in legacy ASPX pages that we’ve been asking to rewrite in modern React for years. We haven’t been given the time. Perhaps now we will be.
What I Still Don’t Know
The threshold for the stack overflow wasn’t perfectly consistent. In testing, 2,000 IDs would reliably crash. In production, it seemed to depend on server load—sometimes failing at lower counts. I suspect the existing stack depth from the request pipeline plays a role, but I haven’t fully characterized it.
I also don’t have a great answer for how to proactively monitor for this pattern. Expression tree depth isn’t something standard APM tools track. If you’ve solved this problem, I’d love to hear about it.
Closing Thoughts
Eight years. That’s how long this code waited to break. Not because it was good code that aged poorly, but because our infrastructure was robust enough to absorb its failures.
Sometimes the most dangerous bugs aren’t the ones that crash immediately. They’re the ones that wait.
If you’re dealing with similar legacy .NET challenges or want to discuss production debugging war stories, feel free to reach out. I’m always happy to compare notes.