The Correlation Conundrum: Why Current Approaches Fall Short and What Comes Next
Summarizing Javier Antich's vision for autonomous event correlation from AutoCon3
"Correlation sometimes looks like a very mysterious term," began Javier Antich, "and people talk about the correlation engine like it's a magic box. But correlation, if you ask people from the data world, they'll just tell you—that's a join, a join of two tables."
His demystification of event correlation led to a provocative presentation that challenged current approaches and proposed a path toward true network autonomy through what he called "zero-day event correlation."
The Operational Noise Problem
Every network operations team faces the same challenge: overwhelming alert volumes that bury critical signals in operational noise. Event correlation promises to aggregate related events into meaningful incidents, distinguishing root causes from symptoms.
But as Antich demonstrated, most correlation approaches fall short of this promise, sometimes making the problem worse rather than better.
The Evolution of Correlation Techniques
Time-Based Correlation: The simplest approach groups events occurring within specific time windows. "If two events happen within a specific time window, they are considered correlated," Antich explained. "Which, of course, is not necessarily true. There are a lot of things happening at the same time in a network that have nothing to do with each other."
Rule-Based Correlation: The industry standard for years uses "if-this-then-that" logic to connect events sharing certain fields or characteristics. While more sophisticated than time-based approaches, rule-based systems suffer from fundamental limitations:
Rules only address frequent or known scenarios
New, unseen events fall through the cracks
Maintenance becomes unwieldy (some organizations have hundreds of thousands of rules)
The long tail of edge cases remains unaddressed
"There's always going to be that case where there's this new syslog message or new event that no one has seen before, for which you didn't have a rule," Antich noted.
Similarity-Based Correlation: Recent approaches use machine learning to compute event similarity through vector representations and mathematical formulas like cosine similarity. Despite the ML branding, Antich was skeptical: "The fact that two events are similar really has nothing to do with whether they are correlated. It just says they are similar."
Networks are complex, distributed systems where related events often show no surface-level similarity. A hardware failure might generate completely different event types across various network locations.
Graph-Based Correlation: The most promising current approach uses network topology to determine event relationships. Events affecting entities that are topologically close get correlated based on their semantic network connections.
"It's actually a quite effective method to derive correlation between events," Antich acknowledged, though it still requires rules to map unstructured events to network graph entities.
The Quality Problem
A critical insight emerged from Antich's analysis: correlation quality determines whether systems help or hurt operational teams. Poor correlation doesn't just fail to save time—it creates negative value by forcing engineers to untangle misleading connections before addressing actual problems.
"If the correlation quality is low, it's not that it doesn't save time—it will make your team waste time," he emphasized. "It would be better to have no correlation system whatsoever because the results are misleading."
From Automation to Autonomy
Antich positioned event correlation within a broader industry evolution from automation to autonomy. While automation relies heavily on human-created rules, workflows, and templates, autonomy requires systems that can "automatically observe, adapt, and learn" without rule dependencies.
"You can only create rules for things that are more frequent, for the head of the curve," he explained. "But what if you want to start addressing the remaining use cases—that part you cannot address realistically with rules? Then you need to start thinking about autonomy."
This isn't about replacing automation but extending it: "This is not about replacing anything. This is about getting to a point where we can address certain use cases without dependency on the existence of rules."
Zero-Day Event Correlation
Antich's proposed solution: "semantic event correlation" that can handle never-before-seen events without human-created rules. The approach leverages technologies that "didn't exist two years ago or three years ago" to semantically interpret events with high precision.
The architecture creates dynamic event graphs where truly connected events form subgraphs representing incidents. In his lab demonstration with hardware failure simulation, the system:
Reduced 22 separate events to a single incident
Achieved zero false positives and zero false negatives
Identified related events across different network locations automatically
The LLM Integration Opportunity
High-quality correlation enables powerful analysis through language models. But Antich emphasized a crucial caveat: "If you push bad correlation outcome into an LLM—garbage in, garbage out—you will get trash because the LLM will get confused."
Only with precise event clustering can language models provide meaningful incident summaries, root cause analysis, and remediation guidance.
The Feedback Loop Challenge
When asked about handling false positives in autonomous systems, Antich acknowledged the fundamental challenge: these systems make probabilistic decisions rather than deterministic rule-based ones.
"The algorithm will not know in real-time whether the correlation was successful or not," he admitted. "The best way to deal with this is with human feedback through the process."
This creates "after-the-fact rules" that address model knowledge gaps—a hybrid approach that combines autonomous operation with human guidance.
Implementation Considerations
Graph Construction: The quality of semantic correlation depends heavily on network graph completeness and accuracy. Sources might include network controllers, configuration parsing, or discovery mechanisms, with more entities and precise relationships improving correlation accuracy.
Dependency Mapping: Beyond simple relationships, mapping actual dependencies creates causal graphs enabling true root cause analysis rather than just correlation hypotheses.
Technology Leverage: The approach relies on recent advances in semantic interpretation that can map events to graph entities without explicit parsing rules.
The Path Forward
Antich's vision extends beyond event correlation to broader network autonomy applications. The principles—reducing rule dependencies, handling zero-day scenarios, maintaining high quality through semantic understanding—apply to many automation challenges.
His correlation quality graph illustrated the opportunity: while rule-based systems provide modest benefits for known scenarios, semantic approaches promise exponential improvements by addressing the entire use case spectrum.
Why This Matters
Event correlation represents a microcosm of broader automation evolution challenges. As networks become more complex and dynamic, traditional rule-based approaches increasingly struggle with:
The explosion of possible failure modes and event combinations
The maintenance burden of extensive rule sets
The inability to handle novel scenarios gracefully
The quality degradation that makes tools counterproductive
Antich's semantic correlation approach suggests a path toward systems that can reason about network events without explicit programming for every scenario. Whether this specific approach succeeds or not, the principles—autonomous interpretation, graph-based reasoning, semantic understanding—likely represent necessary capabilities for future network operations.
As he concluded: "In the era of the autonomous network, we need to start embracing this type of technique that allows us this transition from automation to autonomy."
The question isn't whether we need better event correlation—it's whether we're ready for the fundamental shift from rule-based automation to semantic autonomy that such improvements require.
Watch the full presentation: Network Event Correlation: Stories to Keep You Awake