The Art of Network Surgery: Migrating 7,000 Services Without Breaking Anything

Summarizing Luke Gollan's EVPN migration journey from AutoCon3

"We obviously want to track the rollout," Luke Gollan explained as he described migrating Megaport's global network to EVPN. "We're not ready to go to the pub and celebrate yet."

His cautious optimism was well-founded. When you're managing 3,000 devices across 930 data centers in 26 countries—all serving customers who depend on your network 24/7—celebration comes only after successful execution at massive scale.

The Scale of Modern Network Automation

Megaport operates one of the world's largest software-defined networks, and their approach to automation reflects this reality. Their network is 100% automated, driven by APIs, with essentially no need for engineers to log into devices except for edge cases requiring specific output.

This isn't theoretical automation—it's operational automation that enables five network operations engineers to manage a global network. Luke's team provides all tooling for support teams: turning up MPLS links, draining traffic, traffic engineering, and service diagnostics.

The core product is VXC (Virtual Cross-Connect), which allows customers to connect data centers, virtual functions, and cloud providers. Under the hood, it's point-to-point Layer 2 VPN, but the business decision to leverage EVPN for this service created an interesting automation challenge.

The Foundation: Source of Truth Architecture

Megaport's automation centers on a relational database that Luke calls their "source of truth"—though he qualifies it as "probably 99.9% accurate." Everything lives in this database: BGP connections, MPLS links, device interfaces, cross-connects, patching, and services.

Objects in the database map directly to software classes with setter/getter methods tied to database fields. Each class includes functional methods like full_config() for generating configurations and deploy_config() that knows whether to use API calls or NETCONF.

The database is wrapped in a heavily validated API that serves as the front door for all data entry and retrieval. The validation is extensive—ensuring even distribution of diverse hardware (devices are colored "red" or "blue" for customer diversity requirements), managing VLANs to prevent overlap, and validating service capabilities across locations.

The Orchestration Layer

Multiple applications run under their orchestration framework, handling different aspects of network management. Two "heavy hitters" handle configuration changes:

Deploy Agent: Dedicated to customer service configuration changes—processing new services, updates, and terminations based on database status fields.

Full Config Deployment: Continuously rebuilds the entire network from scratch, taking about two days to complete. This application ensures network consistency by detecting and correcting configuration drift.

The full config approach uses YANG models and NETCONF with XML containers. Each container (SNMP, BGP, IGP, MPLS) can be created, deleted, replaced, or merged using operation attributes. They use the "replace" operation extensively—essentially deleting everything in a container and applying fresh configuration in one atomic operation.

Smart Timing and Risk Management

Luke's team implemented "device maintenance windows" to minimize risk during full config deployments:

  • Changes only between midnight and 6 AM local time

  • No deployments on Sundays

  • Only one diversity zone per night (reds one night, blues the next)

This prevents scenarios like crashing two core devices simultaneously in a location—the kind of failure that could create serious problems.

The EVPN Migration Challenge

When the business decided to migrate to EVPN, three key questions emerged:

  1. What existing automation code could they leverage?

  2. Could they achieve 100% automated migrations?

  3. Could they avoid service interruptions?

The rollout split into two phases:

  • Phase 1: Turn up IBGP mesh (7,000 IBGP sessions) and enable EVPN signaling

  • Phase 2: Support EVPN service-specific configurations (route policies, BGP communities)

Luke's tip: "It's probably a good idea to let your NOC know if you're turning up 7,000 IBGP sessions over two days. They will be grateful."

Tracking and Validation

Progress tracking proved crucial. Their full config deployment application logs every change, showing which devices received configuration, when changes were committed, and what specific configurations were applied.

The logging works by taking snapshots: collecting candidate configuration before changes (pre-change config), making modifications, collecting candidate configuration again (post-change config), diffing the two, then logging the output before commit.

Critically, they built in automatic rollback capability. If the application sees unexpected changes in the diff—like a port being shut down when only IBGP changes were expected—it aborts the commit, rolls back, and moves to the next device.

From Device-Based to Service-Based Migration

Initially, they planned to use "seamless integration" for device-by-device migration—a make-before-break approach that spins up EVPN services alongside existing L2VPN, then preferences the new EVPN tunnel.

Lab testing revealed this wasn't truly seamless, with packet loss during transitions. So they shifted to service-based migration with specific criteria:

  • Simple process (likely human-triggered)

  • Ability to roll forward to EVPN or back to L2VPN

  • Customer self-service capability

Customer Empowerment

Rather than scheduling batch migrations through support teams, they added migration capability to their customer portal. Customers can trigger their own VXC migrations during convenient maintenance windows.

The implementation treats migration like a normal VXC update (speed changes, VLAN changes) by adding an EVPN boolean field to the database VPN table. The API recognizes when this field changes from false to true, triggers the deploy agent, which generates EVPN configuration and deploys it to the network.

The process: customer clicks button → API validates → database field flips → deploy agent sees update → EVPN configuration deploys → brief downtime as A and B sides get configured.

Hard-Earned Lessons

Communication is Critical: "People get frightened when you tell them you're turning up 7,000 IBGP sessions over two days. If you explain how you're doing it and the protection mechanisms, people are way more receptive."

Work with Operations Teams: "They're the eyes and ears of the network... always liaising with customers about concerns."

Rigorous Testing: "Worst bugs are usually the result of varying factors—version of code, amount of configuration. Test as close as possible to production, particularly heavy boxes."

Clear Rollback Plans: "Break it up into logical steps, make sure you can roll back each step. If something goes wrong, you'll have a really bad time determining what triggered the bug."

The Automation Reality

What makes Luke's presentation valuable is its honesty about operational automation at scale. This isn't automation for efficiency gains—it's automation as the only viable way to manage a network of this size and complexity.

Their approach demonstrates mature thinking about automation architecture: strong data foundations, careful validation, comprehensive testing, risk-aware deployment strategies, and customer empowerment through self-service capabilities.

Most importantly, they've built automation that operators trust to make major network changes autonomously—a level of confidence that comes only from rigorous engineering and extensive validation.

As Luke concluded, they're not quite ready to celebrate yet. There's always more to improve: state awareness, better failure detection, enhanced rollback capabilities. But they've proven that large-scale network transformations can be automated safely when you build the right foundations and respect the complexity of the challenge.

Sometimes the best automation stories aren't about the flashy features—they're about the methodical engineering that makes critical infrastructure changes feel routine.


Chris Grundemann

Executive advisor. Specializing in network infrastructure strategy and how to leverage your network to the greatest possible business advantage through technological and cultural transformation.

https://www.khadgaconsulting.com/
Previous
Previous

The AI Revolution We Didn't See Coming: Why Network Automation Is About to Change Forever

Next
Next

Stop Trying to Grow Unicorns: Why Network Automation Needs Software Developers