The Network Automation Epic: Terraform's Journey from Cloud to Campus

Aug 1

Summarizing Eduardo Pozo's infrastructure-as-code adventure from AutoCon3

"Some want to deliver a ring to a mountain, others just want to destroy a death star," began Eduardo Pozo with characteristic humor. "But I think we have a harder journey ahead—we want to automate our networks."

His epic tale of implementing Terraform for network automation across healthcare providers and Norwegian oil rigs provided both cautionary wisdom and hard-won insights about bringing infrastructure-as-code principles to enterprise networking.

Assembling the Fellowship

Pozo's first insight challenges the typical network engineer mindset: "When we as network engineers think about the network, we think about switches and routers. But in reality, for clients, the network is everything that allows the PC to connect to services."

This expanded view—encompassing firewalls, servers, authentication, 802.1X, certificates—requires assembling a team with diverse expertise. Full automation efforts need collaborative teams, not solo heroes.

Terraform vs. Ansible: The State Difference

For those unfamiliar with Terraform, Pozo provided a clear contrast with Ansible. While Ansible executes tasks sequentially from top to bottom, Terraform is stateful—saving configuration state in storage and comparing desired state with actual infrastructure on every run.

"If someone changes manually the description, on the next plan we're going to overwrite what they did manually on the devices," he explained. This state management is both Terraform's power and its complexity.

Unlike Ansible's linear execution, Terraform builds dependency graphs, requiring explicit relationships between resources. You can't create a BGP neighbor in a VRF without creating the VRF first—Terraform needs to understand these dependencies.

The Vendor Reality Check

The journey through vendor support proved sobering. While vendors claim Terraform support ("they want to make money"), reality differs significantly. Pozo discovered that 99% of vendors have "half-baked providers":

Some resources can be created but not deleted
Others can be created but not modified
Many crash on first attempt
Issues remain open for years without resolution

One provider had 121 open issues. His team found cases where supposedly destroying a single BGP VRF would actually wipe out the entire BGP process due to poorly implemented YANG models.

Solutions included:

Getting involved in open-source provider development
Hybrid approaches (automate what works, use alternatives for what doesn't)
Forking providers to implement necessary fixes
Extensive testing of every resource

The Sorcery of State Management

State management brings both power and peril. The state file becomes "our golden egg"—a complete mirror of infrastructure that could theoretically rebuild everything after a catastrophic failure.

But state also introduces dangerous commands:

terraform destroy can wipe entire infrastructure
terraform apply means sync state with infrastructure—it will delete resources no longer in state

For healthcare providers and oil rigs where "we don't want to lose billions or lose lives," they implemented manual approval gates. Engineers must review every plan before execution, looking for unexpected deletions or changes.

Scaling Challenges: Performance Realities

The scaling story reveals Terraform's cloud origins. Designed for loosely coupled cloud resources, it struggles with networking's strict operational dependencies and large object counts.

Time multiplication effect: If each API call takes half a second, and Terraform must read object status, push changes, then read again—that's 2 seconds per object. For 5,000 objects, pipelines run for hours just to create one VLAN.

Solutions involved:

Distribution: Multiple pipelines for different controllers rather than one monolithic process
State division: Split states into overlay/underlay, tenants, Layer 2/Layer 3
Controller bypass: Go directly to devices when possible instead of through controllers

The Final Boss: Performance Optimization

Even with distribution and state division, migrating a 550-leaf data center with thousands of objects still took 32 minutes to create a single VLAN. The culprit? Terraform's internal functions.

When processing 10,000+ objects, Terraform would consume 100% CPU for 10 minutes before even contacting network devices. Investigation revealed that HashiCorp's implementation of functions like distinct (deduplication) has exponential complexity.

The solution: Move all data transformation outside Terraform. Instead of using Terraform's built-in functions for loops, deduplication, and transformations, they built a Python preprocessing layer that prepares data before sending to Terraform.

Result: Deployment time dropped from 56 minutes to 7 minutes for 6,000 objects.

The Business Case Reality

Pozo offered a brutally honest assessment: "Anyone trying to go into network automation thinking they're going to directly save money will see the project as a loss."

The direct costs—salaries, consultancy, infrastructure—are visible and immediate. The savings are indirect but substantial:

Standards Enforcement: 100% compliance with design and configuration standards across all devices

Error Prevention: Automated deployment of complex configurations (32 clicks just to put one VLAN on one port in Cisco ACI)

Security and Auditing: All changes tracked, manual modifications detected and corrected

Knowledge Abstraction: Operations teams don't need vendor-specific expertise—they specify intent, automation handles implementation

Scale Achievement: Managing 60+ hospitals with limited operations staff became feasible

Key Lessons for Infrastructure-as-Code

Start Small, Think Big: Don't try to automate everything immediately. Focus on dynamic, frequently changing elements.

Expect Vendor Limitations: Budget time for provider testing, fixes, and workarounds.

Plan for State Management: Implement approval processes and state backup strategies from day one.

Design for Performance: Consider computational complexity when processing large datasets.

Build Teams, Not Tools: Network automation requires diverse expertise and collaborative approaches.

Measure Indirect Value: Focus on risk reduction, compliance, and operational capacity rather than just time savings.

The Modernization Imperative

Despite challenges, Pozo remained enthusiastic: "I prefer to write code than go to a thousand switches and update switch ports. It's much better, at least for me—much more fun."

His journey illustrates that successful network automation with Terraform requires understanding its strengths and limitations, building appropriate processes around state management, and accepting that the real value comes from operational consistency and risk reduction rather than immediate cost savings.

For organizations considering Terraform for network automation, Pozo's epic provides a realistic roadmap: expect challenges, plan for scale, build teams, and focus on the indirect business value that makes the journey worthwhile.

Sometimes the hardest journeys yield the most valuable destinations.

Watch the full presentation: An Epic Journey: Automating Networks with Terraform

AC3

Chris Grundemann

Executive advisor. Specializing in network infrastructure strategy and how to leverage your network to the greatest possible business advantage through technological and cultural transformation.

https://www.khadgaconsulting.com/