The Idempotency Audit: When Scripts Run Twice
Why 'check-then-act' logic is fragile, and how a script that ran twice broke production.
A key differentiator for senior engineers is the focus on idempotency. This report tells the story of a script that ran twice and broke production, highlighting why "check-then-act" logic is fragile compared to declarative state.
Automated scripts replaced with declarative state enforcement, preventing race conditions and duplicate resources.
- Goal
- Automate VM provisioning in a brownfield environment.
- Constraint
- Existing legacy network configuration must be preserved.
- Reality
- The script added duplicate NICs and corrupted routing tables when retried.
Imperative scripts that check for existence often fail during race conditions or partial failures. Declarative engines enforce the end state regardless of the starting point.
Engineering standards used
- Idempotency is code property. It's not the runner's job to be safe; the code must handle re-runs.
- Destructive actions need checks. Explicitly verify state before modifying or deleting resources.
- Distinguish change from no-op. Logs must clearly show when no action was taken vs. when a change occurred.
The “Smart” Script
The incident started with a well-intentioned script designed to provision VMs. It included logic to check if a VM already existed before attempting to create it.if (!exists(vm)) create(vm);
This logic works perfectly in isolation. However, in a distributed system, or even a slow one, the gap between the check and the act is a danger zone.
The Race Condition
During a deployment, the API response for the creation request timed out. The system, interpreting this as a failure, retried the script.
The first request had actually succeeded on the backend but failed to report back in time. The retry script checked for existence, but due to eventual consistency or simple timing, the new VM wasn’t yet visible in the query result.
The script proceeded to “create” the resources again. Since the VM ID was reused, it attached a second network interface to the existing VM instead of failing or updating it. This duplicate NIC grabbed a new IP via DHCP, creating a routing loop that took the application offline.
The Fix: Declarative State
The solution wasn’t to write better checks. It was to stop checking entirely.
We moved the provisioning logic to a declarative tool (Terraform/Ansible). Instead of saying “create this,” we defined the end state: “This VM exists, and it has exactly one NIC.”
When the declarative engine runs, it queries the actual state of the resource. If it sees two NICs, it removes one to match the definition. If the VM exists, it does nothing. The outcome is always the same, no matter how many times you run it.
If you can't run it twice safely, don't run it once automatically.
Idempotency is the foundation of automation that lets you sleep at night.