A story about scrapped items, shopfloor chaos, and the importance of a robust DevOps strategy.
When Production and Local Collided
It was early in my career as a junior developer, and I was working on a manufacturing execution software—a high-profile system tightly integrated with shopfloor machines, controlling real-time operations in factories. It was exhilarating to be part of something so impactful.
At the time, it was standard practice to connect to the live production database through a local development environment for debugging. Strange? Definitely. But it was “the way things were.”
One day, I was troubleshooting a bug, connected to the production database. I solved the issue and didn’t think much about it. The next day, I moved on to working on a new feature: adding a way to easily scrap items. I ran countless tests locally, tweaking the functionality and ensuring it worked as expected.
Except… I forgot I was still connected to the production database.
Inadvertently, I scrapped about 100 items in a real factory in Sweden.
“Where Are Our Items?”
The next day, we got a call from our colleagues in Sweden. They had sent items for manufacturing, but the system showed they didn’t exist.
It didn’t take long to realize what had happened: I happened. Unknowingly, my testing had affected the live production system, scrapping real items that were physically present on the shopfloor. I felt panic taking over me.
Thankfully, the issue was resolved fairly quickly by restoring the database to a previous state. But it was a moment that left a lasting impression on me—not just because of the mistake itself, but because it highlighted just how flawed the system was.
Was It My Fault?
Yes, surely… and no. Could I have been more cautious? Absolutely. But the truth is, a professional setup should not allow a developer to connect directly to production. The very fact that this was possible was a failure in itself.
Here’s the thing: mistakes happen. Humans are fallible. The role of a well-designed infrastructure is to minimize the risk of human error, especially in critical systems. Unfortunately, this setup lacked the safeguards needed to prevent a situation like this.
What Went Wrong (and How It Should Have Been Prevented)
Let’s break down the issues with this setup and how modern DevOps practices could have prevented it.
1. No Segregation Between Environments
Connecting a local development environment directly to production is asking for trouble. A proper setup should include:
- Separate Environments: Development, staging, and production environments should be completely isolated from each other.
- Staging Environment: This mirrors production closely, allowing developers to reproduce and debug issues safely.
Modern Example: Cloud-Based Environments
In cloud setups like AWS or Azure, you can spin up staging environments on-demand using Infrastructure as Code (IaC) tools like Terraform. These environments can mimic production without touching real data.
2. Insufficient Access Controls
A junior developer should never have the ability to directly modify production data. Best practices include:
- Role-Based Access Control (RBAC): Access should be tightly controlled based on roles, with junior developers having read-only access (if any) to production.
- Just-in-Time Access: Temporary production access can be granted for debugging with strict approvals.
Modern Example: IAM
Identity and Access Management (IAM) allows you to define detailed permissions for each user. For example, a developer might have full access to staging but only temporary, limited access to production logs via tools like AWS Systems Manager.
3. Lack of Network Isolation
Local environments shouldn’t have direct access to production databases. A properly secured network setup includes:
- Virtual Private Clouds (VPCs): Production resources should reside in a private network inaccessible from the outside.
- Bastion Hosts: Any production access should go through a secured bastion host with strict logging and monitoring.
- Database Proxies: Use tools like AWS RDS Proxy to manage database access securely.
Modern Example: Virtual Networks and VPNs
Virtual Networks allow you to isolate production environments entirely while providing controlled access via VPNs or private endpoints.
4. No Monitoring or Alerts
If production data is being modified unexpectedly, alarms should go off. Monitoring tools can provide real-time alerts and prevent damage from escalating.
- Audit Logs: All actions in production should be logged.
- Real-Time Alerts: Tools like AWS CloudWatch or Datadog can send alerts if unusual activity is detected.
- Immutable Backups: Frequent snapshots of production data ensure quick recovery without data loss.
Lessons Learned
That experience taught me a lot—not just about my own mistakes, but about how critical proper DevOps practices are in protecting production systems. Here are the key takeaways:
- Build Guardrails, Not Blame: Mistakes are inevitable, especially with junior developers. The system should be designed to absorb these mistakes without catastrophic consequences.
- Invest in Staging Environments: Reproducing issues without touching production is a non-negotiable in modern software development.
- Emphasize DevOps Practices: Role-based access, network isolation, and robust monitoring are foundational to a secure and resilient system.
Final Thoughts
Breaking production is a rite of passage for many developers, but it shouldn’t have to be. By adopting modern DevOps practices, organizations can prevent these situations, protect their critical systems, and give their teams the confidence to experiment and grow without fear of causing real-world chaos.
For me, it was a hard lesson learned. But for you, it’s an opportunity to make sure your systems are built to withstand the inevitable oops moments. This is why while working as the architect of Flowpass, one of the first things I took care of is having dev, staging and prod environments, with VPN access and IAM access control.
Leave a Reply