There is a belief in the space where containers meet DevOps that a service hosted in a container crashing is no big deal, and you can just spin up a new one. It’s not a pervasive belief, but it is common enough that we’ve all heard/read it somewhere.
I think this view of the world is, at best, incomplete. There are easily verifiable cases where it just isn’t reality. A good example is the Cloudbleed bug, where ignoring errors and spinning up a new instance to meet load demands was one of many things that combined to make a sensational (if low-risk) failure.
Another good example is shopping carts and payment systems. All the small-business owners I know would balk at a shopping cart of payment processing system that could crash mid-transaction, and the provider shrug and say, “We spun up another instance; have the customer try again.”
Are there scenarios where crashing containers simply being replaced might be acceptable? I would argue no. They’re software errors. The world has not changed; errors are still bad. I can invent “but if you …” scenarios as well as the next person, but the fact is, if your software is crashing—be it container, VM, hardware, firmware or whatever—you did something wrong and you can’t know the extent of the issues the crashing creates without digging into the problem. At which point, you may as well fix it.
I’ve written my share of bugs, like any developer. And I’ve fixed my share of bugs, like any developer. Ignoring them is a recipe for disaster, when a seemingly benign bug is suddenly exposed to have major issues.
Schedule time to fix issues. Rushing forward with new features and products is cool, but keeping the customer base happy and safe is far cooler when revenue time comes. If containers and DevOps are saving you the time they’re supposed to be, you have a little extra time to get in there and fix issues, so use it productively.