You have done the root cause analysis. So, what’s next?
Let’s discover why fixing the broken rule matters.
We had a production issue in a rules-based database system that lasted more than 36 hours. We conducted several investigations to identify the root cause of the defect. After we understood the root cause of the problem, we identified three options to fix it:
- Change the data so that the rule is no longer triggered
- Change the rule to handle that specific instance of the data
- Change the rule to handle that case and future cases like it
Hundreds of users in different countries used the system for trade operations and reporting, and the effort to implement the fix for each option was about the same.
After a brief discussion, we decided to implement Option 3.
Let’s consider the advantages and trade-offs of each of the three options
Option 1. Change the data so that the rule is no longer triggered
This option was the least desirable:
- This option presented a tactical alternative for a system user with superpowers (supervisory access rights) to fix the problem without consulting IT.
- Because this fix would not require escalation to IT, it could have been completed and approved within a few minutes in the production system.
- But the information in the system would be wrong, probably soon forgotten, and a potential source of future “surprises.”
- Faulty data influences faulty decisions.
Option 2. Change the rule to handle that specific instance of the data
This option was better than the first:
- It addressed the immediate need and would not corrupt the data.
- Changing the rule required IT involvement and would follow a proven process to test the impact of the change offline before fixing it in production.
- But this option only went halfway because we expected the system to fail again when it encountered similar data.
Option 3. Change the rule to handle that case and future cases like it
This option was the best of the three:
- This option addressed the root cause of the incident and extended the life of the fix beyond that specific data occurrence.
- It had all the qualities of Option 2 and none of the flaws of Option 1.
- It addressed not only what caused the system failure but also why that data caused the system to fail.
- The underlying issue was a broken rule, not faulty data.
When presented with options like these, always fix the broken rule.