Incident Management -Production Issues- processes are created out of desperation. After a few crises, stats out of sync and unhappy users, the engineering team decides a process to handle incidents is a must
The basic version usually goes like this:
- A production issue is reported
- A user files a ticket
- The ticket is assigned to the engineering team
- An owner is assigned
- The root cause (or temporary fix) is identified
- A fix is ready and deployed
- Users are again happy
Of course, most of the time is not a real issue. It could be the user was doing something wrong; there was a temporary downtime in the server or in the databases; there was a deployment going on.
When it is a real issue, most of the time it’s not worth to work on a definitive fix. It usually means scaling, redesigning, refactoring, upgrading, the application which is outside of the scope of an incident.
Update your incident management process and add one step at the beginning:
1) Is this really an issue?
And one step at the end:
2) Is this issue a design or scale problem?
- Challenges in Implementing Federated Learning in Ad Tech - 10/09/24
- Graph Neural Networks for RTB Systems - 10/02/24
- Measuring Emotional Impact in Local Campaigns - 09/25/24