Repair, Resolution, Recovery and Restoration – A recipe of confusion in Incident Management?
Repair, Resolution, Recovery and Restoration are the 4 R’s mostly used during the Incident Management process. While ITIL is very particular about the terms and terminology, there seems to be enough confusion while discussing these four terms.
There are states defined for the incident lifecycle:
These 4 R’s are typically used / defined in the last 2 states and what I found is sometimes are used interchangeably. Here is my attempt to add a bit more due clarity to these definitions with a scenario.
Consider the scenario:
On a bright and sunny Monday morning, a relatively busy Service Desk gets a phone call from a technical user Todd, stating he is unable to work on the xyz database, it’s taking a long time to respond and freezing intermittently. The user also states that he has checked with other team members sitting beside him working on the same database, experiencing the same issue.
The fairly courteous and energetic Service Desk agent Brenda responds to Todd’s call and try to get more information around this issue while quickly searching some relevant knowledge articles. Brenda was lucky enough that she found a knowledge article stating that any issues related to xyz database needs to be escalated to the L2 team straight away and no troubleshooting required by the Service Desk.
Brenda creates an incident ticket and assigns that to L2 team, gives that ticket number to Todd for reference and finishes the call with the fancy (or boring?) call closing script!
The incident ticket documentation by Brenda looks as below:
Here is when the real action starts. The incident ticket opened by Brenda comes to Mark, the L2 Engineer from database team. After spending few minutes on investigation and diagnosis Mark realizes what is wrong with the xyz database and performs few steps and document neatly in the ticket work notes.
The ticket documentation by Mark looks as below:
Now let’s consider the ITIL Definitions around these four terms:
Repair – The replacement or correction of a failed configuration item.
Resolution – Action taken to repair the root cause of an incident or problem, or to implement a workaround.
Recovery – Returning a configuration item or an IT service to a working state. Recovery of an IT service often includes recovering data to a known consistent state. After recovery, further steps may be needed before the IT service can be made available to the users (restoration).
Restoration – Taking action to return an IT service to the users after repair and recovery from an incident.
In the context of our scenario, “database is locked and screens are freezing” is an incident here which is making the database services inaccessible for the users.
When we try to relate these terms with the above scenario, here is how it looks like:
Resolution is the act of bouncing the database to clear the locks.
Questions to ponder:
Q. Is the database accessible again right after doing the above activities?
A. Not yet!
Q. Would doing just these activities made the database service available to users?
Q. Can you resolve the service?
A. Eh, what question is that, we can resolve the incident not service!
Conclusion: These activities are essential to resolve the incident and precursor to making the database service available for users to use.
Recovery is the act of correcting (repairing) any inconsistencies in the database following the bounce that would lead to further issues.
Question to ponder:
Q. What could go wrong if Mark doesn’t repair the database by correcting the inconsistencies?
A. The particular incident could resolve but the issue would occur right again until he repairs what is broken.
Conclusion: This activity is actually repairing the broken piece and recover the working state of the database again.
Restoration is the act of bringing the database service back online for the users.
Questions to ponder:
Q. Can users resume their usage on database after “resolution, repair and recover” as stated above?
Q. Can you restore an incident?
A. Not really, we can restore the services though!
Conclusion: The database service can be restored after the resolution and recovery from the incident.
Here is the “all-in-one” view looks like:
Another confusion of Resolution vs. Restoration in the context of Incident and Problem Management:
If the confusion is, “true resolution can only happen once we fix the root cause of the incidents permanently!”
Well, that’s the Problem Resolution. In the above scenario, Mark has resolved the incident and restored the service for the users which doesn’t mean that this incident may not occur again.
Mark has just resolved the cause of “this particular incident” by correcting the inconsistencies in the table SSS and QQQ. That’s not a permanent fix to xyz database freezing issue. If Mark would have identified the “root cause” (that why this has occurred at the first place) of the problem which has caused this incident (and may cause more such incidents in future) and permanently fixes it, that would be the problem resolution, i.e. fixing the cause to prevent problems and resulting incidents from happening (objective of Problem Management).
Remember, you can resolve the incident not the service and restore the service not the incident!