Resolution vs. Restoration in Incident Management
Related terms can be tricky especially when people use them interchangeably. Here is a scenario which helps understanding the terms – Resolution and Recovery. (If you are looking for the two more similar terms, Repair and Recovery, visit my article Repair, Resolution, Recovery and Restoration – A recipe of confusion in Incident Management.)
Fact in question: Under Incident Management, we restore the services while Problem Management focuses on the resolution.
Consider the scenario:
Thursday Morning, around 9.30 am
Users are calling the Service Desk to report that they are unable to access their network drives and getting below error.
Multiple users seems to be affected as Service Desk is getting calls reporting the same issue from multiple locations. Service Desk does the initial diagnosis and figures out that it is not a network issue but the server or storage seems to be down.
Service Desk has create a priority 2 incident and assigned to the Windows team (L2).
Same day, around 9.45 am
An L2 engineer grabs the ticket and started to investigate the issue. He performed various troubleshooting steps nut no success yet. After troubleshooting for almost 3 hours, the L2 engineer escalates the ticket to the L3 engineer for further assistance.
Same day, around 12.50 pm
An L3 engineers who was about to go for lunch gets this P2 incident and quickly looks into it. After understanding all troubleshooting steps taken so far, he decides to restart the server so requests for an emergency change and restarts the server after securing all approvals in ECAB. The engineer restarts the server and once it comes back up, he tries to access the network drives again. Lucky enough, he was able to access the drives! Now the engineer needs to do one more thing before he can go for the lunch is, restore the data to these network drives with the help of Storage Team.
Same day, around 3.45 pm
The L3 engineer asks the Service Desk to confirm the resolution from the users. Meanwhile L3 engineer assigns the incident to Storage Team to initiate the back up to these drives so that users can access their data as well.
As expected, Service Desk reports back that users can access the drives but cannot find their folders and files.
Luckily the storage team has scheduled a daily back up so they decide to restore the recent last backup from the mirror storage. After discussing the further proceedings in the ECAB, the storage engineer initiates the restore from the mirror set and expected completion time is 2 hours and 40 minutes.
Same day, around 4.30 pm
Service Desk updates the front end messages and IVR stating that the issue with network drive is actively being worked upon and users are advised not to use the network drives until further notice as all their data will be lost.
Meanwhile, the Incident Manager / Incident Coordinator driving the bridge call, creates a problem record with reference to this incident so that root cause can be identified and permanent fix can be implemented if viable.
Same day, around 6.30 pm
The data restore from the mirror set is completed before time in 2 hours and 30 minutes and the engineer try to access the network drive to see all the files and folders are available. The good news is, they can see all the data back again. Storage engineer quickly assigns the incident to Service Desk for user confirmation and ticket closure.
Service Desk was able to confirm with some of the users that they can access the network drives and see their files and folders too. Some of the users are already gone home in the evening so Service Desk decides to keep the ticket open, pending customer confirmation before the close it out.
Service Desk reverts with the finding to the Windows and Storage Team’s engineers and bridge call gets completed.
Same day, around 6.40 pm
The L3 engineer from the Windows team, finally goes out for the lunch, figuring out whether he should consider it a lunch or dinner!
Now, if you consider the entire scenario, the incident was resolved by applying the workaround of rebooting the server which has fixed the issue. However, the service (network drives, in this case) was not restored (to the last known condition) for the users making it accessible for users but not usable!
So in this scenario, the resolution took approximately 6.5 hours (Incident Identification at 9.30 am to Incident Resolution at 3.45 pm); whereas the service restoration took 9 hours (Incident Resolution Time, 6.5 hrs. + Recover Time or time taken to restore data from the backup storage, 2.5 hrs.).
ITIL Definitions in context of the scenario above:
Resolution – Action taken to repair the root cause of an incident or problem, or to implement a workaround. As a workaround, the L3 Engineer from Windows Team rebooted the server. Took approx. 6.5 Hrs.
Recovery – Returning a configuration item or an IT service to a working state. Recovery of an IT service often includes recovering data to a known consistent state. After recovery, further steps may be needed before the IT service can be made available to the users (restoration). Storage Engineer has restored the data from the backup storage to recover all the lost files and folders. Took 2.5 Hrs.
Restoration – Taking action to return an IT service to the users after repair and recovery from an incident. Rebooting the server, restoring the data from backup storage, user confirmation was done and then the service (network drives) was made available to the users. Took 9 Hrs. (6.5 + 2.5)
BTW, if you are interested to know what happened after that, here it is:
The Problem Record was assigned to the Windows Team and after spending a decent amount of time in doing root cause analysis, the engineer has found that there was a brief power outage in the data center and secondary power source didn’t work either. This has caused some of the servers and storage boxes break down. The power outage was due to fluctuation in the power supply due to a faulty switch.
Permanent fix to this problem has been implemented by replacing the faulty switch at the data center and the engineer has successfully saved future incidents due to this cause.
Everybody lived happily ever after, at least till the time next major incident!