Problem Management

Published date: April 15, 2024, Version: 1.0

Enterprise Problem Management (EPM) works closely with application and infrastructure support teams, managing the investigation and identification of underlying causes of Critical Incidents, and the permanent removal of them from the production environment.

What We Do

The three main tasks of EPM are:

  • Eliminate recurring Incidents by identifying their root causes and initiate actions to remove errors from the infrastructure.
  • Achieve the highest levels of service availability through the minimization of impact from Incidents and Problems.
  • Proactively monitor for Incident trends for prevention of future Incidents and Problems.

EPM is also responsible for reviewing Changes whose implementation caused Critical Incidents (via PIRs).

Our Goal

EPM seeks to minimize the adverse impact of Incidents and Problems to the businesses that are caused by underlying errors within the IT Infrastructure and to proactively prevent recurrence of Incidents related to these errors.

In order to achieve this, EPM strives to identify the root cause of Incidents, document and communicate Known Errors, and initiate actions to improve or correct the situation.​​​​​​​

Problem Management Framework

The Problem Management Framework is defined to leverage a set of standard problem management processes, templates, and tools that can be used to initiate, plan, execute, control, and close every Problem record. The Problem framework allows for streamlined communications, decision making, and structure around each Problem record.

The below image identifies the

  • Four major components of each Problem record
  • The handoff that occurs between Incident > Problem Management
  • Template of the fishbone diagram we leverage in CTC, including the common fault categories defined

It also identifies some strategies leveraged throughout the process including Fishbone diagrams & 5 Whys. Click on the links to learn more about these techniques.

Post Implementation Review (PIR) & Service Improvement Plans (SIP)

Post Implementation Review (PIR):

Post Implementation Reviews follow every Change implemented in a production environment that results in a Critical Incident. This process provides a structured method to implement resolutions for Critical Incidents which are a result of IT Changes. This includes optimization and lessons learned moving forward to ensure future Changes can utilize said solutions and improve future iterations. 

PIRs will leverage the same techniques and requirements identified in the Problem Framework, including fishbone and 5 Whys, focusing on the Change and reason for its failure, which are often process/procedure in nature.

Service Improvement Plans (SIPs):

Service improvement plans are developed for services that have multiple underlying issues that result in recurring outages. SIPs are not typically contributed to one root cause; instead, there are multiple solutions identified. 

SIPs leverage the same techniques and requirements identified in the Problem Framework, including fishbone and 5 Whys. They also take a wider view at the service, including larger action items such as service upgrade, replacement & overall architecture review.

Role Responsibility
Problem Manager  Owns the documentation & process. Accountable for the final outcome of a problem
Change Manager  Owns the change process. Responsible for any changes required to process 
IT

Owns the change & service. Responsible for implementation & diagnosis of any issues identified. 

This includes IT users who don’t own the service, but have a task in a change or problem solution

Incident Manager Owns the Incident Process. Responsible for ensuring the Incident is marked as caused by change.  Hand-off incident to Problem Manager. 
Service Desks  Is Informed for any updates in regards to a problem
Business Stakeholders 

Owns business processes. Responsible for any changes required to process 

This includes change approvers