An introduction to the MOF Problem Management SMF
Our previous blog articles in this series explain the role of the Microsoft Operations Framework (MOF), service management functions (SMF’s) and introduce ITIL IQ® which is the first step in implementing MOF within your business. Before you use this SMF, you may want to read the following ITIL IQ® guidance to learn more about the MOF IT service lifecycle and the Operate Phase: Blog Article 1: What’s your ITIL IQ®? Meet MOF Blog Article 2: The MOF Plan Phase Blog Article 7: The MOF Deliver Phase Blog Article 13: The MOF Operate Phase The MOF IT service lifecycle encompasses all the activities and processes involved in managing an IT service: its conception, development, operation, maintenance, and ultimately its retirement. MOF organises these activities and processes into Service Management Functions (SMFs), which are grouped together in phases that mirror the IT service lifecycle. Each SMF is anchored within a lifecycle phase and contains a unique set of goals and outcomes supporting the objectives of that phase. An IT service’s readiness to move from one phase to the next is confirmed by management reviews, which ensure that goals are being achieved in an appropriate fashion and that IT’s goals are aligned with the goals of the organisation. Each SMF is anchored within a lifecycle phase and contains a unique set of goals and outcomes supporting the objectives of that phase. The SMFs can be used as standalone sets of processes, but it is when SMFs are used together that they are most effective in ensuring service delivery at the desired quality and risk levels.
Position of the Problem Management SMF Within the MOF IT Service Lifecycle
The Problem Management SMF belongs to the Operate Phase of the MOF IT service lifecycle. The following figure shows the place of the Problem Management SMF within the Operate Phase, as well as the location of the Operate Phase within the IT service lifecycle. Figure 1. Position of the MOF Problem Management SMF within the IT service lifecycle
Why Use the MOF Problem Management SMF?
This SMF should be useful to anyone who is tasked with identifying underlying problems to prevent incidents before they occur. Typically, the focus of Problem Management is on complex problems that are beyond the scope of a request for incident resolution. Specifically, this SMF addresses how to:
- Document the problem.
- Filter the problem.
- Research the problem.
- Research the outcome.
Problem Management Service Management Function Overview
The Problem Management SMF provides guidance to help IT professionals resolve complex problems that may be beyond the scope of Incident Resolution requests, which are described in the Customer Service SMF. An incident is any event that is not part of the standard operation of a service and that causes, or may cause, an interruption to, or a reduction in, the quality of service. Problem Management involves:
- Recording incident, operations, and event data about a problem within an IT service or system.
- When justified, researching the problem to identify its root cause.
- Developing workarounds, reactive fixes, or proactive fixes for the problem.
Problem Management should begin at the start of a service’s lifecycle and should be applied to all aspects of IT, including application development, server building, desktop deployment, user training, and service operation. As more problems are discovered, recorded, researched, and resolved, IT will experience fewer failures. If Problem Management is performed during the period when a service is envisioned, planned, designed, built, and stabilised, the service will be deployed into productive use with fewer failures and higher customer satisfaction.
Problem Management SMF Role Types
The primary team accountability that applies to the Problem Management SMF is the Support Accountability. The role types within that accountability and their primary activities within this SMF are displayed in the following table. Table 1. Support Accountability and Its Attendant Role Types
|Role Type||Responsibilities||Role in This SMF|
|Customer Service Representative||· Handles calls · Has first contact with user, registers call, categorises it, determines supportability, and dispatches call||· Helps the customer|
|Incident Resolver||· Diagnoses · Investigates · Resolves||· Watches for evidence of problems · Passes on incident information to Problem Manager|
|Incident Coordinator||· Responsible for incident from beginning to end · Owns quality control||· Watches for evidence of problems · Passes on incident information to Problem Manager|
|Problem Analyst||· Investigates and diagnoses||· Finds underlying root causes of the incidents|
|Problem Manager||· Identifies problems from the incident list||· Prevents future incidents|
|Customer Service Manager||· Accountable for goals of Support · Covers incidents and problems||· Oversight|
Goals of Problem Management
The primary goal of Problem Management is to reduce the occurrence of failures with IT services. Its secondary goals are to generate data and lessons that IT can use to provide feedback during the IT lifecycle and to help drive the development of more stable solutions. Table 2. Outcomes and Measures of the Problem Management SMF Goals
|Problems affecting infrastructure and service are identified and assigned an owner.||The number of unassigned problems is reduced, and the number of problems assigned to an owner is increased.|
|Steps are identified and taken to reduce the impact of incidents and problems.||The number of incidents and problems that occur is reduced, and the impact of those that still occur is lessened.|
|Root cause is identified for problems, and activity is initiated to establish workarounds or permanent solutions to identified problems.||The number of workarounds and permanent solutions to identified problems is increased.|
|Trend analysis is used to predict future problems and enable prioritisation of problems.||More problems are resolved earlier or avoided entirely.|
The following table contains definitions of key terms found in this guide. Table 3. Key Terms
|Problem||A scenario describing symptoms that have occurred in an IT service or system that threatens its availability or reliability.|
|Error||A fault, bug, or behavior issue in an IT service or system.|
|Known error||An error that has been observed and documented|
|Root cause||The specific reason that most directly contributes to the occurrence of an error.|
|Known error database||A subsection of the knowledge base or overall configuration management system (CMS) that stores known errors and their associated root causes, workarounds, and fixes.|
Problem Management Process Flow
The Problem Management process flow consists of the following processes:
- Document the problem.
- Filter the problem.
- Research the problem.
- Research the outcome.
Figure 2. Problem management process flow
Process 1: Document the Problem
The first process in Problem Management is to thoroughly document the problem. This includes classifying and prioritising the problem. Figure 3. Document the problem
Activities: Document the Problem
A problem is any scenario that threatens the reliability or availability of a service or system. Problems may arise from many sources and can be triggered by many events. For a problem to qualify for Problem Management, however, there must be value in documenting the problem by doing research on it and attempting to locate and resolve its root cause. In addition, the value of removing the problem from the environment should be greater than the effort and cost to do so. Record keeping is critical to Problem Management. If data about the problem is lost, duplicated, or incorrectly recorded, Problem Management cannot function correctly. The success of the process depends on having good data to analyse and research. Note It is important to keep in mind that if a service or system is interrupted, this is considered an incident and not a problem. Be very careful not to get in the way of the service resumption activity of incident resolution, which is described in the Customer Service SMF. The following table lists the activities involved in documenting a problem. These include:
- Creating a problem record.
- Classifying the problem.
- Prioritising the problem.
Table 4. Activities and Considerations for Documenting the Problem
|Create a problem record||Key questions: · What are the symptoms of the problem? · Is there an existing problem record with the same symptoms? · Has a known error record already been created? Inputs: · Incident record · Events from System Center Operations Manager · Trends discovered from operational data Output: · A record that tracks the symptoms and scenario surrounding the observation of the problem Best practices: · Detailed data is important. A Problem Management tracking system should allow users to attach or link to logs, screenshots, dump files, and other diagnostic data. · Tight integration between the customer service tracking system and the Problem Management tracking system is very beneficial. Whenever a problem record is created as the result of a Help request, the two should be linked together for future analysis and review.|
|Classify the problem||Key Questions: · What are the available classifications? · Are there applicable sub-classifications? · What is the best-fit classification to select? Inputs: · Working knowledge of the environment · IT Service Catalogue Outputs: · Metadata added to the problem record to help associate it with other data previously collected · Data to help properly document information coming out of Problem Management to make it usable by the customer service process Best practices: · Classification should be a best-fit approach. · It is important to offer an “unknown” option. If the problem does not fit into an existing classification, a new classification might be required. Users of the Problem Management tracking system should not be forced to select an incorrect class, as this can skew the data.|
|Prioritise the problem||Key questions: · Is there a Help request from the customer service tracking system associated with this problem? · How many people does, or could, this problem affect? · How significant is, or would be, the impact to important business processes? · Is this an obvious one-time occurrence? · What is the criticality of the business process being affected? · What SLAs are in place? · What groups of people are affected: back-office support, external customers, or upper management? Inputs: · Data from the problem record · Related Help requests · Operational data and evidence · IT Service Catalogue Outputs: · A determination of the priority of this problem over others · A determination of whether this problem should be deferred to a later time Best practice: · It is important to prioritise consistently. Building a matrix into the Problem Management tracking system can help guide users to setting meaningful priorities.|
Process 2: Filter the Problem
The next process is to filter the problem to determine if solving it should be pursued. Figure 4. Filtering the problem
Activities: Filter the Problem
During this process, you filter the problem to decide whether to pursue solving it. Table 5. Activities and Considerations for Filtering the Problem
|Filter the problem||Key questions: · Has a problem record already been created for the problem? · What is the business justification for researching this problem? · How many hours will it take to reproduce the problem? · What is the payoff if a fix is found? Inputs: · Data from the problem record · Experience from past similar problems · IT Service Catalogue Output: · Determination to continue work on the problem or to close the record Best practices: · Turning down a challenge can be difficult for motivated and curious IT professionals. After all, technology is driven by a desire to understand how things work and how to fix them when they stop working. However, when it comes to managing the IT resources in a business, a proper balance must be maintained. There must be a justifiable reason for solving the problem. · The filtering activity demands an objective review of the problem. If the benefit of fixing it does not significantly outweigh the cost of researching and fixing the problem, then a fix probably should not be attempted. If more data is discovered later that provides more details, the problem can be revisited.|
Process 3: Research the Problem
After you make the decision to solve the problem, the next process is to do the research necessary to find a fix or workaround. Figure 5. Research the problem
Activities: Research the Problem
For effective and meaningful Problem Management, researching a problem must follow disciplines similar to those of the scientific method. This includes:
- Reproducing the problem in a test environment.
- Observing the symptoms of the problem and noting your observations.
- Performing root cause analysis.
- Developing a hypothesis and testing it.
- Repeating this process until the root cause has been determined.
At first glance, this may seem overly complicated. However, it is critical that Problem Management identifies the root cause of the problem and determines the exact steps to eliminate it. This process allows you to examine one variable at a time. This is important because introducing multiple variables at the same time can make it impossible to isolate the valuable ones. This could lead to ineffective or over-complicated fixes being deployed. The output of this process is the production of a known error record. However, since it can be difficult to pinpoint exactly when the data required to create the known error record will become apparent, you should ask the following questions during each activity in this process. If the answer to any of the questions is yes, it is time to create the known error record:
- Has any information been discovered that would aid others in resolving incidents or events matching this specific problem?
- Has a definitive root cause been identified?
- Have any actions been uncovered that would reduce the frequency or impact of the error?
- Can a date be projected for when the error will be resolved?
- Is there meaningful information available to share about the progress of resolving the error?
- Are there actions that Problem Management needs individuals to take to aid in the research efforts?
- Has a workaround been discovered?
- Has a fix been designed?
The following table lists the activities involved in researching the problem. These include:
- Reproducing the problem.
- Observing the symptoms of the problem.
- Performing root cause analysis.
- Developing a hypothesis.
- Testing the hypothesis.
Table 6. Activities and Considerations for Researching the Problem
|Reproduce the problem||Key questions: · Can the problem be reproduced at will? · What user context or security access is required to reproduce the problem? · Will special lab equipment be required or can this be reproduced on any system? Input: · Problem record Outputs: · An environment where the problem can be studied and observed · A new or updated known error record Best practices: · Production systems should not be used for Problem Management work if at all possible. In scenarios where the problem can only be reproduced in production, extreme care must be taken so that the act of observation does not affect the system. System monitoring and debugging tools can cause drops in performance. In some cases, the service might have to be taken offline to use these tools. Activities like this must be treated as changes and should be passed through the Change Management and Change Control processes. · It may be tempting to introduce small changes to systems and services, disguising them as “Break-Fix” activities. This should not be allowed. Change and problem processes should work hand-in-hand to drive stability and reliability into production services. Circumventing reviews and approval activities for changes can have negative impacts. · The steps discovered to reproduce the problem should be documented in full detail in the problem record. In the event that others get involved in working on the problem, this ensures that the steps are reproduced exactly the same way each time.|
|Observe the symptoms of the problem||Key questions: · What are the symptoms of the problem? · How can they be observed? · What tools are required to capture and record the occurrence of the problem? Inputs: · Problem record · Lessons learned during reproduction Outputs: · An understanding of the timing, triggers, and results of the problem · New or updated known error record|
|Perform root cause analysis||Key question: · What technique should be used for performing root cause analysis? To learn about some of the available techniques, see “Root Cause Analysis Techniques” in this article. Inputs: · Selected root cause analysis technique · Problem record Outputs: · Hypothesis to test · New or updated known error record|
|Develop a hypothesis||Key questions: · What actions might work around this problem? · What actions might fix this problem? · Could this problem be the result of another problem? · Have changes been made to the service or system recently that may have created the problem? Inputs: · Output from root cause analysis · Problem record Outputs: · Hypothesis to test · New or updated known error record Best practice: · Document, document, document! Effort documented today is effort avoided tomorrow. As the Problem Management process is repeated, it can become increasingly more efficient by using data created during previous efforts. This means that all hypotheses should be documented in the problem record. Information such as the reasoning behind the hypothesis, how to test it, and what the expected results are should be captured.|
|Test the hypothesis||Key questions: · What actions might work around this problem? · What actions might fix this problem? · Could this problem be the result of another problem? · Have changes been made to the service or system recently that may have created the problem? Inputs: · Hypothesis to test · Problem record Output: · New or updated known error record Best practices: · Keep a control system in place to compare results of the testing. This system should remain unmodified during the testing of any hypothesis. This enables the testers to determine if their actions have resolved the problem or if some other uncontrollable factor has been introduced. · Test only one hypothesis at a time, and test each hypothesis one step at a time. Introducing complex modifications can make it difficult to pinpoint the actual workaround or fix. · Document all of the results—both positive and negative outcomes. · If circumstances force these activities to take place on production systems, make sure that proper change procedures have been followed and that back-out plans are tested and in place.|
Root Cause Analysis Techniques
A difficult area of Problem Management for most organisations is analysing the root cause of a problem. Root cause analysis techniques are used to identify the conditions that initiate an undesired activity or state. Since problems are best solved by attempting to correct or eliminate their root causes, this is a critical part of resolving any problem. There are many techniques available for performing root cause analysis. Two of the most popular are:
- Fishbone diagrams.
- Fault tree analysis.
Visual techniques are often used to assist IT professionals to determine the root cause of a problem. One tool useful in visually diagramming the process is the Ishikawa, or fishbone, diagram. The following figure illustrates a fishbone diagram. Figure 6. Fishbone diagram
Fault Tree Analysis
Fault tree analysis is another visual technique used to assist with root cause analysis. It is a top down approach to identifying all potential causes leading to a defect. In the final stage of diagnosis, the root cause is identified and the problem is moved from an unknown state to a known state. The following figure shows an example of fault tree analysis.
Figure 7. Example of fault tree analysis
Process 4: Research the Outcome
In this process, you review the outcome of your research and determine whether a workaround or fix for the problem has been discovered. Figure 8. Research the outcome
Activities: Research the Outcome
Once the first pass through the research process has been completed, it is time to look at the results and determine if a viable workaround or fix has been discovered. Because of the complex nature of IT and the intricacies of highly integrated systems, you might need to perform the research process a number of times in order to achieve a workaround or fix. If you repeat the research process, you must go through the filtering activity each time. This gives you the opportunity to re-evaluate the value of continuing the effort to resolve the problem. If the resolution to the problem becomes too difficult to find, it might be time to stop your attempts and focus on a more achievable goal. The following table lists the activities involved in researching the outcome. These include:
- Determining if a workaround or fix has been discovered.
- Determining if a proactive action is possible.
- Closing the problem record.
Table 7. Activities and Considerations for Researching the Outcome
|Has a workaround or fix been discovered?||Key questions: · Has a workaround or a fix been discovered? · Are there any prerequisites to applying this knowledge? · What level of authentication is required to use this knowledge? Inputs: · Problem record · Research process Output: · Updated known error record with step-by-step instructions for the workaround or fix for this problem Best practices: · If the workaround or fix requires a change and is expected to be implemented many times, Change Control should be engaged to provide guidance on establishing a standard change. · Workarounds and fixes should be documented in great detail. The known error record should be updated to reflect that a workaround or fix is available, and the record should be displayed when searching for problems with similar symptoms. · Known error record usage should be tracked and reported. Counting the number of times a workaround is applied can provide justification for developing a fix. Tracking the number of times a fix is applied can provide justification for retrofitting existing systems with the fix. · If a known error record is seldom accessed, this may indicate that the problem had a low value and shouldn’t have been pursued, or that the record is poorly documented and is not being displayed in searches, or that the record is too difficult to follow. Use this information to improve your processes.|
|Is a proactive action possible?||Key questions: · Could this workaround or fix be applied proactively? · What tools are required to deploy it? · What is the benefit of proactive action for this error? Input: · Research process Output: · Change Management engaged to start a Request for Change (RFC) Best practices: · All Problem Management is reactive. You cannot research and resolve an error before it occurs. However, workarounds and fixes can be applied reactively or proactively. · After Problem Management has discovered a beneficial action to alleviate the impact of an error, it can either provide the knowledge for reactive use by entities such as the Service Desk, or deploy the action proactively. However, proactive work can sometimes require more planning, preparation, and effort than the reactive application of a workaround or fix. If the action is only beneficial to a small number of systems, it may be more economical to allow the Service Desk to address them one-by-one as the problem is encountered. If the action is beneficial to many systems or an entire department, the effort to prepare and execute a large-scale proactive deployment is justified. To learn more about resolving problems proactively, See “Proactive Analysis” in this guide.|
|Close the problem record||Key Questions: · Is there any value to be added by continuing work on this problem record? · Has all of the appropriate known error record information been updated? · Have all changes associated with this problem record been completed? Input: · Problem record Outputs: · Updated problem record · Validation that all meaningful efforts for this problem have been concluded Best practices: · Do not confuse problem records with known error records. Problem records track the work effort, actions, and decisions associated with working on a particular problem. When work on that problem has come to a close, so too should the problem record. The known error record lives on in the known error database. That record should be tracked and reviewed and possibly updated from time to time.|
Proactive analysis activities are concerned with identifying and resolving problems before they occur, thereby minimising their adverse impact on the service and business as a whole. This can be accomplished by reviewing:
- The current problem record database.
- All escalated events stored in an incident tracking system.
- Corporate error report data.
- Knowledge base articles containing “unknown error” state.
Selecting which problems to attempt to resolve proactively should be based on a number of factors, including:
- The cost to the business.
- The customers affected.
- The volume, duration, and cost of the problems.
- The cost of implementation.
- The likelihood of success.
Using these factors, an algorithm can be created and used to calculate the business impact of support events. This can be a useful, cost effective way to determine which problems to address.
The Problem Management SMF addresses how to identify underlying problems to prevent incidents before they occur, especially complex problems that are beyond the scope of a request for incident resolution. Specifically, this SMF addresses how to:
- Document the problem.
- Filter the problem.
- Research the problem.
- Research the outcome.
How can I implement MOF?
Hopefully by now you’ll begin to understand the value that the Microsoft Operations Framework can bring to your business. The goals, outcomes and measures outlined above require many activities and considerations which form part of our day to day activities at First Solution. In fact, we’re experts in MOF and have even developed a unique ITIL IQ® process that benchmarks a business’s current state, identifies their desired state and provides an action plan (called a Service Delivery Plan) that helps organisations of all sizes achieve their desired business outcomes. Most importantly, our unique ITIL IQ® process begins with a Proactive Services Maturity Review (PSMR) which identifies a score (out of 100) that clearly communicates the current state of your businesses IT operational maturity. Armed with your ITIL IQ® score, a non-IT professional such as a finance or procurement professional can concisely present to the IT Executive Officer the businesses current state, desired state, and ITIL IQ® score with an action plan to improve the ITIL IQ® score and thereby ensure that IT’s goals are aligned with the goals of the business and that both are progressing together. Once the IT Executive Officer has bought into the MOF concept we can help to develop an IT service strategy, IT service map, IT service portfolio and Service level agreements.
How can I manage IT problems better?
Simply get in touch to arrange a free Proactive Services Maturity Review and one of our MOF experts will conduct an interview with the IT Manager or IT Executive Officer within your business and provide an ITIL IQ® score with which you can measure the performance of your IT function. Once you know your ITIL IQ® score we can provide a Service Delivery Plan to help you improve it each month and measure and report progress back to you during a Monthly Service Review. And there we have it, an ITIL based solution to simply identify and measure the performance of your IT function. So, are you ready to manage IT problems better?
The Microsoft Operations Framework 4.0 is provided with permission from Microsoft Corporation.