An introduction to the MOF Service Monitoring and Control SMF
Our previous blog articles in this series explain the role of the Microsoft Operations Framework (MOF), service management functions (SMF’s) and introduce ITIL IQ® which is the first step in implementing MOF within your business. Before you use this SMF, you may want to read the following ITIL IQ® guidance to learn more about the MOF IT service lifecycle and the Operate Phase:
Blog Article 1: What’s your ITIL IQ®? Meet MOF
Blog Article 2: The MOF Plan Phase
Blog Article 7: The MOF Deliver Phase
Blog Article 13: The MOF Operate Phase
The MOF IT service lifecycle encompasses all the activities and processes involved in managing an IT service: its conception, development, operation, maintenance, and ultimately its retirement. MOF organises these activities and processes into Service Management Functions (SMFs), which are grouped together in phases that mirror the IT service lifecycle. Each SMF is anchored within a lifecycle phase and contains a unique set of goals and outcomes supporting the objectives of that phase. An IT service’s readiness to move from one phase to the next is confirmed by management reviews, which ensure that goals are being achieved in an appropriate fashion and that IT’s goals are aligned with the goals of the organisation.
Each SMF is anchored within a lifecycle phase and contains a unique set of goals and outcomes supporting the objectives of that phase. The SMFs can be used as standalone sets of processes, but it is when SMFs are used together that they are most effective in ensuring service delivery at the desired quality and risk levels.
Position of the Service Monitoring and Control SMF Within the MOF IT Service Lifecycle
The Service Monitoring and Control SMF belongs to the Operate Phase of the MOF IT service lifecycle. The following figure shows the place of the Service Monitoring and Control SMF within the Operate Phase, as well as the location of the Operate Phase within the IT service lifecycle.
Figure 1. Position of the Service Monitoring and Control SMF within the IT service lifecycle
Why Use the Service Monitoring and Control SMF?
This SMF should be useful to anyone who is responsible for the real time observation of and alerting about conditions in an IT production environment for the purpose of monitoring the health of IT services, taking remedial actions to minimise incidents and events, and providing trend data for optimising IT service performance.
It specifically addresses how to:
- Define service monitoring requirements.
- Implement a service.
- Conduct continuous monitoring.
Service Monitoring and Control Overview
Service Monitoring and Control (SMC) is the real time observation of and alerting about health conditions (characteristics that indicate success or failure) in an IT environment. It helps to ensure that deployed services are operated, maintained, and supported in line with the service level agreement (SLA) targets agreed to between the business and IT. For more information about SLAs and operations level agreements (OLAs), see the Business/IT Alignment SMF.
This SMF describes what is required to successfully implement SMC. The components of this process are:
- Establishing a service monitoring function.
- Understanding the nature of new and existing IT services.
- Understanding the requirements for successful service monitoring tools.
- Ensuring that all relevant information from service monitoring is acted upon by the appropriate people.
- Generating all the information required by other SMFs.
- Improving the quality of service information.
The importance of effective service monitoring cannot be overemphasised. If a service can’t be monitored, it can’t be measured, and if it can’t be measured, it can’t be managed.
Service Monitoring and Control Role Types
The primary team accountability that applies to the Service Monitoring and Control SMF is the Operations Accountability. The role types within that accountability and their primary activities within this SMF are displayed in the following table.
Table 1. Operations Accountability and Its Attendant Role Types
|Role Type||Responsibilities||Role in This SMF|
|Monitoring Manager||· Responsible for SMC SMF tasks
· Ensures that the right systems are monitored
· Facilitates effective monitoring mechanism
· Is expert on how to monitor, not what to monitor.
|· Monitors IT service health
· Helps define IT service to be monitored
· Helps prepare service component health model
|Scheduling Manager||· Plans schedule of individual activities within operations
· Owns timing decisions
· Plans operational work, including maintenance
|· Avoids conflicting work|
|Operations Manager||· Is accountable for Operations SMF and Service Monitoring and Control||· Oversees
· Drives definition of IT service to be monitored
· Drives preparation of service component health model
Goals of Service Monitoring and Control
The goals of service monitoring and control include the following:
- Observe the health of IT services.
- Take remedial actions that minimise the impact of service incidents and system events.
- Understand the infrastructure components responsible for the delivery of services.
- Provide data on component or service trends that can be used to optimise the performance of IT services.
Table 2. Outcomes and Measures of the Deploy SMF Goals
|Improved overall availability of services||Percent of time service is available|
|A reduction in the number of SLA and OLA breaches||Number of breaches to SLAs and OLAs|
|A reduction or prevention of service incidents through the use of proactive remedial action||Number of service incidents|
The following table contains definitions of key terms found in this guide.
Table 3. Key Terms
|Aggregation||A function that makes it possible to treat a series of similar events as a single event.|
|Alert||A notification that an event requiring attention has occurred.|
|Configuration item (CI)||An IT component that is under configuration management control.|
|Correlation||A function that groups events together or defines an event’s relationship with other events that together represent an impact.|
|Event||An occurrence within the IT environment detected by a monitoring tool.|
|Health model||A definition of CI health categorised by availability, configuration, performance, or security.|
|IT Control||A specific activity performed by people or systems designed to ensure that business objectives are met.|
|Reporting||The collection, production, and distribution of information about IT services.|
|Resolution completion||The point in the control process where manual/automatic action has been taken and all recording and incident management have been completed.|
|Rules||A predetermined policy that describes the provider (the source of data), the criteria (used to identify a matching condition), and the response (the execution of an action).|
|Threshold/criteria||A configurable value above which something is true and below which it is not.|
Service Monitoring and Control Management Flow
Figure 2. Service Monitoring and Control management flow
Process 1: Define Service Monitoring Requirements
Figure 3. Define service monitoring requirements
Activities: Define Service Monitoring Requirements
Before introducing a new service into the IT environment, the SMC team needs to determine what is required to monitor the health of the service. The SMC team works with those who will release the new service and those responsible for ongoing operations of the service after its release to the production environment to identify needs and dependencies, breaking down the service into steps to ensure accurate monitoring. This information is used to create a health model, which defines whether a system is healthy, that is operating within normal conditions or if it has somehow failed or degraded. This model becomes the basis for system events and instrumentation on which monitoring and automated recovery are built.
This process includes the following activities:
- Define the IT service to be monitored.
- Prepare the service component health model.
- Review the reliability requirements.
The following table describes these activities in greater detail.
Table 4. Activities and Considerations for Defining Monitoring Requirements
|Define IT service to be monitored
· Is this a new service or an extension of an existing one?
· What does the service do?
· What are the service’s technology components and their dependencies?
· Who are the users?
· How important is the service to the business?
· How dependent is the business on this service?
· How is this service dependent on or related to other IT services?
· Are any service level requirements in place?
· Configuration description from the configuration management system (CMS).
· Functional requirements for the IT service
· Operations requirements
· Non-functional requirements for the IT service
· Operations plan
· Service Catalogue
· SLAs, OLAs, underpinning contracts (UCs): if nothing exists, use key questions from this process. For more information about these, see the Business/IT Alignment SMF.
· Forward Schedule of Change (FSC). See the Deploy SMF for more information.
· IT service descriptions:
· Technical (technologies and dependencies)
· Organisational (groups dependent on the service)
· Understand the service’s importance to the business.
· Document the service end to end to ensure that it is monitored as a whole, not just as a group of components.
· To maximise availability, document and understand the service’s dependencies to other services.
· List all stakeholders of a specific service.
· Create a set of basic key performance indicators (KPIs) for all IT services so that basic measurements and comparisons can be done among all IT services.
|Prepare service component health model||Key questions:
· Which configuration items (CIs) make up the service? How are they related?
· Should the system monitor for specific failure scenarios?
· For each failure event, is there also a way to determine when the failure has stopped or has been fixed?
· Which of these scenarios are related to availability? Configuration? Performance? Security?
· Is the CI dependent on other CIs with which it communicates?
· Which events have an impact on a CI’s availability (for example, a service stoppage)?
· Which events have an impact on a CI’s performance (for example, a CPU has insufficient capacity)?
· Which events have an impact on a CI’s configuration (for example, a service pack has not been installed)?
· Which events have an impact on a CI’s security (for example, access denied)?
· Does the severity of the event match the impact on the CI? How are events categorised?
· Can sub-components and dependencies be defined so that the failure explanation is more precise?
· Are there any mission critical dependencies to other CIs, such as operating systems, hardware, network, or SAN?
· Is there a way to predefine whether or not the CI is healthy?
· Does the event message explain clearly what the problem is? Does it offer a solution?
· Can any events or scenarios cause event storms? Event storms are a high volume of events that are logged in a monitoring database and overload the database administrator console. How can the IT team avoid event storms?
· Has the SMC team created instrumentation guidelines for the application or infrastructure configuration?
· Does the service use clustering? Mirroring?
· Events grouped by health model definition
· Relationships to other CIs
· Alert and event definitions for all CIs
· Relationships to other CIs and how these affect each other
· A service model that defines all CIs for the application and their relationship to other CIs
· A complete health model describing each CI error description and troubleshooting hints for every type of CI alert
· A definition of availability for CI via a health model
· Reporting needs for IT services
· The monitoring team and the development team should agree on standards for such items as CI definition, the preferred way of incrementing the application, format logging design, performance counters, synthetic transactions, and reporting.
· Develop the monitoring definition while the service itself is being developed. This way, the definition will be ready to implement when the service is released.
|Review reliability requirements||Key questions:
· Is service monitoring done internally or externally?
· Are team members trained in SMC for the new services?
· Is the IT service documented?
· What are the monitoring requirements from the Support group?
· What are the monitoring requirements from the Release group?
· Requirements, data, and KPIs from other IT functions, including availability, capacity, problem management, incident management, and service continuity. See the Reliability SMF for more information.
· IT organizational diagrams
· Current SMC job descriptions.
· SMC process document
· Organisational structure that supports the entire SMC process
· Required information from other SMFs in terms of reports and statistics should be understood and documented.
· Understand the relationships between SMC and other IT functions and processes.
Process 2: Implement New Service
Figure 4. Implement new service
Activities: Implement New Service
Successfully implementing a new service requires ensuring that it aligns with what is already in place. The first activity ensures that the new service meshes with existing IT processes and functions, the second concerns the service’s impact on the people within IT. Finally, the third activity brings the service in line with existing IT tools and processes.
Implementing a new service involves the following activities:
- Align new IT service to existing processes and functions.
- Align new IT service to existing IT organisation.
- Align the new IT service to existing SMC tools.
The following table describes these activities in greater detail.
Table 5. Activities and Considerations for Implementing Service
|Align new IT service to existing processes and functions||Key questions:
· Does the new service require any changes to existing process descriptions?
· Will the new IT service affect other SMFs (which will then require SMC process description changes)?
· Will the new service change existing or add new SLAs, OLAs, or UCs?
· Will the new service necessitate changes to existing escalation policies?
· What are the KPIs for the new IT service?
· How will the new service be monitored after it is in production? For example, can an SMC agent be installed to reflect the health model, or are there restrictions as to what can be installed locally? Is an agent needed?
· Is there a plan for how unforeseen errors can be incorporated quickly into the monitoring process after release?
· Existing process descriptions
· IT service description
· New or updated service requirements (SLAs, OLAs, UCs)
· KPIs for the new service
· Updated SMC process descriptions
· Updated SMC policies, procedures, and standard operations procedures
· Check all existing escalation routines for workflow changes. Ensure that there are documented escalation routines if they don’t already exist.
· Test the service several times before placing it into production to ensure that every part of the process description is in place and that the process maps correctly to the workflows.
· Document the service end-to-end to ensure that it is monitored as a whole, not just as a group of components.
|Align new IT service to existing IT organisation||Key questions:
· Which group(s) will do the monitoring?
· Who will be responsible for the new IT service?
· Is the existing organisational structure sufficient to handle the increased workload?
· What type of training will team members receive?
· Is all relevant documentation up to date?
· Have all service descriptions been updated, including organisational details such as contact information, hours of operation, and service windows?
· Updated SMC process description
· Existing organisational structure and organisation chart
· List of people involved in the SMC function and their job descriptions
· Updated organisational structure
· Updated service descriptions with contact information
· Training plan and training material
· Updated job descriptions
· Updated information related to escalation policies
· Ensure that proper training is given to the appropriate people. Document and understand all organisational dependencies.
· Make sure that all team members understand day to day SMC roles and responsibilities.
|Align new IT service to existing SMC tools||Key questions:
· Is monitoring currently done at the component level or at the IT business service level?
· What load will monitoring put on the servers?
· Should all technologies be monitored, or only a subset?
· Do the existing SMC tools have the capability to monitor all technologies and platforms (network, hardware, OS, middleware, application) according to SLA and monitoring requirements?
· Are there alternative solutions for the technologies and platforms that cannot be monitored?
· Does existing documentation describe the design and configuration for IT services?
· Does the existing service monitoring tool support industry standards (for example, Service Monitoring Language or Service Definition Model)?
· Will monitoring be agent based or agentless?
· How many technologies and solutions are used to monitor IT services?
· Is there a standard for documenting CIs and services?
· Is there a description of how the SMC systems are configured?
· Can synthetic transactions be defined to monitor end to end scenarios?
· Should we be monitoring any sub services that we do not control?
· Can the monitoring system handle the new IT service requirements (availability, performance load on monitoring system, reporting capabilities)?
· Are there any infrastructure constraints that will prevent monitoring (network access, server access, user access)?
· Is the IT system tuned according to SMC standards?
· Can the role responsible for the IT service be defined within SMC?
· Can any fixes be automated?
· Who has access to information about the service?
· Definition of IT service monitoring according to SLA requirements
· Requirements from other SMFs
· Alert and event definition according to CIs in the IT service
· A monitoring service model describing all CIs for the application and its relationship to other CIs
· A complete health model describing each CI with a list of sub components
· Availability defined and measured for every CI and service via a health model
· Incident categorisation, aggregation, and correlation guidelines for SMC tool
· Platform and applications requirements for monitoring system
· IT service reporting and availability requirements for the monitoring tool
· Alerting and health requirements for the monitoring tool
· Infrastructure requirements for monitoring the IT service
· Operational guidance from vendor, if applicable
· Monitoring requirements from operations plan
· IT service monitoring requirements defined in the monitoring tool
· IT service monitoring requirements defined for manual handling
· Service model (distributed application) defined in the monitoring tool
· Ability to generate reports according to SMF requirements (SLA, availability, capacity, KPI)
· User roles for the IT service defined in the SMC tool
· All CIs monitored by the SMC tool
· Knowledge defined for IT service alerts, such as a checklist that defines what actions can be taken to solve issues related to incoming alerts and events
· Views and tasks defined for the IT service
· Alerts and states tuned as required by the IT service
· SMC tool with automatic actions and manual tasks defined for IT service
· Tune the system so the alert or state showing at the SMC system is actionable, informational, and relevant.
· In the event of an alert, make sure that as much information and guidance as possible is available to the monitoring console user.
· Ensure that the console user sees only relevant alerts.
· Ensure that error descriptions and troubleshooting hints are available for every alert that can come from a CI.
Process 3: Continuous Monitoring
Figure 5. Continuous monitoring
Activities: Continuous Monitoring
The third process in SMC occurs after any monitoring tool being used is in place. When an event occurs, a notification is received, either by a dedicated SMC group or by a related group that has SMC responsibilities. After analysis, the event is either solved or escalated to a higher level for eventual solution.
This process involves the following activities:
- Receive notification.
- Analyse the event.
- Solve or escalate the event.
The following table describes these activities in greater detail.
Table 6. Activities and Considerations for Continuous Monitoring
· Who should receive alerts?
· Do incoming alerts require 24/7 support and, if so, who should handle them?
· Is there a dedicated SMC group, or is monitoring handled by other departments, such as the Service Desk or Operations?
· Is there a need for correlating events? Correlating events allows for an end to end look at related events and makes troubleshooting easier.
· Have events historically been regarded as incidents, and has the incident management process handled the incident to analyse and resolve events/incidents?
· Is there a connector between the monitoring system and the Service Desk tools or will alerts be transferred manually?
· Do other departments or resources work on a given problem?
· Are automated solutions applied?
· Can alerts automatically be solved and closed?
· How are alerts communicated to groups (via pager, text message, monitoring console, e-mail)?
· IT services configured in the monitoring tool
· Role descriptions
· SMC policies and procedures
· Incident information
· Event information
· Alert information
· If something needs immediate attention, ensure that there is a way to prioritise it.
· Who is primarily responsible for event analysis?
· Who is responsible for handling “noise” reduction—for clearing out events that aren’t real and that should be removed from view?
· Is a known problem causing the event?
· Is there clear, easily accessible information available about possible solutions?
· Is the event description understandable?
· Have there been other alerts about the same problem?
· Can certain manual tasks help solve the problem?
· Does any tool used by the Service Desk contain procedures for covering this incident?
· Are there any changes planned for the IT service or for CIs of the IT service?
· Is the event actionable? Is it valid?
· Can the alert be tuned? Alert tuning is the adjustment of a service monitoring tool for a lower level of alert noise to reduce the number of false alerts.
· Is the impact to the IT service clearly understood and communicated in the SMC tool?
· Information about event resolution
· Description of the event
· Open problems
· Open incidents
· Open changes
· Information from other teams
· Event is solved
· Event escalated as an incident and its severity raised, with possible transfer to another team
· Ensure that all alerts are understandable, relevant, and up to date.
|Resolve or escalate event||Key questions:
· Who has authority to escalate events?
· Who receives the escalated event?
· How can we ensure that the receiver takes ownership of the event? If the receiver can’t, is there an alternate individual or team to call upon?
· Which events should be subject to 24/7 escalation?
· Was the event resolved through the use of a knowledge base? Product knowledge? Other approaches?
· Should the alert threshold be tuned or updated?
· Updated knowledge about alerts
· Input for tuning the alerts
· Additional error description of the alert for further troubleshooting
· Description of previous activities (if problem is not solved)
· Escalated alerts
· Solved alerts
· Encourage each individual on the alert escalation chain to provide input and knowledge.
Process 4: Control and Reporting
Figure 6. Control and reporting
Activities: Control and Reporting
The fourth SMC process, Control and Reporting, involves generating information for the entire IT organisation and ensuring that ongoing monitoring is doing its intended job.
This process consists of the following activities:
- Produce reports and statistics.
- Conduct Operational Health management review (MR).
- Plan and execute service improvements.
The following table describes these activities in detail.
Table 7. Activities and Considerations for Control and Reporting
|Produce reports and statistics
· What kind of monitoring and control information has been requested?
· What critical success factors (CSFs) and KPIs need to be measured?
· Is the monitoring tool configured to produce the necessary reports and statistics?
· Are there any analysis measures in place?
· How are reports distributed?
· KPI requirements
· CSF requirements
· Develop a strong working knowledge about what kind of information is required by other parts of IT.
· Use automatically generated reports wherever possible to save time and labor.
|Conduct operational health management review
· In what meeting format are operational health management reviews conducted?
· Who is responsible for running the review?
· What should be on the review agenda?
· Are meetings conducted across IT services, or only with Operations?
· Who provides input for the review?
· Input to the review agenda from IT Service Manager and key Operations staff
· List of corrective actions and projects needed to improve the quality of delivered IT services, processes, and technologies (Service Desk and monitoring tools).
· Do not confuse the Operational Health management review with a service alignment management review meeting, where IT management would normally evaluate how well services have been delivered (and whether SLAs have been met). Operational Health management reviews focus on the effectiveness of Operations management. For more information about the Operational Health MR, see the Operate Overview. For more information about the Service Alignment MR, see the Plan Overview.
|Plan and execute service improvements||Key questions:
· Which areas need improvement?
· Who is responsible for improved service?
· Are improvements related to:
· People, roles, organisational responsibilities?
· Processes, procedures, documentation, policies, standards?
· Services, technology upgrades and improvements?
· Reports and statistics
· Feedback gathered at review
· Service improvement plans
· Follow up to ensure that findings from reviews and reports actually contribute to service improvements.
The Service Monitoring and Control SMF addresses how to conduct real time observation of and alerting about health conditions in an IT production environment for the purpose of monitoring the health of IT services, taking remedial actions to minimise incidents and events, and providing trending data for optimising IT service performance.
It specifically addresses how to:
- Define service monitoring requirements.
- Implement a service.
- Conduct continuous monitoring.
How can I implement MOF?
Hopefully by now you’ll begin to understand the value that the Microsoft Operations Framework can bring to your business. The goals, outcomes and measures outlined above require many activities and considerations which form part of our day to day activities at First Solution. In fact, we’re experts in MOF and have even developed a unique ITIL IQ® process that benchmarks a business’s current state, identifies their desired state and provides an action plan (called a Service Delivery Plan) that helps organisations of all sizes achieve their desired business outcomes. Most importantly, our unique ITIL IQ® process begins with a Proactive Services Maturity Review (PSMR) which identifies a score (out of 100) that clearly communicates the current state of your businesses IT operational maturity. Armed with your ITIL IQ® score, a non-IT professional such as a finance or procurement professional can concisely present to the IT Executive Officer the businesses current state, desired state, and ITIL IQ® score with an action plan to improve the ITIL IQ® score and thereby ensure that IT’s goals are aligned with the goals of the business and that both are progressing together. Once the IT Executive Officer has bought into the MOF concept we can help to develop an IT service strategy, IT service map, IT service portfolio and Service level agreements.
How can I monitor and control better IT services?
Simply get in touch to arrange a free Proactive Services Maturity Review and one of our MOF experts will conduct an interview with the IT Manager or IT Executive Officer within your business and provide an ITIL IQ® score with which you can measure the performance of your IT function. Once you know your ITIL IQ® score we can provide a Service Delivery Plan to help you improve it each month and measure and report progress back to you during a Monthly Service Review. And there we have it, an ITIL based solution to simply identify and measure the performance of your IT function. So, are you ready to monitor and control better IT services?
The Microsoft Operations Framework 4.0 is provided with permission from Microsoft Corporation.