Document control

AreaPM
Procedure status
FINALISED
Owner
Approval status
Approved version and date

v.41  

Statement

Procedure about how a problem is identified through trend analysis of incidents and how can be registered as known error.

Table of contents

Overview

Procedure explaining how problems may be identified through trend analysis of incidents and how problems can be registered as known errors.

The EGI Federated KEDB is hosted on the following page: Known Error Database (KEDB).

Each known error is registered as Jira ticket within the Jira project EGIKEDB.

Known errors are not frequent, and history of them can be useful even after resolving the underlying problem; hence problems will not be removed after permanent resolution, instead they will be marked as FIXED.

In the past, the KEDB only included problems affecting the middleware deployed in the EGI Infrastructure, an easy point of reference where resource centre administrators and supporters could find explanations, workarounds, and solutions to recurring incidents due to installation and configuration errors, bugs, etc. However, services and problems not strictly related to technology were out of scope of the KEDB. That is why we decided to create an additional KEDB (the internal KEDB) to take into account this aspect, allowing the EGI Foundation staff to report a problem for such services that are not properly represented in the federated KEDB. The access to the internal KEDB is only for EGI Foundation staff and the idea is to track problems that go beyond the technology aspect.  

Definitions

Please refer to the EGI Glossary for the definitions of the terms used in this procedure.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", “MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Entities involved in the procedure

  • DMSU (2nd level of user support in GGUS)
  • the appointed SU in the EGI Helpdesk
  • EGI Operations

Triggers

  • a problem is identified by support teams of the  EGI Helpdesk when an incident is resolved, but its root cause is not clearly identified
  • a problem is identified by support teams of the  EGI Helpdesk when an incident is resolved, but the same incident reappears again from time to time on the infrastructure (suggesting that the underlying problem is still alive)
  • trend analysis of incidents performed by experts (internal experts, or even suppliers and customers) revealing that an underlying problem exists; this is done by the 2nd level of support (DMSU) which has monthly meetings where recent new incidents are analysed to identify trends indicating underlying problems

  • automated detection by monitoring tools can raise an incident which may reveal the need for a problem record

Steps for Middleware/Technology related problems

Step# ResponsibleActionNotes
1DMSUAnalyse recent new incidents to find common denominators identifying problems, or resolved incidents where the root cause is not clear and deserve further investigation.After an investigation of the reported issue, if it is not possible providing a solution at this point, ticket have been reassigned either to the 3rd level SUs or to any most suitable one to let the experts study the issue reported and investigate on a solution.
2aDMSU

Send an analysis report to EGI Operations for evaluation and discussion. The report can be examined by email or discussed directly at the EGI Operations monthly meetings. (Operations Meeting)


2bEGI Operationsanalyse the reports ISRM Reports (restricted) for any anomalies; evaluate any other notification sent by email, ticket, or other channels about possible problems.
3EGI Operations / DMSU

If a new problem is identified, and not permanently resolved, a corresponding Jira ticket is created to report it and track its evolution, providing values for the following fields:

  • Subject (=quick description of the problem)
  • Services affected (select a value in the Components field)
  • Middleware product
  • Entities impacted
  • Description
  • GGUS reference tickets (link to related incident tickets)
  • Workarounds (if any)

The Jira ticket is assigned to DMSU staff. 

Not all the information could be available at the time of creating the entry in KEDB, depending on the investigation stage of the problem.

Even if a known error implies the availability of a workaround, the problem can be added to the Known Error Database even if no workaround is available, just for reference purposes.

4appointed SU

when a solution of a problem is available with a new release of a service / service component, trigger the change:

  • for a change affecting a service under the scope of CHM process, the CHM1 Manage changes including emergency changes will be used.
  • for a change affecting other services (typically middleware technology), the procedure PROC25 in EGI Wiki will be used inform the UMD team by opening a ticket to start the release process for including the new version of the given middleware product

5EGI Operations / DMSUreview the KEDB at the EGI Operations monthly meetings. (Operations Meeting)Problems which got a solution are marked as SOLVED. Problems for which no solution will be provided are marked as UNSOLVED.

 

Steps for non-Middleware/non-technical related incidents and problems

The following steps are meant to track non-technical (e.g., coordination, management, organisation, etc.) incidents and problems.

Step# ResponsibleActionNotes
1EGI Foundation staffOpens an issue (type incident) in the jira project IMSKEDB providing a description, selecting in the "components" field the affected service (in case fill in also the "Service component" field) and the priority. The issue is assigned automatically to the service owner, unless a specific assignee is specified.the incident owner is assigned
2Incident OwnerAcknowledge the ticket by changing the status to "in progress", starting the investigation on the incident.
3Incident Owner

If it is identified a problem, opens a new issue in the jira project IMSKEDB with type problem providing a description, selecting a Component (in case fill in also the "Service component" field) and the priority. The issue is assigned automatically to the service owner, unless a specific assignee is specified.

The incident ticket can be closed when the incident is resolved.


4Problem OwnerIf a workaround is available, put a comment in the problem issue, fill in the information in the related field and change the issue status to "Workaround".
5Problem OwnerWhen the problem is solved, put a comment in the problem ticket, add the relevant information in the "Solution" field, and change the issue status to "Solved".