Procedure explaining how problems may be identified through trend analysis of incidents and how problems can be registered as known errors. The EGI Federated KEDB is hosted on the following page: Known Error Database (KEDB). Each known error is registered as Jira ticket within the Jira project EGIKEDB. Known errors are not frequent, and history of them can be useful even after resolving the underlying problem; hence problems will not be removed after permanent resolution, instead they will be marked as FIXED. In the past, the KEDB only included problems affecting the middleware deployed in the EGI Infrastructure, an easy point of reference where resource centre administrators and supporters could find explanations, workarounds, and solutions to recurring incidents due to installation and configuration errors, bugs, etc. However, services and problems not strictly related to technology were out of scope of the KEDB. That is why we decided to create an additional KEDB (the internal KEDB) to take into account this aspect, allowing the EGI Foundation staff to report a problem for such services that are not properly represented in the federated KEDB. The access to the internal KEDB is only for EGI Foundation staff and the idea is to track problems that go beyond the technology aspect. Please refer to the EGI Glossary for the definitions of the terms used in this procedure. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", “MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. trend analysis of incidents performed by experts (internal experts, or even suppliers and customers) revealing that an underlying problem exists; this is done by the 2nd level of support (DMSU) which has monthly meetings where recent new incidents are analysed to identify trends indicating underlying problems automated detection by monitoring tools can raise an incident which may reveal the need for a problem record Send an analysis report to EGI Operations for evaluation and discussion. The report can be examined by email or discussed directly at the EGI Operations monthly meetings. (Operations Meeting) If a new problem is identified, and not permanently resolved, a corresponding Jira ticket is created to report it and track its evolution, providing values for the following fields: The Jira ticket is assigned to DMSU staff. Not all the information could be available at the time of creating the entry in KEDB, depending on the investigation stage of the problem. Even if a known error implies the availability of a workaround, the problem can be added to the Known Error Database even if no workaround is available, just for reference purposes. when a solution of a problem is available with a new release of a service / service component, trigger the change: The following steps are meant to track non-technical (e.g., coordination, management, organisation, etc.) incidents and problems. If it is identified a problem, opens a new issue in the jira project IMSKEDB with type problem providing a description, selecting a Component (in case fill in also the "Service component" field) and the priority. The issue is assigned automatically to the service owner, unless a specific assignee is specified. The incident ticket can be closed when the incident is resolved.Overview
Definitions
Entities involved in the procedure
Triggers
Steps for Middleware/Technology related problems
Step# Responsible Action Notes 1 DMSU Analyse recent new incidents to find common denominators identifying problems, or resolved incidents where the root cause is not clear and deserve further investigation. After an investigation of the reported issue, if it is not possible providing a solution at this point, ticket have been reassigned either to the 3rd level SUs or to any most suitable one to let the experts study the issue reported and investigate on a solution. 2a DMSU 2b EGI Operations analyse the reports ISRM Reports (restricted) for any anomalies; evaluate any other notification sent by email, ticket, or other channels about possible problems. 3 EGI Operations / DMSU 4 appointed SU the procedure PROC25 in EGI Wiki will be used inform the UMD team by opening a ticket to start the release process for including the new version of the given middleware product5 EGI Operations / DMSU review the KEDB at the EGI Operations monthly meetings. (Operations Meeting) Problems which got a solution are marked as SOLVED. Problems for which no solution will be provided are marked as UNSOLVED. Steps for non-Middleware/non-technical related incidents and problems
Step# Responsible Action Notes 1 EGI Foundation staff Opens an issue (type incident) in the jira project IMSKEDB providing a description, selecting in the "components" field the affected service (in case fill in also the "Service component" field) and the priority. The issue is assigned automatically to the service owner, unless a specific assignee is specified. the incident owner is assigned 2 Incident Owner Acknowledge the ticket by changing the status to "in progress", starting the investigation on the incident. 3 Incident Owner 4 Problem Owner If a workaround is available, put a comment in the problem issue, fill in the information in the related field and change the issue status to "Workaround". 5 Problem Owner When the problem is solved, put a comment in the problem ticket, add the relevant information in the "Solution" field, and change the issue status to "Solved".
Overview
Content Tools