Document control
Procedure reviews
The following table is updated after every review of this procedure.
Table of contents
Overview
The document describes the process of how to handle justification for poor monthly performance.
Links to all monthly statistics are provided on a regular basis at Availability and reliability monthly statistics page.
Definitions
Please refer to the EGI Glossary for the definitions of the terms used in this procedure.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", “MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Process of handling RC Availability and Reliability
Availability alarms are raised on the ROD Dashboard (login required) and are thought to be a warning for NGI informing about poor performance of site within the last 30 days.
Entities involved in the procedure
Regional Operator on Duty (ROD): team provided by NGIs and responsible for handling RC availability and reliability alarms through Dashboard in Operations Portal.
Operations: team provided by EGI Foundation and responsible for handling of under-performing sites which were below the target for 3 consecutive months.
NGI manager: person who suspend the under-performing site or provide site justification.
Triggers
When an alarm is raised, it means that the Availability metric has dropped below the threshold of 80% for the last 30 days period.
Steps
Handling alarms:
Step# | Responsible | Action | Prerequisites, if any | |
---|---|---|---|---|
1 | ROD | Creates a ticket through the dashboard notifying site administrator that the Availability metric has dropped below the threshold of 80% for the last 30 days period. The expiration date should be set to not later then same date next month minus 1 day. | ||
2 | ROD | Escalation of the ticket will vary between NGIs. NGIs have freedom to decide if they want to apply any escalation procedure or treat availability tickets just as an notification for site administrators. | ||
3 | ROD |
|
Handling of underperforming sites (below the target for 3 consecutive months):
Step# | Responsible | Action | Prerequisites, if any |
---|---|---|---|
1 | Operations | Creates a GGUS ticket for each under-performing site. See Ticket template. | |
2 | NGI operations manager | Within 10 working days NGI operations manager can suspend the site or ask to not suspend the site by providing adequate explanation | |
3 | Operations | Send a direct email to NGI and site contact email (in GOC DB) with deadline 2 days for comments | |
4 | Operations |
|
Recomputation procedure
In case of doubts about the validity of Availability/Reliability reports, a RC/NGI can request recomputations according to the procedure defined at PROC10
Ticket template to NGIs for under-performing RCs
$NGI - September 2016 - RP/RC OLA performance
Dear NGI/ROC,
the EGI RC OLA and RP OLA Report for June 2021 has been produced and is available at the following links:
- NGIs reports: http://argo.egi.eu/egi/report-ar/Critical/NGI?month=2021-06 (Clicking on the NGI name, it will be displayed the resource centres A/R figures)
- RCs reports: http://argo.egi.eu/egi/report-ar/Critical/SITES?month=2021-06
According to the Service targets reports for Resource infrastructure Provider [1] and Resource Centre[2] OLAs, the following problems occurred:
============= RC Availability Reliability [2]==========
According to recent availability/reliability report following sites have achieved insufficient performance below Availability target threshold in 3 consecutive months (April, May, and June):
$SITE
$SITE
$SITE
* During the 10 working days after receiving this ticket the NGI can suspend the site or ask to not suspend the site by providing adequate explanation. If no answer is provided to this ticket, the NGI will be contacted by email; if no reply will be provided to the email, the site will be suspended[6].
If NGI intervene and performance is still below targets 3 days after the intervention, the site will also be suspended.
If you think that the site should not be suspended please provide justification in this ticket within 10 working days. In case the site performance rises above targets within 3 days from providing explanation, the site will not be suspended. Otherwise EGI Operations may decide on suspension of the site.
============= Quality of Support [1]==========
According to recent report your NGI achieved insufficient Quality of Support performance:
less urgent (expected 5 working days):
urgent (expected 5 working days):
very urgent (expected 1 working day):
top priority (expected 1 working day):
In order to see the details (https://wiki.egi.eu/wiki/Service_Level_Target_-_Quality_of_Support ):
1) go on https://ggus.eu/?mode=report_view
2) choose "response times" or "tickets submitted" and the proper timeframe
3) select your NGI
4) group by Responsible unit and priority
5) click on the lines displayed for getting the tickets details
Please investigate the reasons for mentioned above situation and provide an explanation for each case in this ticket within 10 days from receiving it.
If no explanation is provided, the EGI Operation will be informed about the situation and may take further steps.
**********************
Links:
[1] https://documents.egi.eu/public/ShowDocument?docid=463 "Resource infrastructure Provider Operational Level Agreement"
[2] https://documents.egi.eu/public/ShowDocument?docid=31 "Resource Centre Operational Level Agreement"
[3] https://wiki.egi.eu/wiki/PROC01 "EGI Infrastructure Oversight escalation"
[4] https://confluence.egi.eu/display/EGIPP/PROC10+Recomputation+of+SAM+results+or+availability+reliability+statistics "Recomputation of SAM results or availability reliability statistics"
[5] https://wiki.egi.eu/wiki/MAN05 "top-BDII and site-BDII High Availability"
[6] https://confluence.egi.eu/display/EGIPP/PROC04+Quality+verification+of+monthly+availability+and+reliability+statistics "Quality verification of monthly availability and reliability statistics"
Best Regards,
EGI Operations
Known issues and recommendations to NGIs
- Newly certified sites will get inaccurate Availability/Reliability figures for the month they were certified and all months before that. ARGO Computation Engine takes into account the Certification status of the site in GOCDB in order to decide if metrics should be calculated for the site. Because the Certification status history is not currently available in the operations tools, until a solution is implemented NGIs should check if they have sites affected by this issue and report it as explanation. More information at [1] and [2].
- Recalculation - The calculations performed by ARGO always take into account the information system status and GOCDB information at the time the calculation is performed, and not that of a certain checkpoint in the past. The implication of this is that any complete recalculation has the risk of altering the results for sites that had correct numbers in the first place. Thus until a solution is found, complete recalculations are avoided whenever possible, and errors are fixed on per site basis for those that have lower number than they should.
- Weighted availability is calculated by multiplying the number of logical CPUs a site published with the published HEPSPEC value in BDII. It is important that these numbers are correct, if HEPSPEC for a site is too high or too low (for example in case of mistake) the overall NGI weighted availability will be affected.