Introduction
This page reports on the Availability and Continuity plan for the EGI HTC service and it is the result of the risks assessment conducted for this service: a series of risks and threats has been identified and analysed, along with the correspondent countermeasures currently in place.
The EGI High-Throughput Compute is a federated service provided by EGI Federation members. Given the federated nature of the service, the AvCo plan deviates from the standard SACM template given the high number of RCs federated within the service, and as such it deals with a part of the AvCo requirements, performances and risk assessment only from a top perspective of the infrastructure level. Furthermore at the moment there is no need for a continuity/recovery test, even though with the addition of more precise countermeasures in the future, some specific tests can be planned.
The aim is to agree on a minimum set of standard measures that each HTC provider should implement in order to guarantee to the users a sufficient level of continuity of the HTC service. At the same time, this is also a way to provide the users with a number of guidelines on what to do when the HTC provider they use becomes unavailable.
Last | Next | |
---|---|---|
Risk assessment | 2024-08 | 2025 Q4 |
Av/Co plan | 2024-09 | 2025 Q4 |
Availability requirements and performances
The general Service Level Targets are defined in the RCs OLA:
- Monthly Availability / Reliability: 80% / 85%
Other availability requirements:
- The service is accessible via EGI Workload Manager
- The service is accessible through EGI Check-in or through VOMS, depending how the VO serving a specific user community had been configured.
The service availability is regularly tested with the metrics org.nordugrid.ARC-CE-result, org.nordugrid.ARC-CE-IGTF, argo.certificate.validity-htcondorce
- Status on ARGO UI
The performances reports in terms of Availability and Reliability are produced by ARGO on an almost real time basis and they are also collected into the Documentation Database on a monthly basis.
Risk assessment and management
For more details, please look at the High-Throughput Compute Risk assessment page. We will report here a summary of the assessment.
Risk analysis
Title | Risk description | Affected components of the service | Established measures | Risk level | Expected duration of downtime/ time for recovering | Treatment - Protective/mitigation measures - recovery activities - controls |
---|---|---|---|---|---|---|
High-Throughput Compute Risk assessment | Service unavailable / loss of data due to hardware failure | All components | Provide users more than one place where to run their grid jobs (VO SLA OLAs) | (2) Low | 1 or more working days | The measures already in place are considered satisfactory and risk level is acceptable |
High-Throughput Compute Risk assessment | Service unavailable due to software failure | All components | Provide users more than one place where to run their grid jobs (VO SLA OLAs) Waiting for fixes in middleware and running upgrade campaigns | (2) Low | Up to 1 working day | The measures already in place are considered satisfactory and risk level is acceptable |
High-Throughput Compute Risk assessment | Service unavailable / loss of data due to human error | All | Users can use other Resource Centres Use of documentation for sysadmins and service operators | (2) Low | up to 1 working day | The measures already in place are considered satisfactory and risk level is acceptable |
High-Throughput Compute Risk assessment | Not enough people for maintaining and operating the service | Batch system, Computing element | Ensure minimal coverage support of services (rota) Users can use alternative supporting sites | (1) Low | Up to 1 working day | The measures already in place are considered satisfactory and risk level is acceptable |
High-Throughput Compute Risk assessment | Major disruption in the data centre. | Batch system, Computing element | Users to access other supporting sites | (2) Low | 1 or more working days | The measures already in place are considered satisfactory and risk level is acceptable |
High-Throughput Compute Risk assessment | Major security incident. The system is compromised by external attackers and needs to be reinstalled and restored. | All | - Follow security advisories of used products/platforms | (4) Medium | Up to 1 working day | The measures already in place are considered satisfactory and risk level is acceptable |
High-Throughput Compute Risk assessment | (D)DOS attack. The service is unavailable because of a coordinated DDOS. | Batch system, Computing element | - Follow security advisories of used products/platforms - RCs are recommended to use latest SW and implement security measures at facility level - RCs to abide to EGI security policies | (4) Medium | 1 or more working days | The measures already in place are considered satisfactory and risk level is acceptable |
Outcome
The established measures and treatments mentioned above are recommendations to RCs administrators and users that should be put in place to avoid the occurrence of a risk. In the context of a given SLA, the user communities can agree with the involved RCs additional and specific recovery measures.
Additional information
- When something doesn't work as expected, users should create an incident ticket in the EGI Helpdesk to notify the given RC about any malfunctioning.
- The RC will provide details about the failures and an estimation of when the service will be recovered.
- The services can be unavailable due to either planned or unplanned downtime:
- it is convenient for the users to subscribe to the downtime tool in the Operations Portal to get a notification when a downtime affecting either single RC or a whole VO is announced.
Availability and Continuity test
Not requested at the moment.