Introduction

This page reports on the Availability and Continuity plan for the EGI HTC service and it is the result of the risks assessment conducted for this service: a series of risks and threats has been identified and analysed, along with the correspondent countermeasures currently in place.

The EGI High-Throughput Compute is a federated service provided by EGI Federation members. Given the federated nature of the service, the AvCo plan deviates from the standard SACM template given the high number of RCs federated within the service, and as such it deals with a part of the AvCo requirements, performances and risk assessment only from a top perspective of the infrastructure level. Furthermore at the moment there is no need for a continuity/recovery test, even though with the addition of more precise countermeasures in the future, some specific tests can be planned.

The aim is to agree on a minimum set of standard measures that each HTC provider should implement in order to guarantee to the users a sufficient level of continuity of the HTC service. At the same time, this is also a way to provide the users with a number of guidelines on what to do when the HTC provider they use becomes unavailable.

	Last	Next
Risk assessment	2024-08	2025 Q4
Av/Co plan	2024-09	2025 Q4

Availability requirements and performances

The general Service Level Targets are defined in the RCs OLA:

Monthly Availability / Reliability: 80% / 85%

Other availability requirements:

The service is accessible via EGI Workload Manager
The service is accessible through EGI Check-in or through VOMS, depending how the VO serving a specific user community had been configured.

The service availability is regularly tested with the metrics org.nordugrid.ARC-CE-result, org.nordugrid.ARC-CE-IGTF, argo.certificate.validity-htcondorce

Status on ARGO UI

The performances reports in terms of Availability and Reliability are produced by ARGO on an almost real time basis and they are also collected into the Documentation Database on a monthly basis.

Risk assessment and management

For more details, please look at the High-Throughput Compute Risk assessment page. We will report here a summary of the assessment.

Risk analysis

Title	Risk description	Affected components of the service	Established measures	Risk level	Expected duration of downtime/ time for recovering	Treatment - Protective/mitigation measures - recovery activities - controls
High-Throughput Compute Risk assessment	Service unavailable / loss of data due to hardware failure	All components	Provide users more than one place where to run their grid jobs (VO SLA OLAs)	(2) Low	1 or more working days	The measures already in place are considered satisfactory and risk level is acceptable
High-Throughput Compute Risk assessment	Service unavailable due to software failure	All components	Provide users more than one place where to run their grid jobs (VO SLA OLAs) Waiting for fixes in middleware and running upgrade campaigns	(2) Low	Up to 1 working day	The measures already in place are considered satisfactory and risk level is acceptable
High-Throughput Compute Risk assessment	Service unavailable / loss of data due to human error	All	Users can use other Resource Centres Use of documentation for sysadmins and service operators	(2) Low	up to 1 working day	The measures already in place are considered satisfactory and risk level is acceptable
High-Throughput Compute Risk assessment	Not enough people for maintaining and operating the service	Batch system, Computing element	Ensure minimal coverage support of services (rota) Users can use alternative supporting sites	(1) Low	Up to 1 working day	The measures already in place are considered satisfactory and risk level is acceptable
High-Throughput Compute Risk assessment	Major disruption in the data centre.	Batch system, Computing element	Users to access other supporting sites	(2) Low	1 or more working days	The measures already in place are considered satisfactory and risk level is acceptable
High-Throughput Compute Risk assessment	Major security incident. The system is compromised by external attackers and needs to be reinstalled and restored.	All	- Follow security advisories of used products/platforms - Periodic backup of data - RCs are recommended to use latest SW and implement security measures at facility level - RCs to abide to EGI security policies	(4) Medium	Up to 1 working day	The measures already in place are considered satisfactory and risk level is acceptable
High-Throughput Compute Risk assessment	(D)DOS attack. The service is unavailable because of a coordinated DDOS.	Batch system, Computing element	- Follow security advisories of used products/platforms - RCs are recommended to use latest SW and implement security measures at facility level - RCs to abide to EGI security policies	(4) Medium	1 or more working days	The measures already in place are considered satisfactory and risk level is acceptable

Outcome

The established measures and treatments mentioned above are recommendations to RCs administrators and users that should be put in place to avoid the occurrence of a risk. In the context of a given SLA, the user communities can agree with the involved RCs additional and specific recovery measures.

Additional information

When something doesn't work as expected, users should create an incident ticket in the EGI Helpdesk to notify the given RC about any malfunctioning.
- The RC will provide details about the failures and an estimation of when the service will be recovered.
The services can be unavailable due to either planned or unplanned downtime:
- it is convenient for the users to subscribe to the downtime tool in the Operations Portal to get a notification when a downtime affecting either single RC or a whole VO is announced.

Availability and Continuity test

Not requested at the moment.

Revisions history

Version	Authors	Date	Comment
v.10	Catalin Condurache Alessandro Paolini	03 Sep 2024	plan finalised.

Page tree

High-Throughput Compute Availability and Continuity plan