Introduction

This page reports on the Availability and Continuity plan for the EGI HTC service and it is the result of the risks assessment conducted for this service: a series of risks and threats has been identified and analysed, along with the correspondent countermeasures currently in place.

The EGI High-Throughput Compute is a federated service provided by EGI Federation members. Given the federated nature of the service, the AvCo plan deviates from the standard SACM template given the high number of RCs federated within the service, and as such it deals with a part of the AvCo requirements, performances and risk assessment only from a top perspective of the infrastructure level. Furthermore at the moment there is no need for a continuity/recovery test, even though with the addition of more precise countermeasures in the future, some specific tests can be planned.

The aim is to agree on a minimum set of standard measures that each HTC provider should implement in order to guarantee to the users a sufficient level of continuity of the HTC service. At the same time, this is also a way to provide the users with a number of guidelines on what to do when the HTC provider they use becomes unavailable.



Last

Next

Risk assessment

2024-08

2025 Q4
Av/Co plan

2024-09

2025 Q4

Availability requirements and performances

The general Service Level Targets are defined in the RCs OLA:

  • Monthly Availability / Reliability: 80% / 85%

Other availability requirements:

  • The service is accessible via EGI Workload Manager
  • The service is accessible through EGI Check-in or through VOMS, depending how the VO serving a specific user community had been configured.

The service availability is regularly tested with the metrics org.nordugrid.ARC-CE-result, org.nordugrid.ARC-CE-IGTF, argo.certificate.validity-htcondorce

The performances reports in terms of Availability and Reliability are produced by ARGO on an almost real time basis and they are also collected into the Documentation Database on a monthly basis.

Risk assessment and management

For more details, please look at the High-Throughput Compute Risk assessment page. We will report here a summary of the assessment.

Risk analysis

 


Outcome

The established measures and treatments mentioned above are recommendations to RCs administrators and users that should be put in place to avoid the occurrence of a risk. In the context of a given SLA, the user communities can agree with the involved RCs additional and specific recovery measures.

Additional information

  • When something doesn't work as expected, users should create an incident ticket in the EGI Helpdesk to notify the given RC about any malfunctioning.
    • The RC will provide details about the failures and an estimation of when the service will be recovered.
  • The services can be unavailable due to either planned or unplanned downtime:
    • it is convenient for the users to subscribe to the downtime tool in the Operations Portal to get a notification when a downtime affecting either single RC or a whole VO is announced.

Availability and Continuity test

Not requested at the moment.

Revisions history

VersionAuthorsDateComment
v.10

 

plan finalised.