Introduction

This page reports on the Availability and Continuity plan for the EGI FedCloud (based on the EGI Cloud Compute service) and it is the result of the risks assessment conducted for this service: a series of risks and threats has been identified and analysed, along with the correspondent countermeasures currently in place. In particular, each RCs of the EGI FedCloud performed a risk assessment on its own, allowing to assess the situation of any single RC in respect of the analysed risks. In addition to this, it was also performed an overall risk assessment of the FedCloud service where it was defined a series of countermeasures and treatments that ideally should be put in place to deal with the several risks. The aim is to agree on a minimum set of standard measures that each provider should implement in order to guarantee to the users a sufficient level of continuity of the cloud service. At the same time, this is also a way to provide the users with a number of guidelines on what to do when the cloud provider they use becomes unavailable.

The first version of this Availability and Continuity plan contains a very basic indications to the users: more precise information is going to be added in future updates of this plan. Moreover, at the moment there is no need for a continuity/recovery test, even though with the addition of more precise countermeasures in the future, some specific tests can be planned.



Last

Next

Risk assessment

2022/01/27

n.d.
Av/Co plan

2022/03/24

n.d.

Availability requirements and performances

The general Service Level Targets are defined in the RCs OLA:

  • Monthly Availability/Reliability: 80%/85%

Other availability requirements:

  • the service is accessible through EGI Check-in 
  • The service is accessible via CLI and/or webUI

The service availability is regularly tested by ARGO with the metric eu.egi.cloud.OpenStack-VM: https://argo.egi.eu/egi/Critical/metrics/eu.egi.cloud.OpenStack-VM

The performances reports in terms of Availability and Reliability are produced by ARGO on an almost real time basis and they are also collected into the Documentation Database on a monthly basis.

Risk assessment and management

For more details, please look at the google spreadsheet. We will report here a summary of the assessment.

Risk analysis

Risk id

Risk description

Affected components

Established measures

Risk level

Treatment
(Protective, mitigation measures, recovery activities, controls)

Expected duration of downtime / time for recovery

1Temporary unavailability of servicesAll- Periodic backup of data (create backup and/or snapshots of block storage volumes, use Data Transfer for object or grid storage), more often means smaller RPO
- Keep service configuration and deployment in source control (Infrastructure as Code)
Low- Restore configuration from source control
- Restore user data from last backup
up to 1 working day
2Temporary unavailability of a Resource CentreAll- Periodic backup of data (see above)Low- Switch to a different resource centre (deploy using configuration from source control, restore data from last backup), automated procedures mean smaller RTOup to 1 working day
3Data lossBlock Storage
Object Storage
Grid Storage
- Periodic backup of data (see above)
- Configuration/deployment details in source control (see above)
- Build fault tolerant user solutions
Medium- Restore data from last backupup to 1 working day
4Complete loss of a Resource CentreAll- Periodic backup of data (see above)
- Configuration/deployment details in source control (see above)
- Build distributed user solutions
Medium- Switch to a different resource centre (see above)1 or more working days
5Security incidentAll- Follow security advisories of used products/platforms
- Periodic backup of data (see above)
Medium- Restore configuration from source control
- Restore user data from last backup
up to 1 working day
6Denial of serviceAll- Periodic backup of data (see above)
- Configuration/deployment details in source control (see above)
- Build distributed user solutions
Low- Switch to a different resource centre (see above)1 or more working days

Outcome

The countermeasures and treatments mentioned above are recommendations to RCs administrators and users that should be put in place to avoid the occurrence of a risk. Each RC defined its specific countermeasures and the idea is to converge to a common set of countermeasures provided as a default. Then, in the context of a given SLA, the user communities can agree with the involved RCs additional and specific recovery measures.

Additional information

  • When something doesn't work as expected, users should create an incident ticket in the EGI Helpdesk to notify the given RC about the problems.
    • The RC will provide details about the failures and an estimation of when the service will be recovered.
  • The services can be unavailable due to either planned or unplanned downtime:
    • it is convenient for the users to subscribe to the downtime tool in the Operations Portal to get a notification when a downtime affecting either single RCs or a whole VO is announced.
  • At the moment there is no general agreement on the backup and restore of VMs and users data:
    • the VOs can agree the details on this with any RCs involved in the given SLA.

Availability and Continuity test

At the moment not requested.

Revision History

VersionAuthorsDateComments
v. 17Alessandro Paolini2022-03-24first version finalised. To agree with the cloud providers on a standard set of measures (like frequency of backups, amount of data that can be restored, migration/clonation of VMs, automated re-deployment, etc)