Introduction

This page reports on the Availability and Continuity plan for the EGI FedCloud (based on the EGI Cloud Compute service) and it is the result of the risks assessment conducted for this service: a series of risks and threats has been identified and analysed, along with the correspondent countermeasures currently in place. In particular, each RCs of the EGI FedCloud performed a risk assessment on its own, allowing to assess the situation of any single RC in respect of the analysed risks. In addition to this, it was also performed an overall risk assessment of the FedCloud service where it was defined a series of countermeasures and treatments that ideally should be put in place to deal with the several risks. The aim is to agree on a minimum set of standard measures that each provider should implement in order to guarantee to the users a sufficient level of continuity of the cloud service. At the same time, this is also a way to provide the users with a number of guidelines on what to do when the cloud provider they use becomes unavailable.

The first version of this Availability and Continuity plan contains a very basic indications to the users: more precise information is going to be added in future updates of this plan. Moreover, at the moment there is no need for a continuity/recovery test, even though with the addition of more precise countermeasures in the future, some specific tests can be planned.

	Last	Next
Risk assessment	2022/01/27	n.d.
Av/Co plan	2022/03/24	n.d.

Availability requirements and performances

The general Service Level Targets are defined in the RCs OLA:

Monthly Availability/Reliability: 80%/85%

Other availability requirements:

the service is accessible through EGI Check-in
The service is accessible via CLI and/or webUI

The service availability is regularly tested by ARGO with the metric eu.egi.cloud.OpenStack-VM: https://argo.egi.eu/egi/Critical/metrics/eu.egi.cloud.OpenStack-VM

The performances reports in terms of Availability and Reliability are produced by ARGO on an almost real time basis and they are also collected into the Documentation Database on a monthly basis.

Risk assessment and management

For more details, please look at the google spreadsheet. We will report here a summary of the assessment.

Risk analysis

Risk id	Risk description	Affected components	Established measures	Risk level	Treatment (Protective, mitigation measures, recovery activities, controls)	Expected duration of downtime / time for recovery
1	Temporary unavailability of services	All	- Periodic backup of data (create backup and/or snapshots of block storage volumes, use Data Transfer for object or grid storage), more often means smaller RPO - Keep service configuration and deployment in source control (Infrastructure as Code)	Low	- Restore configuration from source control - Restore user data from last backup	up to 1 working day
2	Temporary unavailability of a Resource Centre	All	- Periodic backup of data (see above)	Low	- Switch to a different resource centre (deploy using configuration from source control, restore data from last backup), automated procedures mean smaller RTO	up to 1 working day
3	Data loss	Block Storage Object Storage Grid Storage	- Periodic backup of data (see above) - Configuration/deployment details in source control (see above) - Build fault tolerant user solutions	Medium	- Restore data from last backup	up to 1 working day
4	Complete loss of a Resource Centre	All	- Periodic backup of data (see above) - Configuration/deployment details in source control (see above) - Build distributed user solutions	Medium	- Switch to a different resource centre (see above)	1 or more working days
5	Security incident	All	- Follow security advisories of used products/platforms - Periodic backup of data (see above)	Medium	- Restore configuration from source control - Restore user data from last backup	up to 1 working day
6	Denial of service	All	- Periodic backup of data (see above) - Configuration/deployment details in source control (see above) - Build distributed user solutions	Low	- Switch to a different resource centre (see above)	1 or more working days

Outcome

The countermeasures and treatments mentioned above are recommendations to RCs administrators and users that should be put in place to avoid the occurrence of a risk. Each RC defined its specific countermeasures and the idea is to converge to a common set of countermeasures provided as a default. Then, in the context of a given SLA, the user communities can agree with the involved RCs additional and specific recovery measures.

Additional information

When something doesn't work as expected, users should create an incident ticket in the EGI Helpdesk to notify the given RC about the problems.
- The RC will provide details about the failures and an estimation of when the service will be recovered.
The services can be unavailable due to either planned or unplanned downtime:
- it is convenient for the users to subscribe to the downtime tool in the Operations Portal to get a notification when a downtime affecting either single RCs or a whole VO is announced.
At the moment there is no general agreement on the backup and restore of VMs and users data:
- the VOs can agree the details on this with any RCs involved in the given SLA.

Availability and Continuity test

At the moment not requested.

Revision History

Version	Authors	Date	Comments
v. 17	Alessandro Paolini	2022-03-24	first version finalised. To agree with the cloud providers on a standard set of measures (like frequency of backups, amount of data that can be restored, migration/clonation of VMs, automated re-deployment, etc)

Page tree

EGI FedCloud Availability and continuity plan