List of risks

Risk no	Risk description	Threat
1	Service unavailable / loss of data due to hardware failure	Hardware failure
2	Service unavailable / loss of data due to software failure	Software errors (stack/dead processes, hard disk full because log files, ...)
3	service unavailable / loss of data due to human error	Human error: staff not well aware/trained about service and procedures, lacking of documentation, patching/upgrading procedures not properly followed, ...
4	Not enough people for maintaining and operating the service	Unavailability of key technical and support staff (holidays period, sickness, ...)
5	Major disruption in the data centre.	Fire, flood, failure or disruption of the power supply , natural disasters, environmental disaster, major events in the environment,
6	Major security incident. The system is compromised by external attackers and needs to be reinstalled and restored.	Software vulnerabilities, identity theft, unauthorised access
7	(D)DOS attack. The service is unavailable because of a coordinated DDOS.	Denial of service attack

Risks rating criteria

In order evaluate the level of a risk, it is first assessed its likelihood and impact. Both of them are integers between 1 and 4 (inclusive):

Rating	Likelihood	Impact
1	Unlikely to happen	Minimal impact
2	Happens less than once per year	Minor impact, local service disruption less than 1 week
3	Happens every few months / more than once per year	Serious disruption for multiple users, more than a week
4	Happens every 2-3 months or more frequently	Serious disruption to the ability to deliver service

Then the risk level is given by the product of Likelihood and Impact (from 1 to 16), and the risks are prioritised in the following way:

Low: 1 and 2
Medium: 3 and 4
High: 6, 8, and 9
Extreme: 12 and 16

Likelihood	Impact
	1 - Minimal impact	2 - Minor impact, local service disruption less than 1 week	3 - Serious disruption for multiple users, more than a week	4 - Serious disruption to the ability to deliver service
1 - Unlikely to happen	(1) Low	(2) Low	(3) Medium	(4) Medium
2 - Happens less than once per year	(2) Low	(4) Medium	(6) High	(8) High
3 - Happens every few months / more than once per year	(3) Medium	(6) High	(9) High	(12) Extreme
4 - Happens every 2-3 months or more frequently	(4) Medium	(8) High	(12) Extreme	(16) Extreme

Risk no. 1

Risk description	Service unavailable / loss of data due to hardware failure
Affected components of the service	All components
Threats	Hardware failure
Consequences of risk occurrence	Users cannot submit to a given Resource Centre
Established measures	Provide users more than one place where to run their grid jobs (VO SLA OLAs)
Identified / remaining vulnerabilities
Likelihood	2 - It happens less than once per year
Impact	1 - Minimal impact
Risk level	(2) Low
Treatment - Protective/mitigation measures - recovery activities - controls	The measures already in place are considered satisfactory and risk level is acceptable
Expected duration of downtime/ time for recovering	1 or more working days

Risk no. 2

Risk description	Service unavailable due to software failure
Affected components of the service	All components
Threats	Software errors (stack/dead processes, hard disk full because log files, ...)
Consequences of risk occurrence	The Resource Centre is not accessible Multiple Resource Centres can be affected by a single software problem
Established measures	Provide users more than one place where to run their grid jobs (VO SLA OLAs) Waiting for fixes in middleware and running upgrade campaigns
Identified / remaining vulnerabilities
Likelihood	2 - It happens less than once per year
Impact	1 - Minimal impact
Risk level	(2) Low
Treatment - Protective/mitigation measures - recovery activities - controls	The measures already in place are considered satisfactory and risk level is acceptable
Expected duration of downtime/ time for recovering	Up to 1 working day

Risk no. 3

Risk description	Service unavailable / loss of data due to human error
Affected components of the service	All
Threats	Human error: staff not well aware/trained about service and procedures, lacking of documentation, patching/upgrading procedures not properly followed, ...
Consequences of risk occurrence	Users cannot submit to a given Resource Centre
Established measures	Users can use other Resource Centres Use of documentation for sysadmins and service operators
Identified / remaining vulnerabilities
Likelihood	2 - It happens less than once per year
Impact	1 - Minimal impact
Risk level	(2) Low
Treatment - Protective/mitigation measures - recovery activities - controls	The measures already in place are considered satisfactory and risk level is acceptable
Expected duration of downtime/ time for recovering	up to 1 working day

Risk no. 4

Risk description	Not enough people for maintaining and operating the service
Affected components of the service	Batch system, Computing element
Threats	Unavailability of key technical and support staff (holidays period, sickness, ...)
Consequences of risk occurrence	Potentially degraded services Incidents occurring at a specific site are not resolved in a required period of time (as per OLA) Tickets not responded in time
Established measures	Ensure minimal coverage support of services (rota) Users can use alternative supporting sites
Identified / remaining vulnerabilities
Likelihood	1 - Unlikely to happen
Impact	1 - Minimal impact
Risk level	(1) Low
Treatment - Protective/mitigation measures - recovery activities - controls	The measures already in place are considered satisfactory and risk level is acceptable
Expected duration of downtime/ time for recovering	Up to 1 working day

Risk no. 5

Risk description	Major disruption in the data centre.
Affected components of the service	Batch system, Computing element
Threats	Fire, flood, failure or disruption of the power supply, natural disasters, environmental disaster, major events in the environment, ...
Consequences of risk occurrence	- Data created since last backup may be lost - Users unable to use services - User solutions unavailable until deployed elsewhere - Site becomes unusable
Established measures	Users to access other supporting sites
Identified / remaining vulnerabilities
Likelihood	1 - Unlikely to happen
Impact	2 - Minor impact, local service disruption less than 1 week
Risk level	(2) Low
Treatment - Protective/mitigation measures - recovery activities - controls	The measures already in place are considered satisfactory and risk level is acceptable
Expected duration of downtime/ time for recovering	1 or more working days

Risk no. 6

Risk description	Major security incident. The system is compromised by external attackers and needs to be reinstalled and restored.
Affected components of the service	All
Threats	Software vulnerabilities, identity theft, unauthorised access
Consequences of risk occurrence	- Potentially degraded services - Potential user data loss - Potential compromise of sensitive data - Resource Centre becomes unavailable because of ongoing investigations
Established measures	- Follow security advisories of used products/platforms - Periodic backup of data - RCs are recommended to use latest SW and implement security measures at facility level - RCs to abide to EGI security policies
Identified / remaining vulnerabilities	- Services and/or solutions could be compromised (backdoors)
Likelihood	2 - It happens less than once per year
Impact	2 - Minor impact, local service disruption less than 1 week
Risk level	(4) Medium
Treatment - Protective/mitigation measures - recovery activities - controls	The measures already in place are considered satisfactory and risk level is acceptable
Expected duration of downtime/ time for recovering	Up to 1 working day

Risk no. 7

Risk description	(D)DOS attack. The service is unavailable because of a coordinated DDOS.
Affected components of the service	Batch system, Computing element
Threats	Denial of service attack
Consequences of risk occurrence	- Users unable to use services or services degraded - User solutions unavailable until deployed elsewhere - One or more RCs become unavailable
Established measures	- Follow security advisories of used products/platforms - RCs are recommended to use latest SW and implement security measures at facility level - RCs to abide to EGI security policies
Identified / remaining vulnerabilities	If a community uses RCs from a specific country/region, it might not be able to use the service at all
Likelihood	2 - It happens less than once per year
Impact	2 - Minor impact, local service disruption less than 1 week
Risk level	(4) Medium
Treatment - Protective/mitigation measures - recovery activities - controls	The measures already in place are considered satisfactory and risk level is acceptable
Expected duration of downtime/ time for recovering	1 or more working days

Page tree

High-Throughput Compute Risk assessment