List of risks

Risk no

Risk description

Threat

1Service unavailable / loss of data due to hardware failureHardware failure
2Service unavailable / loss of data due to software failureSoftware errors (stack/dead processes, hard disk full because log files, ...)
3service unavailable / loss of data due to human errorHuman error: staff not well aware/trained about service and procedures, lacking of documentation, patching/upgrading procedures not properly followed, ...
4Not enough people for maintaining and operating the serviceUnavailability of key technical and support staff (holidays period, sickness, ...)
5Major disruption in the data centre.Fire, flood, failure or disruption of the power supply , natural disasters, environmental disaster, major events in the environment, 
6Major security incident. The system is compromised by external attackers and needs to be reinstalled and restored.

Software vulnerabilities, identity theft, unauthorised access

7(D)DOS attack. The service is unavailable because of a coordinated DDOS.Denial of service attack


Risks rating criteria

In order evaluate the level of a risk, it is first assessed its likelihood and impact. Both of them are integers between 1 and 4 (inclusive): 

RatingLikelihoodImpact
1Unlikely to happenMinimal impact
2Happens less than once per yearMinor impact, local service disruption less than 1 week
3Happens every few months / more than once per yearSerious disruption for multiple users, more than a week
4Happens every 2-3 months or more frequentlySerious disruption to the ability to deliver service

Then the risk level is given by the product of Likelihood and Impact (from 1 to 16), and the risks are prioritised in the following way:

  • Low: 1 and 2
  • Medium: 3 and 4
  • High: 6, 8, and 9
  • Extreme: 12 and 16
LikelihoodImpact

1 - Minimal impact2 - Minor impact, local service disruption less than 1 week3 - Serious disruption for multiple users, more than a week4 - Serious disruption to the ability to deliver service
1 - Unlikely to happen(1) Low(2) Low(3) Medium(4) Medium
2 - Happens less than once per year(2) Low(4) Medium(6) High(8) High
3 - Happens every few months / more than once per year(3) Medium(6) High(9) High(12) Extreme
4 - Happens every 2-3 months or more frequently(4) Medium(8) High(12) Extreme(16) Extreme


Risk no. 1

Risk descriptionService unavailable / loss of data due to hardware failure
Affected components of the serviceAll components
ThreatsHardware failure
Consequences of risk occurrence

Users cannot submit to a given Resource Centre

Established measures

Provide users more than one place where to run their grid jobs (VO SLA OLAs)

Identified / remaining vulnerabilities
Likelihood

2 - It happens less than once per year

Impact

1 - Minimal impact

Risk level(2) Low

Treatment - Protective/mitigation measures - recovery activities - controls

The measures already in place are considered satisfactory and risk level is acceptable

Expected duration of downtime/ time for recovering

1 or more working days

Risk no. 2

Risk descriptionService unavailable due to software failure
Affected components of the serviceAll components
ThreatsSoftware errors (stack/dead processes, hard disk full because log files, ...)
Consequences of risk occurrence

The Resource Centre is not accessible

Multiple Resource Centres can be affected by a single software problem

Established measures

Provide users more than one place where to run their grid jobs (VO SLA OLAs)

Waiting for fixes in middleware and running upgrade campaigns 

Identified / remaining vulnerabilities
Likelihood

2 - It happens less than once per year

Impact

1 - Minimal impact

Risk level(2) Low

Treatment - Protective/mitigation measures - recovery activities - controls

The measures already in place are considered satisfactory and risk level is acceptable

Expected duration of downtime/ time for recovering

Up to 1 working day


Risk no. 3

Risk descriptionService unavailable / loss of data due to human error
Affected components of the serviceAll
ThreatsHuman error: staff not well aware/trained about service and procedures, lacking of documentation, patching/upgrading procedures not properly followed, ...
Consequences of risk occurrence

Users cannot submit to a given Resource Centre

Established measures

Users can use other Resource Centres

Use of documentation for sysadmins and service operators

Identified / remaining vulnerabilities
Likelihood

2 - It happens less than once per year

Impact

1 - Minimal impact

Risk level(2) Low

Treatment - Protective/mitigation measures - recovery activities - controls

The measures already in place are considered satisfactory and risk level is acceptable

Expected duration of downtime/ time for recovering

up to 1 working day

Risk no. 4

Risk descriptionNot enough people for maintaining and operating the service
Affected components of the serviceBatch system, Computing element
ThreatsUnavailability of key technical and support staff (holidays period, sickness, ...)
Consequences of risk occurrence

Potentially degraded services

Incidents occurring at a specific site are not resolved in a required period of time (as per OLA)

Tickets not responded in time

Established measures

Ensure minimal coverage support of services (rota)

Users can use alternative supporting sites

Identified / remaining vulnerabilities
Likelihood

1 - Unlikely to happen

Impact

1 - Minimal impact

Risk level(1) Low

Treatment - Protective/mitigation measures - recovery activities - controls

The measures already in place are considered satisfactory and risk level is acceptable

Expected duration of downtime/ time for recovering

Up to 1 working day


Risk no. 5

Risk descriptionMajor disruption in the data centre.
Affected components of the serviceBatch system, Computing element
ThreatsFire, flood, failure or disruption of the power supply, natural disasters, environmental disaster, major events in the environment, ...
Consequences of risk occurrence

- Data created since last backup may be lost
- Users unable to use services
- User solutions unavailable until deployed elsewhere
- Site becomes unusable

Established measures

Users to access other supporting sites

Identified / remaining vulnerabilities
Likelihood

1 - Unlikely to happen

Impact

2 - Minor impact, local service disruption less than 1 week

Risk level(2) Low

Treatment - Protective/mitigation measures - recovery activities - controls

The measures already in place are considered satisfactory and risk level is acceptable

Expected duration of downtime/ time for recovering

1 or more working days


Risk no. 6

Risk descriptionMajor security incident. The system is compromised by external attackers and needs to be reinstalled and restored.
Affected components of the serviceAll
Threats

Software vulnerabilities, identity theft, unauthorised access

Consequences of risk occurrence

- Potentially degraded services
- Potential user data loss
- Potential compromise of sensitive data
- Resource Centre becomes unavailable because of ongoing investigations

Established measures

- Follow security advisories of used products/platforms
- Periodic backup of data
- RCs are recommended to use latest SW and implement security measures at facility level
- RCs to abide to EGI security policies

Identified / remaining vulnerabilities- Services and/or solutions could be compromised (backdoors)
Likelihood

2 - It happens less than once per year

Impact

2 - Minor impact, local service disruption less than 1 week

Risk level(4) Medium

Treatment - Protective/mitigation measures - recovery activities - controls

The measures already in place are considered satisfactory and risk level is acceptable

Expected duration of downtime/ time for recovering

Up to 1 working day


Risk no. 7

Risk description(D)DOS attack. The service is unavailable because of a coordinated DDOS.
Affected components of the serviceBatch system, Computing element
ThreatsDenial of service attack
Consequences of risk occurrence

- Users unable to use services or services degraded
- User solutions unavailable until deployed elsewhere
- One or more RCs become unavailable

Established measures- Follow security advisories of used products/platforms
- RCs are recommended to use latest SW and implement security measures at facility level
- RCs to abide to EGI security policies
Identified / remaining vulnerabilities

If a community uses RCs from a specific country/region, it might not be able to use the service at all

Likelihood

2 - It happens less than once per year

Impact

2 - Minor impact, local service disruption less than 1 week

Risk level(4) Medium

Treatment - Protective/mitigation measures - recovery activities - controls

The measures already in place are considered satisfactory and risk level is acceptable

Expected duration of downtime/ time for recovering

1 or more working days