- Created by Alessandro Paolini, last modified by Catalin Condurache on 2024 Aug 20
List of risks
Risk no | Risk description | Threat |
---|---|---|
1 | Service unavailable / loss of data due to hardware failure | Hardware failure |
2 | Service unavailable / loss of data due to software failure | Software errors (stack/dead processes, hard disk full because log files, ...) |
3 | service unavailable / loss of data due to human error | Human error: staff not well aware/trained about service and procedures, lacking of documentation, patching/upgrading procedures not properly followed, ... |
4 | Not enough people for maintaining and operating the service | Unavailability of key technical and support staff (holidays period, sickness, ...) |
5 | Major disruption in the data centre. | Fire, flood, failure or disruption of the power supply , natural disasters, environmental disaster, major events in the environment, |
6 | Major security incident. The system is compromised by external attackers and needs to be reinstalled and restored. | Software vulnerabilities, identity theft, unauthorised access |
7 | (D)DOS attack. The service is unavailable because of a coordinated DDOS. | Denial of service attack |
Risks rating criteria
In order evaluate the level of a risk, it is first assessed its likelihood and impact. Both of them are integers between 1 and 4 (inclusive):
Rating | Likelihood | Impact |
---|---|---|
1 | Unlikely to happen | Minimal impact |
2 | Happens less than once per year | Minor impact, local service disruption less than 1 week |
3 | Happens every few months / more than once per year | Serious disruption for multiple users, more than a week |
4 | Happens every 2-3 months or more frequently | Serious disruption to the ability to deliver service |
Then the risk level is given by the product of Likelihood and Impact (from 1 to 16), and the risks are prioritised in the following way:
- Low: 1 and 2
- Medium: 3 and 4
- High: 6, 8, and 9
- Extreme: 12 and 16
Likelihood | Impact | |||
---|---|---|---|---|
1 - Minimal impact | 2 - Minor impact, local service disruption less than 1 week | 3 - Serious disruption for multiple users, more than a week | 4 - Serious disruption to the ability to deliver service | |
1 - Unlikely to happen | (1) Low | (2) Low | (3) Medium | (4) Medium |
2 - Happens less than once per year | (2) Low | (4) Medium | (6) High | (8) High |
3 - Happens every few months / more than once per year | (3) Medium | (6) High | (9) High | (12) Extreme |
4 - Happens every 2-3 months or more frequently | (4) Medium | (8) High | (12) Extreme | (16) Extreme |
Risk no. 1
Risk description | Service unavailable / loss of data due to hardware failure |
---|---|
Affected components of the service | All components |
Threats | Hardware failure |
Consequences of risk occurrence | Users cannot submit to a given Resource Centre |
Established measures | Provide users more than one place where to run their grid jobs (VO SLA OLAs) |
Identified / remaining vulnerabilities | |
Likelihood | 2 - It happens less than once per year |
Impact | 1 - Minimal impact |
Risk level | (2) Low |
Treatment - Protective/mitigation measures - recovery activities - controls | The measures already in place are considered satisfactory and risk level is acceptable |
Expected duration of downtime/ time for recovering | 1 or more working days |
Risk no. 2
Risk description | Service unavailable due to software failure |
---|---|
Affected components of the service | All components |
Threats | Software errors (stack/dead processes, hard disk full because log files, ...) |
Consequences of risk occurrence | The Resource Centre is not accessible Multiple Resource Centres can be affected by a single software problem |
Established measures | Provide users more than one place where to run their grid jobs (VO SLA OLAs) Waiting for fixes in middleware and running upgrade campaigns |
Identified / remaining vulnerabilities | |
Likelihood | 2 - It happens less than once per year |
Impact | 1 - Minimal impact |
Risk level | (2) Low |
Treatment - Protective/mitigation measures - recovery activities - controls | The measures already in place are considered satisfactory and risk level is acceptable |
Expected duration of downtime/ time for recovering | Up to 1 working day |
Risk no. 3
Risk description | Service unavailable / loss of data due to human error |
---|---|
Affected components of the service | All |
Threats | Human error: staff not well aware/trained about service and procedures, lacking of documentation, patching/upgrading procedures not properly followed, ... |
Consequences of risk occurrence | Users cannot submit to a given Resource Centre |
Established measures | Users can use other Resource Centres Use of documentation for sysadmins and service operators |
Identified / remaining vulnerabilities | |
Likelihood | 2 - It happens less than once per year |
Impact | 1 - Minimal impact |
Risk level | (2) Low |
Treatment - Protective/mitigation measures - recovery activities - controls | The measures already in place are considered satisfactory and risk level is acceptable |
Expected duration of downtime/ time for recovering | up to 1 working day |
Risk no. 4
Risk description | Not enough people for maintaining and operating the service |
---|---|
Affected components of the service | Batch system, Computing element |
Threats | Unavailability of key technical and support staff (holidays period, sickness, ...) |
Consequences of risk occurrence | Potentially degraded services Incidents occurring at a specific site are not resolved in a required period of time (as per OLA) Tickets not responded in time |
Established measures | Ensure minimal coverage support of services (rota) Users can use alternative supporting sites |
Identified / remaining vulnerabilities | |
Likelihood | 1 - Unlikely to happen |
Impact | 1 - Minimal impact |
Risk level | (1) Low |
Treatment - Protective/mitigation measures - recovery activities - controls | The measures already in place are considered satisfactory and risk level is acceptable |
Expected duration of downtime/ time for recovering | Up to 1 working day |
Risk no. 5
Risk description | Major disruption in the data centre. |
---|---|
Affected components of the service | Batch system, Computing element |
Threats | Fire, flood, failure or disruption of the power supply, natural disasters, environmental disaster, major events in the environment, ... |
Consequences of risk occurrence | - Data created since last backup may be lost |
Established measures | Users to access other supporting sites |
Identified / remaining vulnerabilities | |
Likelihood | 1 - Unlikely to happen |
Impact | 2 - Minor impact, local service disruption less than 1 week |
Risk level | (2) Low |
Treatment - Protective/mitigation measures - recovery activities - controls | The measures already in place are considered satisfactory and risk level is acceptable |
Expected duration of downtime/ time for recovering | 1 or more working days |
Risk no. 6
Risk description | Major security incident. The system is compromised by external attackers and needs to be reinstalled and restored. |
---|---|
Affected components of the service | All |
Threats | Software vulnerabilities, identity theft, unauthorised access |
Consequences of risk occurrence | - Potentially degraded services |
Established measures | - Follow security advisories of used products/platforms |
Identified / remaining vulnerabilities | - Services and/or solutions could be compromised (backdoors) |
Likelihood | 2 - It happens less than once per year |
Impact | 2 - Minor impact, local service disruption less than 1 week |
Risk level | (4) Medium |
Treatment - Protective/mitigation measures - recovery activities - controls | The measures already in place are considered satisfactory and risk level is acceptable |
Expected duration of downtime/ time for recovering | Up to 1 working day |
Risk no. 7
Risk description | (D)DOS attack. The service is unavailable because of a coordinated DDOS. |
---|---|
Affected components of the service | Batch system, Computing element |
Threats | Denial of service attack |
Consequences of risk occurrence | - Users unable to use services or services degraded |
Established measures | - Follow security advisories of used products/platforms - RCs are recommended to use latest SW and implement security measures at facility level - RCs to abide to EGI security policies |
Identified / remaining vulnerabilities | If a community uses RCs from a specific country/region, it might not be able to use the service at all |
Likelihood | 2 - It happens less than once per year |
Impact | 2 - Minor impact, local service disruption less than 1 week |
Risk level | (4) Medium |
Treatment - Protective/mitigation measures - recovery activities - controls | The measures already in place are considered satisfactory and risk level is acceptable |
Expected duration of downtime/ time for recovering | 1 or more working days |