Security challenges: what they are about?

An important stage of preparing ourselves for the possibility of a cybersecurity attack is to exercise our capabilities. This process – intended to support our sites and security teams in their preparations – allows us to identify issues with our procedures and areas where we can best focus our attention for future development. Challenges such as these can take many forms including table-tops, communication challenges, focused technical exercises and those that test all stages of our distributed incident response capability. It is the last of these that we focus on here.

Incident Response Procedures

EGI CSIRT Security incident response procedure

This exercise will also test the current SEC01 EGI CSIRT Security Incident Handling Procedure, and here in particular step 5, which covers the information collected for the coordinated incident response.

Please try to follow this procedure where possible, and note/report any problems with it

Important

PLEASE REMEMBER THAT FOR THE CHALLENGE
THE PROCEDURE IS APPLIED WITH RESTRICTIONS
AS STATED IN THE GENERAL RULES (below).
For questions please contact: ssc(at)mailman.egi.eu

EGI Policies and Procedures

All the EGI Federation policies and procedures can be found at the EGI Policies and Procedures Home.

Security Service Challenges

A key element related to our distributed environment is the use of federated, shared identities. As such, it is vital that suspicious activities attributed to a DN or a proxy IdP (EGI Check-in) identity are reported to the Virtual Organisation (VO) or to one of the federated infrastructure CSIRTs (EGI CSIRT, OSG, WLCG Security, RC CSIRT). It is important that this information is immediately synchronized across the involved security teams to quickly understand the scope of the incident to respond most effectively, and to act to minimise the impact on the VOs production, or that the incident spreads through channels the individual security teams can not monitor nor control.

The goals of the security drills are:

to investigate whether sufficient information is available to be able conduct an audit trace as part of an incident response, and to ensure that appropriate communications channels are available.
to assess the efficiency of controls applied by the involved security teams.
to evaluate the efficiency of the various incident response operations aiming at containment.
trigger and improve the collaboration of the full incident response chain, involving security teams from the RCs, NGIs, EGI, OSG, VOs, Identity Providers (IdPs) and Certifications Authorities (CAs).

SSC Schedule, estimated workload on the RC security teams

The SSC will run one week. From earlier SSCs it is understood that most activities will happen in the first 2 days of the SSC, putting a high load on the coordinating security teams (EGI CSIRT, U.S. CMS security).

While the basic actions (communications, user suspension) should not take more then 2h in total, the optional forensics training (see below) will take an an experienced admin another 2h, maybe more in case you got a bit rusty with that topic.

The SSC Red-Team will deploy "malware" in the coming days, and increase the noise on the Worker Nodes (WNs) until a RC security team spots the suspicious activity and reports it to EGI CSIRT (abuse@egi.eu).

EGI CSIRT will then evaluate the incoming reports, if it has sufficient information to conclude that this is likely a grid incident a broadcast will be send with all aggregated information to all participants. This timestamp of this broadcast will be the reference time for the evaluation (see below for details).

General Rules

The participants in this SSC contacted in this challenge are asked to follow the normal security incident response procedure, and react as if the incident was real, with the two following exceptions:

No sanctions must be applied against the Virtual Organization (VO) that was used to submit the job / start the VM.
Note: a Pilot WMS is used, no sanctions must be applied affecting the production of the VO.

All (mail) communications related to this exercise should have the following header:

* This is the security exercise SSC 23-03 run by EGI CSIRT https://csirt.egi.eu *
* Further information can be found in: https://go.egi.eu/security-challenges    *
* For any questions please contact: abuse@egi.eu                                *

For all "multi-destination" alerts:
- It should be addressed to the e-mail list which has been designated for the test: abuse(at)egi.eu
- Insert the originally intended "multi-destination" address(es) at the beginning of the body of your message.
- Make sure to have the string: [SSC-23.03] in the subject of the message.

The second rule is to avoid communications escaping the boundaries of the SSC.

Scenario: Stolen Credentials

A common problem in distributed environments is that user credentials get compromised resulting in illicit usage of resources.

This might happen as a result of brute force attacks on weak passwords, lost/stolen hardware, phishing, or following an earlier incident where this data was harvested by the attacker. In addition, in the Cloud environment, we rather often see that users choose insecure (default) configuration for services they install or introduce other vulnerabilities which are then quickly exploited by automated attacks constantly targeting all systems connected to the internet.

Stolen or brute forced (ssh) credentials in distributed environments carry the additional risk that such incidents can spread rapidly, affecting multiple Resource Centres (RCs) in multiple countries. Therefore proper access management is crucial in incident response. In the EGI infrastructure access to resources is currently mainly controlled based on x509 certificates for HTC resources and on tokens for Cloud resources. There is currently a transition of the infrastructure to work on replacing X.509 certificates by tokens in most services.

X.509 certificates

x509 access management can happen on different levels, each action has a certain delay until it takes effect and a certain scope.

At Resource Center (RC) / Service level. immediately bans the user at the RC/Service. Running jobs may continue.
Suspend DN at VOMS. Latency up to 1 week, since already issued voms-proxies remain valid, no new proxies will be issued. Scope VO wide, certificate could also be used within other VOs.
CA revokes certificate, takes effect when the new CRLs are loaded to the services, up to 48 hours, globally. Certificate will not be accepted at an service. (Note, for a CA to revoke a certificate requires for a example a proof that the user lost control over the private key. This can be done by signing a revocation request to the CA signed with the certificate you want to revoke.)

Since suspending at the RC service level is immediately effective it is crucial that the RC security teams, as well as the VO security teams, managing the access to their resources are trained to suspend a reported malicious certificate DN on all of there systems, to stop all running processes related to that DN.

Tokens

On EGI Federated Cloud

EGI Federated Cloud, or FedCloud, is a federation of OpenStack providers relying on OpenID Connect and OAuth 2.0 to manage authentication and authorisation. All the cloud service providers are integrated with EGI Check-in using OIDC.

In the case of cloud Virtual Machines (VMs), sites should be able to trace back a IP/VM to the controlling user identifier.
At the same time the state of the VM in question should be preserved for later investigation and further access to it suspended.
Once a user is blocked at the Check-in level, it should prevent him for issuing new OIDC access tokens, or to use existing access tokens to interact with the cloud API to start/stop/delete VM, but it will not revoke existing OpenStack tokens, or prevent the user to access a running VM via a protocol decoupled from the OIDC (like SSH) and therefore may requires some manual intervention of the RC admins to make sure that the user in question can not access the API to start/stop/delete VMs not connect to it.
- OpenStack uses the access token to contact the Check-in OIDC introspection endpoint to fetch the claims, this would raise an error if the user account got blocked
- The typical lifetime of an OIDC access token and of an OpenStack token are usually 1 hour, but this is subject to configuration

Security challenges: what is expected from the VO ?

VOs using a VO-WMS like CRAB, CMS-Connect, Panda, DIRAC are in a unique position with regard to which Identities (DNs) are allowed to submit jobs to the RCs. For RCs its not easily detectable which identity is running a particular payload within a pilot-job. In addition this concept may circumvent RC local access policies.

To contain an incident the following information needs to be shared with the coordinating CSIRTs as soon as possible:

who is the identity (DN)) responsible for certain payloads.
which RCs ran payload of the identity in question

In addition the identity in question should be suspended at the VOs job repo or other gateways to the compute infrastructure.

For proper coordination, all related activities should be reported to the coordinating CSIRT (EGI CSIRT).

Security challenges: what is expected from Resource Centers (RCs, sites)?

Usage of the resources violating the Acceptable Usage Policy are usually detected at the RCs. Suspicious payloads submitted to resources dedicated to a VO have a very high potential to be submitted to other RCs as well and, in general, are a grid incident.

According to our Security Incident Response Policy you shall promptly report suspected security incidents that have a known or potential or relationship to the e-infrastructure resources via the incident response channels defined by the e-Infrastructure. For the purpose of this SSC the incident response channel is abuse@egi.eu.

When you detect something suspicious, please check below (For an initial response and first directions try to find answers to the following questions)

Note

It is not expected that all RCs detect the "malicious" activities of this SSC, rather that any findings pointing to a possible incident (Indicators of Compromise) are send to abuse@egi.eu. The coordinating CSIRT will then send the relevant information to all affected entities.

Information to be gathered at the RCs (sites)

What is the suspicious activity?
Who is running the suspicious payload (Certificate DN or EGI Check-in ID)
To which VO does the DN or Check-in ID belong to.
For computing jobs: which technology was used to submit the job, like:
- VO WMS (Crab, CMS-Connect, etc)
- Direct submissions to the RC via ARC-CE, HT-Condor etc.
- Other
For Virtual Machines: which technology was used to create the VM, like:
- OpenStack CLI
- Infrastructure Manager
- AppDB VMops dashboard
- Other
What is the originating IP used to send the job to the RC or start the VM?

Actions to be executed by the RCs (sites)

Any findings (see above) should be reported to abuse@egi.eu.

All sites should promptly respond to requests from the Security Teams (here from EGI CSIRT). These could be requests to:

check for reported IoCs, and report back.
suspend reported identities (DN, Check-in ID) from accessing the local resources. See Access control to compute and storage infrastructure.
stop "malicious" processes related to the reported identity (DN, Check-in ID)

Evaluation

In this challenge the following basic Incident Response activities will get evaluated:

Communication times: based on timestamps in the ticket-system, the response times will be measured. Note, only office hours are counted, time zones will be taken into account.
Containment:
- Time when the processes of the "attacker DN / Check-in ID" are stopped
- Time when the "attacker DN / Check-in ID" is suspended locally.

For an initial response and first directions try to find answers to the following questions

Network:

Are there any other suspicious connections open to/from a reported IP or jobs/VM running under a reported DN/Check-in ID?
- If so, to which IPs?
What are the DNs/Check-in IDs associated to the reported IP?

Containment:

If possible suspend the DN/Check-in ID locally
From where (IPs) where the Jobs submitted or VMs started?
To which VO(s) are the user/certificate/Check-in ID affiliated?
Which grid-certificates (DN)/Check-in ID are involved in this test-incident?
- Example DN-1: "CN=John Doe, O=<SomeInstitute>,O=<Something>, ..."
- Example Check-in ID: "0123456789012345678901234567890123456789@egi.eu"

Which Identities are used to start suspicious VMs?
- Example Check-in ID: "0123456789012345678901234567890123456789@egi.eu"
Since when were the VM running (FedCloud)?
- Example: YYYY:MM:DD hh:mm

Important

The sites should provide the security teams as soon as possible with this information at the latest within one working day. The time needed to pass this information to EGI-CSIRT by replying to the alarm mail will be measured and evaluated.

Please also visit our Forensics Howto wiki pages. If you want to contribute to this page, just send your input to irtf(at)mailman.egi.eu.

Scores, Evaluation - Report generation

Per site operations, target time, office hours

initial feedback: 4h
found malicious job/processes/stop them: 4h
ban problematic certificate: 4h
contain the malicious binary and sent it to the incident-coordinator: 24h

These will be measured by the ssc-monitor and the scores the sites get are calculated according to the formula stated on the wiki page. Times are relative to the timestamp in the alarm ticket sent to the site, the evaluation takes office-hours (09:00 - 18:00, local time) into account.

(Optional) SSC Forensics

The Forensics part of the SSC is managed via https://ssc.egi.eu/.

This is an optional activity of the Site Security Challenge (SSC).

The first forensics challenges will be available in the portal after the SSC announcement mails are send. It will be closed after the SSC is finished. The SSC will run maximum for one week, starting point being the "Alert Broadcast" send by EGI to the participating entities.

By taking part in this game, you will be able to submit answers to additional questions.

Note

The game will focus on selected areas of digital forensics which could be solved with the help of the information in the Forensics Howto.

Then after the SSC you will have the possibility to opt-in for having your results added to the final report.

This exercise is organised by EGI CSIRT with the support of different collaborating organisations:

CMS
US CMS
EGI

The required forensics skills will increase through the investigations, the forensic information you should find will be clearly indicated. This is mainly meant for fun, so enjoy and try to solve as many riddles you see fit. It should not take much longer then 2 hours.

Access details to the scoreboard will be distributed in the SSC announcement mail. Please make sure you can access it.

Contact information

In case of problems contact us: ssc@mailman.egi.eu

Post processing, clean up

As part of the incident handling, access authorisations may have been withdrawn from the DN/Check-in ID that was used to submit the job or create VMs. When the incident response procedure is complete, the test operator will explicitly request restoration of any such authorisations to their original state.

SSC Evaluation Form

De-briefing

When the challenge has been completed on a representative number of Sites, the test operator will ask for de-briefing input from the participating Sites. Material submitted will be used to edit a report. The report will be circulated to the contributors for comments before being presented to the EGI-CSIRT.

Communication Template Debriefing

Dear all,
Thank you for your contributions to the SSC-XX-YY.

This message is about to inform you that the SSC-XX-YY is now over. You should receive the site report the next days.

As a clean-up step we would now ask the challenged sites to restore eventually banned credentials, in particular:
 <DN/CHECK-IN ID OF USED ATTACK ID>
 and
 <DN/CHECK-IN ID OF USED ID FOR TESTING PURPOSES>

Page tree

Table of Contents