Document control

AreaEGI Federation Operations
Procedure status

FINAL

OwnerMatthew Viljoen 
ApproversOperations Management Board
Approval status

APPROVED

Approved version and date

v5,  

Statement

A procedure describing the steps to decommission a Service operated by a Resource Centre in the EGI infrastructure

Next procedure reviewon demand

Procedure reviews

The following table is updated after every review of this procedure.

DateReview bySummary of resultsFollow-up actions / Comments

 

Alessandro Paolini copy from PROC12_Production_Service_Decommissioning in EGI Wiki




Table of contents

Overview

This procedure defines the good practices between a Resource Centre (aka site) and its users when a production service is being decommissioned.

Definitions

Please refer to the EGI Glossary for the definitions of the terms used in this procedure.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", “MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

  • Stateful service is a service which contains persistent user's data. Examples of stateful services are storage elements, LFCs or VOMS.
  • Stateless service is a service which does not retain persistent data, and no data need to be migrated in case of a re-installation.

Entities involved in the procedure

  • Resource Centre Operations Manager: person who is responsible for initiating the decommissioning procedure by contacting the Resource Infrastructure Operations Manager.
  • Resource Infrastructure Operations Manager (aka NGI operations manager) : person who is responsible for finding and agreement with the VO Manager about the migration of the service in another site, in case the service is a VO specific service hosted by the site according to an agreement between the Resource Infrastructure Provider and the VO.
  • Virtual Organizations (VO's): Data and other stateful objects of the supported VO's may be stored at the Resource Centre.
  • Virtual Organizations (VO) managers: persons who are responsible for retrieving this data from the Resource Centre in due time. Tracking is done through their support unit in GGUS. If such support unit is not available, the VOs should be contacted directly using the contact information available in the VO ID card.

Contact information

  • A list of EGI Operations Centres and Resource Centres with their respective contact information is available on GOCDB
  • The list of VO's served by a specific Resource Centre can be retrieved from the BDII and VAPOR.
  • The VO managers and their contact information for a specific VO can be retrieved from the VO ID Cards on the Operations Portal.

Actions and responsibilities

Resource Centre Operations Manager

  1. The Resource Centre is responsible for decommissioning the service.
  2. The Resource Centre is responsible for updating the corresponding entries in the EGI configuration repository GOCDB.
  3. The Resource Centre Operations Manager is REQUIRED to provide the necessary Resource Centre information needed to complete the decommission process, and he/she is responsible for its accuracy and maintenance.

Resource Infrastructure Operations Manager

  1. A Resource Infrastructure Provider is REQUIRED to be responsible for all Resource Centres within its respective jurisdiction. For this reason the Resource Infrastructure Provider is responsible for assuring that all the Resource Centres follow this procedure for services decommissioning.

VO's and VO managers

  1. give the users the relevant information about the decommissioning (deadlines, involved resources, files, how to handle it)
  2. follow-up and support users in their file migration procedures until the deadline
  3. inform Resource Centre about the status of the migration(s)

Steps

  • Actions tagged RC are the responsibility of the Resource Centre Operations Manager.
  • Actions tagged RP are the responsibility of the Resource Infrastructure Operations Manager.
  • Actions tagged OC are the responsibility of the Operations Centre
#ResponsibleAction
1RC
  1. The Resource Centre Operations Manager opens a GGUS ticket, which will be used as parent ticket to track the whole process. The ticket must remain in an open status until the service is removed from GOCDB. The ticket has to be assigned to the Resource Infrastructure Operations Centre (NGI).
  2. The Resource Centre Operations Manager contacts the Resource Provider regional staff, communicating the decommissioning plan for the service.
2RC
  1. The Resource Centre Operations Manager announces through the broadcast tool to VO managers and users of all the VOs (Except OPS and DTEAM VO) supported by the service under decommissioning that it is starting the decommissioning procedure:
    • Announce a detailed timeline for the decommissioning and that the Resource Centre Manager will start a downtimes of the service to prevent any further usage. In the timeline must be clearly listed the deadlines for the VO Managers' actions.
    • The timeline is recorded in the parent ticket.
    • The broadcast link is recorded in the parent ticket.
    • The downtime should start no earlier than 15 days and no later than one month after the broadcast.
    • State that the aim is to remove the service in XX weeks (min 6 weeks for stateful services).
3RC
  1. [If the service is a CE or a workload management service] After the announce of the service decommissioning the Resource Centre MAY disable VO job submissions to prevent further VO activity - except the monitoring jobs.

    [If the service is a storage or data management service] After the announce of the service decommissioning the Resource Centre MAY disable VO writing access to prevent further VO activity - except infrastructure VOs (If selective permissions are not possible, the service must remain enabled also in writing until the begin of the downtime).

4VO (OC)
  1. [If the service is a storage element] The VO Manager in the time between the announcement of the decommissioning and the begin of the downtime SHOULD check If the volume of data stored by a VO in the site is big enough to require more than one month to be moved, the VO manager can ask to reschedule the downtime period.
    • If no communications are sent to the Resource Centre by the first week of downtime the schedule can be considered agreed by all VO Managers.
    • If there are multiple SEs being decommissioned together, the total amount of data to move could be bigger, and VOs may be informed about that.
    • Any request of reschedule MUST be supported by technical reasons (e.g. total amount of data to move / Site max data transfer throughput)
  2. [If the service is a central service like VOMS or LFC for a given VO] VO Manager, Resource Centre Operations Manager and Resource Infrastructure Operations Manager should discuss finding a new Resource Centre for hosting these services, taking into account pre-existing agreement between VO and NGI. For international VOs, this discussion could be held at the EGI level, especially if a solution cannot be easily found within that Resource Infrastructure Provider.
5RC
  1. According to the dates announced in the broadcast or differently agreed in step 4, the Resource Centre puts the service in downtime to prevent any further usage. This downtime shall last for the scheduled period or until phase 5 is over - which ever is the shorter.
    • The downtime must be recorded in the parent ticket
6RC

If the service is a stateful service containing VO data:

  • Once the service is in downtime and closed for write access (if possible) the Resource Centre Operations Manager opens N child tickets of the procedure's parent ticket to each of the N VO managers of the N VOs the service supports.
  • The VOs are given up to the amount of time agreed in step 4 - to retrieve their data from the decommissioning service. During this period, the Resource Centre should make sure that the service works for the different VOs to allow them to migrate their data. The VO managers can specify any specific requirements in their child ticket. For instance:
    • Request in the child ticket from the Resource Centre Operations Manager the time limit needed to retrieve data.
    • (If the service is an SE) Request from VO central services admins the list of LFNs/DNs still having SURLs on SEs at that Resource Centre.
    • VO Manager MUST communicate to the Resource Centre - if possible using the GGUS child ticket - when the data moving is completed.
    • If the service's data cannot be migrated using the user interface (e.g. if there is the need to have access to a database dump) the Resource Centre administrators should cooperate with the VO Managers.

If the service does not contain user/state persistent data (e.g. CE):

  • Once the service is in downtime the interface can be closed in order to prevent users to start new tasks on the service, but allowing them to retrieve the output of the tasks submitted before the begin of the downtime.
7RC
  1. At the end of the scheduled downtime period or when step 6 is completed and validated:
    • The service is set to "production=N" "monitored=N" in the GOCDB.
    • Once the service disappears from Nagios, it must be removed from the Resource Centre GIIS (e.g. Site-BDII).
    • The downtime is terminated.
    • All this actions must be recorded in the parent ticket.
  2. At this point the service is no longer listed in the top-BDIIs of EGI. If hardware is closed down, the Resource Centre will need to address this, possibly informing these users that their data could be at risk.
9RC
  1. Logs are to be kept at the Resource Centre, available for the period of time requested by the Security Traceability and Logging Policy (90 days) after the service has been removed from the resource centre GIIS and its public interfaces are no more accessible by the users, in case of inquiries related to security incidents the period could be extended. Note:If the logs are saved elsewhere the services hardware can be disposed.
10RC
  1. Service is removed from GOCDB.
    • This action must be recorded in the parent ticket.
11OC
  1. Parent ticket is closed.