General information

Middleware


UMD


Operations

ARGO/SAM

  • Monitoring of webdav endpoints:
    • new version of the probe taking into account that with the Object store as the backend disk storage, the "ls"  of a "directory" is failing for the webdav check.
      • the "ls" check can be disabled by setting a proper information in GOCDB
      • released in production on Jun 5th.
  • Monitoring of xrootd endpoints
    • some endpoints are exposed outside the site in read-only mode
    • need to modify the xrootd probe to execute only "read" tests
    • the new service type "eu.egi.readonly.xrootd" was created for this purpose (see GGUS 160848)

FedCloud

Feedback from DMSU


New Known Error Database (KEDB)

The KEDB has been moved to Jira+Confluence: https://confluence.egi.eu/display/EGIKEDB/EGI+Federation+KEDB+Home

  • problems are tracked with Jira tickets to better follow-up their evolution
  • problems can be registered by DMSU staff and EGI Operations team


Issues with publishing the accounting records

Monthly Availability/Reliability

Under-performed sites in the past A/R reports with issues not yet fixed:


Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: (May 2023):

sites suspended: 

  • Jun 2nd: egee.irb.hr (NGI_HR)

Documentation

IPv6 readiness plans

Change in the APEL client configuration due to CERN top-BDII decommission

  • The top-bdii lcg-bdii.cern.ch, which will be turned off on June 19th as announced with a broadcast circulated on May 23rd.
  • Currently this endpoint is a default setting in the APEL Client configuration file /etc/apel/client.cfg
    • If you are using this setting, we kindly ask you to change it: you can replace it with the endpoint lcg-bdii.egi.eu (or any other top-bdii which is provided by your NGI).
    • So the variable to set in /etc/apel/client.cfg is the following

      ldap_host = lcg-bdii.egi.eu
    • In this way, the apel client will query that BDII to gather the information about the benchmark published by your CEs.
    • you can set any top-BDII of your preference

      Please note that it is also possible to set manually the benchmark information by setting one or more of the following variables:

      ## To manually set specs for all jobs (not just local ones), configure lines
      ## like the following named "manual_spec" followed by consecutive integers for
      ## however many batch systems are relevant. The value should be a unique name
      ## for the system, then the spec type ('HEPscore23', 'HEPSPEC' or 'Si2k') and the spec value.
      # manual_spec1 = grid10.uni.ac.uk:1234/grid10.uni.ac.uk-condor,HEPSPEC,10.0
      # manual_spec2 = grid22.uni.ac.uk:1234/grid22.uni.ac.uk-condor,HEPSPEC,15.0
      # manual_spec3 = grid35.uni.ac.uk:1234/grid35.uni.ac.uk-condor,HEPSPEC,15.0


    • Please apply this change by June 19th, thanks!
  • broadcast about the APEL configuration change sent on May 26th

Transition from X509 to federated identities (AARC profile token)

  • In Feb 2022 OSG fully moved to token-based AAI, abandoning X509 certificates
  • HTCondorCE: replacement of Grid Community Toolkit
    • The long-term support series (9.0.x) from the CHTC repositories will support X509/VOMS authentication until May 2023
    • Starting in 9.3.0 (released in October 2021), the HTCondor feature releases does NOT contain this support
    • EGI sites are recommended to stay with the long-term support series for the time being

Migration of the VOs from VOMS to Check-in

  • transition period where both X509 and tokens can be used
    • delays in updating the GRID elements to the latest version compliant with tokens
    • not all of the middleware products can be compliant with tokens at the same time
    • the same VO has to interact with element supporting different authentications

Testing HTCondorCE and AARC Profile token

  • INFN-T1 did some tests with the AARC Profile token using its HTCondorCE endpoints
  • dteam VO registered in Check-in/Comanage:
    • Entitlements:
      • urn:mace:egi.eu:group:dteam:role=member#aai.egi.eu
      • urn:mace:egi.eu:group:dteam:role=vm_operator#aai.egi.eu
  • The HTCondorCE expects to find in the token the scope claim to authorise the jobs submission
    • in that moment Check-in didn't release this claim: it does since the migration to Keykloak technology replacing MitreID

WLCG Campaign

Hackathon events

  • 15th - 16th September ARC/HTCondor CE Hackathon, organised by WLCG, with HTCondorCE and ARC-CE to mostly investigating data staging issues (see GDB introduction)
    • agreed to enable the support of the several token profiles through plugins
      • same plugin for the several CEs 
      • plugins provided by the "creators" of the token profiles
    • CE teams to provide specifics to the AAI teams and to release a new CE version supporting the plugins

Plans for the coming months:

  • ARC-CE and HTCondorCE implemented a new API interface
  • The Check-in team released a plugin for the CEs allowing the Check-in/AARC token profile to work
    • according to the AARC guidelines, the claim to authorise the job submission is provided through a different attributes than the one used by the WLCG token
    • the plugin translates the attribute to be understandable to the CEs
  • The plugin is currently under testing before its release in UMD
    • involved DESY-HH, FZK-LCG2, INFN-BARI, INFN-T1, RECAS-NAPOLI
    • the plugin is tested with HTCondor Feature Release which introduced the support to Check-in tokens
    • tests were successful
    • a few aspects concerning some HTCondor variables to be clarified
  • HTCondor now supports SSL for authentication and mapping of x509 certificates.
    • The SSL workaround does not allow the use of VOMS extensions to map, but it was mentioned that user mapping can be achieved using only the DN.
    • Waiting for the creation of documentation about this setting
  • Important for the sites: please get in contact with the VOs to verify their status about the transition to tokens:
    • if the VOs need a bit more time you can use the SSL settings to map the users DN...
    • ...but you need to know who these users are!
  • Then we can start the decommission procedure for the HTCondorCE long-term support series (9.0.x)
    • we might agree to postpone this decommission deadline considering the (relatively ) low likelihood of security issues
    • this will make the transition to tokens less painful for VOs that are not ready yet
  • At the same time the VOs using voms will be cloned to Check-in in order to be ready to use the tokens when the first HTCondor (Feature Channel version) endpoints are in productions.
  • To be clarified the monitoring:
    • if a new version of the probe using tokens is needed
    • how to deal with CEs using different authz system during the migration phase

DPM Decommission and migration

  • DPM supported until June 2023
  • Sites are encouraged to start the migration to a different storage element since the process will take time
    • choosing the new storage solution depends on the expertise/experience of the sites and on the needs of the supported VOs 
  • See the slides presented by Petr Vokac at the EGI Conference 2022 about the migration tools to dCache
  • DPM provides a migration script to dCache (migration guide)
    • Transparent migration
      • Migrate just catalog (database) and keep files untouched
      • both SE store files on posix filesystem
  • Migration in three steps
    • verify the DPM data consistency
      • no downtime needed
      • the operation can last several days or some weeks
    • DPM dump and dCache import
      • downtime lasting about 1 day
  • In September 2022 opened tickets to the sites to plan the migration and decommission:
    • tickets list (30 out of 57 were solved)
    • Please let us know your plans for DPM EOL and in case you decide to use dCache migration tools the tickets will be used to support you on this storage migration method.
    • dCache migration should be done by June 2023.

Planned completion date

By Feb 2023

1

By Q1 2023

3

By May 2023

2

By June 2023

8

By Q2 2023

4

undefined

6

Chosen technology

Migration to dCache

27

Migration to EOS

8

Migration to xroot/ceph

3

Migration to XrootD

1

Migration to Xcache

1

Migration to Dynafed

2

Not yet decided/no clear plan

6

Decommissioning SE or site

7

  • Procedure to decommission unsupported software: PROC16 Decommissioning of unsupported software
    • In compliance to the EGI Service Operations Security Policy (1), unsupported software SHOULD be decommissioned before its End of Security Updates and Support, and MUST be retired no later than 1 month after its End of Security Updates and Support. After this date, if a critical vulnerability were to emerge in the software, EGI CSIRT can request the service to be turned off immediately.
    • (1) a Resource Centre Administrator SHOULD follow IT security best practices that include pro-actively applying software patches, updates or configuration changes related to security.
  • DPM end of security updates and Support: 30th June 2023
  • DPM decommissioning deadline: 31st July 2023
    • Failure to do so MAY ultimately lead to site suspension
  • Please note that after June 30th no support is going to be provided with the migration to dCache in case of issues.

New benchmark HEPscore23

The benchmark HEPscore23 is replacing the old Hep-SPEC06

Main points agreed:

  • On the Accounting Portal all of the metric units refer to HEPscore23 (since April 1st 2023)
  • Existing resources at the sites will not be re-benchmarked with HEPscore23 (unless the site has modern resources and would like to re-benchmark them in order to get higher consumption in the accounting reports)
  • New resources purchased by the site will be benchmarked with HEPscore23
  • This implies that two benchmarks will co-exist on the infrastructure for quite some time
  • Normalisation factor between HEPscore23 and HS06 is 1
  • We would like to follow the progress regarding amount of the resources benchmarked with HEPscore23
  • No need for reporting of measurements for two benchmarks in parallel for the same set of resources
    • This implies that accounting record should contain one metric for a single benchmark and benchmark name has to be properly defined in the accounting record.

Recent activities:

  • Some tests in particular with sites sending normalised reports were performed.
  • APEL client 1.9.2 released that adds basic HEPscore23 publishing using existing message format
    • It needs to be added to UMD
  • APEL server release candidate in testing
    • Liaising with Portal on setting up testing with them
    • this new version allows the aggregation of the accounting records by benchmark to monitor the move to the new benchmark over the time
    • When the tests are successful, final release of APEL server update and of the Portal
  • Information for testing the publication of accounting records with the new benchmark:
  • Expected a fix in ARC-CE for the proper configuration of HEPscore23
  • Please contact us if you'd like to make tests with the new benchmark

HEPSCORE application:

April GDB:

June WLCG Operations Coordination meeting:

Monitoring of webdav and xrootd protocols/endpoints

AOB


Next meeting

July or August

  • No labels