General information

GEANT TCS certificate service interruption

Middleware


UMD

Migration to EL9

Following PROC16 Decommissioning of unsupported software

Broadcast circulated in June.

Requested to enable the metric to detect CentOS7 endpoints:

The NGIs can open tickets against sites to track the migration

Operations

Accounting Portal

  • After an update for introducing the new benchmark (involving both the Accounting Repository and the Portal), the Accounting Portal started to show a double CPU consumption
    • when existing data for past months are received again, these are summed up to the incoming ones instead of been overwritten
    • for the moment, the DB has been restored to a point when no duplication was present, and the receiving of data has been temporarily disabled.
  • Fix to be implemented this afternoon

ARGO/SAM

  • Waiting for the new version of the HTCondorCE probe
    • for the moment the endpoints are tested with the host certificate validity metric
  • Several sites with HTCondorCE are failing the tests:
    • They still have HTCondor 9 (on CentOS 7) which doesn't work correctly with the new HTCondor client (v23) on EL9
    • Those sites are requested to upgrade to HTCondor 23.0.x as soon as possible
  • Monitoring issue with ARC-CE 6.20.1 version

FedCloud

Feedback from DMSU


New Known Error Database (KEDB)

The KEDB has been moved to Jira+Confluence: https://confluence.egi.eu/display/EGIKEDB/EGI+Federation+KEDB+Home

  • problems are tracked with Jira tickets to better follow-up their evolution
  • problems can be registered by DMSU staff and EGI Operations team

Monthly Availability/Reliability

Under-performed sites in the past A/R reports with issues not yet fixed:

  • NGI_CHINA: https://helpdesk.ggus.eu/#ticket/zoom/1596 
    • BEIJING-T1: host certificate validity metric is failing.
    • HK-LCG2: DNS issues with ARC-CE; SE certificate is expired. Problems with the national CA: they are in contact with another CA to get new host certificates for their services.
  • NGI_CH: https://helpdesk.ggus.eu/#ticket/zoom/1577
    • CSCS-LCG2: test jobs failures due to the REST interface and IGTF; IGTF fixed; the LDAP server was disabled on the CE so the tests are failing; waiting for the new version of the probe.
  • NGI_CH: https://helpdesk.ggus.eu/#ticket/zoom/1578 
    • UNIBE-LHEP: the LDAP server was disabled on the CE so the tests are failing; waiting for the new version of the probe.
  • NGI_IE: https://helpdesk.ggus.eu/#ticket/zoom/2074 
    • WALTON-CLOUD: as a consequence of Cyber attack carried out at SETU in November, all internet traffic through their firewalls was disabled. They have to migrate the Walton Institute connections onto a new infrastructure: at that point the default routes should start to work again.
  • NGI_IT: https://helpdesk.ggus.eu/#ticket/zoom/1714 
    • INFN-BARI: job submission failures
    • INFN-GENOVA: SRM and job submission failures
  • NGI_IT: https://helpdesk.ggus.eu/#ticket/zoom/1710 
    • INFN-PISA: information on GOCDB about webdav to be fixed. 
  • NGI_IT:
    • INFN-MILANO-ATLASC: https://helpdesk.ggus.eu/#ticket/zoom/1696 
      • internal error in StoRM's webdav server that couldn't be to sorted out; plans to phase out StoRM and migrate to dCache. New dcache server installed but the webdav tests are failing because of an authentication failure; HTCondorCE has to be reinstalled with a newer version.
  • NGI_IT:
  • NGI_IT: https://helpdesk.ggus.eu/#ticket/zoom/1685 
    • INFN-ROMA1-CMS: Downtime for replacing the UPS; webdav failures
  • NGI_IT: https://helpdesk.ggus.eu/#ticket/zoom/1692 
    • INFN-LECCE: they need to make a plan for migrating to EL9.
    • INFN-NAPOLI-ATLAS: migration to Alma9 and HtCondor23 is ongoing; tests are OK after some power supply issues
    • INFN-TORINO: they need to make a plan for migrating to EL9.
    • INFN-TRIESTE: they need to make a plan for migrating to EL9.
    • RECAS-NAPOLI: migration to EL9: expected to be completed by end of January 2025
  • NGI_IT: https://helpdesk.ggus.eu/#ticket/zoom/2114
    • EODC: integration with Check-in to be fixed
    • INFN-MIB: authentications issues, fixed.
  • NGI_RO: https://helpdesk.ggus.eu/#ticket/zoom/1961 
    • GRIDIFIN: the arc-ce-srm metric is constantly failing.
    • RO-03-UPB: jobs could not be submitted even if RTE was enabled; the priority in the queues has been fixed.
  • NGI_RO: https://helpdesk.ggus.eu/#ticket/zoom/1962 
    • RO-07-NIPNE: migration to AlmaLinux 9, issues with the UPS; new failures with the jobs. The test jobs cannot complete, but they are successful on the ARGO devel instance where ARC client v7 is used: involved the ARC-CE team.
  • NGI_UK: https://helpdesk.ggus.eu/#ticket/zoom/1813 
    • UKI-SOUTHGRID-BRIS-HEP: downtime for a major infrastructure overhaul; The migration to EL9 has been completed and new storage and batch systems commissioned. Working on the authentication settings of the HTCondorCE.

Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: (February 2024):

sites suspended:

  • BG05-SUGrid: under-performing

Publishing the services in the BDII

All the sites are asked to publish their computing and storage endpoints in the BDII in order to:

  • allow the collection of the information of the compute and storage capacity of the Infrastructure
  • allow the verification of the middleware version installed across the Infrastructure (for upgrade campaigns and security reasons mainly)

Configuring the Site-BDII and the infoprovider on the several endpoints

New benchmark HEPscore23

The benchmark HEPscore23 is replacing the old Hep-SPEC06

Recent activities:

  • APEL client 2.1.0 released and included in UMD 5
  • Accounting Repository and Portal has been upgraded to support the new version of the accounting records compliant with the new benchmark.
    • Accounting Portal: new features to filter the accounting records by benchmark deployed in production at the end of February.
  • When the issues mentioned earlier are solved, we will ask the sites to use also the new benchmark when sending the accounting records 

HEPSCORE application:

WLCG Operations Coordination meeting (Oct 2024)

New helpdesk

AOB


Next meeting

March

  • No labels