General information
GEANT TCS certificate service interruption
- As of 10th January 2025 it is not possible to request/renew GEANT TCS certificates any longer
- see the broadcast sent on Nov 21th
- New solutions are under investigations, but finalising them will take time.
Middleware
UMD
- UMD5 released: https://repository.egi.eu/umd/distribution.html?id=5#5
- APEL 2.1.0, APEL SSM 3.4.1
- Arc 6.20.1
- BDII 6.0.3,
- WN 5.1.0
- UI 7.0.0
- Dcache 9.2.25
- Gfal2 2.23.0
- Frontier-squid 5.9.2
- Voms 2.1.0, voms-api 3.3.3, voms-client-java 3.3.3, voms-client-cpp 2.1.0
- xroot 5.7.1
- htcondor-ce 23.0
- cvmfs 2.11.5
- config-egi 2.6.1
- egi-cvmfs 6.7.28
- Davix 0.8.7
Migration to EL9
Following PROC16 Decommissioning of unsupported software
Broadcast circulated in June.
Requested to enable the metric to detect CentOS7 endpoints:
- GGUS 167352
The NGIs can open tickets against sites to track the migration
Operations
Accounting Repository
Pub/Sync system taken offline for a security issue. Accounting Repository operation unaffected, but Repository test is provided via the pub/sync hosts.
We receive weekly reports by email about the publication of the accounting records.
ARGO/SAM
- Waiting for the new version of the HTCondorCE probe
- for the moment the endpoints are tested with the host certificate validity metric
- Several sites with HTCondorCE are failing the tests:
- They still have HTCondor 9 (on CentOS 7) which doesn't work correctly with the new HTCondor client (v23) on EL9
- Those sites are requested to upgrade to HTCondor 23.0.x as soon as possible
- Monitoring issue with ARC-CE 6.20.1 version
- ARC-CE-srm status is missing because of some failures with ARC-CE-SRM-result metric (or jobs cannot complete their run)
- ARC-CE-result status is missing because "job not finished" with ARC-CE-submit metric
- the same endpoints are ok on the ARGO devel instance where ARC-CE client v7 is used
- not yet released in production because of some further fixes needed in combination with the new version of the probe
- asked the developers to investigate
FedCloud
- some sites affected by failures between 2024-12-02 and 2024-12-05 due to the expired VA image in ops VO image list
- Requested the recomputation of the 2024-12 A/R figures:
Feedback from DMSU
New Known Error Database (KEDB)
The KEDB has been moved to Jira+Confluence: https://confluence.egi.eu/display/EGIKEDB/EGI+Federation+KEDB+Home
- problems are tracked with Jira tickets to better follow-up their evolution
- problems can be registered by DMSU staff and EGI Operations team
Monthly Availability/Reliability
Under-performed sites in the past A/R reports with issues not yet fixed:
- AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168531
- TW-FTT: a network issue prevented the jobs to be submitted
- NGI_BG: https://ggus.eu/index.php?mode=ticket_info&ticket_id=169261
- BG05-SUGrid: migration to EL9 by mid-february, some manpower issues.
- NGI_CHINA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168876
- BEIJING-T1: host certificate validity metric is failing.
- HK-LCG2: DNS issues with ARC-CE; SE certificate is expired. Problems with the national CA: they are in contact with another CA to get new host certificates for their services.
- NGI_CH: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168015
- CSCS-LCG2: test jobs failures due to the REST interface and IGTF; IGTF fixed; the LDAP server was disabled on the CE so the tests are failing; waiting for the new version of the probe.
- NGI_CH: https://ggus.eu/index.php?mode=ticket_info&ticket_id=169270
- UNIBE-LHEP: the LDAP server was disabled on the CE so the tests are failing; waiting for the new version of the probe.
NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=167470
mainz: SRM overload due to large amount if data transferred
- NGI_HR: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168880
- fedcloud.srce.hr: After reinstalling the OpenStack instance to Antelope we are unable to integrate EGI Checkin with the Horizon. We are actively working on a solution.
- NGI_IBERGRID: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168879
- CIEMAT-LCG2: the srm endpoint wasn't configured after the upgrade of dcache servers and it is not used by the supported VOs; there was an issue with a version of Java; now recurring dns issues.
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=166697
- INFN-BARI: job submission failures
- INFN-GENOVA: SRM and job submission failures
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=165200
- INFN-PISA: information on GOCDB about webdav to be fixed.
- NGI_IT:
- INFN-MILANO-ATLASC: https://ggus.eu/index.php?mode=ticket_info&ticket_id=167467
- internal error in StoRM's webdav server that couldn't be to sorted out; plans to phase out StoRM and migrate to dCache. New dcache server installed but the webdav tests are failing because it is missing the information on the storage area; HTCondorCE has to be reinstalled with a newer version.
- INFN-MILANO-ATLASC: https://ggus.eu/index.php?mode=ticket_info&ticket_id=167467
- NGI_IT:
- INFN-CATANIA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168017
- failures with the host certificate validity check: the CE needs to be reinstalled with a newer version.
- INFN-ROMA1: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168018
- Downtime for replacing the UPS
- INFN-CATANIA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168017
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168529
- INFN-ROMA1-CMS: Downtime for replacing the UPS
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=169267
- INFN-LECCE:
- INFN-NAPOLI-ATLAS: migration to Alma9 and HtCondor23 is ongoing.
- INFN-TORINO:
- INFN-TRIESTE: they need to make a plan for migrating to EL9.
- RECAS-NAPOLI: migration to EL9: expected to be completed by end of January 2025
- NGI_PL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=169262
- WUT: migration to EL9
- NGI_RO: https://ggus.eu/index.php?mode=ticket_info&ticket_id=169264
- GRIDIFIN: the arc-ce-srm metric is constantly failing.
- RO-03-UPB: jobs cannot be submitted even if RTE was enabled.
- NGI_RO: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168530
- RO-07-NIPNE: migration to AlmaLinux 9, issues with the UPS; new failures with the jobs. The test jobs cannot complete, but they are successful on the ARGO devel instance where ARC client v7 is used: involved the ARC-CE team.
- NGI_UK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=166699
- UKI-SOUTHGRID-BRIS-HEP: downtime for a major infrastructure overhaul; The migration to EL9 has been completed and new storage and batch systems commissioned. Working on the authentication settings of the HTCondorCE.
- NGI_UK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=169269
- UKI-SOUTHGRID-OX-HEP: there was a missing csh package issue on WNs; currently the test jobs cannot complete their run, and the status of some metrics is missing.
Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: (December 2024):
- NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=169538
- MPPMU: IGTF outdated, fixed.
- NGI_IBERGRID: https://ggus.eu/index.php?mode=ticket_info&ticket_id=169537
- CESGA: site to be decommissioned
- NGI_IE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=169541
- WALTON-CLOUD:
- ROC_LA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=169539
- SAMPA: problems with querying the host certificate information, investigations ongoing.
- NGI_UK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=169540
- UKI-LT2-Brunel: the IGTF certificates weren't properly updated.
sites suspended:
- INDIACMS-TIFR (AsiaPacific)
Using YAIM to configure Site and Top BDII on EL9
- Maarten Litmaath created an rpm, glite-yaim-bdii, to help with easily configure site and top BDII endpoints
- rpm added to the WLCG repository and to UMD5
- For more details, see the documentation: https://twiki.cern.ch/twiki/bin/view/LCG/BDIIconfigYAIMel9
IPv6 readiness plans
- please provide updates to the IPv6 assessment (ongoing) https://wiki.egi.eu/w/index.php?title=IPV6_Assessment
- if any relevant, information will be summarised at OMB
VOMS upgrade campaign to EL9
- VOMS released on EL9:
- The sites can now upgrade their VOMS endpoints to either to EL8 or EL9
- Packages available on the product team repository:
- EL9 package was also released in UMD5
- Optionally you could keep the current server to work as the database (not exposed to the outside), while you expose externally the new server with voms and voms-admin
- This should shorten the downtime when doing the switch
- Note: it was noticed a dependency of voms-admin on Python 2 that makes it difficulty the installation on EL9 (EL9 removed the support to python 2)
- the voms team is working to fix this
- as alternative, the sites can install voms-admin on EL8 where Python 2 is still supported
Currently there are 28 VOMS endpoints in production. We are also starting to decommission about 100 inactive VOs, so the number of VOMS endpoints could also decrease.
Tickets tracked here: 2024 VOMS upgrade campaign
StoRM upgrade campaign to EL9
- INFN is working to release StoRM on EL9
- StoRM WebDAV v1.4.2 (the latest released on CentOS 7) is available also for el9 in their stable repository
- The other components will be soon ready
- 31 StoRM endpoints published in the BDII
New benchmark HEPscore23
The benchmark HEPscore23 is replacing the old Hep-SPEC06
Recent activities:
- APEL client 2.1.0 released and included in UMD 5
- Testing ongoing, with data sent from some sites to the accounting repository and published into the staging accounting portal
- Please contact us if you'd like to make tests with the new benchmark
- Information for testing the publication of accounting records with the new benchmark:
- Last week the Accounting Repository was upgrade to the new version supporting the new benchmark.
HEPSCORE application:
- link to the gitlab page: https://gitlab.cern.ch/hep-benchmarks/hep-score
WLCG Operations Coordination meeting (Oct 2024)
New helpdesk
- Pilot production instance was released in October
- The new GGUS implementation is based on Zammad
- You can login and explore the new look
- the supporter role that you have in the old GGUS will be assigned to you automatically after a few days from the first login
- First Steps Guide for New GGUS Users ( start here)
- Test emails to all support units will be sent
- The current GGUS implementation will be put in read-only mode on Feb 1st
- In January all the open tickets will be imported by the new helpdesk implementation
- a downtime will be required
- All SUs should use the new helpdesk by the end of Janaury (you can already start)
AOB
Next meeting
February