General information
GEANT TCS certificate service interruption
- At the beginning of January 2025 it will not be possible to request/renew GEANT TCS certificates any longer
- see the broadcast sent on Nov 21th
- Please renew all the host/personal TCS certificate in the coming few weeks
- New solutions are under investigations, but finalising them will take time.
Middleware
UMD
- UMD5 released: https://repository.egi.eu/umd/distribution.html?id=5#5
- APEL 2.1.0, APEL SSM 3.4.1
- Arc 6.20.1
- BDII 6.0.3,
- WN 5.1.0
- UI 7.0.0
- Dcache 9.2.25
- Gfal2 2.23.0
- Frontier-squid 5.9.2
- Voms 2.1.0, voms-api 3.3.3, voms-client-java 3.3.3, voms-client-cpp 2.1.0
- xroot 5.7.1
- htcondor-ce 23.0
- cvmfs 2.11.5
- config-egi 2.6.1
- egi-cvmfs 6.7.28
- Davix 0.8.7
Migration to EL9
Following PROC16 Decommissioning of unsupported software
Broadcast circulated in June.
Requested to enable the metric to detect CentOS7 endpoints:
- GGUS 167352
The NGIs can open tickets against sites to track the migration
Operations
Accounting Repository
Pub/Sync system taken offline for a security issue. Accounting Repository operation unaffected, but Repository test is provided via the pub/sync hosts.
We receive weekly reports by email about the publication of the accounting records.
ARGO/SAM
- Waiting for the new version of the HTCondorCE probe
- for the moment the endpoints are tested with the host certificate validity metric
- Several sites with HTCondorCE are failing the tests:
- They still have HTCondor 9 (on CentOS 7) which doesn't work correctly with the new HTCondor client (v23) on EL9
- Those sites are requested to upgrade to HTCondor 23.0.x as soon as possible
- Monitoring issue with ARC-CE 6.20.1 version
- ARC-CE-srm status is missing because of some failures with ARC-CE-SRM-result metric (or jobs cannot complete their run)
- ARC-CE-result status is missing because "job not finished" with ARC-CE-submit metric
- the same endpoints are ok on the ARGO devel instance where ARC-CE client v7 is used
- not yet released in production because of some further fixes needed in combination with the new version of the probe
- asked the developers to investigate
FedCloud
A/R numbers report sent on 5th/Dec. had to be re-calculated. There are still some issues now being fixed by Emir (ARGO).
Feedback from DMSU
From July 1st the second level support is provided by UKIM:
- the partner representing the Macedonian Academic Research Grid Initiative (MARGI) in the EGI Council, is now a full member of the EGI Federation
New Known Error Database (KEDB)
The KEDB has been moved to Jira+Confluence: https://confluence.egi.eu/display/EGIKEDB/EGI+Federation+KEDB+Home
- problems are tracked with Jira tickets to better follow-up their evolution
- problems can be registered by DMSU staff and EGI Operations team
Monthly Availability/Reliability
Under-performed sites in the past A/R reports with issues not yet fixed:
AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=167466
INDIACMS-TIFR: downtime for several structural upgrades in the infrastructure. Then CE and webdav failures; to decide if disabling the SRM endpoint.
- AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168531
- TW-FTT: jobs cannot be submitted
- NGI_CHINA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168876
- BEIJING-LCG2: host certificate validity metric is failing.
- HK-LCG2: DNS issues with ARC-CE; SE certificate is expired. Problems with the national CA: they are in contact with another CA to get new host certificates for their services.
- NGI_CH: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168015
- CSCS-LCG2: test jobs failures due to the REST interface and IGTF
NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=167470
mainz: SRM overload due to large amount if data transferred
- NGI_IBERGRID: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168488
- BIFI: infrastructure redeployed; a new project for the monitoring VO has been set-up.
- CESGA-CLOUD: recovered
- NGI_IBERGRID: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168879
- CIEMAT-LCG2: the srm endpoint wasn't configured after the upgrade of dcache servers and it is not used by the supported VOs; there was an issue with a version of Java; now investigating on recurring failures with the host certificate validity metric.
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=166697
- INFN-BARI: job submission failures
- INFN-GENOVA: SRM and job submission failures
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=165200
- INFN-PISA: information on GOCDB about webdav to be fixed.
- NGI_IT:
- INFN-MILANO-ATLASC: https://ggus.eu/index.php?mode=ticket_info&ticket_id=167467
- internal error in StoRM's webdav server that couldn't be to sorted out; plans to phase out StoRM and migrate to dCache. New dcache server installed but the webdav tests are failing because it is missing the information on the storage area; HTCondorCE has to be reinstalled with a newer version.
- INFN-MILANO-ATLASC: https://ggus.eu/index.php?mode=ticket_info&ticket_id=167467
- NGI_IT:
- INFN-CATANIA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168017
- failures with the host certificate validity check: the CE needs to be reinstalled with a newer version.
- INFN-ROMA1: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168018
- Downtime for replacing the UPS
- INFN-CATANIA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168017
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168529
- INFN-ROMA1-CMS: Downtime for replacing the UPS
- NGI_HR: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168880
- fedcloud.srce.hr: After reinstalling the OpenStack instance to Antelope we are unable to integrate EGI Checkin with the Horizon. We are actively working on a solution.
- NGI_RO: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168530
- RO-07-NIPNE: migration to AlmaLinux 9, issues with the UPS; new failures with the jobs. The test jobs cannot complete, but they are successful on the ARGO devel instance where ARC client v7 is used: involved the ARC-CE team.
- NGI_UK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=166699
- UKI-SOUTHGRID-BRIS-HEP: downtime for a major infrastructure overhaul; The migration to EL9 has been completed and new storage and batch systems commissioned. Working on the authentication settings of the HTCondorCE.
- NGI_UK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168532
- UKI-LT2-QMUL: long downtime for data centre maintenance; fixed
- UKI-NORTHGRID-LIV-HEP: failures caused by the institute firewall
- UKI-SCOTGRID-ECDF: investigation on some changes that created issues; relocation of the machines in the data centre; fixed.
- UKI-SCOTGRID-GLASGOW: webdav failures which have been resolved.
Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: (November 2024):
- NGI_BG: https://ggus.eu/index.php?mode=ticket_info&ticket_id=169261
- BG05-SUGrid:
- NGI_CH: https://ggus.eu/index.php?mode=ticket_info&ticket_id=169270
- UNIBE-LHEP:
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=169267
- INFN-LECCE:
- INFN-NAPOLI-ATLAS: migration to Alma9 and HtCondor23 is ongoing.
- INFN-TORINO:
- INFN-TRIESTE: they need to make a plan for migrating to EL9.
- RECAS-NAPOLI: migration to EL9: expected to be completed by end of January 2025
- NGI_PL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=169262
- CYFRONET-CLOUD:
- WUT: migration to EL9
- NGI_RO: https://ggus.eu/index.php?mode=ticket_info&ticket_id=169264
- GRIDIFIN: the arc-ce-srm metric is constantly failing.
- RO-03-UPB: jobs cannot be submitted even if RTE was enabled.
- NGI_UK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=169269
- UKI-LT2-RHUL: failures due to the old HTCondor version provided; migration to EL9 and upgrade of HTCondor planned withing a few days.
- UKI-SOUTHGRID-BHAM-HEP: a firewall could be stopping the CE tests.
- UKI-SOUTHGRID-OX-HEP: there was a missing csh package issue on WNs; currently the test jobs cannot complete their run, and the status of some metrics is missing.
sites suspended:
- GR-07-UOI-HEPLAB (NGI_GRNET)
Using YAIM to configure Site and Top BDII on EL9
- Maarten Litmaath created an rpm, glite-yaim-bdii, to help with easily configure site and top BDII endpoints
- rpm added to the WLCG repository and to UMD5
- For more details, see the documentation: https://twiki.cern.ch/twiki/bin/view/LCG/BDIIconfigYAIMel9
IPv6 readiness plans
- please provide updates to the IPv6 assessment (ongoing) https://wiki.egi.eu/w/index.php?title=IPV6_Assessment
- if any relevant, information will be summarised at OMB
VOMS upgrade campaign to EL9
- VOMS released on EL9:
- The sites can now upgrade their VOMS endpoints to either to EL8 or EL9
- Packages available on the product team repository:
- EL9 package was also released in UMD5
- Optionally you could keep the current server to work as the database (not exposed to the outside), while you expose externally the new server with voms and voms-admin
- This should shorten the downtime when doing the switch
- Note: it was noticed a dependency of voms-admin on Python 2 that makes it difficulty the installation on EL9 (EL9 removed the support to python 2)
- the voms team is working to fix this
- as alternative, the sites can install voms-admin on EL8 where Python 2 is still supported
Currently there are 28 VOMS endpoints in production. We are also starting to decommission about 100 inactive VOs, so the number of VOMS endpoints could also decrease.
Tickets tracked here: 2024 VOMS upgrade campaign
StoRM upgrade campaign to EL9
- INFN is working to release StoRM on EL9
- StoRM WebDAV v1.4.2 (the latest released on CentOS 7) is available also for el9 in their stable repository
- The other components will be soon ready
- 31 StoRM endpoints published in the BDII
- We can track the migration in 2024 StoRM upgrade campaign
New benchmark HEPscore23
The benchmark HEPscore23 is replacing the old Hep-SPEC06
Recent activities:
- APEL client 2.1.0 released and included in UMD 5
- Testing ongoing, with data sent from some sites to the accounting repository and published into the staging accounting portal
- Please contact us if you'd like to make tests with the new benchmark
- Information for testing the publication of accounting records with the new benchmark:
- Last week the Accounting Repository was upgrade to the new version supporting the new benchmark.
HEPSCORE application:
- link to the gitlab page: https://gitlab.cern.ch/hep-benchmarks/hep-score
WLCG Operations Coordination meeting (Oct 2024)
New helpdesk
- Pilot production instance was released in October
- The new GGUS implementation is based on Zammad
- You can login and explore the new look
- the supporter role that you have in the old GGUS will be assigned to you automatically after a few days from the first login
- First Steps Guide for New GGUS Users ( start here)
- Test emails to all support units will be sent
- The current GGUS implementation will be put in read-only mode on Feb 1st
- In January all the open tickets will be imported by the new helpdesk implementation
- a downtime will be required
- All SUs should use the new helpdesk by the end of Janaury (you can already start)
AOB
- Accounting data missing since October 2024 for IL-TAU-HEP and TECHNION-HEP:
- https://accounting.egi.eu/egi/ngi/NGI_IL/
- suggested to get in contact with the accounting supporters
Next meeting
January