General information
GEANT TCS certificate service interruption
- As of 10th January 2025 it is not possible to request/renew GEANT TCS certificates any longer
- see the broadcast sent on Nov 21th
- New solutions are under investigations, but finalising them will take time.
Middleware
UMD
- UMD5 released: https://repository.egi.eu/umd/distribution.html?id=5#5
- APEL 2.1.0, APEL SSM 3.4.1
- Arc 6.20.1
- BDII 6.0.3,
- WN 5.1.0
- UI 7.0.0
- Dcache 9.2.25
- Gfal2 2.23.0
- Frontier-squid 5.9.2
- Voms 2.1.0, voms-api 3.3.3, voms-client-java 3.3.3, voms-client-cpp 2.1.0
- xroot 5.7.1
- htcondor-ce 23.0
- cvmfs 2.11.5
- config-egi 2.6.1
- egi-cvmfs 6.7.28
- Davix 0.8.7
Migration to EL9
Following PROC16 Decommissioning of unsupported software
Broadcast circulated in June.
Requested to enable the metric to detect CentOS7 endpoints:
- GGUS 167352
The NGIs can open tickets against sites to track the migration
Operations
Accounting Repository
Pub/Sync system taken offline for a security issue. Accounting Repository operation unaffected, but Repository test is provided via the pub/sync hosts.
We receive weekly reports by email about the publication of the accounting records.
ARGO/SAM
- Waiting for the new version of the HTCondorCE probe
- for the moment the endpoints are tested with the host certificate validity metric
- Several sites with HTCondorCE are failing the tests:
- They still have HTCondor 9 (on CentOS 7) which doesn't work correctly with the new HTCondor client (v23) on EL9
- Those sites are requested to upgrade to HTCondor 23.0.x as soon as possible
- Monitoring issue with ARC-CE 6.20.1 version
- ARC-CE-srm status is missing because of some failures with ARC-CE-SRM-result metric (or jobs cannot complete their run)
- ARC-CE-result status is missing because "job not finished" with ARC-CE-submit metric
- the same endpoints are ok on the ARGO devel instance where ARC-CE client v7 is used
- not yet released in production because of some further fixes needed (GGUS 167050)
- asked the developers to investigate
FedCloud
- some sites affected by failures between 2024-12-02 and 2024-12-05 due to the expired VA image in ops VO image list
- Requested the recomputation of the 2024-12 A/R figures:
Feedback from DMSU
New Known Error Database (KEDB)
The KEDB has been moved to Jira+Confluence: https://confluence.egi.eu/display/EGIKEDB/EGI+Federation+KEDB+Home
- problems are tracked with Jira tickets to better follow-up their evolution
- problems can be registered by DMSU staff and EGI Operations team
Monthly Availability/Reliability
Under-performed sites in the past A/R reports with issues not yet fixed:
- NGI_BG: https://helpdesk.ggus.eu/#ticket/zoom/1569
- BG05-SUGrid: migration to EL9 by mid-february, some manpower issues.
- NGI_CHINA: https://helpdesk.ggus.eu/#ticket/zoom/1596
- BEIJING-T1: host certificate validity metric is failing.
- HK-LCG2: DNS issues with ARC-CE; SE certificate is expired. Problems with the national CA: they are in contact with another CA to get new host certificates for their services.
- NGI_CH: https://helpdesk.ggus.eu/#ticket/zoom/1577
- CSCS-LCG2: test jobs failures due to the REST interface and IGTF; IGTF fixed; the LDAP server was disabled on the CE so the tests are failing; waiting for the new version of the probe.
- NGI_CH: https://helpdesk.ggus.eu/#ticket/zoom/1578
- UNIBE-LHEP: the LDAP server was disabled on the CE so the tests are failing; waiting for the new version of the probe.
NGI_DE: https://helpdesk.ggus.eu/#ticket/zoom/1613
mainz: SRM overload due to large amount if data transferred, fixed. Webdav tests failing because of a wrong storage path; fixed.
- NGI_IBERGRID: https://helpdesk.ggus.eu/#ticket/zoom/1662
- CIEMAT-LCG2: the srm endpoint wasn't configured after the upgrade of dcache servers and it is not used by the supported VOs; there was an issue with a version of Java; now recurring dns issues.
- NGI_IE: https://helpdesk.ggus.eu/#ticket/zoom/2074
- WALTON-CLOUD:
- NGI_IT: https://helpdesk.ggus.eu/#ticket/zoom/1714
- INFN-BARI: job submission failures
- INFN-GENOVA: SRM and job submission failures
- NGI_IT: https://helpdesk.ggus.eu/#ticket/zoom/1710
- INFN-PISA: information on GOCDB about webdav to be fixed.
- NGI_IT:
- INFN-MILANO-ATLASC: https://helpdesk.ggus.eu/#ticket/zoom/1696
- internal error in StoRM's webdav server that couldn't be to sorted out; plans to phase out StoRM and migrate to dCache. New dcache server installed but the webdav tests are failing because it is missing the information on the storage area; HTCondorCE has to be reinstalled with a newer version.
- INFN-MILANO-ATLASC: https://helpdesk.ggus.eu/#ticket/zoom/1696
- NGI_IT:
- INFN-CATANIA: https://helpdesk.ggus.eu/#ticket/zoom/1698
- failures with the host certificate validity check: the CE needs to be reinstalled with a newer version.
- INFN-ROMA1: https://helpdesk.ggus.eu/#ticket/zoom/1704
- Downtime for replacing the UPS; failures with CE, SRM, webdav.
- INFN-CATANIA: https://helpdesk.ggus.eu/#ticket/zoom/1698
- NGI_IT: https://helpdesk.ggus.eu/#ticket/zoom/1685
- INFN-ROMA1-CMS: Downtime for replacing the UPS; webdav failures
- NGI_IT: https://helpdesk.ggus.eu/#ticket/zoom/1692
- INFN-LECCE: they need to make a plan for migrating to EL9.
- INFN-NAPOLI-ATLAS: migration to Alma9 and HtCondor23 is ongoing; tests are OK after some power supply issues
- INFN-TORINO: they need to make a plan for migrating to EL9.
- INFN-TRIESTE: they need to make a plan for migrating to EL9.
- RECAS-NAPOLI: migration to EL9: expected to be completed by end of January 2025
- NGI_RO: https://helpdesk.ggus.eu/#ticket/zoom/1961
- GRIDIFIN: the arc-ce-srm metric is constantly failing.
- RO-03-UPB: jobs could not be submitted even if RTE was enabled; the priority in the queues has been fixed.
- NGI_RO: https://helpdesk.ggus.eu/#ticket/zoom/1962
- RO-07-NIPNE: migration to AlmaLinux 9, issues with the UPS; new failures with the jobs. The test jobs cannot complete, but they are successful on the ARGO devel instance where ARC client v7 is used: involved the ARC-CE team.
- NGI_UK: https://helpdesk.ggus.eu/#ticket/zoom/1813
- UKI-SOUTHGRID-BRIS-HEP: downtime for a major infrastructure overhaul; The migration to EL9 has been completed and new storage and batch systems commissioned. Working on the authentication settings of the HTCondorCE.
- NGI_UK: https://helpdesk.ggus.eu/#ticket/zoom/2184
- UKI-SOUTHGRID-OX-HEP: there was a missing csh package issue on WNs; currently the test jobs cannot complete their run, and the status of some metrics is missing.
- NGI_UK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=169540
- UKI-LT2-Brunel: the IGTF certificates weren't properly updated.
- ROC_LA: https://helpdesk.ggus.eu/#ticket/zoom/1878
- SAMPA: problems with querying the host certificate information, investigations ongoing.
Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: (January 2024):
- NGI_IT: https://helpdesk.ggus.eu/#ticket/zoom/2114
- EODC:
- INFN-MIB: authentications issues, fixed.
sites suspended:
Publishing the services in the BDII
All the sites are asked to publish their computing and storage endpoints in the BDII in order to:
- allow the collection of the information of the compute and storage capacity of the Infrastructure
- allow the verification of the middleware version installed across the Infrastructure (for upgrade campaigns and security reasons mainly)
Configuring the Site-BDII and the infoprovider on the several endpoints
- Site-BDII: https://twiki.cern.ch/twiki/bin/view/LCG/BDIIconfigYAIMel9
- HTCondor-CE: https://htcondor.com/htcondor-ce/v24/configuration/optional-configuration/#enabling-bdii-integration
- ARC-CE: all the infosys block of arc.conf
- dCache: https://www.dcache.org/manuals/Book-10.2/config-info-provider.shtml
- EOS: https://eos-docs.web.cern.ch/diopside/manual/egi.html#info-provider
- StoRM: https://italiangrid.github.io/storm/documentation.html
IPv6 readiness plans
- please provide updates to the IPv6 assessment (ongoing) https://wiki.egi.eu/w/index.php?title=IPV6_Assessment
- if any relevant, information will be summarised at OMB
New benchmark HEPscore23
The benchmark HEPscore23 is replacing the old Hep-SPEC06
Recent activities:
- APEL client 2.1.0 released and included in UMD 5
- Testing ongoing, with data sent from some sites to the accounting repository and published into the staging accounting portal
- Please contact us if you'd like to make tests with the new benchmark
- Information for testing the publication of accounting records with the new benchmark:
- In December the Accounting Repository was upgraded to the new version supporting the new benchmark.
- APEL server for HEPSCORE functionality is planned to be deployed early February.
- Accounting Portal: new features to filter the accounting records under test in the staging instance.
HEPSCORE application:
- link to the gitlab page: https://gitlab.cern.ch/hep-benchmarks/hep-score
WLCG Operations Coordination meeting (Oct 2024)
New helpdesk
- New system in production since Thu 30th Jan:
- The tickets still open have been importated by the new system
- Information for the access and about roles:
AOB
Next meeting
March