General information
GEANT TCS certificate service interruption
- As of 10th January 2025 it is not possible to request/renew GEANT TCS certificates any longer
- the broadcast sent on Nov 21th was removed by accident
- The next implementation of the service is almost ready for business, as presented here:
Middleware
UMD
- UMD5 released: https://repository.egi.eu/umd/distribution.html?id=5#5
- recent additions
- ARC 6.21.2
- Dcache 9.2.31
- storm-webdav 1.4.2
- News from product teams:
- ARC 7 getting closer to release in Nordugrid channels
- StoRM:
- StoRM WebDAV v1.5.0 released for el9 (el8 and el7)
- Release notes: https://github.com/italiangrid/storm-webdav/releases/tag/v1.5.0
- Available on StoRM stable EL9 repo: https://repo.cloud.cnaf.infn.it/service/rest/repository/browse/storm-rpm-stable/redhat9/ (repofile)
- Legacy components (Backend + Frontend + Info Provider + Native Libs) still on testing phase at T1
- Available on StoRM beta EL9 repo: https://repo.cloud.cnaf.infn.it/service/rest/repository/browse/storm-rpm-beta/redhat9/ (repofile)
- Still working on new StoRM documentation
- Soon we're going to test next StoRM WebDAV release (breaking change is moving towards Java 17)
- StoRM WebDAV v1.5.0 released for el9 (el8 and el7)
- VOMS-Admin: move to python3 not yet finalised
Migration to EL9
Following PROC16 Decommissioning of unsupported software
Broadcast circulated in June.
Requested to enable the metric to detect CentOS7 endpoints:
- GGUS 167352
The NGIs can open tickets against sites to track the migration
Operations
Accounting Portal
- After an update for introducing the new benchmark (involving both the Accounting Repository and the Portal), the Accounting Portal started to show a double CPU consumption
- when existing data for past months are received again, these are summed up to the incoming ones instead of been overwritten
- for the moment, the DB has been restored to a point when no duplication was present, and the receiving of data has been temporarily disabled.
- Fix to be implemented this afternoon
ARGO/SAM
- Waiting for the new version of the HTCondorCE probe
- for the moment the endpoints are tested with the host certificate validity metric
- Several sites with HTCondorCE are failing the tests:
- They still have HTCondor 9 (on CentOS 7) which doesn't work correctly with the new HTCondor client (v23) on EL9
- Those sites are requested to upgrade to HTCondor 23.0.x as soon as possible
- Monitoring issue with ARC-CE 6.20.1 version
- ARC-CE-srm status is missing because of some failures with ARC-CE-SRM-result metric (or jobs cannot complete their run)
- ARC-CE-result status is missing because "job not finished" with ARC-CE-submit metric
- the same endpoints are ok on the ARGO devel instance where ARC-CE client v7 is used
- not yet released in production because of some further fixes needed (https://helpdesk.ggus.eu/#ticket/zoom/1566)
- asked the developers to investigate
FedCloud
Feedback from DMSU
New Known Error Database (KEDB)
The KEDB has been moved to Jira+Confluence: https://confluence.egi.eu/display/EGIKEDB/EGI+Federation+KEDB+Home
- problems are tracked with Jira tickets to better follow-up their evolution
- problems can be registered by DMSU staff and EGI Operations team
Monthly Availability/Reliability
Under-performed sites in the past A/R reports with issues not yet fixed:
- NGI_CHINA: https://helpdesk.ggus.eu/#ticket/zoom/1596
- BEIJING-T1: host certificate validity metric is failing.
- HK-LCG2: DNS issues with ARC-CE; SE certificate is expired. Problems with the national CA: they are in contact with another CA to get new host certificates for their services.
- NGI_CH: https://helpdesk.ggus.eu/#ticket/zoom/1577
- CSCS-LCG2: test jobs failures due to the REST interface and IGTF; IGTF fixed; the LDAP server was disabled on the CE so the tests are failing; waiting for the new version of the probe.
- NGI_CH: https://helpdesk.ggus.eu/#ticket/zoom/1578
- UNIBE-LHEP: the LDAP server was disabled on the CE so the tests are failing; waiting for the new version of the probe.
- NGI_IE: https://helpdesk.ggus.eu/#ticket/zoom/2074
- WALTON-CLOUD: as a consequence of Cyber attack carried out at SETU in November, all internet traffic through their firewalls was disabled. They have to migrate the Walton Institute connections onto a new infrastructure: at that point the default routes should start to work again.
- NGI_IT: https://helpdesk.ggus.eu/#ticket/zoom/1714
- INFN-BARI: job submission failures
- INFN-GENOVA: SRM and job submission failures
- NGI_IT: https://helpdesk.ggus.eu/#ticket/zoom/1710
- INFN-PISA: information on GOCDB about webdav to be fixed.
- NGI_IT:
- INFN-MILANO-ATLASC: https://helpdesk.ggus.eu/#ticket/zoom/1696
- internal error in StoRM's webdav server that couldn't be to sorted out; plans to phase out StoRM and migrate to dCache. New dcache server installed but the webdav tests are failing because of an authentication failure; HTCondorCE has to be reinstalled with a newer version.
- INFN-MILANO-ATLASC: https://helpdesk.ggus.eu/#ticket/zoom/1696
- NGI_IT:
- INFN-CATANIA: https://helpdesk.ggus.eu/#ticket/zoom/1698
- failures with the host certificate validity check: the CE needs to be reinstalled with a newer version.
- INFN-ROMA1: https://helpdesk.ggus.eu/#ticket/zoom/1704
- Downtime for replacing the UPS; failures with CE, SRM, webdav.
- INFN-CATANIA: https://helpdesk.ggus.eu/#ticket/zoom/1698
- NGI_IT: https://helpdesk.ggus.eu/#ticket/zoom/1685
- INFN-ROMA1-CMS: Downtime for replacing the UPS; webdav failures
- NGI_IT: https://helpdesk.ggus.eu/#ticket/zoom/1692
- INFN-LECCE: they need to make a plan for migrating to EL9.
- INFN-NAPOLI-ATLAS: migration to Alma9 and HtCondor23 is ongoing; tests are OK after some power supply issues
- INFN-TORINO: they need to make a plan for migrating to EL9.
- INFN-TRIESTE: they need to make a plan for migrating to EL9.
- RECAS-NAPOLI: migration to EL9: expected to be completed by end of January 2025
- NGI_IT: https://helpdesk.ggus.eu/#ticket/zoom/2114
- EODC: integration with Check-in to be fixed
- INFN-MIB: authentications issues, fixed.
- NGI_RO: https://helpdesk.ggus.eu/#ticket/zoom/1961
- GRIDIFIN: the arc-ce-srm metric is constantly failing.
- RO-03-UPB: jobs could not be submitted even if RTE was enabled; the priority in the queues has been fixed.
- NGI_RO: https://helpdesk.ggus.eu/#ticket/zoom/1962
- RO-07-NIPNE: migration to AlmaLinux 9, issues with the UPS; new failures with the jobs. The test jobs cannot complete, but they are successful on the ARGO devel instance where ARC client v7 is used: involved the ARC-CE team.
- NGI_UK: https://helpdesk.ggus.eu/#ticket/zoom/1813
- UKI-SOUTHGRID-BRIS-HEP: downtime for a major infrastructure overhaul; The migration to EL9 has been completed and new storage and batch systems commissioned. Working on the authentication settings of the HTCondorCE.
Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: (February 2024):
- ROC_CANADA: https://helpdesk.ggus.eu/#ticket/zoom/2521
- CA-WATERLOO-T2: major rework of power and cooling infrastructure started in December, re-installation of the services is going on.
- ROC_LA: https://helpdesk.ggus.eu/#ticket/zoom/2520
- ATLAND:
- EELA-UTFSM: IGTF certificates weren't properly updated on the several machines, fixed.
- Russia: https://helpdesk.ggus.eu/#ticket/zoom/2522
- JINR-T1: IGTF not updated in time; at the moment they need to keep an obsolete CA for local reasons.
sites suspended:
- BG05-SUGrid: under-performing
Publishing the services in the BDII
All the sites are asked to publish their computing and storage endpoints in the BDII in order to:
- allow the collection of the information of the compute and storage capacity of the Infrastructure
- allow the verification of the middleware version installed across the Infrastructure (for upgrade campaigns and security reasons mainly)
Configuring the Site-BDII and the infoprovider on the several endpoints
- Site-BDII: https://twiki.cern.ch/twiki/bin/view/LCG/BDIIconfigYAIMel9
- HTCondor-CE: https://htcondor.com/htcondor-ce/v24/configuration/optional-configuration/#enabling-bdii-integration
- ARC-CE: all the infosys block of arc.conf
- dCache: https://www.dcache.org/manuals/Book-10.2/config-info-provider.shtml
- EOS: https://eos-docs.web.cern.ch/diopside/manual/egi.html#info-provider
- StoRM: https://italiangrid.github.io/storm/documentation.html
New benchmark HEPscore23
The benchmark HEPscore23 is replacing the old Hep-SPEC06
Recent activities:
- APEL client 2.1.0 released and included in UMD 5
- Accounting Repository and Portal has been upgraded to support the new version of the accounting records compliant with the new benchmark.
- Accounting Portal: new features to filter the accounting records by benchmark deployed in production at the end of February.
- When the issues mentioned earlier are solved, we will ask the sites to use also the new benchmark when sending the accounting records
HEPSCORE application:
- link to the gitlab page: https://gitlab.cern.ch/hep-benchmarks/hep-score
WLCG Operations Coordination meeting (Oct 2024)
New helpdesk
- New system in production since Thu 30th Jan:
- The tickets still open have been importated by the new system
- Helpdesk documentation to be updated: https://docs.egi.eu/internal/helpdesk/
- Already updated:
- Access and roles
- 'SITES' field and tickets to multiple sites
- Do you have a role to create Team/Alarm tickets? please propose your changes to the associated pages
- Already updated:
- Anyone in the federation can contribute to the documentation
- If you want to share your experience with operating a site (HTC, Storage, Cloud, etc.), just create a PR.
AOB
Next meeting
March