General information
Middleware
UMD
- UMD5 released: https://repository.egi.eu/umd/distribution.html?id=5#5
- Products to add to the next update:
- Storm webdav 1.6.0 (prod repo)
- Storm webdav 1.7.1 (testing repo)
- Arc-CE 7.0.0 (testing repo)
- Xroot 6.7.3
- dache 9.35
- HTcondor 23.0.22 (production repo)
- HTcondor (24.0.X) (testing repo)
- News from product teams:
- ARC 7 released
- StoRM:
Released versions 1.5.0, 1.6.0 (Java 17 + Spring Security 5.8), 1.6.1, 1.7.0 (Spring Boot 3.x migration + Jetty 12) and 1.7.1 for StoRM WebDAV. They are all available on StoRM stable repository and, since v1.6.0, also on the related release section on GitHub (assets section):
Next 1.8.0 iteration will include an option that will make WebDAV deployable behind a NGINX which will be in charge of managing file GET transfers (expected end of April).
Ongoing work to align documentation to el9 (it's still an old doc related to CentOS 7 products).
About el9 support for legacy SRM products (Backend, Frontend, etc..), our tests were productive too, in fact we fixed some small bugs. Tests are still in progress also due to the fact that CI is not in a well state and we're running the test-suites manually. But the end seems not so far.
VOMS:
About LSC parsing problem we have fixes on development on both Java API and C++ clients.
Soon we should be able to fix/release both of them. We already have a SNAPSHOT version on Maven Central of VOMS Java API (v3.3.4-SNAPSHOT).
- VOMS-Admin:
- move to python3 not yet finalised. It's hard to find time to manage it unluckily. That's one of the reasons also related to the end-of-life decision in fact.
Migration to EL9
Following PROC16 Decommissioning of unsupported software
Broadcast circulated in June.
Requested to enable the metric to detect CentOS7 endpoints:
- GGUS 167352
The NGIs can open tickets against sites to track the migration
Operations
Accounting Portal
- After an update for introducing the new benchmark (involving both the Accounting Repository and the Portal), the Accounting Portal started to show a double CPU consumption
- when existing data for past months are received again, these are summed up to the incoming ones instead of been overwritten
- for the moment, the DB has been restored to a point when no duplication was present, and the receiving of data has been temporarily disabled.
- Fix to be implemented this afternoon
ARGO/SAM
- Waiting for the new version of the HTCondorCE probe
- for the moment the endpoints are tested with the host certificate validity metric
- Several sites with HTCondorCE are failing the tests:
- They still have HTCondor 9 (on CentOS 7) which doesn't work correctly with the new HTCondor client (v23) on EL9
- Those sites are requested to upgrade to HTCondor 23.0.x as soon as possible
- Monitoring issue with ARC-CE 6.20.1 version
- ARC-CE-srm status is missing because of some failures with ARC-CE-SRM-result metric (or jobs cannot complete their run)
- ARC-CE-result status is missing because "job not finished" with ARC-CE-submit metric
- the same endpoints are ok on the ARGO devel instance where ARC-CE client v7 is used
- not yet released in production because of some further fixes needed (https://helpdesk.ggus.eu/#ticket/zoom/1566)
- asked the developers to investigate
FedCloud
Feedback from DMSU
New Known Error Database (KEDB)
The KEDB has been moved to Jira+Confluence: https://confluence.egi.eu/display/EGIKEDB/EGI+Federation+KEDB+Home
- problems are tracked with Jira tickets to better follow-up their evolution
- problems can be registered by DMSU staff and EGI Operations team
Monthly Availability/Reliability
Under-performed sites in the past A/R reports with issues not yet fixed:
- NGI_CH: https://helpdesk.ggus.eu/#ticket/zoom/1577
- CSCS-LCG2: test jobs failures due to the REST interface and IGTF; IGTF fixed; the LDAP server was disabled on the CE so the tests are failing; waiting for the new version of the probe.
- NGI_CH: https://helpdesk.ggus.eu/#ticket/zoom/1578
- UNIBE-LHEP: the LDAP server was disabled on the CE so the tests are failing; waiting for the new version of the probe.
- NGI_IE: https://helpdesk.ggus.eu/#ticket/zoom/2074
- WALTON-CLOUD: as a consequence of Cyber attack carried out at SETU in November, all internet traffic through their firewalls was disabled. They have to migrate the Walton Institute connections onto a new infrastructure: at that point the default routes should start to work again.
- NGI_IT: https://helpdesk.ggus.eu/#ticket/zoom/1714
- INFN-BARI: job submission failures
- INFN-GENOVA: SRM and job submission failures
- NGI_IT: https://helpdesk.ggus.eu/#ticket/zoom/1710
- INFN-PISA: information on GOCDB about webdav to be fixed.
- NGI_IT:
- INFN-MILANO-ATLASC: https://helpdesk.ggus.eu/#ticket/zoom/1696
- internal error in StoRM's webdav server that couldn't be to sorted out; plans to phase out StoRM and migrate to dCache. New dcache server installed but the webdav tests are failing because of an authentication failure; HTCondorCE has to be reinstalled with a newer version.
- INFN-MILANO-ATLASC: https://helpdesk.ggus.eu/#ticket/zoom/1696
- NGI_IT:
- INFN-CATANIA: https://helpdesk.ggus.eu/#ticket/zoom/1698
- failures with the host certificate validity check: the CE needs to be reinstalled with a newer version.
- INFN-ROMA1: https://helpdesk.ggus.eu/#ticket/zoom/1704
- Downtime for replacing the UPS; failures with CE, SRM, webdav.
- INFN-CATANIA: https://helpdesk.ggus.eu/#ticket/zoom/1698
- NGI_IT: https://helpdesk.ggus.eu/#ticket/zoom/1685
- INFN-ROMA1-CMS: Downtime for replacing the UPS; webdav failures
- NGI_IT: https://helpdesk.ggus.eu/#ticket/zoom/1692
- INFN-LECCE: they are going to decommission the site.
INFN-NAPOLI-ATLAS: migration to Alma9 and HtCondor23 is ongoing; tests are OK after some power supply issues- INFN-TORINO: they need to make a plan for migrating to EL9.
- INFN-TRIESTE: they need to make a plan for migrating to EL9.
- RECAS-NAPOLI: migration to EL9: expected to be completed by end of January 2025
- NGI_RO: https://helpdesk.ggus.eu/#ticket/zoom/1961
- GRIDIFIN: the arc-ce-srm metric was constantly failing: fixed by reconfiguring RTE and setting update-crypto-policies to default SHA1
RO-03-UPB: jobs could not be submitted even if RTE was enabled; the priority in the queues has been fixed.
- NGI_UK: https://helpdesk.ggus.eu/#ticket/zoom/1813
- UKI-SOUTHGRID-BRIS-HEP: downtime for a major infrastructure overhaul; The migration to EL9 has been completed and new storage and batch systems commissioned. Authentication settings of the HTCondorCE have been implemented; working on the storage.
- ROC_CANADA: https://helpdesk.ggus.eu/#ticket/zoom/2521
- CA-WATERLOO-T2: major rework of power and cooling infrastructure started in December, re-installation of the services is going on. Proposed the suspension.
Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: (March 2024):
- NGI_CHINA: https://helpdesk.ggus.eu/#ticket/zoom/3057
- HK-LCG2:
- NGI_FRANCE: https://helpdesk.ggus.eu/#ticket/zoom/3058
- IN2P3-LAPP:
- NGI_IBERGRID:
- CETA-GRID: https://helpdesk.ggus.eu/#ticket/zoom/3056
- Connectivity issues that now should have been fixed.
- CIEMAT-LCG2: https://helpdesk.ggus.eu/#ticket/zoom/3055
- some failures with host certificate metric due to DNS resolution
- CETA-GRID: https://helpdesk.ggus.eu/#ticket/zoom/3056
- NGI_UK: https://helpdesk.ggus.eu/#ticket/zoom/3061
- UKI-NORTHGRID-SHEF-HEP: IGTF not updated
Sites suspended:
- INFN-LECCE (NGI_IT): decommissioning.
Publishing the services in the BDII
All the sites are asked to publish their computing and storage endpoints in the BDII in order to:
- allow the collection of the information of the compute and storage capacity of the Infrastructure
- allow the verification of the middleware version installed across the Infrastructure (for upgrade campaigns and security reasons mainly)
Configuring the Site-BDII and the infoprovider on the several endpoints
- Site-BDII: https://twiki.cern.ch/twiki/bin/view/LCG/BDIIconfigYAIMel9
- HTCondor-CE: https://htcondor.com/htcondor-ce/v24/configuration/optional-configuration/#enabling-bdii-integration
- ARC-CE: all the infosys block of arc.conf
- dCache: https://www.dcache.org/manuals/Book-10.2/config-info-provider.shtml
- EOS: https://eos-docs.web.cern.ch/diopside/manual/egi.html#info-provider
- StoRM: https://italiangrid.github.io/storm/documentation.html
New benchmark HEPscore23
The benchmark HEPscore23 is replacing the old Hep-SPEC06
Recent activities:
- APEL client 2.1.0 released and included in UMD 5
- Accounting Repository and Portal has been upgraded to support the new version of the accounting records compliant with the new benchmark.
- Accounting Portal: new features to filter the accounting records by benchmark deployed in production at the end of February.
- When the issues mentioned earlier are solved, we will ask the sites to use also the new benchmark when sending the accounting records
HEPSCORE application:
- link to the gitlab page: https://gitlab.cern.ch/hep-benchmarks/hep-score
WLCG Operations Coordination meeting (Oct 2024)
New helpdesk
- New system in production since Thu 30th Jan:
- The tickets still open have been importated by the new system
- Helpdesk documentation to be updated: https://docs.egi.eu/internal/helpdesk/
- Already updated:
- Access and roles
- 'SITES' field and tickets to multiple sites
- Do you have a role to create Team/Alarm tickets? please propose your changes to the associated pages
- Already updated:
- Anyone in the federation can contribute to the documentation
- If you want to share your experience with operating a site (HTC, Storage, Cloud, etc.), just create a PR.
AOB
Next meeting
May