General information
Middleware
UMD
- UMD5 released: https://repository.egi.eu/umd/distribution.html?id=5#5
- APEL 2.1.0, APEL SSM 3.4.1
- Arc 6.20.1
- BDII 6.0.3,
- WN 5.1.0
- UI 7.0.0
- Dcache 9.2.25
- Gfal2 2.23.0
- Frontier-squid 5.9.2
- Voms 2.1.0, voms-api 3.3.3, voms-client-java 3.3.3, voms-client-cpp 2.1.0
- xroot 5.7.1
- htcondor-ce 23.0
- cvmfs 2.11.5
- config-egi 2.6.1
- egi-cvmfs 6.7.28
- Davix 0.8.7
Migration to EL9
Following PROC16 Decommissioning of unsupported software
Broadcast circulated in June.
Requested to enable the metric to detect CentOS7 endpoints:
- GGUS 167352
The NGIs can open tickets against sites to track the migration
Operations
Accounting Repository
Pub/Sync system taken offline for a security issue. Accounting Repository operation unaffected, but Repository test is provided via the pub/sync hosts.
We receive weekly reports by email about the publication of the accounting records.
ARGO/SAM
- Waiting for the new version of the HTCondorCE probe
- for the moment the endpoints are tested with the host certificate validity metric
a gfal2 setting is causing failures against SRM/webdav endpoints
- Some SRM/webdav endpoints started to fail the tests when ARGO was migrated to EL9
- A new setting in the new gfal2 version requiring the request of a token before the file transfer is responsible of these failures.
- see details in gfal2 docs.
- See the GGUS ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=168904
- Solution: set the following on ARGO:
# Attempt to retrieve SE-issued tokens RETRIEVE_BEARER_TOKEN=false |
- KEDB entry: EGIKEDB-20 - Getting issue details... STATUS
- On ARGO dev instance the tests turned green on most of the endpoints after the setting was changed
FedCloud
Feedback from DMSU
From July 1st the second level support is provided by UKIM:
- the partner representing the Macedonian Academic Research Grid Initiative (MARGI) in the EGI Council, is now a full member of the EGI Federation
New Known Error Database (KEDB)
The KEDB has been moved to Jira+Confluence: https://confluence.egi.eu/display/EGIKEDB/EGI+Federation+KEDB+Home
- problems are tracked with Jira tickets to better follow-up their evolution
- problems can be registered by DMSU staff and EGI Operations team
Monthly Availability/Reliability
Under-performed sites in the past A/R reports with issues not yet fixed:
AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=167466
INDIACMS-TIFR: downtime for several structural upgrades in the infrastructure.
- AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168531
- TW-FTT: jobs cannot be submitted
- NGI_CH: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168015
- CSCS-LCG2: test jobs failures due to the REST interface and IGTF
NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=167470
mainz: SRM overload due to large amount if data transferred
- NGI_GRNET: https://ggus.eu/index.php?mode=ticket_info&ticket_id=166696
- GR-07-UOI-HEPLAB: SURL information is missing
- NGI_IBERGRID: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168488
- BIFI: redeploying the infrastructure, tcp tests are ok, not yet the keystone ones.
- CESGA-CLOUD: recovered
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=166697
- INFN-BARI: job submission failures
- INFN-GENOVA: SRM and job submission failures
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=165200
- INFN-PISA: information on GOCDB about webdav to be fixed.
- NGI_IT:
- INFN-MILANO-ATLASC: https://ggus.eu/index.php?mode=ticket_info&ticket_id=167467
- internal error in StoRM's webdav server that couldn't be to sorted out; plans to phase out StoRM and migrate to dCache.
- INFN-MILANO-ATLASC: https://ggus.eu/index.php?mode=ticket_info&ticket_id=167467
- NGI_IT:
- INFN-CATANIA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168017
- failures with the host certificate validity check
- INFN-ROMA1: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168018
- Downtime for replacing the UPS
- INFN-CATANIA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168017
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168529
- INFN-ROMA1-CMS: Downtime for replacing the UPS
- NGI_RO: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168530
- RO-07-NIPNE: migration to AlmaLinux 9, issues with the UPS; new failures with the jobs
- NGI_UK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=166699
- UKI-SOUTHGRID-BRIS-HEP: downtime for a major infrastructure overhaul; The migration to EL9 has been completed and new storage and batch systems commissioned. Working on the authentication settings of the HTCondorCE.
- NGI_UK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168532
- UKI-LT2-QMUL: long downtime for data centre maintenance; fixed
- UKI-NORTHGRID-LIV-HEP: failures caused by the institute firewall
- UKI-SCOTGRID-ECDF: investigation on some changes that created issues; relocation of the machines in the data centre; fixed.
- UKI-SCOTGRID-GLASGOW: webdav failures which have been resolved.
Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: (October 2024):
- NGI_CHINA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168876
- BEIJING-LCG2:
- HK-LCG2:
- NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168881
FZK-LCG2: SRM failures; it was decided to decommission the SRM endpoint.
GoeGrid: webdav failures due to a new gfal2 setting.
wuppertalprod: webdav failures due to a new gfal2 setting.
- NGI_HR: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168880
- fedcloud.srce.hr: After reinstalling the OpenStack instance to Antelope we are unable to integrate EGI Checkin with the Horizon. We are actively working on a solution.
- NGI_IBERGRID: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168879
- CIEMAT-LCG2: the srm endpoint wasn't configured after the upgrade of dcache servers and it is not used by the supported VOs; to decide if the monitoring should be disabled.
- ROC_CANADA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=168878
- CA-VICTORIA-WESTGRID-T2: webdav failures due to a new gfal2 setting
sites suspended:
Using YAIM to configure Site and Top BDII on EL9
- Maarten Litmaath created an rpm, glite-yaim-bdii, to help with easily configure site and top BDII endpoints
- rpm added to the WLCG repository, waiting for the inclusion in UMD5
- For more details, see the documentation: https://twiki.cern.ch/twiki/bin/view/LCG/BDIIconfigYAIMel9
IPv6 readiness plans
- please provide updates to the IPv6 assessment (ongoing) https://wiki.egi.eu/w/index.php?title=IPV6_Assessment
- if any relevant, information will be summarised at OMB
VOMS upgrade campaign to EL9
- VOMS released on EL9:
- The sites can now upgrade their VOMS endpoints to either to EL8 or EL9
- Packages available on the product team repository:
- EL9 package was also released in UMD5
- Optionally you could keep the current server to work as the database (not exposed to the outside), while you expose externally the new server with voms and voms-admin
- This should shorten the downtime when doing the switch
- Note: it was noticed a dependency of voms-admin on Python 2 that makes it difficulty the installation on EL9 (EL9 removed the support to python 2)
- the voms team is working to fix this
- as alternative, the sites can install voms-admin on EL8 where Python 2 is still supported
Currently there are 28 VOMS endpoints in production. We are also starting to decommission about 100 inactive VOs, so the number of VOMS endpoints could also decrease.
Tickets to be tracked here: 2024 VOMS upgrade campaign
StoRM upgrade campaign to EL9
- INFN is working to release StoRM on EL9
- StoRM WebDAV v1.4.2 (the latest released on CentOS 7) is available also for el9 in their stable repository
- The other components will be soon ready
- 31 StoRM endpoints published in the BDII
- We can track the migration in 2024 StoRM upgrade campaign
New benchmark HEPscore23
The benchmark HEPscore23 is replacing the old Hep-SPEC06
Recent activities:
- APEL client 2.1.0 released and included in UMD 5
- Testing ongoing, with data sent from some sites to the accounting repository and published into the staging accounting portal
- Please contact us if you'd like to make tests with the new benchmark
- Information for testing the publication of accounting records with the new benchmark:
- plans to finalise the HepScore deployment by the end of November
HEPSCORE application:
- link to the gitlab page: https://gitlab.cern.ch/hep-benchmarks/hep-score
WLCG Operations Coordination meeting (Oct 2024)
Verify configuration records
On a yearly basis, the information registered into GOC-DB need to be verified. NGIs and RCs have been asked to check them. In particular:
- NGI managers should review the people registered and the roles assigned to them, and in particular check the following information:
- ROD E-Mail
- Security E-Mail
NGI Managers should also review the status of the "not certified" RCs, in according to the RC Status Workflow;
- RCs administrators should review the people registered and the roles assigned to them, and in particular check the following information:
- telephone numbers
- CSIRT E-Mail
RC administrators should also review the information related to the registered service endpoints.
The process should be completed by Oct 7th.
List of tickets in the GGUS search page
- 11 out of 31 tickets still open
New helpdesk
- Pilot production instance was released in October
- The new GGUS implementation is based on Zammad
- You can login and explore the new look
- the supporter role that you have in the old GGUS will be assigned to you automatically after a few days from the first login
- First Steps Guide for New GGUS Users ( start here)
AOB
Next meeting
December