General information
EGI Core Services delivered in best effort way
- Broadcast circulated on July 6th
- Gap between EGI-ACE project (ended in June 2023) and the EOSC procurement that is going to find them (starting from Jan 2024)
- Delivery in maintenance mode, ensuring continuous operation and security system maintenance
- bugs fixing and application of security patches
- no implementation of new features
- no major upgrades
- expected slower response time to the tickets
Middleware
UMD
- Repo under testing: https://egirepo.a.incd.pt/test-url.html
- Supported OS (Alma Linux 9)
- Production Branck (release)
[UMD-5-testing]
baseurl=
- Supported OS (Alma Linux 9)
- New Frontend under testing (no public URL available)
- Will allow UMD-4 and UMD-5 using different infrasctures.
- Required changes at INCD / LIP infrascture
- Will allow UMD-4 and UMD-5 using different infrasctures.
- New Backend ready
- Missing Backend → FrontEnd automation
- New Functionalities:
- IPV6 and High Availabitiy
New infrastructure in production during 2nd half of July.
Operations
ARGO/SAM
- Monitoring of xrootd endpoints
- some endpoints are exposed outside the site in read-only mode
- need to modify the xrootd probe to execute only "read" tests
- the new service type "eu.egi.readonly.xrootd" was created for this purpose (see GGUS 160848)
- New version of srm probe to be deployed (GGUS 162411)
- support for py3 only
- support for SRM+HTTPS
- updated default Top-BDII endpoint
ARGO Message Service - Problems with the Swiss CA
- GGUS ticket 162283 opened on 12th June
- The NGI_CH sites cannot publish the accounting record through the Message service since the release of IGTF 1.120
- Swiss CA based on DigitalTrust which has been discontinued
- the bundle introduced a new long-lived CRL for DigitalTrust
- this is an exceptional case to allow the business continuity of the Swiss sites since the CA stopped operating
- AMS instead look at the CRL links provided with the clients' certificates, links that are no more valid, so it rejects all of the requests
- Proposed to install the long-lived CRL in the trust store, to be used as a locally provided CRL that is in the trust store
- it is a long-lived and would need to be retrieved once
- In case of a security issue, the EGI CSIRT will notify immediately the provider to remove the CRL
- if it works it would resolve the current issue for as long as the CA transition period is ongoing
- AMS team was initially refusing to apply this fix
- Worried about security implications
- Entering in best effort phase from July 1st
- but the solution was proposed in June
- AMS team is currently assessing the effort that the proposed solution would require on their side
FedCloud
Feedback from DMSU
New Known Error Database (KEDB)
The KEDB has been moved to Jira+Confluence: https://confluence.egi.eu/display/EGIKEDB/EGI+Federation+KEDB+Home
- problems are tracked with Jira tickets to better follow-up their evolution
- problems can be registered by DMSU staff and EGI Operations team
Monthly Availability/Reliability
Under-performed sites in the past A/R reports with issues not yet fixed:
- AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=160323
- Australia-T2: CE failures
TW-NCUHEP: several failures, fixed.
- AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=160769
- Australia-ATLAS: Hardware issues, downtime longer than a month, site SUSPENDED
- INDIACMS-TIFR: problems with SRM settings; intermittent webdav failures due to overheating of some problematic disk nodes; downtime for DPM migration.
- NGI_GRNET: https://ggus.eu/index.php?mode=ticket_info&ticket_id=162186
- GRNET-OPENSTACK: quota issues
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=161498
- INFN-ROMA1: new CE failures
INFN-ROMA1-CMS: problems due to the large number of jobs submitted on the site, fixed.INFN-ROMA3: problems configuring STORM and HTCondorCE, fixed.
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=162187
- INFN-COSENZA: The recent performance loss and down were always caused by UPS system failure: the UPS batteries are going to be replaced.
Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: (June 2023):
- NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=162630
- UNI-SIEGEN-HEP
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=162628
- INFN-MIB: authentication failures on the SE
- ROC_Russia: https://ggus.eu/index.php?mode=ticket_info&ticket_id=162629
- RU-SARFTI: hardware problems, site suspended by the NGI for long downtime,
sites suspended:
- Australia-ATLAS (AsiaPacific), RU-SARFTI (ROC_Russia)
Documentation
- MediaWiki in read-only mode
- content moved to different locations (confluence and https://docs.egi.eu/)
- confluence space hosting policies and procedures: EGI Policies and Procedures
- EGI Federation Operations
- Change Management, Release and Deployment Management, Incident and Service Request Management, Problem Management, Information Security Management
- Manuals, How-Tos, Troubleshooting, FAQs:
- huge number of material need to be reviewed and in case updated when moved to the new place
- location will be https://docs.egi.eu/providers/operations-manuals/
- Guidelines for providers to join EGI: https://docs.egi.eu/providers/joining/
- Tutorial on submitting HTC jobs: https://docs.egi.eu/users/tutorials/htc-job-submission/
IPv6 readiness plans
- please provide updates to the IPv6 assessment (ongoing) https://wiki.egi.eu/w/index.php?title=IPV6_Assessment
- if any relevant, information will be summarised at OMB
Change in the APEL client configuration due to CERN top-BDII decommission
- The top-bdii lcg-bdii.cern.ch, was turned off on June as announced with a broadcast circulated on May 23rd.
- Currently this endpoint is a default setting in the APEL Client configuration file /etc/apel/client.cfg
- If you are using this setting, we kindly ask you to change it: you can replace it with the endpoint lcg-bdii.egi.eu (or any other top-bdii which is provided by your NGI).
- So the variable to set in /etc/apel/client.cfg is the following
ldap_host = lcg-bdii.egi.eu - In this way, the apel client will query that BDII to gather the information about the benchmark published by your CEs.
- you can set any top-BDII of your preference
Please note that it is also possible to set manually the benchmark information by setting one or more of the following variables:
## To manually set specs for all jobs (not just local ones), configure lines
## like the following named "manual_spec" followed by consecutive integers for
## however many batch systems are relevant. The value should be a unique name
## for the system, then the spec type ('HEPscore23', 'HEPSPEC' or 'Si2k') and the spec value.
# manual_spec1 = grid10.uni.ac.uk:1234/grid10.uni.ac.uk-condor,HEPSPEC,10.0
# manual_spec2 = grid22.uni.ac.uk:1234/grid22.uni.ac.uk-condor,HEPSPEC,15.0
# manual_spec3 = grid35.uni.ac.uk:1234/grid35.uni.ac.uk-condor,HEPSPEC,15.0
- broadcast about the APEL configuration change sent on May 26th
Transition from X509 to federated identities (AARC profile token)
- In Feb 2022 OSG fully moved to token-based AAI, abandoning X509 certificates
- HTCondorCE: replacement of Grid Community Toolkit
- The long-term support series (9.0.x) from the CHTC repositories will support X509/VOMS authentication until May 2023
- Starting in 9.3.0 (released in October 2021), the HTCondor feature releases does NOT contain this support
- EGI sites are recommended to stay with the long-term support series for the time being
Enabling SSL authentication on HTCondor 9 and 10
The HTCondor team set-up an upgrade procedure to help sites and VOs with the migration from X509 personal certificates to tokens.
Essentially it was created an intermediate step where the plain SSL authentication can be used to authenticate a client' proxy, in addition to the GSI one or to the token one:
In summary, the steps are:
- update to HTCondor 9.0.19
- enable the SSL authz (with priority over GSI)
- map the users' DNs
- test the SSL authz successfully
- update to HTCondor 10.6.0 or later
Note the usage in the last step of the HTCondor Feature channel since it this the one supporting the EGI Check-in plugin from 10.4.0.
- In this way the sites can accept clients’ proxies and tokens at the same time while waiting for the supported VOs moving completely to tokens.
Before starting with the official campaign, we are going to test the SSL authz with a few sites.
- the sites will get a ticket for starting the upgrade
The check-in-validator-plugin:
- ARC-CE and HTCondorCE implemented a new API interface
- The Check-in team released a plugin for the CEs allowing the Check-in/AARC token profile to work
- according to the AARC guidelines, the claim to authorise the job submission is provided through a different attributes than the one used by the WLCG token
- the plugin translates the attribute to be understandable to the CEs
- The plugin has been successfully tested and can be now included in UMD
- involved DESY-HH, FZK-LCG2, INFN-BARI, INFN-T1, RECAS-NAPOLI
- the plugin was tested with HTCondor Feature Release 10.4.0 which introduced the support to Check-in tokens
- involved DESY-HH, FZK-LCG2, INFN-BARI, INFN-T1, RECAS-NAPOLI
- the plugin can be installed with HTCondor 10.4.0 or later
Important for the sites:
- please start collecting information from the VOs you support about the DNs that should be mapped on your endpoints
Important for the VOs:
- update the condor-client as well in coordination with the sites
Monitoring:
- enabling the SSL authentication on the HTCondorCE endpoints should allow the current probe to continue to work, especially during the migration phase
- of course the ARGO users needs to be mapped on the several endpoints
- To be clarified with the developers if the current version of the probe can work also with tokens or a new version is needed
DPM Decommission and migration
- Suppor of DPM ended in June 2023
- CERN IT will provide a minimal support to DPM until the EOL of CentOS 7, with very little effort:
- only critical issues will be looked into
- only critical issues will be looked into
- CERN IT will provide a minimal support to DPM until the EOL of CentOS 7, with very little effort:
- DPM provides a migration script to dCache (migration guide)
- In September 2022 opened tickets to the sites to plan the migration and decommission:
- tickets list (34 out of 57 were solved, 1 unsolved)
- Migrations still pending
- By June 2023
- Australia-T2
- BEIJING-LCG2
- BG05-SUGrid (EOS)
- CYFRONET-LCG2 (EOS)
- GRIF (EOS)
- IN2P3-IRES (DPM read-only from July)
- INFN-FRASCATI (dCache)
- INFN-ROMA1 (dCache)
- IR-IPM-HEP (dCache)
- NCP-LCG2 (dCache)
- UKI-LT2-Brunel (XrootD/CEPHFS)
- UKI-NORTHGRID-LIV-HEP (dCache)
- UNIBE-LHEP (dCache)
- By July 2023
- INFN-COSENZA (dCache)
- TR-10-ULAKBIM (dCache): Completed. To fix an issue with the host certificate validity check.
By Q3 2023
- UKI-SCOTGRID-DURHAM (XrootD/CEPHFS)
- By Q4 2023
- PSNC (EOS)
By Q1 2024
- UKI-NORTHGRID-MAN-HEP (XrootD/CEPHFS)
- not clear/no reply
- ATLAND
- GR-07-UOI-HEPLAB
- By June 2023
- Please note that after June 30th no support is going to be provided with the migration to dCache in case of issues.
New benchmark HEPscore23
The benchmark HEPscore23 is replacing the old Hep-SPEC06
Recent activities:
- Some tests in particular with sites sending normalised reports were performed.
- APEL client 1.9.2 released that adds basic HEPscore23 publishing using existing message format
- It needs to be added to UMD
- APEL server release candidate in testing
- Liaising with Portal on setting up testing with them
- this new version allows the aggregation of the accounting records by benchmark to monitor the move to the new benchmark over the time
- When the tests are successful, final release of APEL server update and of the Portal
- Information for testing the publication of accounting records with the new benchmark:
- Expected a fix in ARC-CE for the proper configuration of HEPscore23
- Please contact us if you'd like to make tests with the new benchmark
HEPSCORE application:
- link to the gitlab page: https://gitlab.cern.ch/hep-benchmarks/hep-score
April GDB:
June WLCG Operations Coordination meeting:
Monitoring of webdav and xrootd protocols/endpoints
- 93 tickets were created requesting to update the information for monitoring webdav and xrootd endpoints
- Extension Properties to set:
- webdav:
- Name: ARGO_WEBDAV_OPS_URL
- Value: webdav URL containing also the VO ops folder, for example: https://darkstorm.cnaf.infn.it:8443/webdav/ops or https://hepgrid11.ph.liv.ac.uk/dpm/ph.liv.ac.uk/home/ops/
- xrootd:
- Name: ARGO_XROOTD_OPS_URL
- Value: XRootD base SURL to test (the path where ops VO has write access, for example: root://eosatlas.cern.ch//eos/atlas/opstest/egi/, root://recas-se-01.cs.infn.it:1094/dpm/cs.infn.it/home/ops/, root://dcache-atlas-xrootd-ops.desy.de:2811/pnfs/desy.de/ops or similar)
- webdav:
- Reference: https://docs.egi.eu/internal/configuration-database/adding-service-endpoint/#webdav
- Link to the broadcast circulated in October 2022
- 74 tickets were solved (4 Unsolved)
- Extension Properties to set:
AOB
Next meeting
September