General information
- EGI Conference 2023, 19 - 23 June, Poznan (Poland): https://www.egi.eu/event/egi2023/
- Agenda on WHOVA
Middleware
UMD
About the new Linux distribution
- CERN and FNAL decided to jointly support AlmaLinux : https://linux.web.cern.ch/#fermilabcern-recommendation-for-linux-distribution
- check also the presentation at the January GDB
- on Jan16 the Linux Future Committee met to discuss the adoption of Almalinux: https://indico.cern.ch/event/1229565/
- presentation about Almalinux on Jan31st https://indico.cern.ch/event/1242311/
- middleware will be released for Almalinux 9 (no more for CS9)
- CentOS7 is the recommended OS until UMD5 is released
- CentOS7 supported until Jun 2024
- recommended path is C7→AL9
- survey about Linux future circulated by WLCG to site-admins
- results presented during February GDB
About UMD (see also the presentation in March OMB meeting)
- collecting the products available on EL9 to prepare the first UMD5 release
New repository structure implemented in the testing infrastructure and operational:
Main repository: umd/5/release
Testing repository: umd/5/testing
Contrib repository: umd/5/contrib
New public key already available in the repository: https://nexusrepoegi.a.incd.pt/repository/umd/UMD-5-RPM-PGP-KEY
All tests with priority were made and working as expected (new bdii in there for testing)
Small changes were made at the level of the workflow. This was required to improve the automation process.
A lot of work made at the level of the back-end and jenkins workflows close to finish. Running manual scripts already possible and the new json structure being finalized (json will replace the current xml for releases)
What needs to be done:
Implement an High Availability proxy based on NGINX in order to filter the traffic to umd4 and umd5 repos.
Using the HA proxy will provide a certain degree of freedom and in security terms will be an improvement since for example it will allow us to block most of the bots that keep scanning the repositories.
move the testing infrastructure to production.
finalize the umd-5 rpm and upload it to the repositories so people can test it.
Operations
ARGO/SAM
- notifications enabled for the metrics included in the profile ARGO_MON_OPERATORS (instead of ARGO_MON_CRITICAL)
- changing the warning period of the host certificate validity metric
- https://ggus.eu/index.php?mode=ticket_info&ticket_id=161019
- currently is 1 month before the certificate expiration
- most of the sites cannot fix the situation immediately
- proposing to have the WARNING status 2 weeks before the certificate expiration
- what about implementing an option to allow authorised people to acknowledge a notification (at least to stop the sending for a given amount of days)?
- Monitoring of xrootd endpoints
- some endpoints are exposed outside the site in read-only mode
- need to modify the xrootd probe to execute only "read" tests
- the new service type "eu.egi.readonly.xrootd" was created for this purpose (see GGUS 160848)
FedCloud
Feedback from DMSU
New Known Error Database (KEDB)
The KEDB has been moved to Jira+Confluence: https://confluence.egi.eu/display/EGIKEDB/EGI+Federation+KEDB+Home
- problems are tracked with Jira tickets to better follow-up their evolution
- problems can be registered by DMSU staff and EGI Operations team
Issues with publishing the accounting records
- Accounting Portal fails in retrieving the accounting records from ARGO Message Service
- GGUS 161788
- HTC accounting data missing since end of April, GGUS 161869
- last accounting records published by grid and cloud sites was May 5th:
- https://argo.egi.eu/egi/OPERATORS/metrics/argo.APEL-Pub#
- WARNING - WARN [ last published 9 days ago: 2023-05-05 ]
For more info check URL: http://goc-accounting.grid-support.ac.uk:80/rss/ARNES_Pub.html
- WARNING - WARN [ last published 9 days ago: 2023-05-05 ]
- https://argo.egi.eu/egi/ALL/metrics/eu.egi.cloud.APEL-Pub
- FedCloud Accounting Freshness WARNING - Accounting data is older than 7 days. Last update occured 2023-05-05 12:30:12.
- http://goc-accounting.grid-support.ac.uk/cloudtest/cloudsites2.html
- opened a ticket to APEL to investigate on the issue:
- https://argo.egi.eu/egi/OPERATORS/metrics/argo.APEL-Pub#
Monthly Availability/Reliability
Under-performed sites in the past A/R reports with issues not yet fixed:
- AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=160323
- Australia-T2: CE failures
- TW-NCUHEP: several failures, fixed.
- AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=160769
- Australia-ATLAS: CE failures
- INDIACMS-TIFR: problems with SRM settings; intermittent webdav failures due to overheating of some problematic disk nodes.
- NGI_CH: https://ggus.eu/index.php?mode=ticket_info&ticket_id=161499
- T3_CH_PSI: problem with retrieving the SURL; additional failures due to the host certificate. Now the tests are ok.
- NGI_GRNET: https://ggus.eu/index.php?mode=ticket_info&ticket_id=158231
- GR-07-UOI-HEPLAB: SRM failures have been fixed
- NGI_HR: https://ggus.eu/index.php?mode=ticket_info&ticket_id=160330
- egee.irb.hr: certificate expired and other issues to resolve.
- NGNGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=161498
- INFN-ROMA1:
- INFN-ROMA1-CMS: problems due to the large number of jobs submitted on the site, fixed.
- INFN-ROMA3: problems configuring STORM and HTCondorCE, fixed.
- NGI_UK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=156742
- RAL-LCG2: authz issues with the webdav endpoint
- they provide an object store endpoint, and the commands executed by the webdav probe don't work (GGUS 157748).
- new version of the probe created (GGUS 159953) and the inclusion in UMD was requested (GGUS 160997)
- the "ls" check can be disabled by setting a proper information in GOCDB
- they provide an object store endpoint, and the commands executed by the webdav probe don't work (GGUS 157748).
- RAL-LCG2: authz issues with the webdav endpoint
Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: (Apr 2023):
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=161837
- INFN-GENOVA: SE certificate was expired; outdated CA on the CE.
sites suspended:
Documentation
- MediaWiki in read-only mode
- content moved to different locations (confluence and https://docs.egi.eu/)
- confluence space hosting policies and procedures: EGI Policies and Procedures
- EGI Federation Operations
- Change Management, Release and Deployment Management, Incident and Service Request Management, Problem Management, Information Security Management
- Manuals, How-Tos, Troubleshooting, FAQs:
- huge number of material need to be reviewed and in case updated when moved to the new place
- location will be https://docs.egi.eu/providers/operations-manuals/
- Guidelines for providers to join EGI: https://docs.egi.eu/providers/joining/
- Tutorial on submitting HTC jobs: https://docs.egi.eu/users/tutorials/htc-job-submission/
IPv6 readiness plans
- please provide updates to the IPv6 assessment (ongoing) https://wiki.egi.eu/w/index.php?title=IPV6_Assessment
- if any relevant, information will be summarised at OMB
Transition from X509 to federated identities (AARC profile token)
- In Feb 2022 OSG fully moved to token-based AAI, abandoning X509 certificates
- HTCondorCE: replacement of Grid Community Toolkit
- The long-term support series (9.0.x) from the CHTC repositories will support X509/VOMS authentication until May 2023
- Starting in 9.3.0 (released in October 2021), the HTCondor feature releases does NOT contain this support
- EGI sites are recommended to stay with the long-term support series for the time being
Migration of the VOs from VOMS to Check-in
- transition period where both X509 and tokens can be used
- delays in updating the GRID elements to the latest version compliant with tokens
- not all of the middleware products can be compliant with tokens at the same time
- the same VO has to interact with element supporting different authentications
Testing HTCondorCE and AARC Profile token
- INFN-T1 did some tests with the AARC Profile token using its HTCondorCE endpoints
- Configuring authentication on HTCondorCE
- HTCondor CE token configuration tips (by INFN-T1)
- dteam VO registered in Check-in/Comanage:
- Entitlements:
- urn:mace:egi.eu:group:dteam:role=member#aai.egi.eu
- urn:mace:egi.eu:group:dteam:role=vm_operator#aai.egi.eu
- Entitlements:
- The HTCondorCE expects to find in the token the scope claim to authorise the jobs submission
- in that moment Check-in didn't release this claim: it does since the migration to Keykloak technology replacing MitreID
WLCG Campaign
- WLCG started the CE Token support campaign
- https://twiki.cern.ch/twiki/bin/view/LCG/CEtokenSupportCampaign
- sites should upgrade the CEs to the version supporting also tokens, and configure tokens
Hackathon events
- 15th - 16th September ARC/HTCondor CE Hackathon, organised by WLCG, with HTCondorCE and ARC-CE to mostly investigating data staging issues (see GDB introduction)
- agreed to enable the support of the several token profiles through plugins
- same plugin for the several CEs
- plugins provided by the "creators" of the token profiles
- CE teams to provide specifics to the AAI teams and to release a new CE version supporting the plugins
- agreed to enable the support of the several token profiles through plugins
Plans for the coming months:
- ARC-CE and HTCondorCE implemented a new API interface
- The Check-in team released a plugin for the CEs allowing the Check-in/AARC token profile to work
- according to the AARC guidelines, the claim to authorise the job submission is provided through a different attributes than the one used by the WLCG token
- the plugin translates the attribute to be understandable to the CEs
- The plugin is currently under testing before its release in UMD
- involved DESY-HH, FZK-LCG2, INFN-BARI, INFN-T1, RECAS-NAPOLI
- the plugin is tested with HTCondor Feature Release which introduced the support to Check-in tokens
- tests were successful
- a few aspects concerning some HTCondor variables to be clarified
- involved DESY-HH, FZK-LCG2, INFN-BARI, INFN-T1, RECAS-NAPOLI
- Testing a particular SSL setting to allow the authz with X509/VOMS even if the CE supports only tokens
- a new fix will be soon released.
- Then we can start the decommission procedure for the HTCondorCE long-term support series (9.0.x)
- we might agree to postpone this decommission deadline considering the (relatively ) low likelihood of security issues
- this will make the transition to tokens less painful for VOs that are not ready yet
- At the same time the VOs using voms will be cloned to Check-in in order to be ready to use the tokens when the first HTCondor (Feature Channel version) endpoints are in productions.
- To be clarified the monitoring:
- if a new version of the probe using tokens is needed
- how to deal with CEs using different authz system during the migration phase
DPM Decommission and migration
- DPM supported until June 2023
- Sites are encouraged to start the migration to a different storage element since the process will take time
- choosing the new storage solution depends on the expertise/experience of the sites and on the needs of the supported VOs
- See the slides presented by Petr Vokac at the EGI Conference 2022 about the migration tools to dCache
- DPM provides a migration script to dCache (migration guide)
- Transparent migration
- Migrate just catalog (database) and keep files untouched
- both SE store files on posix filesystem
- Transparent migration
- Migration in three steps
- verify the DPM data consistency
- no downtime needed
- the operation can last several days or some weeks
- DPM dump and dCache import
- downtime lasting about 1 day
- verify the DPM data consistency
- In September opened tickets to the sites to plan the migration and decommission:
- tickets list (23 out of 57 were solved)
- Please let us know your plans for DPM EOL and in case you decide to use dCache migration tools the tickets will be used to support you on this storage migration method.
- dCache migration should be done by June 2023.
- Last week we requested info from undefined and planned until Feb.2023
Planned completion date | |
Beginning of 2023 | 1 |
By Feb 2023 | 1 |
By Q1 2023 | 3 |
By May 2023 | 3 |
By Q2 2023 | 12 |
undefined | 9 |
Chosen technology | |
Migration to dCache | 26 |
Migration to EOS | 8 |
Migration to xroot/ceph | 3 |
Migration to XrootD | 1 |
Migration to Xcache | 1 |
Migration to Dynafed | 2 |
Not yet decided/no clear plan | 6 |
Decommissioning SE or site | 7 |
New benchmark replacing HEP-SPEC06
The benchmark HEPSCORE is going to replace the old Hep-Spec06
Main points agreed:
- Existing resources at the sites will not be re-benchmarked with HEPscore (unless the site has modern resources and would like to re-benchmark them in order to get higher consumption in the accounting reports)
- New resources purchased by the site will be benchmarked with HEPscore
- HEPscore23 will be normalized with HS06 with factor 1
- The switch to HEPscore23 in the accounting reports will happen on the 1st of April 2023 (when WLCG switch yearly pledges)
- This implies that two benchmarks will co-exist on the infrastructure for quite some time
- We would like to follow the progress regarding amount of the resources benchmarked with HEPscore23
- No need for reporting of measurements for two benchmarks in parallel for the same set of resources
- This implies that accounting record should contain one metric for a single benchmark and benchmark name has to be properly defined in the accounting record.
Plans for the coming months:
- Change of units: last week of March 2023 in the Accounting Portal all of the references to HepSpec06 were replaced with HEPscore23
- Some tests in particular with sites sending normalised reports were performed.
- A new APEL version compliant with the new benchmark will be released within a few weeks
- new APEL client version by mid-may
- new APEL server before the end of May
- this new version allows the aggregation of the accounting records by benchmark to monitor the move to the new benchmark over the time
- After the update of the APEL server, the aggregation by benchmark will be implemented also in the Accounting Portal
HEPSCORE application:
- link to the gitlab page: https://gitlab.cern.ch/hep-benchmarks/hep-score
April GDB:
Monitoring of webdav and xrootd protocols/endpoints
- 93 tickets were created requesting to update the information for monitoring webdav and xrootd endpoints
- Extension Properties to set:
- webdav:
- Name: ARGO_WEBDAV_OPS_URL
- Value: webdav URL containing also the VO ops folder, for example: https://darkstorm.cnaf.infn.it:8443/webdav/ops or https://hepgrid11.ph.liv.ac.uk/dpm/ph.liv.ac.uk/home/ops/
- xrootd:
- Name: ARGO_XROOTD_OPS_URL
- Value: XRootD base SURL to test (the path where ops VO has write access, for example: root://eosatlas.cern.ch//eos/atlas/opstest/egi/, root://recas-se-01.cs.infn.it:1094/dpm/cs.infn.it/home/ops/, root://dcache-atlas-xrootd-ops.desy.de:2811/pnfs/desy.de/ops or similar)
- webdav:
- Reference: https://docs.egi.eu/internal/configuration-database/adding-service-endpoint/#webdav
- Link to the broadcast circulated last October
- 72 tickets were solved (3 Unsolved)
- Extension Properties to set:
AOB
Next meeting
June