General information
- EGI Conference 2023, 19 - 23 June, Poznan (Poland): https://www.egi.eu/event/egi2023/
- Agenda on WHOVA
Middleware
UMD
- Repo under testing: https://egirepo.a.incd.pt/test-url.html
- Supported OS (Alma Linux 9)
- Production Branck (release)
[UMD-5-testing]
baseurl=
- Supported OS (Alma Linux 9)
- New Frontend under testing (no public URL available)
- Will allow UMD-4 and UMD-5 using different infrasctures.
- Required changes at INCD / LIP infrascture
- Will allow UMD-4 and UMD-5 using different infrasctures.
- New Backend ready
- Missing Backend → FrontEnd automation
- New Functionalities:
- IPV6 and High Availabitiy
Operations
ARGO/SAM
- Monitoring of webdav endpoints:
- new version of the probe taking into account that with the Object store as the backend disk storage, the "ls" of a "directory" is failing for the webdav check.
- the "ls" check can be disabled by setting a proper information in GOCDB
- released in production on Jun 5th.
- new version of the probe taking into account that with the Object store as the backend disk storage, the "ls" of a "directory" is failing for the webdav check.
- Monitoring of xrootd endpoints
- some endpoints are exposed outside the site in read-only mode
- need to modify the xrootd probe to execute only "read" tests
- the new service type "eu.egi.readonly.xrootd" was created for this purpose (see GGUS 160848)
FedCloud
Feedback from DMSU
New Known Error Database (KEDB)
The KEDB has been moved to Jira+Confluence: https://confluence.egi.eu/display/EGIKEDB/EGI+Federation+KEDB+Home
- problems are tracked with Jira tickets to better follow-up their evolution
- problems can be registered by DMSU staff and EGI Operations team
Issues with publishing the accounting records
- Accounting Portal fails in retrieving the accounting records from ARGO Message Service
- GGUS 161788
- HTC accounting data missing since end of April, GGUS 161869
- last accounting records published by grid and cloud sites was May 5th:
- https://argo.egi.eu/egi/OPERATORS/metrics/argo.APEL-Pub#
- WARNING - WARN [ last published 9 days ago: 2023-05-05 ]
For more info check URL: http://goc-accounting.grid-support.ac.uk:80/rss/ARNES_Pub.html
- WARNING - WARN [ last published 9 days ago: 2023-05-05 ]
- https://argo.egi.eu/egi/ALL/metrics/eu.egi.cloud.APEL-Pub
- FedCloud Accounting Freshness WARNING - Accounting data is older than 7 days. Last update occured 2023-05-05 12:30:12.
- http://goc-accounting.grid-support.ac.uk/cloudtest/cloudsites2.html
- opened a ticket to APEL to investigate on the issue:
- https://argo.egi.eu/egi/OPERATORS/metrics/argo.APEL-Pub#
- There was an issues in the Accounting Repository caused by a power blip on May 5th, fixed on May 16th
- The Accounting Portal started to fetch again the accounting records from May 23rd.
Monthly Availability/Reliability
Under-performed sites in the past A/R reports with issues not yet fixed:
- AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=160323
- Australia-T2: CE failures
TW-NCUHEP: several failures, fixed.
- AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=160769
- Australia-ATLAS: CE failures
- INDIACMS-TIFR: problems with SRM settings; intermittent webdav failures due to overheating of some problematic disk nodes; downtime for DPM migration.
- NGI_GRNET: https://ggus.eu/index.php?mode=ticket_info&ticket_id=158231
- GR-07-UOI-HEPLAB: SRM failures have been fixed
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=161498
- INFN-ROMA1: new CE failures
INFN-ROMA1-CMS: problems due to the large number of jobs submitted on the site, fixed.INFN-ROMA3: problems configuring STORM and HTCondorCE, fixed.
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=161837
- INFN-GENOVA: SE certificate was expired; outdated CA on the CE. some recurrent failures with webdav.
- NGI_UK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=156742
- RAL-LCG2: authz issues with the webdav endpoint
- they provide an object store endpoint, and the commands executed by the webdav probe don't work (GGUS 157748).
- new version of the probe created (GGUS 159953) and included in UMD 4.17.2 on June 2nd (GGUS 160997)
- the "ls" check can be disabled by setting a proper information in GOCDB
- new version of the probe released in production on Jun 5th: the test is finally successful.
- they provide an object store endpoint, and the commands executed by the webdav probe don't work (GGUS 157748).
- RAL-LCG2: authz issues with the webdav endpoint
Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: (May 2023):
- NGI_GRNET: https://ggus.eu/index.php?mode=ticket_info&ticket_id=162186
- GRNET-OPENSTACK:
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=162187
- INFN-COSENZA:
- NGI_RO: https://ggus.eu/index.php?mode=ticket_info&ticket_id=162188
- NIHAM: IGTF outdated, SOLVED.
sites suspended:
- Jun 2nd: egee.irb.hr (NGI_HR)
Documentation
- MediaWiki in read-only mode
- content moved to different locations (confluence and https://docs.egi.eu/)
- confluence space hosting policies and procedures: EGI Policies and Procedures
- EGI Federation Operations
- Change Management, Release and Deployment Management, Incident and Service Request Management, Problem Management, Information Security Management
- Manuals, How-Tos, Troubleshooting, FAQs:
- huge number of material need to be reviewed and in case updated when moved to the new place
- location will be https://docs.egi.eu/providers/operations-manuals/
- Guidelines for providers to join EGI: https://docs.egi.eu/providers/joining/
- Tutorial on submitting HTC jobs: https://docs.egi.eu/users/tutorials/htc-job-submission/
IPv6 readiness plans
- please provide updates to the IPv6 assessment (ongoing) https://wiki.egi.eu/w/index.php?title=IPV6_Assessment
- if any relevant, information will be summarised at OMB
Change in the APEL client configuration due to CERN top-BDII decommission
- The top-bdii lcg-bdii.cern.ch, which will be turned off on June 19th as announced with a broadcast circulated on May 23rd.
- Currently this endpoint is a default setting in the APEL Client configuration file /etc/apel/client.cfg
- If you are using this setting, we kindly ask you to change it: you can replace it with the endpoint lcg-bdii.egi.eu (or any other top-bdii which is provided by your NGI).
- So the variable to set in /etc/apel/client.cfg is the following
ldap_host = lcg-bdii.egi.eu - In this way, the apel client will query that BDII to gather the information about the benchmark published by your CEs.
- you can set any top-BDII of your preference
Please note that it is also possible to set manually the benchmark information by setting one or more of the following variables:
## To manually set specs for all jobs (not just local ones), configure lines
## like the following named "manual_spec" followed by consecutive integers for
## however many batch systems are relevant. The value should be a unique name
## for the system, then the spec type ('HEPscore23', 'HEPSPEC' or 'Si2k') and the spec value.
# manual_spec1 = grid10.uni.ac.uk:1234/grid10.uni.ac.uk-condor,HEPSPEC,10.0
# manual_spec2 = grid22.uni.ac.uk:1234/grid22.uni.ac.uk-condor,HEPSPEC,15.0
# manual_spec3 = grid35.uni.ac.uk:1234/grid35.uni.ac.uk-condor,HEPSPEC,15.0 - Please apply this change by June 19th, thanks!
- broadcast about the APEL configuration change sent on May 26th
Transition from X509 to federated identities (AARC profile token)
- In Feb 2022 OSG fully moved to token-based AAI, abandoning X509 certificates
- HTCondorCE: replacement of Grid Community Toolkit
- The long-term support series (9.0.x) from the CHTC repositories will support X509/VOMS authentication until May 2023
- Starting in 9.3.0 (released in October 2021), the HTCondor feature releases does NOT contain this support
- EGI sites are recommended to stay with the long-term support series for the time being
Migration of the VOs from VOMS to Check-in
- transition period where both X509 and tokens can be used
- delays in updating the GRID elements to the latest version compliant with tokens
- not all of the middleware products can be compliant with tokens at the same time
- the same VO has to interact with element supporting different authentications
Testing HTCondorCE and AARC Profile token
- INFN-T1 did some tests with the AARC Profile token using its HTCondorCE endpoints
- Configuring authentication on HTCondorCE
- HTCondor CE token configuration tips (by INFN-T1)
- dteam VO registered in Check-in/Comanage:
- Entitlements:
- urn:mace:egi.eu:group:dteam:role=member#aai.egi.eu
- urn:mace:egi.eu:group:dteam:role=vm_operator#aai.egi.eu
- Entitlements:
- The HTCondorCE expects to find in the token the scope claim to authorise the jobs submission
- in that moment Check-in didn't release this claim: it does since the migration to Keykloak technology replacing MitreID
WLCG Campaign
- WLCG started the CE Token support campaign
- https://twiki.cern.ch/twiki/bin/view/LCG/CEtokenSupportCampaign
- sites should upgrade the CEs to the version supporting also tokens, and configure tokens
Hackathon events
- 15th - 16th September ARC/HTCondor CE Hackathon, organised by WLCG, with HTCondorCE and ARC-CE to mostly investigating data staging issues (see GDB introduction)
- agreed to enable the support of the several token profiles through plugins
- same plugin for the several CEs
- plugins provided by the "creators" of the token profiles
- CE teams to provide specifics to the AAI teams and to release a new CE version supporting the plugins
- agreed to enable the support of the several token profiles through plugins
Plans for the coming months:
- ARC-CE and HTCondorCE implemented a new API interface
- The Check-in team released a plugin for the CEs allowing the Check-in/AARC token profile to work
- according to the AARC guidelines, the claim to authorise the job submission is provided through a different attributes than the one used by the WLCG token
- the plugin translates the attribute to be understandable to the CEs
- The plugin is currently under testing before its release in UMD
- involved DESY-HH, FZK-LCG2, INFN-BARI, INFN-T1, RECAS-NAPOLI
- the plugin is tested with HTCondor Feature Release which introduced the support to Check-in tokens
- tests were successful
- a few aspects concerning some HTCondor variables to be clarified
- involved DESY-HH, FZK-LCG2, INFN-BARI, INFN-T1, RECAS-NAPOLI
- HTCondor now supports SSL for authentication and mapping of x509 certificates.
- The SSL workaround does not allow the use of VOMS extensions to map, but it was mentioned that user mapping can be achieved using only the DN.
- Waiting for the creation of documentation about this setting
- Important for the sites: please get in contact with the VOs to verify their status about the transition to tokens:
- if the VOs need a bit more time you can use the SSL settings to map the users DN...
- ...but you need to know who these users are!
- Then we can start the decommission procedure for the HTCondorCE long-term support series (9.0.x)
- we might agree to postpone this decommission deadline considering the (relatively ) low likelihood of security issues
- this will make the transition to tokens less painful for VOs that are not ready yet
- At the same time the VOs using voms will be cloned to Check-in in order to be ready to use the tokens when the first HTCondor (Feature Channel version) endpoints are in productions.
- To be clarified the monitoring:
- if a new version of the probe using tokens is needed
- how to deal with CEs using different authz system during the migration phase
DPM Decommission and migration
- DPM supported until June 2023
- Sites are encouraged to start the migration to a different storage element since the process will take time
- choosing the new storage solution depends on the expertise/experience of the sites and on the needs of the supported VOs
- See the slides presented by Petr Vokac at the EGI Conference 2022 about the migration tools to dCache
- DPM provides a migration script to dCache (migration guide)
- Transparent migration
- Migrate just catalog (database) and keep files untouched
- both SE store files on posix filesystem
- Transparent migration
- Migration in three steps
- verify the DPM data consistency
- no downtime needed
- the operation can last several days or some weeks
- DPM dump and dCache import
- downtime lasting about 1 day
- verify the DPM data consistency
- In September 2022 opened tickets to the sites to plan the migration and decommission:
- tickets list (30 out of 57 were solved)
- Please let us know your plans for DPM EOL and in case you decide to use dCache migration tools the tickets will be used to support you on this storage migration method.
- dCache migration should be done by June 2023.
Planned completion date | |
By Feb 2023 | 1 |
By Q1 2023 | 3 |
By May 2023 | 2 |
By June 2023 | 8 |
By Q2 2023 | 4 |
undefined | 6 |
Chosen technology | |
Migration to dCache | 27 |
Migration to EOS | 8 |
Migration to xroot/ceph | 3 |
Migration to XrootD | 1 |
Migration to Xcache | 1 |
Migration to Dynafed | 2 |
Not yet decided/no clear plan | 6 |
Decommissioning SE or site | 7 |
- Procedure to decommission unsupported software: PROC16 Decommissioning of unsupported software
- In compliance to the EGI Service Operations Security Policy (1), unsupported software SHOULD be decommissioned before its End of Security Updates and Support, and MUST be retired no later than 1 month after its End of Security Updates and Support. After this date, if a critical vulnerability were to emerge in the software, EGI CSIRT can request the service to be turned off immediately.
- (1) a Resource Centre Administrator SHOULD follow IT security best practices that include pro-actively applying software patches, updates or configuration changes related to security.
- DPM end of security updates and Support: 30th June 2023
- DPM decommissioning deadline: 31st July 2023
- Failure to do so MAY ultimately lead to site suspension
- Please note that after June 30th no support is going to be provided with the migration to dCache in case of issues.
New benchmark HEPscore23
The benchmark HEPscore23 is replacing the old Hep-SPEC06
Main points agreed:
- On the Accounting Portal all of the metric units refer to HEPscore23 (since April 1st 2023)
- Existing resources at the sites will not be re-benchmarked with HEPscore23 (unless the site has modern resources and would like to re-benchmark them in order to get higher consumption in the accounting reports)
- New resources purchased by the site will be benchmarked with HEPscore23
- This implies that two benchmarks will co-exist on the infrastructure for quite some time
- Normalisation factor between HEPscore23 and HS06 is 1
- We would like to follow the progress regarding amount of the resources benchmarked with HEPscore23
- No need for reporting of measurements for two benchmarks in parallel for the same set of resources
- This implies that accounting record should contain one metric for a single benchmark and benchmark name has to be properly defined in the accounting record.
Recent activities:
- Some tests in particular with sites sending normalised reports were performed.
- APEL client 1.9.2 released that adds basic HEPscore23 publishing using existing message format
- It needs to be added to UMD
- APEL server release candidate in testing
- Liaising with Portal on setting up testing with them
- this new version allows the aggregation of the accounting records by benchmark to monitor the move to the new benchmark over the time
- When the tests are successful, final release of APEL server update and of the Portal
- Information for testing the publication of accounting records with the new benchmark:
- Expected a fix in ARC-CE for the proper configuration of HEPscore23
- Please contact us if you'd like to make tests with the new benchmark
HEPSCORE application:
- link to the gitlab page: https://gitlab.cern.ch/hep-benchmarks/hep-score
April GDB:
June WLCG Operations Coordination meeting:
Monitoring of webdav and xrootd protocols/endpoints
- 93 tickets were created requesting to update the information for monitoring webdav and xrootd endpoints
- Extension Properties to set:
- webdav:
- Name: ARGO_WEBDAV_OPS_URL
- Value: webdav URL containing also the VO ops folder, for example: https://darkstorm.cnaf.infn.it:8443/webdav/ops or https://hepgrid11.ph.liv.ac.uk/dpm/ph.liv.ac.uk/home/ops/
- xrootd:
- Name: ARGO_XROOTD_OPS_URL
- Value: XRootD base SURL to test (the path where ops VO has write access, for example: root://eosatlas.cern.ch//eos/atlas/opstest/egi/, root://recas-se-01.cs.infn.it:1094/dpm/cs.infn.it/home/ops/, root://dcache-atlas-xrootd-ops.desy.de:2811/pnfs/desy.de/ops or similar)
- webdav:
- Reference: https://docs.egi.eu/internal/configuration-database/adding-service-endpoint/#webdav
- Link to the broadcast circulated last October
- 74 tickets were solved (3 Unsolved)
- Extension Properties to set:
AOB
Next meeting
July or August