General information
Middleware
UMD
- CentOS8 discussion still ongoing
- migration of Software Provisioning infrastructure to IBERGRID still ongoing
- finalised the migration, we are making a test release for UMD4, if everything works we will make a new April update
- CMD new major release planned in parallel, for Ubuntu 20 and CentOS7
Preview repository
- released on 2021-04-09
- Preview 2.32.0 AppDB info (CentOS 7): APEL Client/Server 1.9.0, APEL-SSM 3.2.0, xrootd 5.1.1
Operations
ARGO/SAM
- Site-BDII metrics org.bdii.Entries and org.bdii.Freshness removed from ARGO_MON_CRITICAL profile
- the metrics are still kept in the ARGO_MON_OPERATORS profiles
- it is still an important service to support infrastructure oversight activities
- HTCondor-CE probes
- deployed on secmon and pakiti: GGUS 150006
- working on the probe for the host certificate validity check: GGUS 147386
- With 8.9.12 installed (expected the week of Mar 15), you should be able to query remote HTCondor-CEs for their host certificate using the following:
$ python -c 'import htcondor; ad = htcondor.Collector("collector2.opensciencegrid.org:9619").locate(htcondor.DaemonTypes.Schedd, "hosted-ce10.opensciencegrid.org"); print htcondor.SecMan().ping(ad, "READ")["ServerPublicCert"]' | openssl x509 -noout -subject -enddate subject= /CN=hosted-ce10.opensciencegrid.org notAfter=Apr 26 12:26:42 2021 GMT
- CREAM-CE metrics removed from ARGO_MON, ARGO_MON_OPERATIONS and ARGO_MON_CRITICAL (GGUS 149778)
- emi.cream.CREAMCE*
- eu.egi.CREAM*
FedCloud
Feedback from DMSU
Monthly Availability/Reliability
- Under-performed sites in the past A/R reports with issues not yet fixed:
- NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=150816
- GoeGrid
- NGI_HR: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148518
- egee.irb.hr: major upgrade from CentOS 6 to CentOS 7; tests currently fail due to wrong information published by the SE.
- NGI_IT: https://ggus.eu/index.php?mode=ticket_info&ticket_id=150818
- INFN-PISA
- ROC_LA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148515
- ATLAND: ARC-CE misconfiguration: "ENV/PROXY runtime environment" wasn't enabled
- ROC_LA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148956
- CBPF: SRM failures due to information not properly published. Physical access to facilities restricted due to COVID measures; planned a DPM update.
- ROC_LA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=150817
- ICN-UNAM: replacing CREAM-CE
- ROC_LA https://ggus.eu/index.php?mode=ticket_info&ticket_id=149355
- SUPERCOMPUTO-UNAM: scheduled a downtime for upgrading the site.
- NGI_DE: https://ggus.eu/index.php?mode=ticket_info&ticket_id=150816
- Under-performed sites after 3 consecutive months, under-performed NGIs, QoS violations: (March 2021):
- AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=151186
- Australia-T2: SRM failures due to missing information
- NGI_NL: https://ggus.eu/index.php?mode=ticket_info&ticket_id=151187
- BelGrid-UCL: SRM issues, improving...
- NGI_UA: https://ggus.eu/index.php?mode=ticket_info&ticket_id=151188
- UA_ICYB_ARC: the site didn't fully recover after an emergency shutdown
- NGI_UK: https://ggus.eu/index.php?mode=ticket_info&ticket_id=151189
- UKI-SOUTHGRID-SUSX: IGTF failures
- AsiaPacific: https://ggus.eu/index.php?mode=ticket_info&ticket_id=151186
- sites suspended:
IPv6 readiness plans
- please provide updates to the IPv6 assessment (ongoing) https://wiki.egi.eu/w/index.php?title=IPV6_Assessment
- if any relevant, information will be summarised at OMB
APEL migration from ActiveMQ to ARGO Message Service (AMS)
- Migration insructions: https://github.com/apel/ssm/blob/dev/migrating_to_ams.md
- ActiveMQ is going to be dismissed at the end of May
- Releasing a new version of Apel Client (1.9.0) compatible with the new AMS protocol when used to trigger the publication of the accounting records
- APEL SSM works fine since 2.4.0 version
- The accounting component of ARC-CE still uses the STOMP protocol to send the message records
- The developers are working on a new version compatible with AMS
- some sites will be asked to test the new version when available
- Cloud accounting campaign:
- HTCondorCE and Storage accounting campaign:
- Most common issues:
- mismatch between the host certificate subject registered in GOCDB and the real DN
- SAN field missing / wrongly defined in the host certificate
- DNS entries not completely defined
- same host used to send different types of accounting records
- a new version of ARGO Message Service mitigates the problems related to DNS entries and the SAN field:
Prerequisites for using AMS
- A valid host certificate from an IGTF Accredited CA.
- A GOCDB 'Site' entry flagged as 'Production'.
- A GOCDB 'Service' entry of the correct service type flagged as 'Production'. The following service types are used:
- For Grid accounting use 'gLite-APEL'.
- For Cloud accounting use 'eu.egi.cloud.accounting'.
- For Storage accounting use 'eu.egi.storage.accounting'.
- The 'Host DN' listed in the GOCDB 'Service' entry must exactly match the certificate DN of the host used for accounting. Make sure there are no leading or trailing spaces in the 'Host DN' field.
Feedback from NGI_FRANCE
On the Cloud infra, several tickets have been open to switch to the new messaging system. It would be nice to have the following RPMs made available from CMD repo, and not only from UMD:
- apel-ssm-2.4.1-1.el7.noarch
- python-argo-ams-library-0.5.1-1.el7.noarch
In addition, many Cloud sites are now using OpenStack Stein or newer. These version are provided with python-daemon = 2.2.3-1.el7. It conflicts with the requirement of apel-ssm ( python-daemon < 2.2.0)
Feedback from URT:
- there is new apel-ssm 3.0.0 version under untested repo and this new version solves the dependency issue of requirement python-daemon <= 2.2.0.
- there is also the new python-argo-ams 0.54 library
ARC-CE probe failing due to UMD repositories being down
- The unavailability of UMD repository caused a failure with the ARC-CE IGTF probes (org.nordugrid.ARC-CE-result-ops)
Job terminated as Failed. - Failed in data staging: Failed checking source replica http://repository.egi.eu:80/sw/production/cas/1/current/meta/ca-policy-egi-core.list: Failed to obtain information about file: Failed to connect to repository.egi.eu(IPv4):80 - JID: gsiftp://alex4.nipne.ro:2811/jobs/yq0NDmskJcynuvw3Vp3UrRNqABFKDmABFKDm8hJKDmABFKDmxx7PPm
- Asked the ARC-CE developers to remove this dependency from the probe:
- On 2021-03-09 it was asked a re-computation to exclude these failures from the A/R figures
CREAM-CE Decommission
- End of Security Updates and Support: 31st Dec 2020
- Original broadcast: https://operations-portal.egi.eu/broadcast/archive/2293
- Decommissioning deadline: 31st Jan 2021
- PROC16 Decommission of unsupported software
- Decommissioning start date: Oct 1st 2020
- a probe detecting CREAM-CE endpoints will be run, returning WARNING status
- GGUS ticket: https://ggus.eu/index.php?mode=ticket_info&ticket_id=148715
- eu.egi.sec.CREAMCE
- Nov 1st: probe returns CRITICAL status, alarms created on the ROD dashboard, ROD teams start to create tickets
- 1st Feb 2021: EGI Ops will start chasing the sites still providing CREAM-CE endpoints
- By this time service end-points which couldn't be upgraded should be put into downtime by site admin or ROD
- 1st March 2021: Sites still deploying unsupported service endpoints risk suspension, unless documented technical reasons prevent a Site Admin from updating these endpoints.
- Tickets opened: 49
- link to the list
- Please note that at least one CE endpoint should be associated to the APEL service type in order to monitor the publication of the accounting data, as explained here
- If the CE you are going to remove was also registered as APEL service type, do not forget to move the APEL service type to a different CE endpoint.
VOMS upgrade to CentOS 7
- VOMS for CentOS 7 released Nov 23rd with UMD 4.12.13
- VOMS Admin 3.8.0, VOMS Server 2.0.15
- VOMS endpoints registered on GOCDB as production and monitored: 41
- Provided by 33 sites
- list of ticket opened: GGUS
- total: 31. Solved: 19.
- the VOMS servers need to be published in the BDII in order to easily collect the deployed version
AOB
Next meeting
12th Apr 2021