General information
- the Operations meeting will be on the 2nd Monday of the month
- the EGI Operations Meeting schedule for first half of 2016 is available on Indico: https://indico.egi.eu/indico/categoryDisplay.py?categId=32 and on the new summary page: https://wiki.egi.eu/wiki/Operations_Meeting
UMD/CMD
- UMD 3.14.2 RC ready
- problem with dependencies generated within EPEL: Package voms-clients is obsoleted by voms-clients-cpp, trying to install voms-clients-cpp-2.0.13-1.el6.x86_64 instead
- solution should be setting priorities so that UMD comes first (thanks Mattias)
- UMD 4 next release in preparation, release scheduled by June
- first update for SL6
- adding several products, see products in verification
- CMD
- RT setup: IT support to configure CMD together with UMD, discussion in progress
- Verification process
- starting with BDII info provider
- external infrastructure needed to perform the tests
- Staged-Rollout: TBD
Staged rollout updates
Preview repository
on 2016-05-17 released:
- preview 1.2.0
- LCMAPS-plugins-vo-ca-ap 0.0.1-1
- STORM 1.11.11
- Preview 2.1.0
- NorduGrid ARC 15.03 update 6
- LCMAPS-plugins-vo-ca-ap 0.0.1-1
Generic information about Preview repository: https://wiki.egi.eu/wiki/Preview_Repository
Note: EGI provides the preview repository without any additional quality assurance process, but the products are released as they are provided by the product team. EGI recommends the use of the UMD repositories, which contain software verified through the quality assurance process of UMD.
Operations
Central monitoring
- this has been postponed due to technical issues in setting up the central instance
RFC proxy will be default
- moving to RFC proxy instead of legacy proxy
- in production since a while, everybody is using RFC
- we will ask VOMS TP to make a little modification on VOMS client, changing the default
EGI Operations Support activities stopped
- Operations Support core activity has not been re-bid in the phase 2 of the EGI core activities
- all Operations Support activities have been moved to the EGI.eu Operations
- all the operational procedures involving operations support have been updated pointing to EGI operations. Please, let us know if we
missed to update any documents.
- The operations support support unit in GGUS has been decommissioned. Please, use the Operations support unit instead from now on.
Monthly Availability/Reliability
A/R report on ARGO: http://argo.egi.eu/lavoisier/ngi_reports?accept=html
List of the underperforming RCs for (at least) 3 consecutive months:
- AfricaArabia https://ggus.eu/?mode=ticket_info&ticket_id=117094: main problems with the monitoring system, waiting for the release of the central one
- ASRT
- DZ-01-ARN (recovered)
- EG-ZC-T3: unresponsive since too months, must be suspended
- ZA-UJ
- AsiaPacific: (since February) https://ggus.eu/index.php?mode=ticket_info&ticket_id=121222
- IN-DAE-VECC-02 (miscellaneous issues)
- MY-UPM-BIRUNI-01
- NGI_DE: https://ggus.eu/?mode=ticket_info&ticket_id=121975
- UNI-SIEGEN-HEP
- NGI_HR: https://ggus.eu/index.php?mode=ticket_info&ticket_id=120573
- egee.fesb.hr issue with SE element which affected the whole NGI: situation improved, they are planning to decommission it during this year.
- NGI_IL: (since last month) https://ggus.eu/index.php?mode=ticket_info&ticket_id=121223
- IL_IUCC_IG: suspended on June 6th
- NGI_MARGI https://ggus.eu/index.php?mode=ticket_info&ticket_id=118465 no monitoring data since January
- NGI_MD: https://ggus.eu/index.php?mode=ticket_info&ticket_id=120578
- the only site MD-02-IMI was suspended in March for security reasons, asked for news
- NGI_NDGF: https://ggus.eu/index.php?mode=ticket_info&ticket_id=121985
- EENet problem with the probe
Decommissioning SL5
- Tracked on SL5_retirement wiki
- Sites still deploying unsupported service end-points risk suspension, unless documented technical reasons prevent a Site Admin from updating these end-points https://wiki.egi.eu/wiki/PROC16_Decommissioning_of_unsupported_software#Escalation_phase see step 7
- Status https://wiki.egi.eu/wiki/SL5_retirement#2016-06-13_Overall_status reported below.
- from this week on EGI Operations can suspend sites that host SL5 services in production and not set under downtime
- tickets will be opened
Status and actions
NGIs argus server not properly configured
Some time ago (more than a year I think), EGI ran a campaign to have NGIs run a "NGI Argus" service. This campaign resulted in new services being added to goc-db for each NGI.
Unfortunately, as explained in the OMB in February, our monitoring is currently unable to check the deployment of these services: - For 6 services, our monitoring cannot contact the NGI Argus - For 18 services, our monitoring is not authorized to get the right information from the NGI Argus - For 1 service, our monitoring indicates that the NGI Argus is not properly configured and does not pull the rules from argus.cern.ch
In the end, only 5 services are properly configured and monitored!
The changes are rather easy:
- If we can't contact them, the site needs to make sure that there is no firewall blocking 195.251.55.111 from accessing the argus 'pap' port
- If we are not authorized, the site needs to add the right ACE to their argus authorization
pap-admin add-ace 'CN=srv-111.afroditi.hellasgrid.gr,OU=afroditi.hellasgrid.gr,O=HellasGrid, C=GR' 'POLICY_READ_LOCAL|POLICY_READ_REMOTE|CONFIGURATION_READ'
- If the argus server is not properly configured (no rule pulled), the site has to follow http://wiki.nikhef.nl/grid/Argus_Global_Banning_Setup_Overview#NGI_Argus
The current status of the infrastructure can be found:
- In the secmon nagios (not sure you have access to this):
- On the security dashboard:
https://operations-portal.egi.eu/csiDashboard/ngi/any/tab/list/filter/monitoring/page/list?tsid=4
On the security dashboard, each NGI should have a "argus-ban" result:
- "Ok" means ok
- "Unknown" means that we can't contact them
- "High" means that we are not authorized
- "Critical" means that argus is not pull rules from argus.cern.ch
The parent ticket is https://ggus.eu/?mode=ticket_info&ticket_id=120770
2016_06_13 UPDATE pending tickets:
- NGI_MD https://ggus.eu/?mode=ticket_info&ticket_id=120746
- NGI_FI https://ggus.eu/?mode=ticket_info&ticket_id=120747
- NGI_MARGI https://ggus.eu/?mode=ticket_info&ticket_id=120765
FedCloud status
- only GoeGrid (NGI_DE) is not publishing images
- open tickets to sites where dteam is not working: MK-04-FINKICLOUD -> this can lead to suspension as per OLA!
- cloud profiles still under approval at OMB, email to be circulated by EGI Operations for approval; if profiles will be approved, the new profile will be used for A/R from July 1st, the suspension will start from August 1st on
A/R Profile | March | April | May |
improvements | 2 | 6 | 5 |
unchanged | 11 | 7 | 5 |
worsening | 9 | 10 | 12 |
- CYFRONET-CLOUD (+100%): in the old profile it fails the accounting test
- GoeGRID (+80.7%): in the old profile it fails the cdmi test
- TR-FC1-ULAKBIM (+47.59%): it was failing the accounting test in the old profile
- HG-09-Okeanos-Cloud: https://ggus.eu/index.php?mode=ticket_info&ticket_id=122012 (SOLVED, updated the cert)
- failures with the probes:
- eu.egi.cloud.OCCI-Context-ops: CATEGORIES CRITICAL - SSL_connect returned=1 errno=0 state=error: certificate verify failed
- eu.egi.cloud.OCCI-VM-ops: CRITICAL - SSL connection with "https://okeanos-occi2.hellasgrid.gr:9000/" could not be established! SSL_connect
- MK-04-FINKICLOUD unreachable
- NCG-INGRID-PT (+26.74%): https://ggus.eu/index.php?mode=ticket_info&ticket_id=122013 (a new server are going to be put in production, decommissioning the old one)
- failures mainly with the cloud probes:
- eu.egi.cloud.OCCI-VM-ops (sometimes warning, sometimes critical): WARNING - "http://aurora.ncg.ingrid.pt:8787" failed to instantiate a COMPUTE instance in the given timeframe! Timeout: 300s
- eu.egi.cloud.OpenStack-VM-ops: Critical: could not fetch flavor ID, endpoint does not correctly exposes available flavors: 110 Connection timed out
- SCAI (-21.61%) https://ggus.eu/index.php?mode=ticket_info&ticket_id=122015 (CAs not completely updated)
- some repeated failures with the CA probes
- also eu.egi.cloud.OCCI-VM-ops CRITICAL - Unexpected response from https://fc.scai.fraunhofer.de:8787/! Net::HTTP::Post failed! HTTP Response status: [500] Internal Server Error : The server has either erred or is incapable of performing the requested operation.
- UPV-GRyCAP (-24.56) https://ggus.eu/index.php?mode=ticket_info&ticket_id=122014 (SOLVED, CAs updated)
- it is still failing the eu.egi.OCCI-IGTF probe
- org.nagios.OCCI-TCP: 05-11-2016 17:56:27 Connection refused
AOB
Next meeting
- 11 Jul 2016 https://indico.egi.eu/indico/event/3003/
- new calendar available until end of 2016 https://indico.egi.eu/indico/category/32/