and hi again, maybe related to the issue of failing spooling/stages:I monitored the behaviour of restarting the condor service unit. Thing is, that on our CEs' the schedd did not properly connect to the shadows. Since the restart took hardly any time, the schedd had been trying to reconnect to the shadows - however, it failed and gave up dropping all shadows.
What makes me supicious are messages in the ShadowLog log [1], that point to a problem delegating the grid proxies.
However, the dir and proxy file exists and are readable [2] - so I am a bit lost. AFAIS there is no hurdle to read/write from/top the FS and the proxy itself is parse'able with the local openssl version [3] But the behaviour looks similar to my other case, where files are not spooled form the CE to the LRMS Schedd??
Cheers and thanks for any ideas, Thomas [1]03/08/21 14:01:09 (9334.0) (2001211): relisock_gsi_get (read from socket) failure 03/08/21 14:01:09 (9334.0) (2001211): ReliSock::put_x509_delegation(): delegation failed: Failed to receive delegation request 03/08/21 14:01:09 (9334.0) (2001211): DoUpload: SHADOW at 131.169.223.119 failed to send file(s) to <131.169.160.33:41404>: error sending /var/lib/condor-ce/spool/5212/0/cluster5212.proc0.subproc0/tmpL6wa9T
03/08/21 14:01:09 (9334.0) (2001211): File transfer failed (status=0).03/08/21 14:01:05 (9247.0) (2001032): condor_write() failed: send() 13 bytes to <131.169.162.103:34015> returned -1, timeout=0, errno=104 Connection reset by peer.
03/08/21 14:01:05 (9247.0) (2001032): Buf::write(): condor_write() failed03/08/21 14:01:05 (9247.0) (2001032): ReliSock::put_x509_delegation(): delegation failed: globus_gsi_proxy: Error with X.509 request structure: Couldn't convert X509_REQ struct from DER encoded to internal form OpenSSL Error: a_d2i_fp.c:247: in library: asn1 encoding routines, function ASN1_D2I_READ_BIO: not enough data
03/08/21 14:01:05 (9247.0) (2001032): DoUpload: SHADOW at 131.169.223.119 failed to send file(s) to <131.169.162.103:34015>: error sending /var/lib/condor-ce/spool/5380/5/cluster5380.proc5.subproc0/tmpL6wa9T
03/08/21 14:01:05 (9247.0) (2001032): File transfer failed (status=0). [2] > ls -all /var/lib/condor-ce/spool/5380/5/cluster5380.proc5.subproc0 total 80 drwx------ 2 belleprd000 belleprd 4096 Mar 8 08:14 . drwxr-xr-x 4 condor condor 4096 Mar 8 08:14 .. -rw-r--r-- 1 belleprd000 belleprd 528 Mar 8 13:57 5380.5.log-rwxr-xr-x 1 belleprd000 belleprd 55919 Mar 8 08:14 DIRAC_d1xT5x_pilotwrapper.py
-rw------- 1 belleprd000 belleprd 10362 Mar 8 08:14 tmpL6wa9T> openssl x509 -in /var/lib/condor-ce/spool/5380/5/cluster5380.proc5.subproc0/tmpL6wa9T -noout -text
Certificate: Data: Version: 3 (0x2) Serial Number: 898008020 (0x358683d4) Signature Algorithm: sha256WithRSAEncryption ... b5:6c:b2:b6:c2:12:b6:82:2c:bc:1a:06:8f:b3:dc:b7:7f:16: 34:46 [3] globus-gsi-openssl-error-4.2-1.el7.x86_64 globus-openssl-module-5.2-1.el7.x86_64 openssl-1.0.2k-21.el7_9.x86_64 openssl-libs-1.0.2k-21.el7_9.x86_64 condor-8.9.11-1.el7.x86_64 condor-boinc-7.16.11-1.el7.x86_64 condor-classads-8.9.11-1.el7.x86_64 condor-externals-8.9.11-1.el7.x86_64 condor-procd-8.9.11-1.el7.x86_64 htcondor-ce-4.4.1-3.el7.noarch htcondor-ce-apel-4.4.1-3.el7.noarch htcondor-ce-bdii-4.4.1-3.el7.noarch htcondor-ce-client-4.4.1-3.el7.noarch htcondor-ce-condor-4.4.1-3.el7.noarch htcondor-ce-view-4.4.1-3.el7.noarch python2-condor-8.9.11-1.el7.x86_64 python3-condor-8.9.11-1.el7.x86_64
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature