Hi all,I am trying to understand several jobs, that entered our CondorCE and went into hold, when their submission to Condor failed due to the Out & Err files missing(?)
For example CondorCE job 406446.0 that got in principle routed to 422981.0 [1]. The CE-job's spool directory exists [2].
However, the resulting Condor job fails during submission, when the stderr and stdout filers are tried to be opened for reading as far as I see [3]
I am not sure, if the OPut & Err file make much sense in this stage for a job (and if one could -as a quick fix- replace them in a route by /dev/null or so)?
Cheers, Thomas [1] ClusterId = 406446 UserLog = "406446.0.log" GlobalJobId = "grid-htcondorce0.desy.de#406446.0#1614750664" Environment = "HTCONDOR_JOBID=406446.0" Iwd = "/var/lib/condor-ce/spool/6446/0/cluster406446.proc0.subproc0" RoutedToJobId = "422981.0" Out = "406446.0.out" Err = "406446.0.err" [2] > ls /var/lib/condor-ce/spool/6446/0/cluster406446.proc0.subproc0 406446.0.log DIRAC_nd5lYU_pilotwrapper.py tmpBU9zHQ [3] 03/04/21 13:25:04 (cid:107) Transferring files for jobs 406446.0 03/04/21 13:25:04 (cid:107) spoolJobFiles(): started worker process 03/04/21 13:25:04 The submitting job ad as the FileTransferObject sees it ... xcount = 103/04/21 13:25:04 ReliSock::put_file_with_permissions(): Failed to stat file '/var/lib/condor-ce/spool/6446/0/cluster406446.proc0.subproc0/406446.0.err': No such file or directory (errno: 2, si_error: 1) 03/04/21 13:25:04 ReliSock::put_file_with_permissions(): Failed to stat file '/var/lib/condor-ce/spool/6446/0/cluster406446.proc0.subproc0/406446.0.out': No such file or directory (errno: 2, si_error: 1) 03/04/21 13:25:04 DoUpload: (Condor error code 13, subcode 2) SCHEDD at 131.169.223.129 failed to send file(s) to <202.13.206.84:37118>: error reading from /var/lib/condor-ce/spool/6446/0/cluster406446.proc0.subproc0/406446.0.err: (errno 2) No such file or directory; TOOL failed to receive file(s) from <131.169.223.129:9619> 03/04/21 13:25:04 (cid:107) generalJobFilesWorkerThread(): failed to transfer files for job 406446.0 03/04/21 13:25:04 condor_write(): Socket closed when trying to write 29 bytes to <202.13.206.84:37118>, fd is 21
03/04/21 13:25:04 Buf::write(): condor_write() failed 03/04/21 13:25:04 ERROR - Staging of job files failed!
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature