Dear all, Further to the email below, we can’t
reproduce this problem on non Matlab compiled executable jobs on condor. For
other jobs we run, jobs suspend fine. It seems that the problem exists only when
running a Matlab compiled executable on Condor and a user logs into a compute
node and then back out within a 10 minute time frame. Has anybody else had any problems like
this? Shaun From:
condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Shaun J. O'Callaghan Dear all, I can successfully reproduce the strange
job suspension and termination problem by logging into a machine which is
currently executing a job and then logging out within the 15 minute suspension
interval. When the job is resumed it terminates with the return value
143. Can somebody please confirm whether this is a bug in Condor 6.8.0 or
not? Again I’m not running Java universe jobs as other people have
done in the past, this is a vanilla job executing a compiled matlab executable
across Condor. Regards, Shaun From:
condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Shaun J. O'Callaghan Dear All, We’re experiencing some very strange intermittent
problems when dealing with large batches of jobs (14,000+) on Condor v 6.8.0
(Linux Central Manager/XP Compute nodes). Some jobs are terminating with a return value of
143. These are vanilla jobs and are actually compiled Matlab executables that
rely on the Matlab Component Runtime (MCR) which is present and in the paths of
all of the compute nodes. This is not a required library issue as only
120 or so jobs failed in a batch of 14,000+. I’ve pulled out the details that depict the
lifetime of one of the jobs that failed with this return value and listed the
details below. I’ve read some details on the Condor list that said
something along the lines of: this problem can occur when a job is suspended
and a user logs out of a machine. As mentioned we’re running 6.8.0 across the
pool. Has this problem been rectified in 6.8.2 or can anyone provide any
further information on this? Kind Regards, Shaun Job details below 000 (014.1804.000) 11/08 08:34:57 Job submitted from
host: <xxx.xxx.xxx.xxx:1058> 001 (014.1804.000) 11/09 15:30:12 Job executing on
host: < xxx.xxx.xxx.xxx:2956> 006 (014.1804.000) 11/09 15:50:21 Image size of job
updated: 97528 010 (014.1804.000) 11/09 16:08:57 Job was suspended.
Number of processes actually suspended: 2 011 (014.1804.000) 11/09 16:18:47 Job was
unsuspended. 010 (014.1804.000) 11/09 17:40:31 Job was suspended.
Number of processes actually suspended: 2 011 (014.1804.000) 11/09 17:49:33 Job was
unsuspended. 006 (014.1804.000) 11/09 17:49:41 Image size of job
updated: 97536 005 (014.1804.000) 11/09 17:49:42 Job terminated.
(1) Normal termination (return value 143)
Usr 0 01:56:36, Sys 0 00:01:26 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 01:56:36, Sys 0 00:01:26 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
151 - Run Bytes Sent By Job
3897398 - Run Bytes Received By Job
151 - Total Bytes Sent By Job
3897398 - Total Bytes Received By Job |