Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] log file indicates termination of job, but output file is empty !?!
- Date: Thu, 25 Nov 2010 13:45:49 +0800
- From: <Greg.Hitchen@xxxxxxxx>
- Subject: Re: [Condor-users] log file indicates termination of job, but output file is empty !?!
Hi Rob
Ah yes, the old -10737*** error. Try googling it (without the -).
Some weird random windows error (unrelated to condor) that some machines
return one day but not the next. Unfortunately they can snaffle a lot
of jobs because they finish quickly and immediately are ready to accept
another job, which also fails, and so on. A remedy we have used is to
include this in the submit file.
on_exit_remove = (ExitCode == 0)
this means that unless the ExitCode is zero (e.g. -107****) it will NOT
remove the job, it will be requeued, and hopefully this time execute
properly on a different machine. Of course this will happen for anything
nonzero so you need to be careful if your code exits non-zero for any
other reason (e.g. file not found, etc). I guess you could check for exactly
-1073741502, maybe something like
on_exit_remove = (ExitCode =!= -1073741502)
Note that we have seen two similar but slightly different -10737* type
error numbers.
Cheers
Greg
>Thank you. I have tested again using the error entry in the submit file.
>I found that both the error and output files are empty when this problem
>re-occured.
>The log file tells me that the job has terminated.
>
>This is what I think is relevant in the log files:
>(Notice the exit status "-1073741502"; what does that mean?)
>
>ShadowLog:11/25 12:55:13 Initializing a VANILLA shadow for job 322.1464
>ShadowLog:11/25 12:55:13 (322.1464) (19248): Request to run on slot1@34-5
><115.125.128.213:1045> was ACCEPTED