Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Complete jobs occasionally do not leave run state
- Date: Wed, 23 Jun 2004 13:07:10 -0400
- From: Joseph Turian <turian@xxxxxxxxx>
- Subject: [Condor-users] Complete jobs occasionally do not leave run state
Dear Condor community,
I am using DAGman to submit a batch of standard universe jobs
heterogeneously (SunOS + Linux) on a shared filesystem. My condor
version is 6.6.0.
In rare circumstances that unfortunately I cannot duplicate, sometimes
a job will complete but stay in the Run state indefinitely.
I know that job has completed because the following line appears in
the stderr log: "Terminating at example 1". The final three lines of
main() in the program's C++ code are:
cerr << "Terminating at example " << example_count << "\n";
cerr.flush();
return 0;
The only explanation I can think of is that maybe there was a problem
during memory deallocation immediately prior to program termination.
This particular time I encountered this behavior, the pertinent job
line from condor_q is:
50813.0 turian 6/23 07:22 0+03:27:32 R 0 200.0
weak-hypothesis.$$
condor_q -long says:
50813.000: Request is being serviced
I grep'ed for 50813 in the logs, the only thing out of the ordinary is
in the DAGman log:
007 (50813.000.000) 06/23 10:24:39 Shadow exception!
Failed to connect to schedd!
8396914 - Run Bytes Sent By Job
14539960 - Run Bytes Received By Job
...
001 (50813.000.000) 06/23 10:24:40 Job executing on host: <128.122.140.86:32773>
What does this error mean? How can I avoid this in the future?
NB I removed the job from the queue and set up a new DAGman job to
start where it left off (i.e. assuming the job completed successfully,
since the output was intact), so I cannot query this particular errant
job. But if anyone can suggest what diagnostics I can perform the next
time this occurs, I'm all ears.
Thanks,
Joseph
--
http://www.cs.nyu.edu/~turian/