Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Problems with checkpointing.
- Date: Tue, 03 May 2005 10:09:22 -0500
- From: Alain Roy <roy@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Problems with checkpointing.
> Can you be more specific about the errors you are getting?
OK, I was waiting for more details from users... I'll attach a bunch of
stuff below, trying to show lifecycle of jobs, but here's a typical log
entry when a job dies... I know this job was condor_compiled on a RH9
box, I don't know where it initially ran, but here it dies on a RH9 box:
001 (12450.852.000) 04/27 17:08:09 Job executing on host:
<129.89.200.78:51017>
...
005 (12450.852.000) 04/27 17:08:14 Job terminated.
(0) Abnormal termination (signal 11)
Hmmm...
Another thing... the user whose log's I'm just checking into has told me
that his failing jobs were condor_compile'ed under 6.7.3, and have been
failing on 6.7.6. I haven't heard back from the user whose snippets are
listed earlier in the thread.
Would the jobs having been condor_compiled under 6.7.3 make a difference?
I don't think it should make any difference, unless we fixed a bug in the
standard universe implementation. That said, I'm not aware of any relevant
bug fixes. It would hurt to try condor_compiling with 6.7.6, but I don't
expect it will help much.
At this point, I would try two things:
1) Look in the StarterLog and StartLog on the execution computer at the
time the job failed to see if there are any obvious problems.
2) Do you get a core file back that can be looked at to see where the
program died? If the program had a segfault, there are a few possibilities:
a) The user's code is flawed and it crashes on its own accord.
b) The Condor library that is linked with the job has a bug that
caused the crash.
c) The user relies on something that isn't true in the standard
universe.
http://www.cs.wisc.edu/condor/manual/v6.7/1_4Current_Limitations.html
There may be a subtle problem in the user's code. Refer to point 9 in the
link above: "All files must be opened read-only or write-only. A file
opened for both reading and writing will cause trouble if a job must be
rolled back to an old checkpoint image. For compatibility reasons, a file
opened for both reading and writing will result in a warning but not an
error." For example, what if the following sequence of events occurs?
* Open file for reading and writing
* CHECKPOINT
* Read some data, write new data based on this
* EVICT
<on new machine>
* Restart at checkpoint, read new data, get confused by the data, crash.
I'm not saying that it's definitely a bug in the user code. It may well be
in Condor. I'm just saying that it might be tricky to track it down.
-alain