Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Condor 7.0.5 never restarting a successfully checkpointed job
- Date: Thu, 12 Mar 2009 10:26:20 +0000
- From: Marcus Bannerman <m.bannerman@xxxxxxxxxxxxxxxxxxxxxxxxx>
- Subject: [Condor-users] Condor 7.0.5 never restarting a successfully checkpointed job
Hello,
I've only just started using condor but hopefully I've included enough
information to start the debugging.
I was hoping someone could help me debug a problem with a checkpoint
problem on a Rocks clusters/Condor 7.0.5 install. Checkpointed jobs
are not restarting
Making a trivial C++ program,
#########################
#include <iostream>
#include <cmath>
int main()
{
for (;;)
{
double sum(0);
for (size_t i(0); i < 1000000; ++i)
sum += std::sqrt(i);
std::cout << "It turns out the sum is " << sum;
}
return 0;
}
#########################
Compiling it with "condor_compile g++ test.cpp -o test.bin" and
submitting it with
#########################
universe = standard
executable = /home/mjki2mb2/dynamo/test.bin
arguments =
log = condor.log
output = condor.out
error = condor.error
queue
#########################
Runs the job fine, output is as expected. If I then use condor_vacate
on the running job, the task checkpoints and stops, but then will
never restart. Running condor_q -better-analyze gives
#########################
015.000: Run analysis summary. Of 100 machines,
0 are rejected by your job's requirements
100 reject your job because of their own requirements
0 match but are serving users with a better priority in the pool
0 match but reject the job for unknown reasons
0 match but will not currently preempt their existing job
0 are available to run your job
Last successful match: Thu Mar 12 08:58:28 2009
Last failed match: Thu Mar 12 09:19:32 2009
Reason for last match failure: no match found
WARNING: Be advised: Request 15.0 did not match any resource's constraints
The following attributes are missing from the job ClassAd:
CheckpointPlatform
########################
Now the only problem i can find is that my job has
LastCheckpointPlatform = "LINUX INTEL 2.6.x normal 0x40000000"
but every node I have has
CheckpointPlatform = "LINUX INTEL 2.6.x normal 0x4001c000"
however if I ssh to any node (I've tested every node using tentakel) and run
/opt/condor/libexec/condor_ckpt_probe --vdso-addr
I obtain
VDSO: 0x40000000
(I got this executable name from the condor_config, I thought its what
you use to generate that hex address, the option is a guess).
I've checked all the condor logs and they're not much help, even with
D_ALL set for all daemons.
Please help, I've built this cluster and almost everything works fine
but I can't get my head round the checkpoint error. When is there any
way I can force a regeneration of the Checkpoint platform? Thanks to
Rocks clusters every node is identical in set up, so could I just set
IsValidCheckpointPlatform = FALSE / TRUE (I thought it would be true
but I think the current expression evaluates to false when its ok)
Thanks in advance,
Marcus Bannerman