Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Condor 7.0.5 never restarting a successfully checkpointed job
- Date: Thu, 12 Mar 2009 13:39:54 +0000
- From: Marcus Bannerman <m.bannerman@xxxxxxxxxxxxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Condor 7.0.5 never restarting a successfully checkpointed job
I've cured the symptoms but I have no idea what the problem was.
For future reference, I killed all condor_startd daemons.
This reran the benchmarks and regenerated CheckpointPlatform on each
compute node. This has now got the correct vsystable page address at
the end.
Marcus Bannerman
2009/3/12 Marcus Bannerman <m.bannerman@xxxxxxxxxxxxxxxxxxxxxxxxx>:
> Hello,
> I've only just started using condor but hopefully I've included enough
> information to start the debugging.
>
> I was hoping someone could help me debug a problem with a checkpoint
> problem on a Rocks clusters/Condor 7.0.5 install. Checkpointed jobs
> are not restarting
>
> Making a trivial C++ program,
> #########################
> #include <iostream>
> #include <cmath>
>
> int main()
> {
>
> for (;;)
> {
> double sum(0);
> for (size_t i(0); i < 1000000; ++i)
> sum += std::sqrt(i);
>
> std::cout << "It turns out the sum is " << sum;
> }
>
> return 0;
> }
> #########################
>
> Compiling it with "condor_compile g++ test.cpp -o test.bin" and
> submitting it with
>
> #########################
> universe = standard
> executable = /home/mjki2mb2/dynamo/test.bin
> arguments =
> log = condor.log
> output = condor.out
> error = condor.error
>
> queue
> #########################
>
> Runs the job fine, output is as expected. If I then use condor_vacate
> on the running job, the task checkpoints and stops, but then will
> never restart. Running condor_q -better-analyze gives
> #########################
> 015.000: Run analysis summary. Of 100 machines,
> 0 are rejected by your job's requirements
> 100 reject your job because of their own requirements
> 0 match but are serving users with a better priority in the pool
> 0 match but reject the job for unknown reasons
> 0 match but will not currently preempt their existing job
> 0 are available to run your job
> Last successful match: Thu Mar 12 08:58:28 2009
> Last failed match: Thu Mar 12 09:19:32 2009
> Reason for last match failure: no match found
>
> WARNING: Be advised: Request 15.0 did not match any resource's constraints
>
>
> The following attributes are missing from the job ClassAd:
>
> CheckpointPlatform
> ########################
>
> Now the only problem i can find is that my job has
>
> LastCheckpointPlatform = "LINUX INTEL 2.6.x normal 0x40000000"
>
> but every node I have has
>
> CheckpointPlatform = "LINUX INTEL 2.6.x normal 0x4001c000"
>
> however if I ssh to any node (I've tested every node using tentakel) and run
>
> /opt/condor/libexec/condor_ckpt_probe --vdso-addr
>
> I obtain
> VDSO: 0x40000000
>
> (I got this executable name from the condor_config, I thought its what
> you use to generate that hex address, the option is a guess).
>
> I've checked all the condor logs and they're not much help, even with
> D_ALL set for all daemons.
> Please help, I've built this cluster and almost everything works fine
> but I can't get my head round the checkpoint error. When is there any
> way I can force a regeneration of the Checkpoint platform? Thanks to
> Rocks clusters every node is identical in set up, so could I just set
>
> IsValidCheckpointPlatform = FALSE / TRUE (I thought it would be true
> but I think the current expression evaluates to false when its ok)
>
> Thanks in advance,
> Marcus Bannerman
>