Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Checkpointing failed on X86_64
- Date: Wed, 22 Nov 2006 04:57:47 GMT-6
- From: tannenba@xxxxxxxxxxx (Todd Tannenbaum)
- Subject: Re: [Condor-users] Checkpointing failed on X86_64
Previously Junjun Mao wrote:
> I compiled this simple program with condor_compile gcc -o count
count.c
>
<snip>
>
> When I used condor_hold while the program was running I got
this error
> in the log file:
>
> 001 (008.000.000) 11/17 19:13:25 Job executing on host:
> <10.10.20.90:42208>
> ...
> 004 (008.000.000) 11/17 19:15:20 Job was evicted.
> (0) Job was not checkpointed.
> Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote
Usage
> Usr 0 00:00:00, Sys 0 00:00:00 - Run Local
Usage
> 570 - Run Bytes Sent By Job
> 4754958 - Run Bytes Received By Job
>
> I looked for the manual
>
http://www.cs.wisc.edu/condor/manual/v6.8/1_5Availability.html#se
:Availability
>
> It appears condor_compile is not supported on my platform
Fedora Core
> 4/Opteron. Is this the real reason?
>
I doubt it if you are running Condor v6.8.2, since that version
added 64bit Linux checkpoint support.
I don't recall if condor_hold will force a checkpoint or not.
So I would retry your test using "condor_vacate" (or
condor_vacate_job) to checkpoint and leave the machine, or
"condor_checkpoint" (or condor_checkpoint_job) to checkpoint and
keep running.
Another thought : maybe the above happened because the job only
ran for less than 2 minutes. Condor will (purposefully) not
bother to checkpoint upon pre-emption unless more than X seconds
of forward progress was made. I don't recall off the top of my
head what X is, sorry, but it was short. 3 minutes perhaps?
Regards,
Todd
--
Posted via a Palm OS PDA (Handspring Visor Edge)