Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Checkpointing failed on X86_64
- Date: Wed, 22 Nov 2006 14:47:44 -0500
- From: Junjun Mao <jmao@xxxxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Checkpointing failed on X86_64
You are right.
As suggested by others, I am able to checkpoint and stop a job with
condor_vacate_job jobid; condor_hold jobid
and resume the job with
condor_release jobid
Junjun
On Wednesday 22 November 2006 14:42, Todd Tannenbaum wrote:
> Previously Junjun Mao wrote:
> > I compiled this simple program with condor_compile gcc -o count
>
> count.c
>
> <snip>
>
> > When I used condor_hold while the program was running I got
>
> this error
>
> > in the log file:
> >
> > 001 (008.000.000) 11/17 19:13:25 Job executing on host:
> > <10.10.20.90:42208>
> > ...
> > 004 (008.000.000) 11/17 19:15:20 Job was evicted.
> > (0) Job was not checkpointed.
> > Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote
>
> Usage
>
> > Usr 0 00:00:00, Sys 0 00:00:00 - Run Local
>
> Usage
>
> > 570 - Run Bytes Sent By Job
> > 4754958 - Run Bytes Received By Job
> >
> > I looked for the manual
>
> http://www.cs.wisc.edu/condor/manual/v6.8/1_5Availability.html#se
>
> :Availability
> :
> > It appears condor_compile is not supported on my platform
>
> Fedora Core
>
> > 4/Opteron. Is this the real reason?
>
> I doubt it if you are running Condor v6.8.2, since that version
> added 64bit Linux checkpoint support.
>
> I don't recall if condor_hold will force a checkpoint or not.
> So I would retry your test using "condor_vacate" (or
> condor_vacate_job) to checkpoint and leave the machine, or
> "condor_checkpoint" (or condor_checkpoint_job) to checkpoint and
> keep running.
>
> Another thought : maybe the above happened because the job only
> ran for less than 2 minutes. Condor will (purposefully) not
> bother to checkpoint upon pre-emption unless more than X seconds
> of forward progress was made. I don't recall off the top of my
> head what X is, sorry, but it was short. 3 minutes perhaps?
>
> Regards,
> Todd
--
To unsubscribe the mailing list, please send me an email
--
Dr. Junjun Mao, Research Associate
Steinman Hall, #1M-11
Levich Institute at City College of CUNY
140th Street & Convent Avenue
New York, NY 10031
(212) 650-6845 (Phone)
(212) 650-6835 (fax)