Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Problems with checkpointing.
- Date: Fri, 29 Apr 2005 15:20:56 -0500 (CDT)
- From: Paul Armor <parmor@xxxxxxxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Problems with checkpointing.
Hi,
here's a typical failure. This is from a users log's from 1600 jobs
submitted, where 976 failed after restarting following a
checkpoint/eviction. I'm just starting to go through the other users
logs. I'm not sure if all jobs that checkpoint/evict fail after
restarting, but I don't beleive they do.
Snippets from users log:
000 (12450.023.000) 04/25 16:36:10 Job submitted from host: <129.89.201.232:57084>
001 (12450.023.000) 04/25 16:40:32 Job executing on host: <129.89.200.36:32774>
006 (12450.023.000) 04/25 17:38:09 Image size of job updated: 52448
...
004 (12450.023.000) 04/25 17:38:10 Job was evicted.
(1) Job was checkpointed.
Usr 0 00:44:47, Sys 0 00:00:11 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
49975312 - Run Bytes Sent By Job
4201851 - Run Bytes Received By Job
...
001 (12450.023.000) 04/25 19:45:45 Job executing on host: <129.89.201.56:32803>
005 (12450.023.000) 04/25 19:45:49 Job terminated.
(0) Abnormal termination (signal 11)
(0) No core file
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:44:47, Sys 0 00:00:11 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
304 - Run Bytes Sent By Job
53707404 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
009 (12450.023.000) 04/25 19:45:49 Job was aborted by the user.
Thanks for your help!
Paul
On Fri, 29 Apr 2005, Paul Armor wrote:
> Hi Alan,
>
> > I would think that these would be identical RPMs, since we don't distribute
> > different binaries for RedHat 9, Fedora Core 1, or Fedora Core 3: We build
> > it on RedHat 9 and it just works on the Fedora Core 1-3. I know that the
> > download web page lists them separately--this is to make it clear what to
> > download. But they are identical.
>
> OK, I was feeling "superstitious" ;-)
>
> > I'm also a bit confused--you're installing the checkpoint server on all the
> > execution computers?
>
> Yes, I inherited the spec file and process, so... (P.S. we're installing
> the same RPM on all nodes, using same condor_config, using different
> condor_config.local)
>
> > Can you be more specific about the errors you are getting?
>
> OK, I was waiting for more details from users... I'll attach a bunch of
> stuff below, trying to show lifecycle of jobs, but here's a typical log
> entry when a job dies... I know this job was condor_compiled on a RH9
> box, I don't know where it initially ran, but here it dies on a RH9 box:
>
> 001 (12450.852.000) 04/27 17:08:09 Job executing on host: <129.89.200.78:51017>
> ...
> 005 (12450.852.000) 04/27 17:08:14 Job terminated.
> (0) Abnormal termination (signal 11)
> (0) No core file
> Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
> Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
> Usr 0 01:30:00, Sys 0 00:00:32 - Total Remote Usage
> Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
> 304 - Run Bytes Sent By Job
> 58917520 - Run Bytes Received By Job
> 0 - Total Bytes Sent By Job
> 0 - Total Bytes Received By Job
> ...
>
> > Yeah--these are the same binaries. Sorry for the confusion. :(
>
> No worries, I still would have probably become superstitious ;-)
>
> > I think we need to see some log files to better help you.
>
> Actually, what's the preferred method of overwhelming you with logs?
> Shall I throw them up so as to be http-able? Or would you prefer email?
>
> Cheers,
> Paul
>
>
>
--
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ UWM-LSC Group Systems Administrator parmor@xxxxxxxxxxxxxxxxxxxx +
+ Physics 462 +
+ U. of W. - Milwaukee +
+ PO Box 413 414-229-2677 +
+ Milwaukee, WI 53201 fax 414-229-5589 +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++