Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Problems with checkpointing.
- Date: Fri, 29 Apr 2005 16:22:36 -0500 (CDT)
- From: Paul Armor <parmor@xxxxxxxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Problems with checkpointing.
Another thing... the user whose log's I'm just checking into has told me
that his failing jobs were condor_compile'ed under 6.7.3, and have been
failing on 6.7.6. I haven't heard back from the user whose snippets are
listed earlier in the thread.
Would the jobs having been condor_compiled under 6.7.3 make a difference?
Thanks!
Paul
On Fri, 29 Apr 2005, Paul Armor wrote:
> Hi,
> here's a typical failure. This is from a users log's from 1600 jobs
> submitted, where 976 failed after restarting following a
> checkpoint/eviction. I'm just starting to go through the other users
> logs. I'm not sure if all jobs that checkpoint/evict fail after
> restarting, but I don't beleive they do.
>
> Snippets from users log:
>
> 000 (12450.023.000) 04/25 16:36:10 Job submitted from host: <129.89.201.232:57084>
> 001 (12450.023.000) 04/25 16:40:32 Job executing on host: <129.89.200.36:32774>
> 006 (12450.023.000) 04/25 17:38:09 Image size of job updated: 52448
> ...
> 004 (12450.023.000) 04/25 17:38:10 Job was evicted.
> (1) Job was checkpointed.
> Usr 0 00:44:47, Sys 0 00:00:11 - Run Remote Usage
> Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
> 49975312 - Run Bytes Sent By Job
> 4201851 - Run Bytes Received By Job
> ...
> 001 (12450.023.000) 04/25 19:45:45 Job executing on host: <129.89.201.56:32803>
> 005 (12450.023.000) 04/25 19:45:49 Job terminated.
> (0) Abnormal termination (signal 11)
> (0) No core file
> Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
> Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
> Usr 0 00:44:47, Sys 0 00:00:11 - Total Remote Usage
> Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
> 304 - Run Bytes Sent By Job
> 53707404 - Run Bytes Received By Job
> 0 - Total Bytes Sent By Job
> 0 - Total Bytes Received By Job
> ...
> 009 (12450.023.000) 04/25 19:45:49 Job was aborted by the user.
>
>
> Thanks for your help!
> Paul
>
>
>
> On Fri, 29 Apr 2005, Paul Armor wrote:
>
> > Hi Alan,
> >
> > > I would think that these would be identical RPMs, since we don't distribute
> > > different binaries for RedHat 9, Fedora Core 1, or Fedora Core 3: We build
> > > it on RedHat 9 and it just works on the Fedora Core 1-3. I know that the
> > > download web page lists them separately--this is to make it clear what to
> > > download. But they are identical.
> >
> > OK, I was feeling "superstitious" ;-)
> >
> > > I'm also a bit confused--you're installing the checkpoint server on all the
> > > execution computers?
> >
> > Yes, I inherited the spec file and process, so... (P.S. we're installing
> > the same RPM on all nodes, using same condor_config, using different
> > condor_config.local)
> >
> > > Can you be more specific about the errors you are getting?
> >
> > OK, I was waiting for more details from users... I'll attach a bunch of
> > stuff below, trying to show lifecycle of jobs, but here's a typical log
> > entry when a job dies... I know this job was condor_compiled on a RH9
> > box, I don't know where it initially ran, but here it dies on a RH9 box:
> >
> > 001 (12450.852.000) 04/27 17:08:09 Job executing on host: <129.89.200.78:51017>
> > ...
> > 005 (12450.852.000) 04/27 17:08:14 Job terminated.
> > (0) Abnormal termination (signal 11)
> > (0) No core file
> > Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
> > Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
> > Usr 0 01:30:00, Sys 0 00:00:32 - Total Remote Usage
> > Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
> > 304 - Run Bytes Sent By Job
> > 58917520 - Run Bytes Received By Job
> > 0 - Total Bytes Sent By Job
> > 0 - Total Bytes Received By Job
> > ...
> >
> > > Yeah--these are the same binaries. Sorry for the confusion. :(
> >
> > No worries, I still would have probably become superstitious ;-)
> >
> > > I think we need to see some log files to better help you.
> >
> > Actually, what's the preferred method of overwhelming you with logs?
> > Shall I throw them up so as to be http-able? Or would you prefer email?
> >
> > Cheers,
> > Paul
> >
> >
> >
>
>
--
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ UWM-LSC Group Systems Administrator parmor@xxxxxxxxxxxxxxxxxxxx +
+ Physics 462 +
+ U. of W. - Milwaukee +
+ PO Box 413 414-229-2677 +
+ Milwaukee, WI 53201 fax 414-229-5589 +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++