Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Condor checkpointing problems?
- Date: Mon, 22 Nov 2004 12:03:05 -0500
- From: Dan Christensen <jdc@xxxxxx>
- Subject: Re: [Condor-users] Condor checkpointing problems?
Will Andrews <andrewsw@xxxxxxxxxxxxxxxxxxxxx> writes:
> At Purdue University we've recently installed a new cluster
> running RedHat ES3. It's running Condor 6.6.5 alongside Debian
> clusters running the same version. About a week ago a user
> reported seeing "shadow exception" errors in the logs. The jobs
> were unable to checkpoint. At first we thought it applied to all
> the nodes, but now we've narrowed it down to only nodes in the
> new RedHat cluster.
I don't know if it's related, but we had problems with a node running
FC1. Upgrading seems to have fixed it. The funny thing is, the
problem depended on a peculiar combination of where the job was
compiled and where it was run. Here's a message I sent to
condor-users in June (message id <87brk0dc2k.fsf@xxxxxx>). I don't
recall getting any responses.
Dan
Dan Christensen wrote:
> Alain Roy <roy@xxxxxxxxxxx> writes:
>
> > Richard O'Shaughnessy wrote:
> >> We recently rebuilt our cluster using fedora core 2. But while job
> >> output seems to work (at least, I can see output on some
> >>jobs), checkpointing doesn't seem to be working correctly:
> >
> > I'm not surprised--Fedora Core 2 uses a newer Linux kernel version
> > (2.6) than we have worked with in Condor.
>
> On our Condor cluster we're having trouble with checkpointing on the
> one machine which runs Fedora Core 1 (1, not 2). That machine uses
> glibc-2.3.2-101.4 with kernel 2.4.22-1.2188.nptlsmp.
>
> The situation is a bit complicated. Our pool runs a mix of Linuxes.
> Several machines run Debian testing, several run various versions
> of RedHat 7.x and 8.0, and just the one above runs FC1.
>
> Almost everything seems to work fine, except that jobs compiled using
> condor_compile on the Debian machines or on the FC1 machine don't
> checkpoint when run on the FC1 machine. They checkpoint on the other
> RedHat machines and on the Debian machines. And if I compile my jobs
> on any of the other RedHat machines, they checkpoint everywhere.
>
> The Debian machines on which I run condor_compile have libc6 2.3.2.
> And I've tried gcc 3.2.3 and 3.3.3, and both have the same problem.
> I also tried gcc 2.95, and compilation failed.
>
> The RedHat machines (besides the FC1 machine) have libc6 2.2.5 and gcc
> "2.96".
>
> All the machines run Condor 6.6.3.
>
> Any thoughts? We don't see anything useful in the log files. What
> debugging options would give more information?
>
> Dan