Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Condor checkpointing problems?
- Date: Mon, 22 Nov 2004 11:48:03 -0600
- From: Erik Paulson <epaulson@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Condor checkpointing problems?
On Mon, Nov 22, 2004 at 12:03:05PM -0500, Dan Christensen wrote:
> Will Andrews <andrewsw@xxxxxxxxxxxxxxxxxxxxx> writes:
>
> > At Purdue University we've recently installed a new cluster
> > running RedHat ES3. It's running Condor 6.6.5 alongside Debian
> > clusters running the same version. About a week ago a user
> > reported seeing "shadow exception" errors in the logs. The jobs
> > were unable to checkpoint. At first we thought it applied to all
> > the nodes, but now we've narrowed it down to only nodes in the
> > new RedHat cluster.
>
> I don't know if it's related, but we had problems with a node running
> FC1. Upgrading seems to have fixed it. The funny thing is, the
> problem depended on a peculiar combination of where the job was
> compiled and where it was run. Here's a message I sent to
> condor-users in June (message id <87brk0dc2k.fsf@xxxxxx>). I don't
> recall getting any responses.
>
We had previously believed that we checkpointed OK on FC1 machines - we
now know that Condor does NOT checkpoint on FC1 machines without some
changes to the machine.
I _think_ all that you need to do is disable exec_shield:
echo 0 > /proc/sys/kernel/exec-shield
Which removes the address-space randomization that is giving us
some trouble.
We're working on dealing with the new address space layouts, disabling
exec-shield system-wide is not a long-term solution.
-Erik
> Dan
>
> Dan Christensen wrote:
>
> > Alain Roy <roy@xxxxxxxxxxx> writes:
> >
> > > Richard O'Shaughnessy wrote:
> > >> We recently rebuilt our cluster using fedora core 2. But while job
> > >> output seems to work (at least, I can see output on some
> > >>jobs), checkpointing doesn't seem to be working correctly:
> > >
> > > I'm not surprised--Fedora Core 2 uses a newer Linux kernel version
> > > (2.6) than we have worked with in Condor.
> >
> > On our Condor cluster we're having trouble with checkpointing on the
> > one machine which runs Fedora Core 1 (1, not 2). That machine uses
> > glibc-2.3.2-101.4 with kernel 2.4.22-1.2188.nptlsmp.
> >
> > The situation is a bit complicated. Our pool runs a mix of Linuxes.
> > Several machines run Debian testing, several run various versions
> > of RedHat 7.x and 8.0, and just the one above runs FC1.
> >
> > Almost everything seems to work fine, except that jobs compiled using
> > condor_compile on the Debian machines or on the FC1 machine don't
> > checkpoint when run on the FC1 machine. They checkpoint on the other
> > RedHat machines and on the Debian machines. And if I compile my jobs
> > on any of the other RedHat machines, they checkpoint everywhere.
> >
> > The Debian machines on which I run condor_compile have libc6 2.3.2.
> > And I've tried gcc 3.2.3 and 3.3.3, and both have the same problem.
> > I also tried gcc 2.95, and compilation failed.
> >
> > The RedHat machines (besides the FC1 machine) have libc6 2.2.5 and gcc
> > "2.96".
> >
> > All the machines run Condor 6.6.3.
> >
> > Any thoughts? We don't see anything useful in the log files. What
> > debugging options would give more information?
> >
> > Dan
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users