Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Problems on Mac OS X
- Date: Tue, 18 Jan 2005 09:29:40 -0800
- From: John Wheez <john@xxxxxxxxxx>
- Subject: Re: [Condor-users] Problems on Mac OS X
dautret wrote:
1. I use Condor on G5' s cluster with Mac OS X. After few hours,
condor_master crashes on the submitter machine what leads the break of
the jobs. It seems that this pb occurs with many applications on mac
os X but no one in forum Mac has got an idea to solve this pb.
This pb occurs with condor6.6.5 condor6.6.7 and condor6.7.2…
So, has anyone use condor on mac os x and have you got these crashes ?
I run Condor 6.7.3 on a Power book G4 with 1 gig of memory and Condor
seems very stable. It was also stable under 6.7.2. I have it running
jobs where a single node computes 400 processes in a row without a
crash. My jobs are also fairly memory intensive. It is using the Shake
program by Apple.
If you update to Condor 6.7.3 beware there is a slight bug and you will
need to add this to your local config file or the classadd for OS type
will be wrong:
OpSys="OSX"
STARTD_EXPRS=$(STARTD_EXPRS) OpSys
Here are some questions whihc might help debug the problem:
1) Do all programs cause crashes? Try to monitor teh ram usage on the
machines..are they running out of memory or is the CPU activity suddenly
always 100% which idicates some sort of crash....Perhaps the program you
are running has some sort of bug in it which eventually causes the Mac's
to crash. Like a memory leak..This was your error message Exception:
EXC_BAD_ACCESS (0x0001) Codes: KERN_PROTECTION_FAILURE (0x0002) at
0x00000000
2) Do all your G5's crash consistently? or is it a specific set of
machines? Perhaps a few of your machines have bad memory chips..that
happened to me on a fairly new computer and it caused teh system to
crash many times.
2. Sometimes, only the manager machine crashes but not the submitter
machine… At this moment, condor stops… but when I launch again
condor_master on the manager machine, jobs restart although I have a
vanilla configuration…!
The jobs should not restart from the first process in the job
cluster..they should continue from the process # just before the crash
i think. Is this what is happening or do your jobs restart from the
first process in the cluster of jobs.