Mailing List Archives
Authenticated access
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [Condor-users] BLAST jobs go to 0% CPU; condor thinks they'rerunning
- Date: Tue, 8 Mar 2005 13:36:28 -0600
- From: "Michael Rusch" <mcrusch@xxxxxxxxxxxxxxxxxxx>
- Subject: RE: [Condor-users] BLAST jobs go to 0% CPU; condor thinks they'rerunning
I don't know what you mean by your question: are the jobs still alive when
the CPU drops to 0%. The processes still exist, as I can see them using
Task Manager (I'm in Windows XP--no ps command), but they never get any CPU
But, the good news is that it's working now. Why? I have no idea. After
having these problems, I switched a couple of the machines to the UWCS
default settings for starting, suspending, preempting jobs, etc. I screwed
up one of them pretty badly, which made startd crash constantly, so that
that node disappeared from the pool. After that, running the BLAST worked
fine. When I fixed the config script and the node came back, it still
worked fine. I had not modified the config on that node at all when it
wasn't was the same as the rest, but after breaking it and
fixing it, it worked.
Go figure.
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-
> bounces@xxxxxxxxxxx] On Behalf Of Jaime Frey
> Sent: Monday, March 07, 2005 1:56 PM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] BLAST jobs go to 0% CPU; condor thinks
> they'rerunning
> On Tue, 1 Mar 2005, Michael Rusch wrote:
> > I have a four machine condor pool (three are dual-processor, so there
> are 7
> > virtual machines), with all machines running Windows XP.
> >
> > I have tried several times to submit a job cluster that has sixteen
> > individual jobs/processes. They're all BLAST searches, for those who
> are
> > familiar with BLAST. Each job uses two input files and a batch script
> > issues the two commands necessary (formatdb and blastall). There are a
> > total of four input files and the submit script queues one process for
> every
> > ordered pair of input files (for 4x4 = 16 jobs).
> >
> > Every time I've submitted the cluster it completes the first four jobs
> > (searching a single input file against each of the other ones), and it
> runs
> > the others for about a minute, after which the execute machine beeps
> (it's
> > the "Asterisk" sound), and then processes drop down to 0% CPU. They do
> not
> > drop down at the same time, but close to one another. Condor_q reports
> that
> > they are still running, but they are not. In one case, they resumed for
> a
> > brief period of time after several hours of not doing anything. Nothing
> in
> > the condor logs.
> >
> > If you run the jobs without condor, it works fine (though it takes
> forever).
> > Also, I noticed that for some reason the jobs when run through condor
> use
> > significantly more CPU than when you just run individually on the local
> > machine.
> Are the jobs still alive when the CPU drops to 0%? You can check by
> logging into the machines, running ps and looking for processes named
> condor_exec.exe. If you are programming savvy and know something about
> the BLAST code, you can attach to them with a debugger to see why they're
> stuck.
> +----------------------------------+---------------------------------+
> | Jaime Frey | Public Split on Whether |
> | jfrey@xxxxxxxxxxx | Bush Is a Divider |
> | | -- CNN Scrolling Banner |
> +----------------------------------+---------------------------------+
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx