[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor 6.9.2 hung schedd



On Wed, Jun 13, 2007 at 11:36:44AM -0500, Dan Bradley wrote:
> 
> 
> Steffen Grunewald wrote:
> 
> >Question to Condor developers: where's the status of submitted jobs kept
> >over a restart of condor_schedd? It might be easier to make changes there...
> >  
> >
> $(SPOOL)/job_queue.log
> 
> It is fairly easy to understand the format and to make manual changes, 
> but be careful!

Hmmm, when would be the best time to make changes? Mine is about 9 MB
in size, and I'm worried that I'd miss some bits.

Certainly, it would be nicer if condor_schedd could handle this situation more
gracefully. I'm thinking of a timeout - if condor_schedd doesn't get the lock
within a configurable time (one minute?) it'd simply write a notice to its
own log, and ignore the user log output... would this be feasible?

> >And why doesn't 'condor_restart -sub schedd' work in this case?
> >  
> Hmm.  It worked for me when I tried it, but I'm running a pre-release of 
> 6.9.3.  The usual problem people have is that their security 
> configuration doesn't allow condor_restart to operate from the machine 
> where they are running it, but the command-line tool does not know 
> whether the operation was rejected or not, so there is no visible 
> complaint to the user.  If you look in the schedd log, you will see a 
> message indicating that it rejected the command.

I didn't see any message in the log because the schedd was completely (!)
unresponsive. Yet the restart command returned without complaints.

Steffen