Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor 6.9.2 hung schedd

Date: Mon, 11 Jun 2007 10:21:15 -0500
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [Condor-users] Condor 6.9.2 hung schedd

Hi,

for the n-th time, I found one of the schedds in my Condor pool dead.
(The machine is the "pool master" but otherwise all submit machines
= cluster headnodes are configured the same.)

[snip]

A look at the process table shows that the corresponding condor_schedd
process is not owned by condor (as on all other submit machines) but by the
user who submitted a job cluster before the problem showed up.

Practically the only time the schedd switches its effective UID to thatof submitting is when it is writing into the user log.


My guess:  did the user in question submit job(s) with
  log = /some/path
in the submit file, and "/some/path" is siting on an NFS server?

If so, welcome to the pathetic world of nfs file locks (esp on Linux).

A quick workaround would be to have the user place the log files onto alocal disk volume, or to not specify a log file.

We need to improve this at some point - since most users only needlocking across the processes on one machine (i.e. the schedd, shadow,gridmanager, and dagman all run on the same box), perhaps replacing thefile lock with a kernel mutex. What do folks think?


hope this helps
Todd

References:
- [Condor-users] Condor 6.9.2 hung schedd
  - From: Steffen Grunewald
- Re: [Condor-users] Condor 6.9.2 hung schedd
  - From: Dan Bradley

Prev by Date: Re: [Condor-users] Using Condor WebServices
Next by Date: Re: [Condor-users] problem with condor_q -analyze
Previous by thread: Re: [Condor-users] Condor 6.9.2 hung schedd
Next by thread: Re: [Condor-users] Condor 6.9.2 hung schedd
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] Condor 6.9.2 hung schedd