Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] schedd changes owner to a regular user and results in queue crash
- Date: Mon, 11 Dec 2006 11:46:29 -0500
- From: Junjun Mao <jmao@xxxxxxxxxxxxxxxxx>
- Subject: [Condor-users] schedd changes owner to a regular user and results in queue crash
Hi all,
A serious problem just happened to my cluster, causing entire shutdown
of condor. The ownership of schedd was was changed to a regular user!!!
How could this happen?
[root@master1 y-61.1]# ps -ef | grep condor
pwang 26763 1 0 Nov18 ? 00:00:00 condor_shadow -f 886.0
<10.10.20.1:34661> -
pwang 26766 1 0 Nov18 ? 00:00:00 condor_shadow -f 886.2
<10.10.20.1:34661> -
pwang 26772 1 0 Nov18 ? 00:00:00 condor_shadow -f 886.1
<10.10.20.1:34661> -
pwang 29394 1 0 Nov18 ? 00:00:00 condor_shadow -f 886.4
<10.10.20.1:34661> -
condor 19319 1 0 Nov21 ?
00:34:54 /home2/condor/sbin/condor_master
condor 19320 19319 0 Nov21 ? 01:43:02 condor_collector -f
pwang 19393 19319 0 Dec09 ? 00:00:06 condor_schedd -f
condor 19401 19319 0 Dec09 ? 00:02:31 condor_negotiator -f
Restarting condor daemons still gives wrong owner of schedd. However
condor started correctly after I deleted job 886 in the tansaction log
file job_queue.log.
Job 886 in was terminated by "condor_rm" job_queue.log.
The lastest log covers from 2am Dec 9. The SchedLog file reports shawdow
exceptions.
ERROR: Shadow exited with job exception code!
In my vague memory, the job 886 was in state X last week. It looks to me
that "condor_rm" will affect schedd. Is it true?
Junjun