Re: [Condor-users] Procd behaving badly in a multi-startd setup

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Date: Thu, 08 Sep 2011 10:09:44 -0500

Subject: Re: [Condor-users] Procd behaving badly in a multi-startd setup

On 9/7/11 4:41 PM, Ian Chesal wrote:

On Wednesday, 7 September, 2011 at 5:15 PM, Ian Chesal wrote:

I may have spoken too quickly on the multi-startd setup working. I thought my troubles were due to collisions on the starter log files, but after implementing the fix Todd suggested I'm still seeing some bad behaviour (but the fix for the log files worked brilliantly).

It appears that I can only start jobs under one startd or the other. Not both. The first startd to run jobs after a Condor restart is the *only* startd that will run jobs until Condor is restarted again.

For example: I submitted two clusters of jobs. Once targeted the slots on the first startd. The other targeted the slots on the second startd. If I let the first cluster start on the S1 startd then the second cluster would attempt to run on the S2 startd and fail. And vice versa.

The log output on failure is always the same:

09/07/11 17:07:43 slot1: Got activate_claim request from shadow (<192.168.1.85:3382>)

09/07/11 17:07:43 slot1: Remote job ID is 9.0

09/07/11 17:07:43 Result of "register_subfamily" operation from ProcD: ERROR: The given PID is not part of the family tree

09/07/11 17:07:43 Create_Process: error registering family for pid 1256

09/07/11 17:07:43 ERROR "error registering process family with procd" at line 7917 in file c:\condor\execute\dir_4228\userdir\src\condor_daemon_core.v6\daemon_core.cpp

09/07/11 17:07:43 slot1: Changing state and activity: Claimed/Idle -> Preempting/Killing

09/07/11 17:07:43 slot1: State change: No preempting claim, returning to owner

09/07/11 17:07:43 slot1: Changing state and activity: Preempting/Killing -> Owner/Idle

09/07/11 17:07:43 slot1: State change: IS_OWNER is false

09/07/11 17:07:43 slot1: Changing state: Owner -> Unclaimed

It looks like the procd doesn't like the idea of two startds on the machine. It appears it can't tell them apart apparently and doesn't like the fact that the jobs being started on the second startd in this case don't have a PPID equal to the PID of the first startd.

I'm either missing something that's procd-specific in my startd config, or the procd isn't going to work here. I'll try disabling the procd but having it there has helped with scalability issues I'm trying to overcome so if I can make this work with the procd in place I'd be a whole lot happier.

Going with:

USE_PROCD = False

gets both starts working, but sets me back as scalability seems to be limited to ~10-12 slots per startd without the procd on a Win2k8 box.

Hi Ian,

The problem is likely that both startd's are creating their own procd, but these two procds are using the same named pipe for communication, so wires are getting crossed. You could configure PROCD_PIPE differently for the two startds. Or you could just configure the startds to share a single procd. One way to achieve that is this:

MASTER.USE_PROCD = TRUE

That causes the master to create a procd, which is then shared by all of its children. Depending on the answer to your puzzling performance problems, having a single procd may be better than two. Then again, it could be worse. It would be interesting to find out!

--Dan

Mailing List Archives

Authenticated access

Re: [Condor-users] Procd behaving badly in a multi-startd setup