[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] procd_pipe.STARTD.watchdog: No such file or directory



Hi!

 

What does the following error message mean?

11/01 20:16:17 error opening watchdog pipe /var/run/condor/procd_pipe.STARTD.watchdog: No such file or directory (2)

 

 

StartLog, execute node

11/01 20:16:16 slot1: Request accepted.

11/01 20:16:16 slot1: Remote owner is o2f_sonlil@xxxxxxxxxxxxxxxxxxxx

11/01 20:16:16 slot1: State change: claiming protocol successful

11/01 20:16:16 slot1: Changing state: Matched -> Claimed

11/01 20:16:16 slot1: Started ClaimLease timer (17) w/ 1800 second lease duration

11/01 20:16:17 slot1: Got activate_claim request from shadow (<10.110.44.78:55118>)

11/01 20:16:17 slot1: Read request ad and starter from shadow.

11/01 20:16:17 Swap space: 917496

11/01 20:16:17 13367628 kbytes available for "/var/lib/condor/execute"

11/01 20:16:17 slot1: Total execute space: 13362508

11/01 20:16:17 13367628 kbytes available for "/var/lib/condor/execute"

11/01 20:16:17 slot2: Total execute space: 13362508

11/01 20:16:17 slot1: Remote job ID is 116.0

11/01 20:16:17 slot1: Remote global job ID is o2f-sth-lap-016.un.dr.dgcsystems.net#116.0#1288638373

11/01 20:16:17 slot1: JobLeaseDuration defined in job ClassAd: 1200

11/01 20:16:17 slot1: Resetting ClaimLease timer (17) with new duration

11/01 20:16:17 slot1: Sending Machine Ad to Starter

11/01 20:16:17 slot1: About to Create_Process "condor_starter -f -a slot1 o2f-sth-lap-014.un.dr.dgcsystems.net"

11/01 20:16:17 Create_Process: using fast clone() to create child process.

11/01 20:16:17 error opening watchdog pipe /var/run/condor/procd_pipe.STARTD.watchdog: No such file or directory (2)

11/01 20:16:17 ProcFamilyClient: error initializing LocalClient

11/01 20:16:17 ProcFamilyProxy: error initializing ProcFamilyClient

11/01 20:16:17 ERROR "ProcD has failed" at line 599 in file proc_family_proxy.cpp

11/01 20:16:17 CronMgr: 0 jobs alive

11/01 20:16:17 slot1: Canceled ClaimLease timer (17)

11/01 20:16:17 slot1: Changing state and activity: Claimed/Idle -> Preempting/Killing

11/01 20:16:17 Entered vacate_client <10.110.44.79:53584> o2f-sth-lap-014.un.dr.dgcsystems.net...

11/01 20:16:17 slot1: State change: No preempting claim, returning to owner

11/01 20:16:17 slot1: Changing state and activity: Preempting/Killing -> Owner/Idle

11/01 20:16:17 slot1: State change: IS_OWNER is false

11/01 20:16:17 slot1: Changing state: Owner -> Unclaimed

11/01 20:16:17 startd exiting because of fatal exception.

 

StarterLog, execute node

11/01 20:20:30 Reading from /proc/cpuinfo

11/01 20:20:30 Found: Physical-IDs:False; Core-IDs:False

11/01 20:20:30 Using processor count: 2 processors, 2 CPUs, 0 HTs

11/01 20:20:30 Reading condor configuration from '/etc/condor/condor_config'

 

If I do condor_restart –startd at this execute node, I get

Can't connect to local startd

 

Is this “Found: Physical-IDs:False; Core-IDs:False” a problem?

 

 

Regards,

Sónia

 

Sónia Liléo
O2 Strandvägen 5B 114 51 Stockholm
Tel: +46 8 559 310 37 Mobile: +46 73 752 95 74

www.o2.se