Hi , I have a condor pool with RHEL4.0_x86_64[Central Manger], IBM, NT,RHEL5, Windows-Vista , HP ,(all clients) when ever i execute a job on ibm /or another machine it is put to idle ., and i have the following error's in my log., When ever my executable gives an error is transferred back as an output ., but when i submit it is in idle state and never completes i am new to condor, can any one please help., here are the log that's written., condor_master's log SCHEDD.log 2442 4/18 19:38:52 (pid:5847) Shadow pid 11902 for job 93.0 exited with status 4 2443 4/18 19:38:52 (pid:5847) ERROR: Shadow exited with job exception code! Shadow.log 4/16 21:10:54 DaemonCore: Command Socket at <10.20.3.180:37120> 4/16 21:10:54 Initializing a VANILLA shadow for job 43.0 4/16 21:10:54 (43.0) (3785): Request to run on <10.20.4.30:34497> was ACCEPTED 4/16 21:10:56 (43.0) (3785): ERROR "Error from starter on slot2@xxxxxxxxxxxxxxxxxxxxxx: Create_Process failed to register the job with the ProcD" at line 649 in file pseudo_ops.C 4/18 19:43:52 DaemonCore: Command Socket at <10.20.3.180:36642> 4/18 19:43:52 Initializing a VANILLA shadow for job 92.0 4/18 19:43:52 (92.0) (12020): Request to run on <10.20.4.30:34497> was ACCEPTED 4/18 19:43:53 (92.0) (12020): ERROR "Can no longer talk to condor_starter <10.20.4.30:34497>" at line 121 in file NTreceivers.C Master.log 4/18 09:53:06 DaemonCore: Command Socket at <10.20.3.180:32825> 4/18 09:53:06 passwd_cache::cache_uid(): getpwnam("condor") failed: user not found 4/18 09:53:06 passwd_cache::cache_uid(): getpwnam("condor") failed: user not found 4/18 09:53:06 Started DaemonCore process "/opt/condor-7.0.1/sbin/condor_collector", pid and pgroup = 5841 4/18 09:53:09 Started DaemonCore process "/opt/condor-7.0.1/sbin/condor_negotiator", pid and pgroup = 5846 4/18 09:53:09 Started DaemonCore process "/opt/condor-7.0.1/sbin/condor_schedd", pid and pgroup = 5847 4/18 09:53:09 Started DaemonCore process "/opt/condor-7.0.1/sbin/condor_startd", pid and pgroup = 5851 4/18 10:53:09 Preen pid is 7611 4/18 10:53:09 Child 7611 died, but not a daemon -- Ignored Job log file : 4343 007 (092.000.000) 04/18 19:53:52 Shadow exception! 4344 Can no longer talk to condor_starter <10.20.4.30:34497> 4345 0 - Run Bytes Sent By Job 4346 0 - Run Bytes Received By Job condor_client's log StartLog 4/18 19:59:35 Create_Process: child failed becuase it failed to register itself with the ProcD 4/18 19:59:35 slot2: ERROR: exec_starter failed! 4/18 19:59:35 slot2: ERROR: exec_starter returned 0 4/18 19:59:35 slot2: Got activate_claim request from shadow (<10.20.3.180:36746>) 4/18 19:59:35 slot2: Remote job ID is 94.0 4/18 19:59:35 mkfifo of /tmp/condor-lock.pv300.928575252954946/procd_pipe.STARTD.442402.0 error: No such file or directory (2) 4/18 19:59:35 failed to initialize named pipe at /tmp/condor-lock.pv300.928575252954946/procd_pipe.STARTD.442402.0 4/18 19:59:35 LocalClient: error initializing NamedPipeReader 4/18 19:59:35 ProcFamilyClient: failed to start connection with ProcD 4/18 19:59:35 register_subfamily: ProcD communication error 4/18 19:59:35 Create_Process: error registering family for pid 372916 4/18 19:59:35 Create_Process: child failed becuase it failed to register itself with the ProcD 4/18 19:59:35 slot2: ERROR: exec_starter failed! 4/18 19:59:35 slot2: ERROR: exec_starter returned 0 4/18 19:59:35 slot2: Got activate_claim request from shadow (<10.20.3.180:36748>) 4/18 19:59:35 slot2: Remote job ID is 94.0 4/18 19:59:35 mkfifo of /tmp/condor-lock.pv300.928575252954946/procd_pipe.STARTD.442402.0 error: No such file or directory (2) 4/18 19:59:35 failed to initialize named pipe at /tmp/condor-lock.pv300.928575252954946/procd_pipe.STARTD.442402.0 4/18 19:59:35 LocalClient: error initializing NamedPipeReader 4/18 19:59:35 ProcFamilyClient: failed to start connection with ProcD 4/18 19:59:35 register_subfamily: ProcD communication error 4/18 19:59:35 Create_Process: error registering family for pid 372920 4/18 19:59:35 Create_Process: child failed becuase it failed to register itself with the ProcD 4/18 19:59:35 slot2: ERROR: exec_starter failed! 4/18 19:59:35 slot2: ERROR: exec_starter returned 0 4/18 19:59:35 slot2: State change: received RELEASE_CLAIM command 4/18 19:59:35 slot2: Changing state and activity: Claimed/Idle -> Preempting/Vacating 4/18 19:59:35 slot2: State change: No preempting claim, returning to owner 4/18 19:59:35 slot2: Changing state and activity: Preempting/Vacating -> Owner/Idle 4/18 19:59:35 slot2: State change: IS_OWNER is false 4/18 19:59:35 slot2: Changing state: Owner -> Unclaimed Sched.log 4/16 09:44:55 (pid:401526) DaemonCore: Command Socket at <10.20.4.30:34417> 4/16 09:44:55 (pid:401526) passwd_cache::cache_uid(): getpwnam("condor") failed: user not found 4/16 09:44:55 (pid:401526) passwd_cache::cache_uid(): getpwnam("condor") failed: user not found 4/16 09:44:55 (pid:401526) History file rotation is enabled. 4/16 09:44:55 (pid:401526) Maximum history file size is: 20971520 bytes 4/16 09:44:55 (pid:401526) Number of rotated history files is: 2 MasterLog 4/16 09:44:55 DaemonCore: Command Socket at <10.20.4.30:34416> 4/16 09:44:55 passwd_cache::cache_uid(): getpwnam("condor") failed: user not found 4/16 09:44:55 passwd_cache::cache_uid(): getpwnam("condor") failed: user not found 4/16 09:44:55 Started DaemonCore process "/u1/pv/.condor-7.0.1/sbin/condor_schedd", pid and pgroup = 401526 4/16 09:44:55 Started DaemonCore process "/u1/pv/.condor-7.0.1/sbin/condor_startd", pid and pgroup = 356550 4/16 10:44:55 Preen pid is 430084 Thanks in Advance javed |