Subject: [Condor-users] Problem with just one machine in a Windows pool
We have a pool of about 20 Windows machines, XP or Win7x64, Condor 7.6.1. One machine acts badly and I can't figure out why.
The offending machine is a Win7x64. The problem is, it accepts jobs, but immediately kills them. Also, it can't get permissions to delete the now-unneeded execute\dir_**** directories, so those accumulate. Eventually the c: drive free space decreases to below the submitted jobs' limits, and the machine quits accepting new jobs.
The model error and output log files on the problem machine are empty...I don't think the model even started. If the problem was only directory permissions, how can the input files be written to the execute directory, and and log files created, though empty? Instead it seems like an execute permission problem.
Several other machines are set up "identically", that is, as much as we know they're identical, though clearly something is wrong on that one machine. I've reproduced some log files below...but I'm hoping someone had a similar problem and can point us to the fix we need to make.
StarterLog.slot1 (notice the immediate job end and error deleting the old execute dir):
08/20/11 16:05:29 Setting maximum accepts per cycle 4. 08/20/11 16:05:29 ****************************************************** 08/20/11 16:05:29 ** condor_starter (CONDOR_STARTER) STARTING UP 08/20/11 16:05:29 ** C:\Condor\bin\condor_starter.exe 08/20/11 16:05:29 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 08/20/11 16:05:29 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON 08/20/11 16:05:29 ** $CondorVersion: 7.6.1 May 31 2011 BuildID: 339001 $ 08/20/11 16:05:29 ** $CondorPlatform: x86_winnt_5.1 $ 08/20/11 16:05:29 ** PID = 3180 08/20/11 16:05:29 ** Log last touched 8/20 15:05:28 08/20/11 16:05:29 ****************************************************** 08/20/11 16:05:29 Using config source: C:\condor\condor_config 08/20/11 16:05:29 Using local config sources: 08/20/11 16:05:29 C:/Condor/condor_config.local 08/20/11 16:05:29 DaemonCore: command socket at <136.200.32.170:63768> 08/20/11 16:05:29 DaemonCore: private command socket at <136.200.32.170:63768> 08/20/11 16:05:29 Setting maximum accepts per cycle 4. 08/20/11 16:05:29 GLEXEC_JOB not supported on this platform; ignoring 08/20/11 16:05:29 Setting resource limits not implemented! 08/20/11 16:05:29 Communicating with shadow <136.200.32.119:62991> 08/20/11 16:05:29 Submitting machine is "bdomo-002.ad.water.ca.gov" 08/20/11 16:05:29 setting the orig job name in starter 08/20/11 16:05:29 setting the orig job iwd in starter 08/20/11 16:05:46 File transfer completed successfully. 08/20/11 16:05:47 Job 1933.0 set to execute immediately 08/20/11 16:05:47 Starting a VANILLA universe job with ID: 1933.0 08/20/11 16:05:47 Tracking process family by login "condor-reuse-slot1" 08/20/11 16:05:47 IWD: C:\Condor\execute\dir_3180 08/20/11 16:05:47 Output file: C:\Condor\execute\dir_3180\dsm2-085.out 08/20/11 16:05:47 Error file: C:\Condor\execute\dir_3180\dsm2-085.err 08/20/11 16:05:47 Renice expr "10" evaluated to 10 08/20/11 16:05:47 About to exec C:\Condor\execute\dir_3180\condor_exec.bat hydro085.inp, qual_ec085.inp 08/20/11 16:05:47 Executable is a batch file, running: "C:\Windows\system32\cmd.exe" /Q /C "C:\Condor\execute\dir_3180\condor_exec.bat" hydro085.inp, qual_ec085.inp 08/20/11 16:05:47 Create_Process succeeded, pid=6184 08/20/11 16:05:47 Process exited, pid=6184, status=-1073741701 08/20/11 16:05:47 Got SIGQUIT. Performing fast shutdown. 08/20/11 16:05:47 ShutdownFast all jobs. 08/20/11 16:05:48 ERROR: C:\Condor\execute\dir_3180 still exists after trying to add Full control to ACLs for PRIV_ROOT 08/20/11 16:05:48 **** condor_starter (condor_STARTER) pid 3180 EXITING WITH STATUS 0