I am running Condor on Debian as a compute node. The
Condor host (and also compute node) is Red Hat 7,
running 8.6.4
The submit node is a Windows 7 machine running 8.6.4
64-bit.
The job is a vanilla job that sends some files and
returns some files from a computation.
$CondorVersion: 8.6.4 Jun 21 2017 BuildID: 408625 $
$CondorPlatform: x86_64_Debian8 $
I have divided the debian machine into 8 equal slots
more . The condor_config.local only was modified
CONDOR_HOST=condor
ALLOW_WRITE = $(FULL_HOSTNAME),*.
mydomainname.com
NUM_SLOTS =8
DAEMON_LIST=MASTER STARTD
When I submit a job to our "cluster" of 2 machines, the Debian
machine completes the jobs but the condor_starter processes
get stuck at 100% CPU seemingly spinning their wheels trying
to transfer the files back to the submit host.
The other Linux node, completes its jobs and sends back the
computed data with no issues.
Examining the /var/lib/condor/execute directory shows the
execute directories, with the output all normal and complete.
Nothing in the Starter logs on the Debian node indicates
any issues.
07/13/17 15:25:33 (pid:34323) Communicating with shadow <
192.168.0.211:9618?addrs=192.168.0.211-9618&noUDP&sock=9892_fcaa_3>
07/13/17 15:25:33 (pid:34323) Submitting machine is "bose"
07/13/17 15:25:33 (pid:34323) setting the orig job name in
starter
07/13/17 15:25:33 (pid:34323) setting the orig job iwd in
starter
07/13/17 15:25:33 (pid:34323) SLOT2_USER set, so running job
as acu
07/13/17 15:25:33 (pid:34323) Chirp config summary: IO false,
Updates false, Delayed updates true.
07/13/17 15:25:33 (pid:34323) Initialized IO Proxy.
07/13/17 15:25:33 (pid:34323) Done setting resource limits
07/13/17 15:25:33 (pid:34323) File transfer completed
successfully.
07/13/17 15:25:33 (pid:34323) Job 84.1 set to execute
immediately
07/13/17 15:25:33 (pid:34323) Starting a VANILLA universe job
with ID: 84.1
07/13/17 15:25:33 (pid:34323) IWD:
/var/lib/condor/execute/dir_34323
07/13/17 15:25:33 (pid:34323) Output file:
/var/lib/condor/execute/dir_34323/_condor_stdout
07/13/17 15:25:33 (pid:34323) Error file:
/var/lib/condor/execute/dir_34323/_condor_stderr
07/13/17 15:25:33 (pid:34323) Renice expr "0" evaluated to 0
07/13/17 15:25:33 (pid:34323) About to exec
/var/lib/condor/execute/dir_34323/condor_exec.exe VA22_2017
BEM_Fluid 1 27012@licenseserver
07/13/17 15:25:33 (pid:34323) Running job as user acu
07/13/17 15:25:33 (pid:34323) Create_Process succeeded,
pid=34340
07/13/17 15:25:33 (pid:34323) Cgroup controller for memory
accounting is not available.
I am forced to restart condor on this node to stop the
condor_starter processes.