Hi Andrew,
This is a known problem with Debian - I have faced this recently. Please check out this recent thread: [https://www-auth.cs.wisc.edu/lists/htcondor-users/2017- ]April/msg00059.shtml
Thanks.
-Samik
On 14-Jul-17 4:08 AM, Andrew Cunningham wrote:
When I submit a job to our "cluster" of 2 machines, the Debian machine completes the jobs but the condor_starter processes get stuck at 100% CPU seemingly spinning their wheels trying to transfer the files back to the submit host.more . The condor_config.local only was modifiedI am running Condor on Debian as a compute node. The Condor host (and also compute node) is Red Hat 7, running 8.6.4
The submit node is a Windows 7 machine running 8.6.4 64-bit.
The job is a vanilla job that sends some files and returns some files from a computation.
I have divided the debian machine into 8 equal slots
$CondorVersion: 8.6.4 Jun 21 2017 BuildID: 408625 $
$CondorPlatform: x86_64_Debian8 $
CONDOR_HOST=condor
ALLOW_WRITE = $(FULL_HOSTNAME),*.mydomainname.com
NUM_SLOTS =8
DAEMON_LIST=MASTER STARTD
The other Linux node, completes its jobs and sends back the computed data with no issues.
Examining the /var/lib/condor/execute directory shows the execute directories, with the output all normal and complete.
Â
Nothing in the Starter logs on the Debian node indicates any issues.
07/13/17 15:25:33 (pid:34323) Communicating with shadow <192.168.0.211:9618?addrs=192.168.0.211-9618&noUDP&sock= >9892_fcaa_3
07/13/17 15:25:33 (pid:34323) Submitting machine is "bose"
07/13/17 15:25:33 (pid:34323) setting the orig job name in starter
07/13/17 15:25:33 (pid:34323) setting the orig job iwd in starter
07/13/17 15:25:33 (pid:34323) SLOT2_USER set, so running job as acu
07/13/17 15:25:33 (pid:34323) Chirp config summary: IO false, Updates false, Delayed updates true.
07/13/17 15:25:33 (pid:34323) Initialized IO Proxy.
07/13/17 15:25:33 (pid:34323) Done setting resource limits
07/13/17 15:25:33 (pid:34323) File transfer completed successfully.
07/13/17 15:25:33 (pid:34323) Job 84.1 set to execute immediately
07/13/17 15:25:33 (pid:34323) Starting a VANILLA universe job with ID: 84.1
07/13/17 15:25:33 (pid:34323) IWD: /var/lib/condor/execute/dir_34323
07/13/17 15:25:33 (pid:34323) Output file: /var/lib/condor/execute/dir_34323/_condor_stdout
07/13/17 15:25:33 (pid:34323) Error file: /var/lib/condor/execute/dir_34323/_condor_stderr
07/13/17 15:25:33 (pid:34323) Renice expr "0" evaluated to 0
07/13/17 15:25:33 (pid:34323) About to exec /var/lib/condor/execute/dir_34323/condor_exec.exe VA22_2017 BEM_Fluid 1 27012@licenseserver
07/13/17 15:25:33 (pid:34323) Running job as user acu
07/13/17 15:25:33 (pid:34323) Create_Process succeeded, pid=34340
07/13/17 15:25:33 (pid:34323) Cgroup controller for memory accounting is not available.
ÂI am forced to restart condor on this node to stop the condor_starter processes.
Thanks for any advice.
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@cs. wisc.edu with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor- The archives can be found at: https://lists.cs.wisc.edu/users archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor- users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/