Hello, I am trying to run a single-node v8.6.1 of HTCondor on a debian VM. The condor system seems to start up, and the condor_status does show the node as available. However, once I submit a job, condor_starter goes to 100% CPU and stays there till I manually shut down the condor system. Upon further investigations, I noticed that the job is actually getting done, and the output is written in the temporary folder (see logs below), but for some reason condor_starter is not getting notification of this, or is getting busy in something else. Here are some relevant logs - please let me know if someone wants to see something else. Where should I start checking? Regards, -Samik ============== StarterLog =============== samik@condor-01:/var/log/condor$ cat StarterLog 04/10/17 09:31:03 (pid:1094) Failed to read results from '/usr/bin/docker -v': 'No such file or directory' (2) 04/10/17 09:31:03 (pid:1094) DockerAPI::detect() failed to detect the Docker version; assuming absent. 04/10/17 09:31:03 (pid:1094) '/usr/bin/singularity --version' did not exit successfully (code 32731); the first line of output was ''. 04/10/17 09:31:58 (pid:1126) ****************************************************** 04/10/17 09:31:58 (pid:1126) ** condor_starter (CONDOR_STARTER) STARTING UP 04/10/17 09:31:58 (pid:1126) ** /usr/sbin/condor_starter 04/10/17 09:31:58 (pid:1126) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 04/10/17 09:31:58 (pid:1126) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON 04/10/17 09:31:58 (pid:1126) ** $CondorVersion: 8.6.1 Mar 01 2017 BuildID: 398585 $ 04/10/17 09:31:58 (pid:1126) ** $CondorPlatform: x86_64_Debian8 $ 04/10/17 09:31:58 (pid:1126) ** PID = 1126 04/10/17 09:31:58 (pid:1126) ** Log last touched 4/10 09:31:03 04/10/17 09:31:58 (pid:1126) ****************************************************** .... 04/10/17 09:31:59 (pid:1126) Job 2.0 set to execute immediately 04/10/17 09:31:59 (pid:1126) Starting a VANILLA universe job with ID: 2.0 04/10/17 09:31:59 (pid:1126) IWD: /var/lib/condor/execute/dir_1126 04/10/17 09:31:59 (pid:1126) Output file: /var/lib/condor/execute/dir_1126/_condor_stdout 04/10/17 09:31:59 (pid:1126) Error file: /var/lib/condor/execute/dir_1126/_condor_stderr 04/10/17 09:31:59 (pid:1126) Renice expr "0" evaluated to 0 04/10/17 09:31:59 (pid:1126) About to exec /opt/solvers/lp_solve/lp_solve -rxli /opt/solvers/lp_solve/libxli_MathProg.so tsp.mod 04/10/17 09:31:59 (pid:1126) Running job as user samik 04/10/17 09:31:59 (pid:1126) Create_Process succeeded, pid=1133 04/10/17 09:31:59 (pid:1126) Cgroup controller for memory accounting is not available. EOF ## Within 2 mins, I have the solution of the job available in _condor_stdout in the temp folder below. samik@condor-01:/var/lib/condor/execute/dir_1126$ ls -al total 52 drwx------ 2 samik samik 4096 Apr 10 09:31 . drwxr-xr-x 3 condor condor 4096 Apr 10 09:31 .. -rwx------ 1 samik samik 48 Apr 10 09:31 .chirp.config -rw-r--r-- 1 samik samik 0 Apr 10 09:31 _condor_stderr -rw-r--r-- 1 samik samik 16469 Apr 10 09:33 _condor_stdout -rw-r--r-- 1 condor condor 3646 Apr 10 09:31 .job.ad -rw-r--r-- 1 condor condor 5311 Apr 10 09:31 .machine.ad -rw-r--r-- 1 samik samik 6665 Apr 10 09:31 tsp.mod ## And the executable process is now defunct samik@condor-01:~/condor$ ps -ef | grep lp_solve samik 1133  584 17 07:16 ? 00:00:14 [lp_solve] <defunct> samik 1249  556 0 07:17 pts/0 00:00:00 grep lp_solve ## However, now the condor_starter is at 100%, and will remain so until killed. root@condor-01:/etc/condor/config.d# top  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND  1126 condor 20 0 48720 9704 8584 R 99.9 1.6 8:39.67 condor_starter  1 root 20 0 28624 4816 3120 S 0.0 0.8 0:01.32 systemd  2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd ## Some suspicious lines in CollectorLog: samik@condor-01:/var/log/condor$ cat CollectorLog | more 04/10/17 09:30:57 Setting maximum file descriptors to 10240. 04/10/17 09:30:57 ****************************************************** 04/10/17 09:30:57 ** condor_collector (CONDOR_COLLECTOR) STARTING UP 04/10/17 09:30:57 ** /usr/sbin/condor_collector 04/10/17 09:30:57 ** SubsystemInfo: name=COLLECTOR type=COLLECTOR(3) class=DAEMON(1) 04/10/17 09:30:57 ** Configuration: subsystem:COLLECTOR local:<NONE> class:DAEMON 04/10/17 09:30:57 ** $CondorVersion: 8.6.1 Mar 01 2017 BuildID: 398585 $ 04/10/17 09:30:57 ** $CondorPlatform: x86_64_Debian8 $ 04/10/17 09:30:57 ** PID = 1083 04/10/17 09:30:57 ** Log last touched time unavailable (No such file or directory) 04/10/17 09:30:57 ****************************************************** 04/10/17 09:30:57 Using config source: /etc/condor/condor_config 04/10/17 09:30:57 Using local config sources: 04/10/17 09:30:57 /etc/condor/config.d/condor_config.local 04/10/17 09:30:57 /etc/condor/condor_config.local 04/10/17 09:30:57 config Macros = 62, Sorted = 62, StringBytes = 1663, TablesBytes = 2280 04/10/17 09:30:57 CLASSAD_CACHING is ENABLED 04/10/17 09:30:57 Daemon Log is logging: D_ALWAYS D_ERROR 04/10/17 09:30:57 SharedPortEndpoint: waiting for connections to named socket collector 04/10/17 09:30:57 DaemonCore: non-shared command socket at <10.128.0.3:24706> 04/10/17 09:30:57 Daemoncore: Listening at <0.0.0.0:24706> on TCP (ReliSock) and UDP (SafeSock). 04/10/17 09:30:57 DaemonCore: non-shared command socket at <[::1]:24706> 04/10/17 09:30:57 WARNING: Condor is running on a loopback address 04/10/17 09:30:57 of this machine, and may not visible to other hosts! 04/10/17 09:30:57 Daemoncore: Listening at <[::]:24706> on TCP (ReliSock) and UDP (SafeSock). 04/10/17 09:30:57 DaemonCore: command socket at <10.128.0.3:9618?addrs=10.128.0.3-9618+[--1]-9618&no UDP&sock=collector> 04/10/17 09:30:57 DaemonCore: private command socket at <10.128.0.3:9618?addrs=10.128.0.3-9618+[--1] -9618&noUDP&sock=collector> .... 04/10/17 09:30:58 Got QUERY_STARTD_PVT_ADS 04/10/17 09:30:58 Number of Active Workers 0 04/10/17 09:30:58 (Sending 0 ads in response to query) 04/10/17 09:30:58 Number of Active Workers 1 04/10/17 09:30:58 Query info: matched=0; skipped=0; query_time=0.003904; send_time=0.001173; type=MachinePrivate; requirements={true}; peer=<10.128.0.3:3535>; projection={} 04/10/17 09:30:58 (Sending 0 ads in response to query) 04/10/17 09:30:58 AccountingAd : Inserting ** "< <none> >" 04/10/17 09:30:58 DaemonCore: Can't receive command request from 10.128.0.3 (perhaps a timeout?) 04/10/17 09:30:58 Query info: matched=0; skipped=1; query_time=0.000652; send_time=0.004940; type=An y; requirements={( ( ( MyType == "Scheduler" ) || ( MyType == "Submitter" ) ) || ( ( MyType == "Mach ine" ) ) )}; peer=<10.128.0.3:42389>; projection={} ... |