[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] condor_starter hanging with 100% CPU



Hello,

I am trying to run a single-node v8.6.1 of HTCondor on a debian VM. The condor system seems to start up, and the condor_status does show the node as available. However, once I submit a job, condor_starter goes to 100% CPU and stays there till I manually shut down the condor system. Upon further investigations, I noticed that the job is actually getting done, and the output is written in the temporary folder (see logs below), but for some reason condor_starter is not getting notification of this, or is getting busy in something else.

Here are some relevant logs - please let me know if someone wants to see something else. Where should I start checking?

Regards,
-Samik


============== StarterLog ===============
samik@condor-01:/var/log/condor$ cat StarterLog
04/10/17 09:31:03 (pid:1094) Failed to read results from '/usr/bin/docker -v': 'No such file or directory' (2)
04/10/17 09:31:03 (pid:1094) DockerAPI::detect() failed to detect the Docker version; assuming absent.
04/10/17 09:31:03 (pid:1094) '/usr/bin/singularity --version' did not exit successfully (code 32731); the first line of output was ''.
04/10/17 09:31:58 (pid:1126) ******************************************************
04/10/17 09:31:58 (pid:1126) ** condor_starter (CONDOR_STARTER) STARTING UP
04/10/17 09:31:58 (pid:1126) ** /usr/sbin/condor_starter
04/10/17 09:31:58 (pid:1126) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
04/10/17 09:31:58 (pid:1126) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
04/10/17 09:31:58 (pid:1126) ** $CondorVersion: 8.6.1 Mar 01 2017 BuildID: 398585 $
04/10/17 09:31:58 (pid:1126) ** $CondorPlatform: x86_64_Debian8 $
04/10/17 09:31:58 (pid:1126) ** PID = 1126
04/10/17 09:31:58 (pid:1126) ** Log last touched 4/10 09:31:03
04/10/17 09:31:58 (pid:1126) ******************************************************
....
04/10/17 09:31:59 (pid:1126) Job 2.0 set to execute immediately
04/10/17 09:31:59 (pid:1126) Starting a VANILLA universe job with ID: 2.0
04/10/17 09:31:59 (pid:1126) IWD: /var/lib/condor/execute/dir_1126
04/10/17 09:31:59 (pid:1126) Output file: /var/lib/condor/execute/dir_1126/_condor_stdout
04/10/17 09:31:59 (pid:1126) Error file: /var/lib/condor/execute/dir_1126/_condor_stderr
04/10/17 09:31:59 (pid:1126) Renice expr "0" evaluated to 0
04/10/17 09:31:59 (pid:1126) About to exec /opt/solvers/lp_solve/lp_solve -rxli /opt/solvers/lp_solve/libxli_MathProg.so tsp.mod
04/10/17 09:31:59 (pid:1126) Running job as user samik
04/10/17 09:31:59 (pid:1126) Create_Process succeeded, pid=1133
04/10/17 09:31:59 (pid:1126) Cgroup controller for memory accounting is not available.
EOF

## Within 2 mins, I have the solution of the job available in _condor_stdout in the temp folder below.

samik@condor-01:/var/lib/condor/execute/dir_1126$ ls -al
total 52
drwx------ 2 samik samik 4096 Apr 10 09:31 .
drwxr-xr-x 3 condor condor 4096 Apr 10 09:31 ..
-rwx------ 1 samik samik 48 Apr 10 09:31 .chirp.config
-rw-r--r-- 1 samik samik 0 Apr 10 09:31 _condor_stderr
-rw-r--r-- 1 samik samik 16469 Apr 10 09:33 _condor_stdout
-rw-r--r-- 1 condor condor 3646 Apr 10 09:31 .job.ad
-rw-r--r-- 1 condor condor 5311 Apr 10 09:31 .machine.ad
-rw-r--r-- 1 samik samik 6665 Apr 10 09:31 tsp.mod

## And the executable process is now defunct

samik@condor-01:~/condor$ ps -ef | grep lp_solve
samikÂÂÂÂÂÂÂ 1133 ÂÂÂ 584 17 07:16 ?ÂÂÂÂÂÂÂ 00:00:14 [lp_solve] <defunct>
samikÂÂÂÂÂÂÂ 1249 ÂÂÂ 556Â 0 07:17 pts/0ÂÂÂ 00:00:00 grep lp_solve

## However, now the condor_starter is at 100%, and will remain so until killed.

root@condor-01:/etc/condor/config.d# top
ÂÂÂ PID USERÂÂÂÂÂ PRÂ NIÂÂÂ VIRTÂÂÂ RESÂÂÂ SHR S %CPU %MEMÂÂÂÂ TIME+ COMMAND
ÂÂ 1126 condorÂÂÂ 20ÂÂ 0ÂÂ 48720ÂÂ 9704ÂÂ 8584 R 99.9Â 1.6ÂÂ 8:39.67 condor_starter
ÂÂÂÂÂ 1 rootÂÂÂÂÂ 20ÂÂ 0ÂÂ 28624ÂÂ 4816ÂÂ 3120 SÂ 0.0Â 0.8ÂÂ 0:01.32 systemd
ÂÂÂÂÂ 2 rootÂÂÂÂÂ 20ÂÂ 0ÂÂÂÂÂÂ 0ÂÂÂÂÂ 0ÂÂÂÂÂ 0 SÂ 0.0Â 0.0ÂÂ 0:00.00 kthreadd

## Some suspicious lines in CollectorLog:

samik@condor-01:/var/log/condor$ cat CollectorLog | more
04/10/17 09:30:57 Setting maximum file descriptors to 10240.
04/10/17 09:30:57 ******************************************************
04/10/17 09:30:57 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
04/10/17 09:30:57 ** /usr/sbin/condor_collector
04/10/17 09:30:57 ** SubsystemInfo: name=COLLECTOR type=COLLECTOR(3) class=DAEMON(1)
04/10/17 09:30:57 ** Configuration: subsystem:COLLECTOR local:<NONE> class:DAEMON
04/10/17 09:30:57 ** $CondorVersion: 8.6.1 Mar 01 2017 BuildID: 398585 $
04/10/17 09:30:57 ** $CondorPlatform: x86_64_Debian8 $
04/10/17 09:30:57 ** PID = 1083
04/10/17 09:30:57 ** Log last touched time unavailable (No such file or directory)
04/10/17 09:30:57 ******************************************************
04/10/17 09:30:57 Using config source: /etc/condor/condor_config
04/10/17 09:30:57 Using local config sources:
04/10/17 09:30:57ÂÂÂ /etc/condor/config.d/condor_config.local
04/10/17 09:30:57ÂÂÂ /etc/condor/condor_config.local
04/10/17 09:30:57 config Macros = 62, Sorted = 62, StringBytes = 1663, TablesBytes = 2280
04/10/17 09:30:57 CLASSAD_CACHING is ENABLED
04/10/17 09:30:57 Daemon Log is logging: D_ALWAYS D_ERROR
04/10/17 09:30:57 SharedPortEndpoint: waiting for connections to named socket collector
04/10/17 09:30:57 DaemonCore: non-shared command socket at <10.128.0.3:24706>
04/10/17 09:30:57 Daemoncore: Listening at <0.0.0.0:24706> on TCP (ReliSock) and UDP (SafeSock).
04/10/17 09:30:57 DaemonCore: non-shared command socket at <[::1]:24706>
04/10/17 09:30:57 WARNING: Condor is running on a loopback address
04/10/17 09:30:57ÂÂÂÂÂÂÂÂÂ of this machine, and may not visible to other hosts!
04/10/17 09:30:57 Daemoncore: Listening at <[::]:24706> on TCP (ReliSock) and UDP (SafeSock).
04/10/17 09:30:57 DaemonCore: command socket at <10.128.0.3:9618?addrs=10.128.0.3-9618+[--1]-9618&no
UDP&sock=collector>
04/10/17 09:30:57 DaemonCore: private command socket at <10.128.0.3:9618?addrs=10.128.0.3-9618+[--1]
-9618&noUDP&sock=collector>
....
04/10/17 09:30:58 Got QUERY_STARTD_PVT_ADS
04/10/17 09:30:58 Number of Active Workers 0
04/10/17 09:30:58 (Sending 0 ads in response to query)
04/10/17 09:30:58 Number of Active Workers 1
04/10/17 09:30:58 Query info: matched=0; skipped=0; query_time=0.003904; send_time=0.001173; type=MachinePrivate; requirements={true}; peer=<10.128.0.3:3535>; projection={}
04/10/17 09:30:58 (Sending 0 ads in response to query)
04/10/17 09:30:58 AccountingAd : Inserting ** "< <none> >"
04/10/17 09:30:58 DaemonCore: Can't receive command request from 10.128.0.3 (perhaps a timeout?)
04/10/17 09:30:58 Query info: matched=0; skipped=1; query_time=0.000652; send_time=0.004940; type=An
y; requirements={( ( ( MyType == "Scheduler" ) || ( MyType == "Submitter" ) ) || ( ( MyType == "Mach
ine" ) ) )}; peer=<10.128.0.3:42389>; projection={}
...