Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] schedds not returning the jobs correctly
- Date: Fri, 22 Feb 2019 17:46:27 +0000
- From: Marco Mambelli <marcom@xxxxxxxx>
- Subject: [HTCondor-users] schedds not returning the jobs correctly
The schedds on a GlideinWMS factory seem not to work correctly:
- there are jobs running and queued and they are visible via condor_status -schedd
- condor_q -g returns nothing, not even "All queues are empty"
$ condor_q -g -xml
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
</classads>
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
</classads>
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
</classads>
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
</classads>
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
</classads>
$ condor_q -g
$ condor_status -schedd
Name Machine RunningJobs IdleJobs HeldJobs
cmsgwms-factory.fnal.gov cmsgwms-factory.fnal.gov 1174 1001 0
schedd_glideins2@myhost myhost 1635 831 22
schedd_glideins3@myhost myhost 285 812 22
schedd_glideins4@myhost myhost 1794 997 1
schedd_glideins5@myhost myhost 2007 1074 8
TotalRunningJobs TotalIdleJobs TotalHeldJobs
Total 6895 4715 53
[I replaced the hostname w/ "myhost" here, it was correct]
$ condor_q -version
$CondorVersion: 8.6.11 May 10 2018 BuildID: 440910 $
$CondorPlatform: x86_64_RedHat7 $
The schedd logs are all unusually flat, a bunch of "Number of Active Workers 0" lines (rarely w/ N<>0) and with a strange line
"Can't find address for startd myhost"
There is no startd on the factory host, it is not in the daemon list
02/22/19 11:20:18 (pid:41978) Number of Active Workers 0
02/22/19 11:20:19 (pid:41978) Number of Active Workers 0
02/22/19 11:20:19 (pid:41978) TransferQueueManager stats: active up=0/100 down=0/100; waiting up=0 down=0; wait time up=0s down=0s
02/22/19 11:20:19 (pid:41978) TransferQueueManager upload 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load
02/22/19 11:20:19 (pid:41978) TransferQueueManager download 1m I/O load: 0 bytes/s 0.000 disk load 0.000 net load
02/22/19 11:20:19 (pid:41978) Started condor_gmanager for owner cmsglobal_1 pid=1423401
02/22/19 11:20:19 (pid:41978) Can't find address for startd myhost
02/22/19 11:20:20 (pid:41978) Number of Active Workers 0
Something is wrong but I cannot understand what.
condor_config_val seems to return the correct spool and address files, also querying directly w/ -address
d_fulldebug is not much of help:
_CONDOR_TOOL_DEBUG="D_FULLDEBUG" condor_q -debug -g
02/22/19 11:39:51 Result of reading /etc/issue: \S
02/22/19 11:39:51 Result of reading /etc/redhat-release: Scientific Linux release 7.5 (Nitrogen)
02/22/19 11:39:51 Using IDs: 16 processors, 8 CPUs, 8 HTs
02/22/19 11:39:51 Reading condor configuration from '/etc/condor/condor_config'
02/22/19 11:39:51 Enumerating interfaces: lo 127.0.0.1 up
02/22/19 11:39:51 Enumerating interfaces: eth2 131.225.X.X up
02/22/19 11:39:51 WARNING: Config source is empty: /etc/condor/config.d/90_condor_test
02/22/19 11:39:51 Will use TCP to update collector cmsgwms-factory.fnal.gov <131.225.X.X:9618>
02/22/19 11:39:51 Trying to query collector <131.225.X.X:9618>
02/22/19 11:39:51 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_9
02/22/19 11:39:51 Sent classad to schedd
02/22/19 11:39:51 Got classad from schedd.
02/22/19 11:39:51 Ad was last one from schedd.
02/22/19 11:39:51 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_5
02/22/19 11:39:51 Sent classad to schedd
02/22/19 11:39:51 Got classad from schedd.
02/22/19 11:39:51 Ad was last one from schedd.
02/22/19 11:39:51 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_6
02/22/19 11:39:51 Sent classad to schedd
02/22/19 11:39:51 Got classad from schedd.
02/22/19 11:39:51 Ad was last one from schedd.
02/22/19 11:39:51 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_7
02/22/19 11:39:51 Sent classad to schedd
02/22/19 11:39:52 Got classad from schedd.
02/22/19 11:39:52 Ad was last one from schedd.
02/22/19 11:39:52 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_8
02/22/19 11:39:52 Sent classad to schedd
02/22/19 11:39:52 Got classad from schedd.
02/22/19 11:39:52 Ad was last one from schedd.
$ _CONDOR_TOOL_DEBUG="D_FULLDEBUG" condor_q -debug -g -xml
02/22/19 11:42:45 Result of reading /etc/issue: \S
02/22/19 11:42:45 Result of reading /etc/redhat-release: Scientific Linux release 7.5 (Nitrogen)
02/22/19 11:42:45 Using IDs: 16 processors, 8 CPUs, 8 HTs
02/22/19 11:42:45 Reading condor configuration from '/etc/condor/condor_config'
02/22/19 11:42:45 Enumerating interfaces: lo 127.0.0.1 up
02/22/19 11:42:45 Enumerating interfaces: eth2 131.225.X.X up
02/22/19 11:42:45 WARNING: Config source is empty: /etc/condor/config.d/90_condor_test
02/22/19 11:42:45 Will use TCP to update collector cmsgwms-factory.fnal.gov <131.225.X.X:9618>
02/22/19 11:42:45 Trying to query collector <131.225.X.X:9618>
02/22/19 11:42:45 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_9
02/22/19 11:42:45 Sent classad to schedd
02/22/19 11:42:45 Got classad from schedd.
02/22/19 11:42:45 Ad was last one from schedd.
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
</classads>
02/22/19 11:42:45 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_5
02/22/19 11:42:45 Sent classad to schedd
02/22/19 11:42:45 Got classad from schedd.
02/22/19 11:42:45 Ad was last one from schedd.
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
</classads>
02/22/19 11:42:45 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_6
02/22/19 11:42:45 Sent classad to schedd
02/22/19 11:42:45 Got classad from schedd.
02/22/19 11:42:45 Ad was last one from schedd.
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
</classads>
02/22/19 11:42:45 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_7
02/22/19 11:42:45 Sent classad to schedd
02/22/19 11:42:45 Got classad from schedd.
02/22/19 11:42:45 Ad was last one from schedd.
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
</classads>
02/22/19 11:42:45 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_8
02/22/19 11:42:45 Sent classad to schedd
02/22/19 11:42:45 Got classad from schedd.
02/22/19 11:42:45 Ad was last one from schedd.
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
</classads>
Krista saw also:
ERROR "Assertion ERROR on (!(pjr->flags & 0x0004))" at line 4319 in file /slots/16/dir_3109781/userdir/.tmpVhmMVH/BUILD/condor-8.6.11/src/condor_q.V6/queue.cpp
Any suggestion about what is wrong?
Thanks you,
Marco