[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] schedds not returning the jobs correctly



The schedds on a GlideinWMS factory seem not to work correctly:
- there are jobs running and queued and they are visible via condor_status -schedd
- condor_q -g returns nothing, not even "All queues are empty"


$ condor_q -g -xml
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
</classads>
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
</classads>
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
</classads>
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
</classads>
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
</classads>

$ condor_q -g

$ condor_status -schedd
Name                                      Machine                  RunningJobs   IdleJobs   HeldJobs

cmsgwms-factory.fnal.gov                  cmsgwms-factory.fnal.gov        1174       1001          0
schedd_glideins2@myhost 			myhost        1635        831         22
schedd_glideins3@myhost			myhost         285        812         22
schedd_glideins4@myhost			myhost        1794        997          1
schedd_glideins5@myhost			myhost        2007       1074          8

                      TotalRunningJobs      TotalIdleJobs      TotalHeldJobs


               Total              6895               4715                 53


[I replaced the hostname w/ "myhost" here, it was correct]
$ condor_q -version
$CondorVersion: 8.6.11 May 10 2018 BuildID: 440910 $
$CondorPlatform: x86_64_RedHat7 $

The schedd logs are all unusually flat, a bunch of "Number of Active Workers 0" lines (rarely w/ N<>0) and with a strange line
"Can't find address for startd myhost"  
There is no startd on the factory host, it is not in the daemon list

02/22/19 11:20:18 (pid:41978) Number of Active Workers 0
02/22/19 11:20:19 (pid:41978) Number of Active Workers 0
02/22/19 11:20:19 (pid:41978) TransferQueueManager stats: active up=0/100 down=0/100; waiting up=0 down=0; wait time up=0s down=0s
02/22/19 11:20:19 (pid:41978) TransferQueueManager upload 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
02/22/19 11:20:19 (pid:41978) TransferQueueManager download 1m I/O load: 0 bytes/s  0.000 disk load  0.000 net load
02/22/19 11:20:19 (pid:41978) Started condor_gmanager for owner cmsglobal_1 pid=1423401
02/22/19 11:20:19 (pid:41978) Can't find address for startd myhost
02/22/19 11:20:20 (pid:41978) Number of Active Workers 0


Something is wrong but I cannot understand what.

condor_config_val seems to return the correct spool and address files, also querying directly w/ -address
d_fulldebug is not much of help:

 _CONDOR_TOOL_DEBUG="D_FULLDEBUG" condor_q -debug -g
02/22/19 11:39:51 Result of reading /etc/issue:  \S

02/22/19 11:39:51 Result of reading /etc/redhat-release:  Scientific Linux release 7.5 (Nitrogen)

02/22/19 11:39:51 Using IDs: 16 processors, 8 CPUs, 8 HTs
02/22/19 11:39:51 Reading condor configuration from '/etc/condor/condor_config'
02/22/19 11:39:51 Enumerating interfaces: lo 127.0.0.1 up
02/22/19 11:39:51 Enumerating interfaces: eth2 131.225.X.X up
02/22/19 11:39:51 WARNING: Config source is empty: /etc/condor/config.d/90_condor_test
02/22/19 11:39:51 Will use TCP to update collector cmsgwms-factory.fnal.gov <131.225.X.X:9618>
02/22/19 11:39:51 Trying to query collector <131.225.X.X:9618>
02/22/19 11:39:51 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_9
02/22/19 11:39:51 Sent classad to schedd
02/22/19 11:39:51 Got classad from schedd.
02/22/19 11:39:51 Ad was last one from schedd.
02/22/19 11:39:51 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_5
02/22/19 11:39:51 Sent classad to schedd
02/22/19 11:39:51 Got classad from schedd.
02/22/19 11:39:51 Ad was last one from schedd.
02/22/19 11:39:51 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_6
02/22/19 11:39:51 Sent classad to schedd
02/22/19 11:39:51 Got classad from schedd.
02/22/19 11:39:51 Ad was last one from schedd.
02/22/19 11:39:51 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_7
02/22/19 11:39:51 Sent classad to schedd
02/22/19 11:39:52 Got classad from schedd.
02/22/19 11:39:52 Ad was last one from schedd.
02/22/19 11:39:52 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_8
02/22/19 11:39:52 Sent classad to schedd
02/22/19 11:39:52 Got classad from schedd.
02/22/19 11:39:52 Ad was last one from schedd.

$ _CONDOR_TOOL_DEBUG="D_FULLDEBUG" condor_q -debug -g -xml
02/22/19 11:42:45 Result of reading /etc/issue:  \S

02/22/19 11:42:45 Result of reading /etc/redhat-release:  Scientific Linux release 7.5 (Nitrogen)

02/22/19 11:42:45 Using IDs: 16 processors, 8 CPUs, 8 HTs
02/22/19 11:42:45 Reading condor configuration from '/etc/condor/condor_config'
02/22/19 11:42:45 Enumerating interfaces: lo 127.0.0.1 up
02/22/19 11:42:45 Enumerating interfaces: eth2 131.225.X.X up
02/22/19 11:42:45 WARNING: Config source is empty: /etc/condor/config.d/90_condor_test
02/22/19 11:42:45 Will use TCP to update collector cmsgwms-factory.fnal.gov <131.225.X.X:9618>
02/22/19 11:42:45 Trying to query collector <131.225.X.X:9618>
02/22/19 11:42:45 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_9
02/22/19 11:42:45 Sent classad to schedd
02/22/19 11:42:45 Got classad from schedd.
02/22/19 11:42:45 Ad was last one from schedd.
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
</classads>
02/22/19 11:42:45 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_5
02/22/19 11:42:45 Sent classad to schedd
02/22/19 11:42:45 Got classad from schedd.
02/22/19 11:42:45 Ad was last one from schedd.
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
</classads>
02/22/19 11:42:45 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_6
02/22/19 11:42:45 Sent classad to schedd
02/22/19 11:42:45 Got classad from schedd.
02/22/19 11:42:45 Ad was last one from schedd.
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
</classads>
02/22/19 11:42:45 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_7
02/22/19 11:42:45 Sent classad to schedd
02/22/19 11:42:45 Got classad from schedd.
02/22/19 11:42:45 Ad was last one from schedd.
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
</classads>
02/22/19 11:42:45 SharedPortClient: sent connection request to schedd at <131.225.X.X:9615> for shared port id 41892_018c_8
02/22/19 11:42:45 Sent classad to schedd
02/22/19 11:42:45 Got classad from schedd.
02/22/19 11:42:45 Ad was last one from schedd.
<?xml version="1.0"?>
<!DOCTYPE classads SYSTEM "classads.dtd">
<classads>
</classads>

Krista saw also:
ERROR "Assertion ERROR on (!(pjr->flags & 0x0004))" at line 4319 in file /slots/16/dir_3109781/userdir/.tmpVhmMVH/BUILD/condor-8.6.11/src/condor_q.V6/queue.cpp


Any suggestion about what is wrong?
Thanks you,
Marco