Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Negotiator crashing
It looks like my Negotiator keeps crashing. If I look in the
NegotiatorLog I see this:
6/22 14:09:08 ERROR "Assertion ERROR on
(resource_hash.insert( ResourceName, ResourceAd ) == 0)" at line 785
in file Accountant.cpp
I can restart it, but it dies again after 30 seconds.
Can anyone give me some pointers on how to troubleshoot this? AFAIK
nothing has changed that explains this. I've got a lot of jobs in the
queue, and they seem to be running, but some of the I submit just
fail, and the log says unable to contact the negotiator.
Thanks!
--Peter
Here's more of the log.
6/22 14:13:33 ******************************************************
6/22 14:13:33 ** condor_negotiator (CONDOR_NEGOTIATOR) STARTING UP
6/22 14:13:33 ** /opt/osg-shared/se/app/site/condor-7.2.1/sbin/
condor_negotiator
6/22 14:13:33 ** SubsystemInfo: name=NEGOTIATOR type=NEGOTIATOR(4)
class=DAEMON(1)
6/22 14:13:33 ** Configuration: subsystem:NEGOTIATOR local:<NONE>
class:DAEMON
6/22 14:13:33 ** $CondorVersion: 7.2.1 Feb 18 2009 BuildID: 133382 $
6/22 14:13:33 ** $CondorPlatform: X86_64-LINUX_RHEL5 $
6/22 14:13:33 ** PID = 7737
6/22 14:13:33 ** Log last touched 6/22 14:09:08
6/22 14:13:33 ******************************************************
6/22 14:13:33 Using config source: /opt/osg-shared/se/app/site/condor/
etc/condor_config
6/22 14:13:33 Using local config sources:
6/22 14:13:33 /opt/osg-local/condor/condor_config.local
6/22 14:13:33 DaemonCore: Command Socket at <10.0.10.39:36051>
6/22 14:13:33 About to rotate ClassAd log /opt/osg-local/condor/spool/
Accountantnew.log
6/22 14:13:34 NEGOTIATOR_SOCKET_CACHE_SIZE = 16
6/22 14:13:34 PREEMPTION_REQUIREMENTS = ( (CurrentTime -
EnteredCurrentState) > (1 * (60 * 60)) && RemoteUserPrio >
SubmittorPrio * 1.2 ) || (MY.NiceUser == True)
6/22 14:13:34 ACCOUNTANT_HOST = None (local)
6/22 14:13:34 NEGOTIATOR_INTERVAL = 25 sec
6/22 14:13:34 NEGOTIATOR_TIMEOUT = 30 sec
6/22 14:13:34 MAX_TIME_PER_SUBMITTER = 31536000 sec
6/22 14:13:34 MAX_TIME_PER_PIESPIN = 31536000 sec
6/22 14:13:34 PREEMPTION_RANK = (RemoteUserPrio * 1000000) -
TARGET.ImageSize
6/22 14:13:34 NEGOTIATOR_PRE_JOB_RANK = RemoteOwner =?= UNDEFINED
6/22 14:13:34 NEGOTIATOR_POST_JOB_RANK = None
6/22 14:13:34 ---------- Started Negotiation Cycle ----------
6/22 14:13:34 Phase 1: Obtaining ads from collector ...
6/22 14:13:34 Getting all public ads ...
6/22 14:13:34 Sorting 176 ads ...
6/22 14:13:34 Can't evaluate STARTD_AD_REEVAL_EXPR
target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool,
treating as TRUE
6/22 14:13:34 Can't evaluate STARTD_AD_REEVAL_EXPR
target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool,
treating as TRUE
6/22 14:13:34 Can't evaluate STARTD_AD_REEVAL_EXPR
target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool,
treating as TRUE
6/22 14:13:34 Can't evaluate STARTD_AD_REEVAL_EXPR
target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool,
treating as TRUE
6/22 14:13:34 Can't evaluate STARTD_AD_REEVAL_EXPR
target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool,
treating as TRUE
6/22 14:13:34 Can't evaluate STARTD_AD_REEVAL_EXPR
target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool,
treating as TRUE
6/22 14:13:34 Can't evaluate STARTD_AD_REEVAL_EXPR
target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool,
treating as TRUE
6/22 14:13:34 Can't evaluate STARTD_AD_REEVAL_EXPR
target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool,
treating as TRUE
6/22 14:13:34 Can't evaluate STARTD_AD_REEVAL_EXPR
target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool,
treating as TRUE
6/22 14:13:34 Can't evaluate STARTD_AD_REEVAL_EXPR
target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool,
treating as TRUE
6/22 14:13:34 Can't evaluate STARTD_AD_REEVAL_EXPR
target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool,
treating as TRUE
6/22 14:13:34 Can't evaluate STARTD_AD_REEVAL_EXPR
target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool,
treating as TRUE
6/22 14:13:34 Can't evaluate STARTD_AD_REEVAL_EXPR
target.UpdateSequenceNumber > my.UpdateSequenceNumber as a bool,
treating as TRUE
6/22 14:13:34 Getting startd private ads ...
6/22 14:13:34 Got ads: 176 public and 123 private
6/22 14:13:34 Public ads include 7 submitter, 137 startd
6/22 14:13:34 Phase 2: Performing accounting ...
6/22 14:13:34 Could not lookup state --- assuming not claimed
6/22 14:13:34 Could not lookup state --- assuming not claimed
6/22 14:13:34 Could not lookup state --- assuming not claimed
6/22 14:13:34 Could not lookup state --- assuming not claimed
6/22 14:13:34 Could not lookup state --- assuming not claimed
6/22 14:13:34 Could not lookup state --- assuming not claimed
6/22 14:13:34 Could not lookup state --- assuming not claimed
6/22 14:13:34 Could not lookup state --- assuming not claimed
6/22 14:13:34 Could not lookup state --- assuming not claimed
6/22 14:13:34 Could not lookup state --- assuming not claimed
6/22 14:13:34 Could not lookup state --- assuming not claimed
6/22 14:13:34 Could not lookup state --- assuming not claimed
6/22 14:13:34 Could not lookup state --- assuming not claimed
6/22 14:13:34 Could not lookup state --- assuming not claimed
6/22 14:13:34 Phase 3: Sorting submitter ads by priority ...
6/22 14:13:34 Phase 4.1: Negotiating with schedds ...
6/22 14:13:34 Negotiating with nysgrid@xxxxxxxxxx at
<10.0.10.39:58621>
6/22 14:13:34 0 seconds so far
6/22 14:13:34 Request 296163.00000:
6/22 14:13:34 Rejected 296163.0 nysgrid@xxxxxxxxxx
<10.0.10.39:58621>: no match found
6/22 14:13:34 Got NO_MORE_JOBS; done negotiating
6/22 14:13:34 Negotiating with ijstokes@xxxxxxxxxx at
<10.0.10.39:58621>
6/22 14:13:34 0 seconds so far
6/22 14:13:34 Request 291916.00000:
6/22 14:13:34 Rejected 291916.0 ijstokes@xxxxxxxxxx
<10.0.10.39:58621>: no match found
6/22 14:13:34 Got NO_MORE_JOBS; done negotiating
6/22 14:13:34 Phase 4.2: Negotiating with schedds ...
6/22 14:13:34 Negotiating with nysgrid@xxxxxxxxxx at
<10.0.10.39:58621>
6/22 14:13:34 0 seconds so far
6/22 14:13:34 Request 296163.00000:
6/22 14:13:34 Rejected 296163.0 nysgrid@xxxxxxxxxx
<10.0.10.39:58621>: insufficient priority
6/22 14:13:34 Got NO_MORE_JOBS; done negotiating
6/22 14:13:34 Negotiating with ijstokes@xxxxxxxxxx at
<10.0.10.39:58621>
6/22 14:13:34 0 seconds so far
6/22 14:13:34 Phase 4.3: Negotiating with schedds ...
6/22 14:13:34 Negotiating with ijstokes@xxxxxxxxxx at
<10.0.10.39:58621>
6/22 14:13:34 0 seconds so far
6/22 14:13:34 Request 291916.00000:
6/22 14:13:34 Rejected 291916.0 ijstokes@xxxxxxxxxx
<10.0.10.39:58621>: no match found
6/22 14:13:34 Got NO_MORE_JOBS; done negotiating
6/22 14:13:34 ---------- Finished Negotiation Cycle ----------
6/22 14:13:59 ---------- Started Negotiation Cycle ----------
6/22 14:13:59 Phase 1: Obtaining ads from collector ...
6/22 14:13:59 Getting all public ads ...
6/22 14:13:59 Sorting 176 ads ...
6/22 14:13:59 Getting startd private ads ...
6/22 14:13:59 Got ads: 176 public and 123 private
6/22 14:13:59 Public ads include 7 submitter, 137 startd
6/22 14:13:59 Phase 2: Performing accounting ...
6/22 14:13:59 ERROR "Assertion ERROR on
(resource_hash.insert( ResourceName, ResourceAd ) == 0)" at line 785
in file Accountant.cpp