Dear all,
    We have found a problem with the GridManager when dealing with
      certain jobs that our HTCondor setup forwards to remote (Grid)
      resources.
    
    
    It's not totally clear what's happened, but we would argue that
      the GridManager behaviour could be improved. The grand summary is
      the following:
    
      In certain circumstances (infrequent, but not very rare, as per
        our experience), a job causes the GridManager code to raise an
        AssertionError and thus exits without processing the rest of its
        jobs, which may be healthy.
    
    We consider this a bug, or at least a problem, because a
      problematic job is preventing other jobs from proceeding, and
      eventually causing all remote activity to halt. In general, we
      have seen that this requires manual cleaning to recover.
    We would rather have (if possible) those problematic jobs ignored
      (i.e., left on hold), and proceed with the other jobs.
    
    
    The assertion error looks like the following:
    
      ERROR "Assertion ERROR on (gahp != __null || gmState == 14 || gmState == 12)" at line
391 in file /slots/10/dir_3391896/userdir/.tmpa8mXuo/BUILD/condor-8.8.1/src/condor_gridmanager/condorjob.cpp
    
    Looking at the code (https://github.com/htcondor/htcondor/blob/V8_8_1-branch/src/condor_gridmanager/condorjob.cpp),
      I see that it happens within CondorJob::doEvaluateState: 
    
      ASSERT ( gahp != NULL || gmState == GM_HOLD || gmState == GM_DELETE );
    
    
    
    For those interested, let me give more details by illustrating
      this with the most recent example of this error:
    
      - 
        
A job is received by our HTCondor-CE (v3.2.1) and reaches our
          HTCondor batch (v8.8.1): CE ID: 2302737, batch ID:
            2314592
       
      - 
        
The job is routed to Universe 9 (Grid), with GridResource set
          to a remote HTCondor-CE resource, and starts running there.
        
       
      - 
        
Eventually, the remote job fails due to memory constraints.
          This is noticed in the local schedd log:
       
    
    
      
        07/27/19 09:39:57 (cid:12488566) Set Attribute for
            job 2302737.0, HoldReason = "Error from slot1_3@xxxxxxxxxxxx:
            Job has gone over memory limit of 2048 megabytes. Peak
            usage: 42532 megabytes."
      
    
    
      - 
        
This causes the local CE job to be removed, but the local
          batch job remains. However, the physical spool directory for
          the job files is the same for both jobs (at least in our
          configuration), and it gets removed when the CE job is
          deleted.
       
      - 
        
In the GridManager log, we see that the job went to
          GM_SUBMITTED, and then moves to GM_CLEAR_REQUEST (instead of
          GM_HOLD, as happens in other cases), causing the assertion
          error:
       
    
    
      
        07/27/19 09:39:50 [1629941] (2314592.0) doEvaluateState
            called: gmState GM_SUBMITTED, remoteState 2
          
        [...]
          
        07/27/19 10:12:19 [1629941] Failed to get expiration time
            of proxy
/var/lib/condor-ce/spool/2737/0/cluster2302737.proc0.subproc0/credential_CMSG-v1_0.main_411868:
            unable to read proxy file 07/27/19 10:12:19 [1629941] 
          
        Found job 2314592.0 --- inserting
        07/27/19 10:12:19 [1629941] (2314592.0) doEvaluateState
            called: gmState GM_CLEAR_REQUEST, remoteState -1Â
        07/27/19 10:12:19 [1629941] ERROR "Assertion ERROR on
            (gahp != __null || gmState == 14 || gmState == 12)" at line
            391 in file
/slots/10/dir_3391896/userdir/.tmpa8mXuo/BUILD/condor-8.8.1/src/condor_gridmanager/condorjob.cpp
      
    
    
      - 
        
One can also notice that the job spool dir (CE id: 2302737)
          has gone ('Failed to get expiration time of proxy'
          line)
       
      - 
        
From that moment on, each time the GridManager wakes up, it
          gets to the same assertion error and exits, missing to handle
          all the other jobs.
       
      - 
        
This can be cleaned up with 'condor_rm -forcex'.
       
    
    
    
    So there are two issues here: one is why the job gets into that
      funny state, the second one (more important, IMHO), is whether the
      GridManager should perhaps skip that job, but continue with the
      rest, instead of dying completely.
    We are aware that our configuration may be uncommon, but we would
      be grateful if this could be avoided somehow (a configuration
      change in our side or perhaps a code fix from the developers).
    Any comment will be welcomed :-)
    Cheers,
    ÂÂ Antonio