[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] jobs with undefined JobStatus in limbo



ah - it worked! :)

I just tried as root on the schedd and was able to delete the job after
changing the ad [1]

Cheers and many thanks,
 Thomas

[1]
> condor_qedit   10738840.0 JobStatus=5
Set attribute "JobStatus" for 1 matching jobs.

> condor_q -l  10738840.0
JobStatus = 5

> condor_rm  10738840.0
Job 10738840.0 marked for removal


On 22/07/2020 16.18, John M Knoeller wrote:
> Hi Thomas,  
> 
> A reboot of the machine is not likely to help here,  what is needed is some surgery on the job_queue.  
> 
> You might try using condor_qedit to give the jobs a JobStatus attribute, so you can then remove them using condor_rm. 
> 
> condor_qedit 10738840.0 JobStatus=5
> 
> This command will need to be run by the job's Owner or by one of the QUEUE_SUPER_USERS,  running
> the command as root will probably work depending on your config. 
> 
> hope this helps
> -tj
> 
> -----Original Message-----
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Thomas Hartmann
> Sent: Wednesday, July 22, 2020 8:00 AM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: [HTCondor-users] jobs with undefined JobStatus in limbo
> 
> Hi all,
> 
> we have a number of jobs, where the JobStatus is undefined like [1].
> 
> These jobs were apparently submitted during a short window, where we had
> deployed a broken config (nothing 'mayor', merely a broken bracket). The
> jobs might be late materialization jobs but not necessarily AFAIS.
> 
> Thing is, that we cannot remove or release them from their undefined
> limbo as the schedd(?) seem not to know about them at that point. In the
> logs, only a job transform for the schedd mentions the job ID [2].
> A restart the daemons has not affected these jobs.
> 
> Next step would be a reboot of the machine, but maybe somebody has an
> idea how to get rid of these jobs?
> 
> Cheers,
>   Thomas
> 
> [1]
>> condor_q -l  10738840.0
> ...
> JobStatus = undefined
> LastJobStatus = 1
> ...
> 
> [2]
> /var/log/condor/SchedLog:07/22/20 12:14:33 (pid:528138) job_transforms
> for 10738840.0: 12 considered, 10 applied
> (T01SysDefaultProject,T02JobDefaults,T03JobValues,T04JobEnhance,T05JobClasses,T07AccountingStatusHold,T08DefaultToOS,T10BirdResource,T11ShellEnvironment,T12JobHistory)
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature