Re: [Condor-users] duplicate jobIDs in the condor

Re: [Condor-users] duplicate jobIDs in the condor_history

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Date: Wed, 24 Nov 2010 16:39:48 -0500

From: Ian Chesal <ichesal@xxxxxxxxxxxxxxxxxx>

Subject: Re: [Condor-users] duplicate jobIDs in the condor_history

Hi all,

A see lots of lots of jobs are running with duplicate jobIDs. At the time of writing, it's almost 700 of them:
[root@serv07 ~]# condor_history | awk '{ print $1 }' | sort | uniq -d | wc -l
684
and it's growing in number in every hour, which is putting us in great trouble debugging some of the issues we have here.
Is it a bug?

Not really. Condor doesn't garuntee that cluster IDs will be unique for a scheduler for all time. If you delete the $(SPOOL) directory or even just the job_queue.log file for a scheduler you'll have your cluster IDs reset.

So the first question is:

Did you delete the $(SPOOL) directory for the scheduler or the contents of that directory or the job_queue.log files? If so, you reset the the cluster ID counter and that's why you've got duplicates.

If you're certain you haven't wiped the job_queue.log file for the scheduler, is it possible you have multiple schedulers writing to the same history file? If so: that's bad. Each scheduler should have its own history file.

- Ian

Mailing List Archives

Authenticated access

Re: [Condor-users] duplicate jobIDs in the condor_history