Hi all,
A see lots of lots of jobs are running with duplicate jobIDs. At the
time of writing, it's almost 700 of them:
[root@serv07 ~]# condor_history | awk '{ print $1 }' | sort | uniq -d | wc -l
684
and it's growing in number in every hour, which is putting us in
great trouble debugging some of the issues we have here.
Is it a bug?
Not really. Condor doesn't garuntee that cluster IDs will be unique for a scheduler for all time. If you delete the $(SPOOL) directory or even just the job_queue.log file for a scheduler you'll have your cluster IDs reset.
So the first question is:
Did you delete the $(SPOOL) directory for the scheduler or the contents of that directory or the job_queue.log files? If so, you reset the the cluster ID counter and that's why you've got duplicates.
If you're certain you haven't wiped the job_queue.log file for the scheduler, is it possible you have multiple schedulers writing to the same history file? If so: that's bad. Each scheduler should have its own history file.
- Ian