Hi group, First of all, my setup is: HTcondor CE + Batch $CondorVersion: 8.8.5 Sep 04 2019 BuildID: 480168 PackageID: 8.8.5-1 $ $CondorPlatform: x86_64_RedHat7 25 workernodes with 40 cores each (1000 cores in total) --------A lot of jobs are going into a hold state with this message: "held job due to expired user proxy".
Sometimes I have more than 3k hold jobs, and this makes my CE unavailable until I remove them from the queue:
condor_ce_q -- Failed to fetch ads from: <10.10.0.10:17685> : <myce-fqdn> SECMAN:2007:Failed to end classad message. ----------Here goes attached a graph of my CE, which is a condor_ce_q and condor_status --total
As you can see fails from time to time. From where should I start to debug ? Regards, EJ
Attachment:
Screen Shot 2020-08-05 at 21.25.33.png
Description: PNG image