it sounds as if a requirement of the job or the machine is not fullfilled - have you tried
Also condor_q -better-analyze <job-id> -reverse - machine <FQDN of a possible workernode>' can be enlightning in similar situations ...
Hi,
I wanted to email my issue here in hopes that someone might have an idea on how to solve our problem.
We are trying to get a working HTCondor solution in place with a Linux VM as the Central Manager and our lab machines which are Windows 11. We have done this in the past, but it was using a very old version of HTCondor (before 9.0.x) and we were using the
insecure security authentication.
We have downloaded and installed one of the the latest HTCondor software versions (23.9.6). Initially we had issues trying to authenticate from from our Windows machines to our Linux CM with IDTOKENS so we turned to SSL as the primary authentication, and
were able to access the CM from our lab machines.
Our latest (and hopefully) final issue is trying to get jobs to run on available machines instead of going immediately to "Idle". For some reason even though there is an eligible machine that can be used to run the job, it gets placed in the "Idle" status
until it is force removed with "condor_rm". This had been happening consistently for weeks while I was working on it at the end of 2024.
I finally came back to the problem this month and reinstalled Condor with a newer version (The CM is still running the same 23.9.6). When I submitted a job it actually worked and ran to completion! But then, it resumed it's old habit of going idle and not
running even though it was the exact same job and no changes to any configurations were made. I haven't been able to recreate a successful submission again.
I tried uninstalling and reinstalling Condor thinking that might be the link, but it still didn't run any jobs. I only have one machine (besides the CM in the pool as of now), but when I enter the "condor_q -better-analyze" command it shows that that machine
is eligible.
I have all the logs and config files that will show this. I am hoping someone can help us solve this problem. Any help or advise on what we can try next would be greatly appreciated. I am attaching logs and command output that I think are relevant for this
issue. If you'd like any more documentation please let me know and I'll be happy to generate it.
Thank you so much for taking the time to read this.
Best regards,
Patrick
Here is a list of files attachments and their significance to the issue (These are all client side logs):
--------------------------------------------------------
File))
log_directory_of_idle_submission.png >>> S
creenshot of the log directory where I can't get a job to run. No "StarterLog.slotx_x" or "startd_history" files are ever created.
File))
log_directory_of_successful_submission.png >>>
Screenshot of the log directory where it successfully ran a submitted job.
File))
commands_showing_idle_job.txt >>>
Commands I issued before and after the job was submitted. "condor_q -better-analyze" shows there is an eligible machine.
File))
SchedLog_with_idle_jobs.txt >>>
SchedLog showing all jobs going idle.
File))
SchedLog_with_successful_job.txt >>>
SchedLog showing a successful job running. Timestamp at 03/06/25 14:41 is when it was submitted.
File))
MasterLog_with_idle_jobs.txt >>>
MasterLog showing all jobs going idle.
File))
MasterLog_with_successful_job.txt >>>
MasterLog showing a successful job running. Timestamp at
03/06/25 14:41 is when it was submitted.
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
Join us in June at Throughput Computing 25: https://osg-htc.org/htc25
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/