it sounds as if a requirement of the job or the machine is
not fullfilled - have you tried
Also condor_q -better-analyze <job-id> -reverse -
machine <FQDN of a possible workernode>' can be
enlightning in similar situations ...
Hi,
I wanted to email my issue here in hopes that someone
might have an idea on how to solve our problem.
We are trying to get a working HTCondor solution in place
with a Linux VM as the Central Manager and our lab machines
which are Windows 11. We have done this in the past, but it
was using a very old version of HTCondor (before 9.0.x) and we
were using the insecure security authentication.
We have downloaded and installed one of the the latest
HTCondor software versions (23.9.6). Initially we had issues
trying to authenticate from from our Windows machines to our
Linux CM with IDTOKENS so we turned to SSL as the primary
authentication, and were able to access the CM from our lab
machines.
Our latest (and hopefully) final issue is trying to get
jobs to run on available machines instead of going immediately
to "Idle". For some reason even though there is an eligible
machine that can be used to run the job, it gets placed in the
"Idle" status until it is force removed with "condor_rm". This
had been happening consistently for weeks while I was working
on it at the end of 2024.
I finally came back to the problem this month and
reinstalled Condor with a newer version (The CM is still
running the same 23.9.6). When I submitted a job it actually
worked and ran to completion! But then, it resumed it's old
habit of going idle and not running even though it was the
exact same job and no changes to any configurations were made.
I haven't been able to recreate a successful submission again.
I tried uninstalling and reinstalling Condor thinking that
might be the link, but it still didn't run any jobs. I only
have one machine (besides the CM in the pool as of now), but
when I enter the "condor_q -better-analyze" command it shows
that that machine is eligible.
I have all the logs and config files that will show this. I am
hoping someone can help us solve this problem. Any help or
advise on what we can try next would be greatly appreciated. I
am attaching logs and command output that I think are relevant
for this issue. If you'd like any more documentation please
let me know and I'll be happy to generate it.
Thank you so much for taking the time to read this.
Best regards,
Patrick
Here is a list of files attachments and their significance to
the issue (These are all client side logs):
--------------------------------------------------------
File))
log_directory_of_idle_submission.png
>>> S
creenshot of the log directory where I can't
get a job to run. No "StarterLog.slotx_x" or
"startd_history" files are ever created.
File))
log_directory_of_successful_submission.png
>>>
Screenshot of the log directory where it
successfully ran a submitted job.
File))
commands_showing_idle_job.txt >>>
Commands I issued before and after the job was submitted.
"condor_q -better-analyze" shows there is an eligible
machine.
File))
SchedLog_with_idle_jobs.txt >>>
SchedLog
showing all jobs going idle.
File))
SchedLog_with_successful_job.txt
>>>
SchedLog showing a successful job running.
Timestamp at 03/06/25 14:41 is when it was submitted.
File))
MasterLog_with_idle_jobs.txt >>>
MasterLog
showing all jobs going idle.
File))
MasterLog_with_successful_job.txt >>>
MasterLog showing a successful job running. Timestamp at
03/06/25 14:41 is when it was submitted.
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
Join us in June at Throughput Computing 25:
https://osg-htc.org/htc25
The archives can be found at:
https://www-auth.cs.wisc.edu/lists/htcondor-users/