[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Code getting empty hostname when executed from Condor on certain pool machines



One handy tool to help investigate this puzzle is to submit your job with the -interactive option of condor_submit. This will give you an ssh session with the exact environment under which HTCondor would run the job (environment variables, whether the job is placed in a container, etc). You can try running the application interactively there and run various hostname-related commands to find out way the machines behave differently.

 - Jaime

On Feb 9, 2026, at 2:34âPM, Vechinski, Douglas <douglas.vechinski@xxxxxxxxxx> wrote:

Iâm having a weird issue when trying to run a particular code through Condor . The short version of the issue. This code, when executed from a Condor submit and runs on a particular machine it appears to not get the hostname of the machine it is running on (and thus bombs) but when the code is run interactively on the same machine it works ok.
 
Here is more detailed information. We are in the process of standing up a new cluster in our move from RHEL 7 to RHEL 8 and have been using condor 8.8 on the RHEL 7 and using Condor 25.4 on the new RHEL 8 system. Currently, this pool consists of 3 machines lets refer to as srv0, srv1, dsktop1. Srv0 is the collector and negotiator and also has schedd and startd running. Srv1 is just a worker so only startd. Dsktop1 is a submitter and worker, so schedd and startd. Currently, this pool seems to be working for the most part with some test runs of various codes we use. Srv0 and srv1 are on the same switch, dsktop1 is another hop or two away on the network.
 
So I started to try some tests with another code that we use (a third party code that we are not able to modify). It appeared to start but then would almost immediately die complaining about an End-of-File. From what Iâve been able to gather, when this code starts, it appears to create a temporary file and writes the current date, hostname of the machine it is running on, and the username running the code into this file. Then for some reason must turn around and attempt to read these values back in, deletes the temporary file, and then starts running the input. The standard output that occurs before the code bombs, has printed out the date, then prints out the hostname, but the value appears to be my username, and then tries to print out the username and bombs at this point. The temporary is still present and only contains 2 lines, a date, and my username.
 
The condor input file is being submitted by dsktop1. The machine that Condor keeps trying to run on is dsktop1. I modified the condor script with a requirements = (Target.Machine == âsrv0.widgets.netâ) and things run fine, as it does if I use srv1 as the target machine. Furthermore, if I run a script that I use for running this code interactively, it works on all 3 machines with no problem. This kickoff script performs a hostname and a uname -a to write out to a file the name of the machine that it is about to run on and they return the correct value/name of the machine even when run through Condor, but the code appears to get no value when submitted through Condor and runson the dsktop1 machine.
 
I tried another simpler test. I created a Condor input file similar to this:
 
executable = /bin/hostname
output = hostname.out
error = hostname.err
log = hostname.log
requirements = (Target.Machine == âdsktop1.widgets.netâ)
queue
 
When I submit this to Condor, the resulting hostname.out file is empty. However, if I change the Target.Machine to either srv0 or srv1, I appropriately get srv0.widgets.net and srv1.widgets.net, respectively.
 
There is another code we use and its output also includes the hostname of the machine it ran on. It runs fine on all 3 systems when executed through Condor and shows the name dsktop1.widgets.netwhen it runs on dsktop1. This code is a C++ code while the code I am having trouble with is a Fortran code I believe.
 
We actually ran into this issue with this code on the RHEL 7 system (Condor 8.8). When a different user was trying to run the code they had this issue. However, when I ran it , I did not and I had jobs running on all the various machines that were in that pool.

This e-mail, including any attached files, may contain confidential and privileged information for the sole use of the intended recipient. Any review, use, distribution, or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive information for the intended recipient), please contact the sender by reply e-mail and delete all copies of this message.
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/