|
I’m having a weird issue when trying to run a particular code through Condor . The short version of the issue. This code, when executed from a Condor submit and runs on a particular machine it appears to not get the hostname
of the machine it is running on (and thus bombs) but when the code is run interactively on the same machine it works ok. Here is more detailed information. We are in the process of standing up a new cluster in our move from RHEL 7 to RHEL 8 and have been using condor 8.8 on the RHEL 7 and using Condor 25.4 on the new RHEL 8 system. Currently,
this pool consists of 3 machines lets refer to as srv0, srv1, dsktop1. Srv0 is the collector and negotiator and also has schedd and startd running. Srv1 is just a worker so only startd. Dsktop1 is a submitter and worker, so schedd and startd. Currently, this
pool seems to be working for the most part with some test runs of various codes we use. Srv0 and srv1 are on the same switch, dsktop1 is another hop or two away on the network. So I started to try some tests with another code that we use (a third party code that we are not able to modify). It appeared to start but then would almost immediately die complaining about an End-of-File. From what
I’ve been able to gather, when this code starts, it appears to create a temporary file and writes the current date, hostname of the machine it is running on, and the username running the code into this file. Then for some reason must turn around and attempt
to read these values back in, deletes the temporary file, and then starts running the input. The standard output that occurs before the code bombs, has printed out the date, then prints out the hostname, but the value appears to be my username, and then tries
to print out the username and bombs at this point. The temporary is still present and only contains 2 lines, a date, and my username. The condor input file is being submitted by dsktop1. The machine that Condor keeps trying to run on is dsktop1. I modified the condor script with a requirements = (Target.Machine == “srv0.widgets.net”) and things run
fine, as it does if I use srv1 as the target machine. Furthermore, if I run a script that I use for running this code interactively, it works on all 3 machines with no problem. This kickoff script performs a hostname and a uname -a to write out to a file the
name of the machine that it is about to run on and they return the correct value/name of the machine even when run through Condor, but the code appears to get no value when submitted through Condor and runson the dsktop1 machine. I tried another simpler test. I created a Condor input file similar to this: executable = /bin/hostname output = hostname.out error = hostname.err log = hostname.log requirements = (Target.Machine == “dsktop1.widgets.net”) queue When I submit this to Condor, the resulting hostname.out file is empty. However, if I change the Target.Machine to either srv0 or srv1, I appropriately get srv0.widgets.net and srv1.widgets.net, respectively.
There is another code we use and its output also includes the hostname of the machine it ran on. It runs fine on all 3 systems when executed through Condor and shows the name dsktop1.widgets.net when it runs on dsktop1.
This code is a C++ code while the code I am having trouble with is a Fortran code I believe. We actually ran into this issue with this code on the RHEL 7 system (Condor 8.8). When a different user was trying to run the code they had this issue. However, when I ran it , I did not and I had jobs running on all
the various machines that were in that pool. This e-mail, including any attached files, may contain confidential and privileged information for the sole use of the intended recipient. Any review, use, distribution, or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive information for the intended recipient), please contact the sender by reply e-mail and delete all copies of this message. |