Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Condor runs immediately go to "Held" state
- Date: Tue, 02 Jan 2018 18:23:44 +0000
- From: Michael Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Condor runs immediately go to "Held" state
Hey Nate,
This looks to be a situation where the exec nodes don't have access to the same filesystems that the submit node does. Make sure that /some/path is available on all of the exec nodes, and I think your jobs will fire up successfully.
My guess is that your exec nodes are on a separate subnet shared with the central manager system, and everyone else is on a different subnet?
Also, make sure that the initialdir is not being shuffled up in the submit - if the user is submitting from one directory and the job is being assigned to a different initial directory which doesn't contain the input file, then that might cause a problem - though I think that would probably be caught at submit time, rather than execute time as this error is.
The alternative to a shared filesystem is to have the job submissions set up input and output file transfers.
-Michael Pelletier.
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Mobley, Nate (Millennium)
Sent: Tuesday, January 2, 2018 9:33 AM
To: 'htcondor-admin@xxxxxxxxxxx' <htcondor-admin@xxxxxxxxxxx>; 'htcondor-users@xxxxxxxxxxx' <htcondor-users@xxxxxxxxxxx>
Subject: [External] Re: [HTCondor-users] Condor runs immediately go to "Held" state
Please advise; I still need assistance with this as my customer is under a deadline. Thank you. Please see below for a sample of the log file after trying a run:
"0 - Run Bytes Received By Job
...
007 (1863.018.000) 12/27 09:30:12 Shadow exception!
Error from slot19@xxxxxxxxxxxxxxxxx: Failed to open '/some/path/filename.inp' as standard input: No such file or directory (errno 2)"
Some context: I have one head node (this is what we log into to submit runs) that is running RHEL6, and 9 compute nodes.
Thanks for any assistance you can provide.