Make sure all of the machines are in the Unclaimed state in condor_status, and not Owner. If they're in Owner state, they don't want to accept jobs.
Then, submit a new set of jobs and run condor_q -analyze <job id>, using the id of one of the idle jobs. That may provide some information about why the jobs aren't running on the other machines.
-- Jaime
Jaime, my submit files is: Executable = PQL Universe = vanilla Output = pql.out Log = pql.log Error = pql.err Arguments = -p params.in -t temps.in notification = Error notify_user = codytrey@xxxxxxxx should_transfer_files = YES Queue 20 I have it queue 20 jobs to see if it would force jobs to other machines if the submit node had all it's processors in use, but it just ran 4 at a time until it was complete Same results with: Executable = test.py Universe = vanilla Output = /Volumes/Scratch/test/test.out.$(Process) Log = /Volumes/Scratch/test/test.log Error = /Volumes/Scratch/test/test.err should_transfer_files = ALWAYS Queue 10 -Cody
On 2013-02-26 10:29, Jaime Frey wrote:
What does your submit file look like?
A common problem is that the machines don't have a shared filesystem, and HTCondor's file transfer option isn't being requested in the submit file. In this case, HTCondor will only run the jobs on the submit machine.
-- Jaime
I do see all of the machines in condor-status
"codytrey@metis:~$ condor_config_val DAEMON_LIST MASTER, SCHEDD, STARTD"
This is the submit machine, it is the same on an execute a just tried. -Cody On 2013-02-26 08:47, Cotton, Benjamin J wrote:
Cody,
The first question is are you sure they're all in the same pool? To
check this, do they all show up in the output of condor_status?
My suspicion is that your submit/execute machine might be running its
own condor_collector and condor_negotiator processes. You can check this
with
condor_config_val DAEMON_LIST
If that's the case, then your execute-only nodes might be as well.
Thanks and regards, Jaime Frey UW-Madison HTCondor Project
|