74 cores unclaimed, 10 jobs submitted
codytrey@metis:~/condor$ condor_q -analyze
-- Submitter: metis.physics.tamu.edu : <128.194.151.193:49656> : metis.physics.tamu.edu
---
055.000: Request has not yet been considered by the matchmaker.
---
055.001: Request has not yet been considered by the matchmaker.
---
055.002: Request has not yet been considered by the matchmaker.
---
055.003: Request has not yet been considered by the matchmaker.
---
055.004: Request has not yet been considered by the matchmaker.
---
055.005: Request has not yet been considered by the matchmaker.
---
055.006: Request has not yet been considered by the matchmaker.
---
055.007: Request has not yet been considered by the matchmaker.
---
055.008: Request has not yet been considered by the matchmaker.
---
055.009: Request has not yet been considered by the matchmaker.
waiting for them to be considered (how long should this take? some times it seems very fast, other times it takes upwards of 10 minutes)
after some time, it changes to:
-- Submitter: metis.physics.tamu.edu : <128.194.151.193:49656> : metis.physics.tamu.edu
---
057.004: Request is being serviced
---
057.005: Request is being serviced
the python script that it runs sleeps for some amount of time then, echo's the host name, the out put of this shows that they all run on the machine named metis.
I was looking at the logs on the central manager, it seems that it often tries to communicate with it's self over 127.0.0.1:49152 but fails, could this be related and or the cause of my problems?
I also just noticed, that condor_status shows 20 cores were matched and 16 remained unclaimed, this leads me to think that the jobs are match to run on other nodes, but the central manager is not able to send it.
You've been very helpful thus far, I greatly appreciate it.
-Cody
On 2013-02-26 11:50, Jaime Frey wrote:
Make sure all of the machines are in the Unclaimed state in condor_status, and not Owner. If they're in Owner state, they don't want to accept jobs.Then, submit a new set of jobs and run condor_q -analyze, using the id of one of the idle jobs. That may provide some information about why the jobs aren't running on the other machines.-- Jaime
On Feb 26, 2013, at 11:03 AM, Cody Belcher <codytrey@xxxxxxxxxxxxxxxx> wrote:
Jaime,
my submit files is:
Executable = PQL
Universe = vanilla
Output = pql.out
Log = pql.log
Error = pql.err
Arguments = -p params.in -t temps.in
notification = Error
notify_user = codytrey@xxxxxxxx
should_transfer_files = YES
Queue 20
I have it queue 20 jobs to see if it would force jobs to other machines if the submit node had all it's processors in use, but it just ran 4 at a time until it was complete
Same results with:
Executable = test.py
Universe = vanilla
Output = /Volumes/Scratch/test/test.out.$(Process)
Log = /Volumes/Scratch/test/test.log
Error = /Volumes/Scratch/test/test.err
should_transfer_files = ALWAYS
Queue 10
-CodyOn 2013-02-26 10:29, Jaime Frey wrote:
What does your submit file look like?A common problem is that the machines don't have a shared filesystem, and HTCondor's file transfer option isn't being requested in the submit file. In this case, HTCondor will only run the jobs on the submit machine.-- Jaime
On Feb 26, 2013, at 9:09 AM, Cody Belcher <codytrey@xxxxxxxxxxxxxxxx> wrote:
I do see all of the machines in condor-status
"codytrey@metis:~$ condor_config_val DAEMON_LIST
MASTER, SCHEDD, STARTD"
This is the submit machine, it is the same on an execute a just tried.-Cody
On 2013-02-26 08:47, Cotton, Benjamin J wrote:
Cody, The first question is are you sure they're all in the same pool? To check this, do they all show up in the output of condor_status? My suspicion is that your submit/execute machine might be running its own condor_collector and condor_negotiator processes. You can check this with condor_config_val DAEMON_LIST If that's the case, then your execute-only nodes might be as well.
Thanks and regards,Jaime FreyUW-Madison HTCondor Project