Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] jobs stuck in queue
- Date: Fri, 19 Aug 2011 22:09:54 +0000
- From: "Koller, Garrett" <kollerg14@xxxxxxxxxxxx>
- Subject: Re: [Condor-users] jobs stuck in queue
Mr. Cannini,
I'm not yet familiar with running MPI jobs on Condor, but I think I've come across a similar situation. First of all, run 'condor_q -better-analyze' to figure out if the job's requirements are causing it to not be scheduled in the first place. If it says "not yet considered by matchmaker" or something, it usually means that it is being run but encounters an error shortly thereafter and so is continuously put back on the queue. Check the MatchLog. If it keeps saying that the same job is "Matched", it means that the job successfully scheduled but something goes wrong with the execute machine. Check which slot and what machine the job is assigned to. Go to the log files of that machine and look for the StarterLog for that slot. The bottom of that log should tell you what error you program encountered that caused it to exit. Let me/us know if this doesn't help you diagnose and solve the problem.
Best Regards,
- Garrett
condor.cs.wlu.edu
________________________________________
From: condor-users-bounces@xxxxxxxxxxx [condor-users-bounces@xxxxxxxxxxx] on behalf of Fabricio Cannini [fcannini@xxxxxxxxx]
Sent: Friday, August 19, 2011 4:36 PM
To: condor-users@xxxxxxxxxxx
Subject: [Condor-users] jobs stuck in queue
Hello
I have installed condor 7.6.0 in a master + 2 execute nodes scheme, with the
following configuration:
*master :*
UID_DOMAIN = internal.domain
FILESYSTEM_DOMAIN = internal.domain
SEC_DEFAULT_NEGOTIATION = OPTIONAL
ALLOW_READ = $(FULL_HOSTNAME),@172.17.8.*
ALLOW_WRITE = $(FULL_HOSTNAME),@172.17.8.*
ALLOW_NEGOTIATOR = $(CONDOR_HOST)
ALLOW_CONFIG = $(CONDOR_HOST),$(FULL_HOSTNAME)
ENABLE_RUNTIME_CONFIG = True
ENABLE_PERSISTENT_CONFIG = True
PERSISTENT_CONFIG_DIR = /etc/condor/config.d
SETTABLE_ATTRS_CONFIG = *
USE_NFS = True
DEFAULT_DOMAIN_NAME = internal.domain
TRUST_UID_DOMAIN = True
DAEMON_LIST = MASTER, STARTD, SCHEDD, COLLECTOR, NEGOTIATOR
SOFT_UID_DOMAIN = TRUE
START = TRUE
*nodes:*
CONDOR_HOST = master
UID_DOMAIN = internal.domain
FILESYSTEM_DOMAIN = internal.domain
SEC_DEFAULT_NEGOTIATION = OPTIONAL
ALLOW_READ = $(CONDOR_HOST),172.17.8.*
ALLOW_WRITE = $(CONDOR_HOST),172.17.8.*
ALLOW_NEGOTIATOR = $(CONDOR_HOST)
ALLOW_CONFIG = $(CONDOR_HOST),$(FULL_HOSTNAME)
ENABLE_RUNTIME_CONFIG = True
ENABLE_PERSISTENT_CONFIG = True
PERSISTENT_CONFIG_DIR = /etc/condor/config.d
SETTABLE_ATTRS_CONFIG = *
USE_NFS = True
DEFAULT_DOMAIN_NAME = internal.domain
ALLOW_DAEMON = *@$(CONDOR_HOST)
SOFT_UID_DOMAIN = TRUE
START = TRUE
TRUST_UID_DOMAIN = TRUE
STARTD_EXPRS=$(STARTD_EXPRS), DedicatedScheduler, ParallelSchedulingGroup
SCHEDD_NAME = $(CONDOR_HOST)
When i submit a simple job like this:
###############################
Error = err-$(cluster).log
Output = out-$(cluster).log
Log = log-$(cluster).log
cmd = /bin/cat
arguments = /proc/cpuinfo
Queue
###############################
It goes ok. But a little more complicated job like this:
===============================
universe = parallel
Error = err-$(cluster).log
Output = out-$(cluster).log
Log = log-$(cluster).log
executable = /usr/bin/mpirun
arguments = -np 8 -host node-01,node-02 /home/user/hw
machine_count = 2
Queue
===============================
The job goes to idle state:
-- Submitter: master.internal.domain : <172.17.8.121:58829> :
master.internal.domain
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
33.0 user 8/19 16:48 0+00:00:00 I 0 0.1 mpirun -np 8 -host
"/home/user/hw" is just a simple mpi hello world.
Any tips to what may (not) be going on are very, very, veeeeery welcome.
TIA
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/