Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] stuck queued jobs
- Date: Thu, 05 Nov 2015 14:56:56 -0600
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] stuck queued jobs
On 11/5/2015 1:59 PM, Ajao, Remi A. wrote:
Hello,
I'm having issues with various jobs being stuck in queue for over a
few hours long, meanwhile there are more than enough available
servers to run the jobs.
The only evidence of the job ID in the logs is in sched log on the
condor master. I see a whole bunch of lines like this.
Inserting new attribute Scheduler into non-active cluster cid
condor_q -analyze jobID - This actually says it's matched to a node
which is not surprising because the as part of the description a
specific host name is given.
Quick thought --
On your stuck job, please do
condor_q jobID -af:r scheduler
You should see something like "DedicatedScheduler@xxxxxxxxxxxxxxxxxx"
Now do
condor_status blah.host.com -af:r rank
(where blah.host.com is the host you restricted your job to using in the
requirements)
You should see the output from condor_status look something like
Scheduler=?="DedicatedScheduler@xxxxxxxxxxxxxxxxxx"
where the exact name in quotes (that starts with DedicatedScheduler)
needs to be the same as what you got above from condor_q.
If it is not the same, see http://is.gd/Zo7lZF for how to setup nodes to
run parallel universe jobs.
If it is the same, please share with us the output from the above
commands along with the output from condor_version.
The bit about
Inserting new attribute Scheduler into non-active cluster cid
is harmless in the case of a parallel universe job and something we will
remove from the log in an upcoming release.
Hope the above helps,
Todd
There is no sign of any of these jobs in the Negotiator log, here's
an example of what the sched log looks like regarding the message I
mentioned earlier.
http://pastebin.com/XKQSpDuV
Here's what the submit.txt file looks like -
http://pastebin.com/yx5JUJnY
1. Executable = g-blah.sh 2. Universe = parallel 3. Log =
g-blah.sh.log 4. Error = err.$(cluster).$(process).$(node).txt 5.
Output = out.$(cluster).$(process).$(node).txt 6. Stream_output =
True 7. Stream_error = True 8.
9. #+ParallelShutdownPolicy = "WAIT_FOR_ALL" 10. machine_count = 1
11. Environment =
LOCKHOME=/home/condor/parallel_universe;CLUSTER_ID=sgoyal/vertest_4node_query_1/2015_11_05__22.14.06;SVNBRANCH=trunk;SVNREV=HEAD;RPMURL=http://10.10.10.16/kits/releases/7.1.2-10//7.1.2-10.x86_64.RHEL5.rpm;user=sgoyal;testlist=vertest_4node_que
ry_1;LOCAL_CLUSTER_NNODES=4;TESTFILTERS=four;r_rpmurl=http://10.10.10.16/kits/releases/7.1.2-4/R_lang/R-lang-7.1.2-4.x86_64.RHEL5.rpm;r_analytics_rpmurl=;r_place_rpmurl=http://10.10.10.16/kits/releases/7.1.2-0/place/place-7.1.2-0.x86_64.RHEL5.rpm;r_pulse_rpmurl=http://10.10.10.16/kits/releases/7.1.2-0/pulse/pulse-7.1.2-0.x86_64.RHEL5.rpm;ignore_rpm_rev=true;SVNBRANCH=branches/7.1_DRAGLINE_SP2_HOTFIX;SVNREV=HEAD;JAVA_HOME_QA=/usr/lib/jvm/java-1.7.0-openjdk.x86_64;STORE_RESULTS=true;VETT_BATCHUPDATE=false;;TIMELIMIT=82100;EMAIL_SUCCESS=blah@xxxxxxxx;EMAIL_FAILURE=blah@xxxxxxxx;EMAIL_STATUS=blah@xxxxxxxx<http://10.10.10.16/kits/releases/7.1.2-10//7.1.2-10.x86_64.RHEL5.rpm;user=sgoyal;testlist=vertest_4node_query_1;LOCAL_CLUS!
TER_NNODES=4;TESTFILTERS=four;r_rpmurl=http://10.10.10.16/kits/releases/7.1.2-4/R_lang/R-lang-7.1.2-4.x86_64.RHEL5.rpm;r_analytics_rpmurl=;r_place_rpmurl=http://10.10.10.16/kits/releases/7.1.2-0/place/place-7.1.2-0.x86_64.RHEL5.rpm;r_pulse_rpmurl=http://10.10.10.16/kits/releases/7.1.2-0/pulse/pulse-7.1.2-0.x86_64.RHEL5.rpm;ignore_rpm_rev=true;SVNBRANCH=branches/7.1_DRAGLINE_SP2_HOTFIX;SVNREV=HEAD;JAVA_HOME_QA=/usr/lib/jvm/java-1.7.0-openjdk.x86_64;STORE_RESULTS=true;VETT_BATCHUPDATE=false;;TIMELIMIT=82100;EMAIL_SUCCESS=blah@xxxxxxxx;EMAIL_FAILURE=blah@xxxxxxxx;EMAIL_STATUS=blah@xxxxxxxx>;
12. arguments = cluster/query_regress_1 13. should_transfer_files =
YES 14. when_to_transfer_output = ON_EXIT_OR_EVICT 15. # need to
explicitly make coreize big otherwise condor sets it to zero by
default 16. coresize = 4000000000 17. periodic_remove =
RemoteWallClockTime > 82800 18. +TimeLimit = 82800 19. Requirements =
(sf_maintenance == FALSE) && (SF == 1) && (Memory > 1024) && (STAGE
== 0) && (QA=?=UNDEFINED) && Machine == "blah.host.com" 20.
21. Queue
I also enabled more debugging on SCHEDD_DEBUG, however not seeing any
other interesting data. Any help is much appreciated. It's worth
noting that I do have some other jobs that are running, mostly
vanilla universe, it's the parallel universe ones that seem to be
mostly affected.
Thanks
_______________________________________________ HTCondor-users
mailing list To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You
can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685