On Mon, Jul 12, 2010 at 1:12 PM, Dan Bradley <
dan@xxxxxxxxxxxx> wrote:
> Gary,
>
> It may help to look in SchedLog to see what is happening to your
> condor_schedd.
>
> --Dan
>
> Gary Orser wrote:
>>
>> Trying sending again ...
>>
>> On Fri, Jul 9, 2010 at 10:46 AM, Gary Orser <garyorser> wrote:
>>
>> Hi all,
>>
>> I just upgraded my cluster from Rocks 5.1 to 5.3.
>> This upgraded Condor from 7.2.? to 7.4.2.
>>
>> I've got everything running, but it won't stay up.
>> (I have had the previous configuration running with condor for
>> years, done millions of hours of compute)
>>
>> I have a good repeatable test case.
>> (each job runs for a couple of minutes)
>>
>> [orser@bugserv1 tests]$ for i in `seq 1 100` ; do condor_submit
>> subs/ncbi++_blastp.sub ; done
>> Submitting job(s).
>> Logging submit event(s).
>> 1 job(s) submitted to cluster 24.
>> Submitting job(s).
>> Logging submit event(s).
>> 1 job(s) submitted to cluster 25.
>> Submitting job(s).
>> .
>> .
>> .
>> Submitting job(s).
>> Logging submit event(s).
>> 1 job(s) submitted to cluster 53.
>>
>> WARNING: File /home/orser/tests/results/ncbi++_blastp.sub.53.0.err
>> is not writable by condor.
>>
>> WARNING: File /home/orser/tests/results/ncbi++_blastp.sub.53.0.out
>> is not writable by condor.
>> Can't send RESCHEDULE command to condor scheduler
>> Submitting job(s)
>> ERROR: Failed to connect to local queue manager
>> CEDAR:6001:Failed to connect to <
153.90.184.186:40026
>> <
http://153.90.184.186:40026>>
>> Submitting job(s)
>> ERROR: Failed to connect to local queue manager
>> CEDAR:6001:Failed to connect to <
153.90.184.186:40026
>> <
http://153.90.184.186:40026>>
>> Submitting job(s)
>>
>> [orser@bugserv1 tests]$ cat subs/ncbi++_blastp.sub
>> ####################################
>> ## run distributed blast ##
>> ## Condor submit description file ##
>> ####################################
>> getenv = True
>> universe = Vanilla
>> initialdir = /home/orser/tests
>> executable = /share/bio/ncbi-blast-2.2.22+/bin/blastn
>> input = /dev/null
>> output = results/ncbi++_blastp.sub.$(Cluster).$(Process).out
>> WhenToTransferOutput = ON_EXIT_OR_EVICT
>> error = results/ncbi++_blastp.sub.$(Cluster).$(Process).err
>> log = results/ncbi++_blastp.sub.$(Cluster).$(Process).log
>> notification = Error
>>
>> arguments = "-db /share/data/db/nt -query
>> /home/orser/tests/data/gdo0001.fas -culling_limit 20 -evalue 1E-5
>> -num_descriptions 10 -num_alignments 100 -parse_deflines -show_gis
>> -outfmt 5"
>>
>> queue
>>
>> [root@bugserv1 etc]# condor_q
>>
>> -- Failed to fetch ads from: <
153.90.84.186:40026
>> <
http://153.90.84.186:40026>> :
bugserv1.core.montana.edu
>> <
http://bugserv1.core.montana.edu>
>> CEDAR:6001:Failed to connect to <
153.90.184.186:40026
>> <
http://153.90.184.186:40026>>
>>
>>
>> I can restart the head node with.
>> /etc/init.d/rocks-condor stop
>> rm -f /tmp/condor*/*
>> /etc/init.d/rocks-condor start
>>
>> and the jobs that got submitted do run.
>>
>> I have trawled through the archives, but haven't found anything
>> that might be useful.
>>
>> I've looked at the logs, but not finding any clues there.
>> I can provide them if that might be useful.
>>
>> The changes from a stock install, are minor.
>> (I just brought the cluster up this week)
>>
>> [root@bugserv1 etc]# diff condor_config.local
>> condor_config.local.08Jul09
>> 20c20
>> < LOCAL_DIR = /mnt/system/condor ---
>> > LOCAL_DIR = /var/opt/condor
>> 27,29c27
>> < PREEMPT = True <
>> UWCS_PREEMPTION_REQUIREMENTS = ( $(StateTimer) > (8 * $(HOUR)) && \
>> < RemoteUserPrio > SubmittorPrio * 1.2 ) || (MY.NiceUser
>> == True)
>> ---
>> > PREEMPT = False
>>
>> Just a bigger volume, and 8 hour preemption quanta.
>>
>> Ideas?
>>
>> -- Cheers, Gary
>> Systems Manager, Bioinformatics
>> Montana State University
>>
>>
>>
>>
>> --
>> Cheers, Gary
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to
condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>>
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>>
https://lists.cs.wisc.edu/archive/condor-users/
>>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to
condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
>
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
>
https://lists.cs.wisc.edu/archive/condor-users/
>
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to
condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/