Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] mpi job stuck as idle
- Date: Mon, 22 Jan 2018 08:11:50 -0600
- From: Jason Patton <jpatton@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] mpi job stuck as idle
It looks like you ran:
condor_status -af:h Machine DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
What if you run:
condor_status -af:h Machine DedicatedScheduler
This will show the value of DedicatedScheduler (and Machine) for each
slot on each execute machine.
Jason
On Mon, Jan 22, 2018 at 3:22 AM, mahmood n <nt_mahmood@xxxxxxxxx> wrote:
> Hi,
>
> Anybody help?
>
> I have stuck at this step. All I see on the web is about setting the
> hostname and policies. I have modified them. Donât know why it doesnât work
>
>
>
> Regards,
>
> Mahmood
>
>
>
> From: Mahmood Naderan
> Sent: Friday, January 19, 2018 4:19 PM
> To: HTCondor-Users Mail List; Jason Patton
> Subject: Re: [HTCondor-users] mpi job stuck as idle
>
>
>
> Jason,
>
>
>
>>Assuming you are running a recent version of condor, "condor_q" will
>
>>not show jobs from all users, but "condor_status -schedd" will show
>>totals from all users. Does the output of "condor_q -all" show more
>>jobs?
>
>
>
> No, Please see below
>
>
>
>
>
> [root@rocks7 examples]# condor_status -schedd
>
> Name Machine RunningJobs IdleJobs
> HeldJobs
>
> rocks7.vbtestcluster.com rocks7.vbtestcluster.com 0 2
> 0
>
> TotalRunningJobs TotalIdleJobs TotalHeldJobs
>
>
> Total 0 2 0
> [root@rocks7 examples]# condor_q -all
>
>
> -- Schedd: rocks7.vbtestcluster.com : <10.0.3.15:48687> @ 01/19/18 07:45:13
> OWNER BATCH_NAME SUBMITTED DONE RUN IDLE
> TOTAL JOB_IDS
> mahmood CMD: /opt/openmpi/bin/mpirun 1/17 03:04 _ _ 1
> 1 5.0
>
> 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
>
>
>
>
>
>
>
>
>
>
> I followed the steps as described in the manual and uncommented the policy.
> The job is still in idle state. Should I kill it and resubmit or I missed
> some configurations?
>
>
>
>
>
>
>
> [root@rocks7 examples]# cat condor_config.local.dedicated.resource
> ######################################################################
> ##
> ## condor_config.local.dedicated.resource
> ##
> ## This is the default local configuration file for any resources
> ## that are going to be configured as dedicated resources in your
> ## Condor pool. If you are going to use Condor's dedicated MPI
> ## scheduling, you must configure some of your machines as dedicated
> ## resources, using the settings in this file.
> ##
> ## PLEASE READ the discussion on "Configuring Condor for Dedicated
> ## Scheduling" in the "Setting up Condor for Special Environments"
> ## section of the Condor Manual for more details.
> ##
> ## You should copy this file to the appropriate location and
> ## customize it for your needs. The file is divided into three main
> ## parts: settings you MUST customize, settings regarding the policy
> ## of running jobs on your dedicated resources (you must select a
> ## policy and uncomment the corresponding expressions), and settings
> ## you should leave alone, but that must be present for dedicated
> ## scheduling to work. Settings that are defined here MUST BE
> ## DEFINED, since they have no default value.
> ##
> ######################################################################
>
>
> ######################################################################
> ######################################################################
> ## Settings you MUST customize!
> ######################################################################
> ######################################################################
>
> ## What is the name of the dedicated scheduler for this resource?
> ## You MUST fill in the correct full hostname where you're running
> ## the dedicated scheduler, and where users will submit their
> ## dedicated jobs. The "DedicateScheduler@" part should not be
> ## changed, ONLY the hostname.
> DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx"
>
>
> ######################################################################
> ######################################################################
> ## Policy Settings (You MUST choose a policy and uncomment it)
> ######################################################################
> ######################################################################
>
> ## There are three basic options for the policy on dedicated
> ## resources:
> ## 1) Only run dedicated jobs
> ## 2) Always run jobs, but prefer dedicated ones
> ## 3) Always run dedicated jobs, but only allow non-dedicated jobs to
> ## run on an opportunistic basis.
> ## You MUST uncomment the set of policy expressions you want to use
> ## at your site.
>
> ##--------------------------------------------------------------------
> ## 1) Only run dedicated jobs
> ##--------------------------------------------------------------------
> #START = Scheduler =?= $(DedicatedScheduler)
> #SUSPEND = False
> #CONTINUE = True
> #PREEMPT = False
> #KILL = False
> #WANT_SUSPEND = False
> #WANT_VACATE = False
> #RANK = Scheduler =?= $(DedicatedScheduler)
>
> ##--------------------------------------------------------------------
> ## 2) Always run jobs, but prefer dedicated ones
> ##--------------------------------------------------------------------
> #START = True
> #SUSPEND = False
> #CONTINUE = True
> #PREEMPT = False
> #KILL = False
> #WANT_SUSPEND = False
> #WANT_VACATE = False
> #RANK = Scheduler =?= $(DedicatedScheduler)
>
> ##--------------------------------------------------------------------
> ## 3) Always run dedicated jobs, but only allow non-dedicated jobs to
> ## run on an opportunistic basis.
> ##--------------------------------------------------------------------
> ## Allowing both dedicated and opportunistic jobs on your resources
> ## requires that you have an opportunistic policy already defined.
> ## These are the only settings that need to be modified from your
> ## existing policy expressions to allow dedicated jobs to always run
> ## without suspending, or ever being preempted (either from activity
> ## on the machine, or other jobs in the system).
>
> SUSPEND = Scheduler =!= $(DedicatedScheduler) && ($(SUSPEND))
> PREEMPT = Scheduler =!= $(DedicatedScheduler) && ($(PREEMPT))
> RANK_FACTOR = 1000000
> RANK = (Scheduler =?= $(DedicatedScheduler) * $(RANK_FACTOR)) + $(RANK)
> START = (Scheduler =?= $(DedicatedScheduler)) || ($(START))
>
> ## Note: For everything to work, you MUST set RANK_FACTOR to be a
> ## larger value than the maximum value your existing rank expression
> ## could possibly evaluate to. RANK is just a floating point value,
> ## so there's no harm in having a value that's very large.
>
>
> ######################################################################
> ######################################################################
> ## Settings you should leave alone, but that must be defined
> ######################################################################
> ######################################################################
>
> ## Path to the special version of rsh that's required to spawn MPI
> ## jobs under Condor. WARNING: This is not a replacement for rsh,
> ## and does NOT work for interactive use. Do not use it directly!
> MPI_CONDOR_RSH_PATH = $(LIBEXEC)
>
> ## Path to OpenSSH server binary
> ## Condor uses this to establish a private SSH connection between execute
> ## machines. It is usually in /usr/sbin, but may be in /usr/local/sbin
> CONDOR_SSHD = /usr/sbin/sshd
>
> ## Path to OpenSSH keypair generator.
> ## Condor uses this to establish a private SSH connection between execute
> ## machines. It is usually in /usr/bin, but may be in /usr/local/bin
> CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
>
> ## This setting puts the DedicatedScheduler attribute, defined above,
> ## into your machine's classad. This way, the dedicated scheduler
> ## (and you) can identify which machines are configured as dedicated
> ## resources.
> ## Note: as of 8.4.1 this setting is automatic
> #STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
> [root@rocks7 examples]# rocks sync host condor rocks7
> [root@rocks7 examples]# condor_status -af:h Machine
> DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> Error: Parse error of: DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> [root@rocks7 examples]# condor_status -af:h Machine rocks7.vbtestcluster.com
> Machine rocks7.vbtestcluster.com
> compute-0-0.local undefined
> compute-0-0.local undefined
> [root@rocks7 examples]# condor_q
>
>
> -- Schedd: rocks7.vbtestcluster.com : <10.0.3.15:48687> @ 01/19/18 05:22:37
> OWNER BATCH_NAME SUBMITTED DONE RUN IDLE
> TOTAL JOB_IDS
> mahmood CMD: /opt/openmpi/bin/mpirun 1/17 03:04 _ _ 1
> 1 5.0
>
> 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
>
> [root@rocks7 examples]#
>
>
>
>
>
>
>
>
>
> Any thought?
>
>
>
>
>
>
>
>
>
>
>
> Regards,
> Mahmood
>
>
>
>
>
>