Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Doubts regarding submitting a job
- Date: Thu, 18 Sep 2014 16:19:09 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Doubts regarding submitting a job
On 9/18/2014 12:12 PM, Roshan Chaudhari wrote:
Hi,
I have some doubts regarding job submission:
Suggest you do some reading of the HTCondor Manual
(http://research.cs.wisc.edu/htcondor/manual/ )
and/or take a look at some of the tutorials on the web site, but quick
answers to the below -
1. Is it possible to submit a job to # of nodes or computers?
Yes.
2. how to get informed if the job finished ?
Jobs that are either running or waiting to run are visible in the job
queue with "condor_q" command-line tool or corresponding API interfaces
(besides command-line tools, HTCondor has Python and SOAP-based
interfaces, amongst others). When a job is completed, it will leave the
job queue, and no longer be visible via condor_q, but instead will be
visible via "condor_history" which shows a list of completed jobs.
Also, you can request upon job submission that HTCondor write events
into a specified "job event log" file - when the job completes, a job
completion event will be written to this file.
3. How to kill a job ?
condor_rm will kill a job if running, and remove it from the queue.
condor_vacate will kill a job on a specified machine, and the job will
then get rescheduled to run again (perhaps someplace else).
4. What happens to job if a user works on the computer, does it get low
priority or what?
HTCondor is very configurable in this regard. You can tell HTCondor to
simply continue running the job at a low priority, or kill the job and
restart it from the beginning someplace else, or suspend the job (i.e.
stop using the CPU) and continue the job when the interactive user
leaves, or kill the job and resume running it from where it left off if
the job can be checkpointed.
5. What happens if a computer is offline? How long will the system wait to
declare a node "down" ?
This is configurable, but with the current defaults, figure it could
take HTCondor about 20 minutes to notice if a machine "crashes". If you
shut the machine down cleanly, HTCondor notices right away.
6. How to put a pc back in the queue? if a system was down and you turn it
back on.
You don't need to do anything to put a PC back into a pool, and HTCondor
'notices' a machine (re)joining in about a minute. HTCondor is very
good at dealing with machines dynamically leaving and joining a cluster.
Hope the above helps,
Todd