Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Condor_q analyze question
- Date: Thu, 06 Nov 2008 10:26:44 -0600
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Condor_q analyze question
Brandon Leeds wrote:
Hi All,
Hi Brandon, some hopefully helpful comments below....
We are trying to understand why a job appears to be running and
accumulating cpu time in the condor_q output,
Note that if you just do "condor_q", the time you are seeing is
RUN_TIME, i.e. wall clock time. To see CPU time you need to pass the
"-cputime" flag to condor_q. The cpu time is then display instead of
wall clock time; note that Condor only updates cpu time periodically, so
you will not notice cpu time incrementing every second with condor_q.
but are told by the end
user that his job is no longer accessing the files it should be along
the computations typical pathway. In hopes of understanding if the
priority is so low that it is starving ,
If his job is marked as running with condor_q, then there is not a low
priority starvation issue.
Some thoughts:
a) the job will still be displayed as "running" in condor_q even if it
is currently suspended at the execute node because the SUSPEND
expression in the config file evaluated to true. You can do a
condor_status to see if the node running the job is in suspended state.
Or if the user specified an job log file (log=<some-file>) in the
submit description file, that log will also state if the job was suspended.
b) the job will still be displayed as "running" in condor_q when in fact
files are being staged (copied) onto or off of the execute node.
c) if the user is expecting to see output files "grow" as the job runs,
note there are many circumstances where they may not happen. for
instance, if the job is vanilla and file transfer is being used (i.e. no
shared file system), the job's files will only get updated when the job
completes or optionally is preempted. if the job is standard universe,
files may only get updated when the program does a sync to disk - i.e.
file I/O may be cached in RAM for long periods of time.
he looked at using the analyze
flag to condor_q. Unfortunately we get this result:
$ condor_q -pool condor -name blaze -analyze 527579.0
Error: Collector has no record of schedd/submitter
This error is saying your pool does not have a schedd (submission point)
named "blaze". If "blaze" is a hostname, perhaps the fully-qualified
name? Also you can do "condor_status -schedd" to see a list of all
possible values you can use with the "-name" option.
Or perhaps the login of the user named "blaze" ? Then perhaps you meant
to use the "-submitter" option to condor_q instead of the "-name" option.
-Todd
--
Todd Tannenbaum University of Wisconsin-Madison
Condor Project Research Department of Computer Sciences
tannenba@xxxxxxxxxxx 1210 W. Dayton St. Rm #4257