[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-devel] follow up SOAP questions
Hi Matt,
Thanks for the response - I know that
was a long email to parse. I had a few follow-up questions and thoughts,
based on what you said. I've tried to trim the non-relevent portions
out though, to make it easier to read.
>>
>> Ok, this might not be the most
satisfying answer, but you should be
>> able to do as much or as little
work as you desire in a transaction.
>> Generally, you'll probably
want to do as little work as possible in
>> any one transaction. For instance,
if you need to change the status
>> on all jobs in a cluster at
the same time you'll need to do it all in
>> the same transaction. If you
want to do multiple things to all the
>> jobs in a cluster you should
be able to do it all in a single
>> transaction, but if you can
do each operation separately that would
>> probably be best, e.g. if you
have to do something to cluster A and
>> something else to cluster B
do them in different transactions.
That's actually a fine answer. :) It's
pretty much what I've been doing, for the most part. Basically, I
process the jobs at the cluster level. When users hold/release/remove
submissions, the operations are performed inside of a transaction per cluster.
So if I was holding three clusters, I create a transaction, hold
the jobs in the cluster, and then close the transaction. Wash.Rinse.Repeat
for the other clusters and operations.
Not completely related, but FYI, I also
make sure to use constraints to only select relevent jobs when performing
operations on them, such as those that are running or ilde when holding
jobs in a cluster. That way I don't release jobs that aren't held,
hold jobs that are already held, etc.
>> The code looks good. Does this
happen consistently? With the
>> D_FULLDEBUG level enabled in
your config file do you see anything
>> interesting in the ScheddLog?
Yes, it happens fairly consistently,
especially with higher numbers of jobs. If I have a cluster with
less than 100 jobs, it's not so often, but for larger clusters it's the
expected error. It seems to happen most with non-empty files. For
example, the stdout and stderror files will be empty and the UserLog will
not be, and the UserLog will mostly likely be the only one that fails with
a "File doesn't exist" error. At the moment I'm guessing
that this might be caused by a file lock of some kind.
>> It looks like this is a bug
in birdbath. The logic in a test is
>> reversed. It sends an error
on success and success on error. This
>> will be fixed immediately.
Thanks. I've updated my code to
catch, log, and otherwise ignore the error and to generate an error upon
success. I'll update it again once the bug is fixed. In the
meantime, I could also fix it locally if you let me know where it is in
the code.
>>
>> Holding a job can fail if the
job doesn't have a JobStatus attribute,
>> if the job is _already_ on
hold, or if setting the JobStatus
>> attribute fails. All of these
should be reported in the ScheddLog
>> (debug level D_ALWAYS).
>>
>> If the underlying state of
a job changes while you are in a
>> transaction you should not
notice it from within the transaction. So,
>> a job going on hold in another
transaction should not make a hold
>> attempt in your transaction
fail. I am not entirely sure this is what
>> actually happens though, especially
if a job is put on hold or
>> removed from outside any transaction.
I don't think you are
>> misunderstanding anything.
It's not likely that the jobs don't
have a JobStatus attribute, as I'm filtering based on JobStatus before
I try to hold any of them, but I'll try to explicitly verify that. I've
increased the logging level, and I'll see if I can find out if it's the
setting of the attribute that's causing the problem. This happens
much less frequently than the file i/o issue above, so it may take a bit
to get more information for you.
>>
>> > a) Add in appropriate
requirements for file transfer, disk size,
>> > memory,
>> > platform (had some trouble
figuring out exactly what needs to be
>> > checked/set but I think
I have the appropriate ones now)
>>
>> Yup.
>>
>> > b) Set NTDomain attribute
>>
>> If on Windows, yes.
Yes, FYI, I'm currently using a linux
server interfacing via soap with a windows scheduler, which parcels out
the jobs to windows startds. I've also used a linux scheduler, which
is how I discovered the file transfer requirement. :) There seem
to be some others though, such as Disk and Disk_RAW which may need to be
set, and I was hoping that I wasn't missing anything important.
>>
>> Yes, but I don't know what
you mean by "safe" or why you would need
>> to change the names.
It seems that condor_submit changes
the names of the output and error parameters from what they are in the
submit file to _std_out and _std_error (or somethings similar) so that
they can safely be created on the execute node. Then, when transfering
the files back from the execute nodes, the scheduler, or condor_transfer_data,
creates them locally as the names specified in the original submit file.
At the moment, I just set them to something apropriate to our system,
so there's no need to rename them when they're retrieved.
>>
>> Yes, most types should be easy
to determine. The only trick being
>> things like Rank/Requirements/Periodic*/On*/etc
which look like
>> strings but should be expressions.
Well, I've had a bit of difficulty at
it, but it may just have been how I went about it. I'll put together
a more comprehensive description of my algorithms when I get a chance.
>>
>> I am not an expert on ClassAds,
but I don't think you want to
>> actually replace the reference
with a value, instead let ClassAd
>> evaluation do this for you
later.
From looking at the source code, I thought
that a lot of the replacement was done in the condor_submit binary. It
seems as if it is Cluster and Process are definitely replaced, as well
as any custom attributes. Are these the only ones or are there others
that are also replaced?
>>
>> I think Yes/No are sometimes
also possible boolean values.
Thanks for pointing that out.
>>
>> I believe condor_submit just
knows the type based on the name of the
>> attribute. For instance, maybe
Rank and Requirements are always the
>> only expressions. I don't think
there are any specific guidelines.
Thanks. I'll take another look
in the code and do some more testing.
>>
>> You should be able to hold
multiple jobs in a single transaction, and
>> right now it seems that is
what you have to do. We could add a hold/
>> release that takes a constraint
instead of cluster+proc ids, if that
>> would be helpful, and I imagine
it would be.
>>
>> As for the error message. That
cannot easily be changed. The actual
>> reason for the failure is only
logged, and is not immediately
>> accessible to the birdbath
code. Bad, I know.
That would be helpful, but in the meantime,
I don't have a problem continuing to hold jobs one at a time in a cluster.
I'll see what the logs say, and get back to you.
>> This is going to depend. Generally,
you can always try to read the
>> files and doing so will not
affect the job, running or not. GetFile
>> reads from the job's spool
directory on the Schedd's disk. However,
>> if a job is running there is
no guarantee that its output will
>> actually be in the spool directory.
It might still be on the execute
>> site waiting to be transferred
to the spool when the job completes.
>> Now, some files (out/err?)
can be streamed to the spool directory,
>> and you should be able to incrementally
read them. When you reach an
>> EOF for those files there might
actually be more data, just coming
>> later. I imagine you would
have to ListSpool after hitting an EOF and
>> check to see if the streaming
file has increased in size. Does that
>> make sense?
I think so, although most cases that
concerned me were cases where the file was on the scheduler - I verified
this manually - and yet when I called getFile, it failed with a "File
does not exist" error. Also, it seemed that in some cases, after
spooling the files, they were removed from the server. I wanted to
make sure that wouldn't happen while a job was still running, so that I
could incrementally get a copy if a user requested one.
>>
>> If you want to run a job N
times but not in parallel, you can use
>> different expressions on the
job ad to rerun a job, but I'm guessing
>> that isn't what you are wanting
to do.
Not really, but it's not really a need
of mine, more a curiosity. I almost never want to rerun the same
job more than once, and if I do it's almost always just to make testing
easier. Thanks for the info.
>> You should be able to just
get the files, as long as the job is still
>> in the queue. Once a job is
removed from the queue (CloseSpool +
>> RemoveJob) its spool directory
is deleted and there is nothing to
>> retrieve.
That's what I thought. Thanks
for confirming this.
>>
>> When removing jobs you should
CloseSpool on them first. This will
>> properly set the LeaveJobInQueue
attribute so that when you RemoveJob
>> it will not appear in the X
(removed) state in the queue. You do not
>> need to retrieve files before
removing a job, just CloseSpool on the
>> job first. If you RemoveJob
and then later CloseSpool your job might
>> sit in the queue for a while
until a timer fires and reevaluates the
>> LeaveJobInQueue attribute,
at which time the job will leave the queue.
Thanks for confirming that. I'll
try just closing the spool without retrieving the files and see if I get
errors setting the FilesRetrieved attribute or not. While I'd still
like to retrieve the files, it's much more important that I be able to
remove the jobs without issue.
Thank you very much for you help. I
really appreciated it. As always, if I can do anything to help or
if you need any more information, just let me know.
Thanks,
Rob
*************************************************************************
This communication, including attachments, is
for the exclusive use of addressee and may contain proprietary,
confidential and/or privileged information. If you are not the intended
recipient, any use, copying, disclosure, dissemination or distribution is
strictly prohibited. If you are not the intended recipient, please notify
the sender immediately by return e-mail, delete this communication and
destroy all copies.
*************************************************************************