HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] follow up SOAP questions




Hi Matt,

Thanks for the response - I know that was a long email to parse.   I had a few follow-up questions and thoughts, based on what you said.  I've tried to trim the non-relevent portions out though, to make it easier to read.

>>
>> Ok, this might not be the most satisfying answer, but you should be  
>> able to do as much or as little work as you desire in a transaction.  
>> Generally, you'll probably want to do as little work as possible in  
>> any one transaction. For instance, if you need to change the status  
>> on all jobs in a cluster at the same time you'll need to do it all in  
>> the same transaction. If you want to do multiple things to all the  
>> jobs in a cluster you should be able to do it all in a single  
>> transaction, but if you can do each operation separately that would  
>> probably be best, e.g. if you have to do something to cluster A and  
>> something else to cluster B do them in different transactions.

That's actually a fine answer. :)  It's pretty much what I've been doing, for the most part.  Basically, I process the jobs at the cluster level.  When users hold/release/remove submissions, the operations are performed inside of a transaction per cluster.  So if I was holding three clusters, I create a transaction, hold the jobs in the cluster, and then close the transaction.  Wash.Rinse.Repeat for the other clusters and operations.

Not completely related, but FYI, I also make sure to use constraints to only select relevent jobs when performing operations on them, such as those that are running or ilde when holding jobs in a cluster.  That way I don't release jobs that aren't held, hold jobs that are already held, etc.

>> The code looks good. Does this happen consistently? With the  
>> D_FULLDEBUG level enabled in your config file do you see anything  
>> interesting in the ScheddLog?
Yes, it happens fairly consistently, especially with higher numbers of jobs.  If I have a cluster with less than 100 jobs, it's not so often, but for larger clusters it's the expected error.  It seems to happen most with non-empty files.  For example, the stdout and stderror files will be empty and the UserLog will not be, and the UserLog will mostly likely be the only one that fails with a "File doesn't exist" error.  At the moment I'm guessing that this might be caused by a file lock of some kind.

>> It looks like this is a bug in birdbath. The logic in a test is  
>> reversed. It sends an error on success and success on error. This  
>> will be fixed immediately.
Thanks.  I've updated my code to catch, log, and otherwise ignore the error and to generate an error upon success.  I'll update it again once the bug is fixed.  In the meantime, I could also fix it locally if you let me know where it is in the code.

>>
>> Holding a job can fail if the job doesn't have a JobStatus attribute,  
>> if the job is _already_ on hold, or if setting the JobStatus  
>> attribute fails. All of these should be reported in the ScheddLog  
>> (debug level D_ALWAYS).
>>
>> If the underlying state of a job changes while you are in a  
>> transaction you should not notice it from within the transaction. So,  
>> a job going on hold in another transaction should not make a hold  
>> attempt in your transaction fail. I am not entirely sure this is what  
>> actually happens though, especially if a job is put on hold or  
>> removed from outside any transaction. I don't think you are  
>> misunderstanding anything.

It's not likely that the jobs don't have a JobStatus attribute, as I'm filtering based on JobStatus before I try to hold any of them, but I'll try to explicitly verify that.  I've increased the logging level, and I'll see if I can find out if it's the setting of the attribute that's causing the problem.  This happens much less frequently than the file i/o issue above, so it may take a bit to get more information for you.

>>
>> > a) Add in appropriate requirements for file transfer, disk size,  
>> > memory,
>> > platform (had some trouble figuring out exactly what needs to be
>> > checked/set but I think I have the appropriate ones now)
>>
>> Yup.
>>
>> > b) Set NTDomain attribute
>>
>> If on Windows, yes.
Yes, FYI, I'm currently using a linux server interfacing via soap with a windows scheduler, which parcels out the jobs to windows startds.  I've also used a linux scheduler, which is how I discovered the file transfer requirement. :)  There seem to be some others though, such as Disk and Disk_RAW which may need to be set, and I was hoping that I wasn't missing anything important.

>>
>> Yes, but I don't know what you mean by "safe" or why you would need  
>> to change the names.
It seems that condor_submit changes the names of the output and error parameters from what they are in the submit file to _std_out and _std_error (or somethings similar) so that they can safely be created on the execute node.  Then, when transfering the files back from the execute nodes, the scheduler, or condor_transfer_data, creates them locally as the names specified in the original submit file.  At the moment, I just set them to something apropriate to our system, so there's no need to rename them when they're retrieved.

>>
>> Yes, most types should be easy to determine. The only trick being  
>> things like Rank/Requirements/Periodic*/On*/etc which look like  
>> strings but should be expressions.
Well, I've had a bit of difficulty at it, but it may just have been how I went about it.  I'll put together a more comprehensive description of my algorithms when I get a chance.

>>
>> I am not an expert on ClassAds, but I don't think you want to  
>> actually replace the reference with a value, instead let ClassAd  
>> evaluation do this for you later.
From looking at the source code, I thought that a lot of the replacement was done in the condor_submit binary.  It seems as if it is Cluster and Process are definitely replaced, as well as any custom attributes.  Are these the only ones or are there others that are also replaced?

>>
>> I think Yes/No are sometimes also possible boolean values.
Thanks for pointing that out.

>>
>> I believe condor_submit just knows the type based on the name of the  
>> attribute. For instance, maybe Rank and Requirements are always the  
>> only expressions. I don't think there are any specific guidelines.
Thanks.  I'll take another look in the code and do some more testing.  

>>
>> You should be able to hold multiple jobs in a single transaction, and  
>> right now it seems that is what you have to do. We could add a hold/
>> release that takes a constraint instead of cluster+proc ids, if that  
>> would be helpful, and I imagine it would be.
>>
>> As for the error message. That cannot easily be changed. The actual  
>> reason for the failure is only logged, and is not immediately  
>> accessible to the birdbath code. Bad, I know.
That would be helpful, but in the meantime, I don't have a problem continuing to hold jobs one at a time in a cluster.  I'll see what the logs say, and get back to you.

>> This is going to depend. Generally, you can always try to read the  
>> files and doing so will not affect the job, running or not. GetFile  
>> reads from the job's spool directory on the Schedd's disk. However,  
>> if a job is running there is no guarantee that its output will  
>> actually be in the spool directory. It might still be on the execute  
>> site waiting to be transferred to the spool when the job completes.  
>> Now, some files (out/err?) can be streamed to the spool directory,  
>> and you should be able to incrementally read them. When you reach an  
>> EOF for those files there might actually be more data, just coming  
>> later. I imagine you would have to ListSpool after hitting an EOF and  
>> check to see if the streaming file has increased in size. Does that  
>> make sense?
I think so, although most cases that concerned me were cases where the file was on the scheduler - I verified this manually - and yet when I called getFile, it failed with a "File does not exist" error.  Also, it seemed that in some cases, after spooling the files, they were removed from the server.  I wanted to make sure that wouldn't happen while a job was still running, so that I could incrementally get a copy if a user requested one.  

>>
>> If you want to run a job N times but not in parallel, you can use  
>> different expressions on the job ad to rerun a job, but I'm guessing  
>> that isn't what you are wanting to do.
Not really, but it's not really a need of mine, more a curiosity.  I almost never want to rerun the same job more than once, and if I do it's almost always just to make testing easier.  Thanks for the info.

>> You should be able to just get the files, as long as the job is still  
>> in the queue. Once a job is removed from the queue (CloseSpool +  
>> RemoveJob) its spool directory is deleted and there is nothing to  
>> retrieve.
That's what I thought.  Thanks for confirming this.

>>
>> When removing jobs you should CloseSpool on them first. This will  
>> properly set the LeaveJobInQueue attribute so that when you RemoveJob  
>> it will not appear in the X (removed) state in the queue. You do not  
>> need to retrieve files before removing a job, just CloseSpool on the  
>> job first. If you RemoveJob and then later CloseSpool your job might  
>> sit in the queue for a while until a timer fires and reevaluates the  
>> LeaveJobInQueue attribute, at which time the job will leave the queue.
Thanks for confirming that.  I'll try just closing the spool without retrieving the files and see if I get errors setting the FilesRetrieved attribute or not.   While I'd still like to retrieve the files, it's much more important that I be able to remove the jobs without issue.


Thank you very much for you help.  I really appreciated it.  As always, if I can do anything to help or if you need any more information, just let me know.

Thanks,
Rob


*************************************************************************
This communication, including attachments, is
for the exclusive use of addressee and may contain proprietary,
confidential and/or privileged information. If you are not the intended
recipient, any use, copying, disclosure, dissemination or distribution is
strictly prohibited. If you are not the intended recipient, please notify
the sender immediately by return e-mail, delete this communication and
destroy all copies.
*************************************************************************