Re: [HTCondor-devel] Patch for drmaa-1.6.1


Date: Mon, 16 Jun 2014 16:11:56 -0500
From: Jaime Frey <jfrey@xxxxxxxxxxx>
Subject: Re: [HTCondor-devel] Patch for drmaa-1.6.1
On Jun 16, 2014, at 1:44 PM, Mikko Vainio <mikko.vainio@xxxxxx> wrote:

On 06/16/2014 08:38 PM, Jaime Frey wrote:
On Jun 13, 2014, at 7:12 AM, Mikko Vainio <mikko.vainio@xxxxxx> wrote:

Please find attached a patch of changes I had to make to file libDrmaa.c of drmaa-1.6.1 C-source code in order to get it play nice with drmaa-python 0.7.6 on 64-bit Windows 7.

A short summary of changes:
- An offset of 200 (STAT_NOR_BASE) is added to the status code of drmaa_wait() on normal job termination (see also file WISDOM), but that offset was not accounted for in functions drmaa_wtermsig and drmaa_wcoredump. These functions returned DRMAA_ERRNO_INVALID_ARGUMENT for a stat value of 200 (= normal termination, 0 + 200).
- The minimum accepted signal buffer size was 100 while drmaa-python has buffer size 32 (I assumed DRMAA_SIGNAL_BUFFER as defined in drmaa.h:52 is the correct value).


Could someone please confirm that these changes are correct?

The second change looks good.
But I don’t see the reason for the first change. As described in the man pages, drmaa_wtermsig() and drmaa_wcoredump() shouldn’t be called for a job that exited normally. They should only be called if the job exited via a signal (i.e. if drmaa_wifsignaled() set its first argument to non-zero). Returning DRMAA_ERRNO_INVALID_ARGUMENT for a normal termination status sounds like the right behavior to me.

If drmaa-python is expecting these functions to return success when called with a normal job termination status, that sounds like a bug in drmaa-python.

drmaa-python calls all the stat interpreter functions, around here:
https://github.com/drmaa-python/drmaa-python/blob/master/drmaa/session.py#L480
Apparently they only tested against  SGE's implementation of DRMAA bindings, where they interpreted the C interface description document (http://redmine.ogf.org/attachments/100/drmaav1-c-binding.pdf) differently. For drmaa_wcoredump(), the argument description in that document says: "stat – The status code of a finished job." Here, a stat value of 200 is of a finished job. The return code description says: "DRMAA_ERRNO_INVALID_ARGUMENT – an argument value is invalid." In my opinion the argument value is valid.
The man page text seems to refer to what to fill in the core_dumped argument.

Workaround could be to use drmaa.Session().synchronize(...) instead of drmaa.Session().wait(...) in Python.

I don’t see a problem in the python code. In wait(), the return values of the DRMAA functions are ignored. When drmaa_wcoredump() and drmaa_wtermsig() are called with the stat of a normal job termination, the variables coredumped and term_signal will keep their initial values, since the DRMAA functions won’t modify them. And the JobInfo object’s members terminatedSignal and hasCoreDump will be an empty string and false, respectively. JobInfo’s member 'signaled’ will be false, so terminatedSignal and hasCoreDump should be ignored.

I do see a bug in our implementation of drmaa_wifexited(). It should set 'exited' to non-zero only if the job terminated normally. If the job terminated via a signal, then drmaa_wifexited() should set 'exited' to zero, and drmaa_wifsignaled() can be used to see whether the job terminated via a signal, or some other error occurred. The man page I was looking at earlier is badly worded and suggests otherwise.

Thanks and regards,
Jaime Frey
UW-Madison HTCondor Project

[← Prev in Thread] Current Thread [Next in Thread→]