[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] permission problems




Hi,

I've set up Condor 6.8.5 on a couple of Debian Sarge machines as non-root, but I can't get it to run any job.

First, I find in the docs (http://www.cs.wisc.edu/condor/manual/v6.8/3_6Security.html#SECTION004612100000000000000) that, if I mark condor_submit as set-GID, it will create group-writable files. Alas, it doesn't, the created files are not group-writable.

If I pre-create the files and make them group-writable, it still complains:

% condor_submit submit_file
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 33.

WARNING: File [....]/out.0 is not writable by condor.

WARNING: File [....]/err.0 is not writable by condor.


In log.0, this is repeated a few times:

022 (035.000.000) 05/31 16:09:32 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to vm2@[...] <[...]:34068>
...
023 (035.000.000) 05/31 16:09:32 Job reconnected to vm2@[...]
    startd address: <[...]:34068>
    starter address: <[...]:43170>

before it gives up:

007 (035.000.000) 05/31 16:09:32 Shadow exception!
Error from starter on vm2@[...]: Repeated attempts to transfer output failed for unknown reasons
        0  -  Run Bytes Sent By Job
        75948  -  Run Bytes Received By Job



In the ShadowLog, I see lines like these:

5/31 16:09:32 (35.0) (27919): Attempting to reconnect to starter <[...]:43170>
5/31 16:09:32 (35.0) (27919): Reconnect SUCCESS: connection re-established
5/31 16:09:32 (35.0) (27919): ReliSock::get_file_with_permissions(): Failed to chmod file '[...]/out.0'
: Operation not permitted (errno: 1)
5/31 16:09:32 (35.0) (27919): DoDownload: SHADOW at [...] failed to receive file [...]/out.0 5/31 16:09:32 (35.0) (27919): Can no longer talk to condor_starter <[...]:43170>


Of course, the directory where out.0 exists, is group-writable.

The file out.0 actually gets written to, repeatedly, but the job never leaves the queue and the log files keep growing, listing repeated error messages until condor_rm is called.

Does anyone have any idea what I'm doing wrong?


kurt.