Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Inconsistent execute dir permissions
- Date: Thu, 17 Mar 2016 11:10:33 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Inconsistent execute dir permissions
On 3/17/2016 10:34 AM, John Hover wrote:
Hi all,
We're having an issue and I'm wondering if you can provide guidance.
Starting with a few questions just to eliminate some quick possibilities -
Do you just have one partitionable slot per startd, or do you also have
some additional static slots? Or more than one partitionable slot?
Do you execute machines have all the local accounts setup in /etc/passwd
for every possible dynamic slot, e.g. on a 32-way machine do you have
user account slot1 thru slot32, or perhaps some machines only have user
accounts slot1 thru slot8
Similar to the above, do you specify at least as many SLOT1_X_USER
entries as your machine has CPU cores?
Are you having HTCondor run via glexec on your execute nodes?
regards,
Todd
Setup is partitionable slots, as typical, with slot users:
DEDICATED_EXECUTE_ACCOUNT_REGEXP = slot.+
STARTER_ALLOW_RUNAS_OWNER = False
SLOT1_1_USER = slot1
SLOT1_2_USER = slot2
SLOT1_3_USER = slot3
SLOT1_4_USER = slot4
<etc>
But on the nodes, I see inconsistent execute directory ownership,
sometimes a mix of slot users and condor. Other times all owned by condor.
I'm seeing job errors that are consistent with failure to read in those
directories by the job running as the user.
[root@ip-10-153-131-168 ~]# ls -alh /home/condor/execute
total 56K
drwxr-xr-x. 6 condor condor 4.0K Mar 16 22:08 .
drwxr-xr-x. 3 condor condor 4.0K Mar 10 13:09 ..
drwx------. 7 condor condor 12K Mar 16 21:49 dir_1043940
drwx------. 7 condor condor 12K Mar 16 22:06 dir_1062269
drwx------. 7 slot4 slot4 12K Mar 16 22:08 dir_1064108
drwx------. 7 slot3 slot3 12K Mar 16 22:09 dir_1064289
[root@ip-10-121-2-98 ~]# ls -alh /home/condor/execute/
total 56K
drwxr-xr-x. 6 condor condor 4.0K Mar 16 22:00 .
drwxr-xr-x. 3 condor condor 4.0K Mar 10 13:09 ..
drwx------. 7 condor condor 12K Mar 16 21:56 dir_1466019
drwx------. 7 condor condor 12K Mar 16 22:02 dir_1467286
drwx------. 7 condor condor 12K Mar 16 22:02 dir_1467287
drwx------. 7 condor condor 12K Mar 16 22:02 dir_1467288
Any idea how this would be happening? Log entries to look for? Ever seen
it before? Any config changes to try?
Thanks,
--john
--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685