HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] New-style locking



On 12/17/2010 11:27 AM, Brian Bockelman wrote:

On Dec 17, 2010, at 10:18 AM, Matthew Farrellee wrote:

On 12/17/2010 10:54 AM, Brian Bockelman wrote:
Hi folks,

Can someone explain the new-style locking to me (or at least point me
to the design document)?  I just upgraded to a bleeding-edge Condor
and new-style locking started to be used for the first time.

I saw lots of files being created in /tmp/condorLocks (which is
inappropriate, that's what /var/lock/condor is for).  Inside it,
there are hundreds of files (after running the schedd for a few
minutes, 2000+ total entries in the directory tree) and a lot of
directories.  The files are owned by many users and the directories
(again, owned by many users) are world-writeable.

Note - it would be very helpful if you could organize the locks by
owner, instead of having an apparently random scheme.  The semantics
are probably identical, but it'll help sysadmins understand what's
happening.

I see a lot of errors in the ScheddLog along the lines of this:

12/17/10 09:45:11 (pid:5168) directory_util::rec_touch_file: File
/tmp/condorLocks/29/85/0/285458.lockc cannot be created (Permission
denied)
12/17/10 09:45:11 (pid:5168) directory_util::rec_touch_file: File
/tmp/condorLocks/29/85/0/285458.lockc cannot be created (Permission
denied)
12/17/10 09:45:11 (pid:5168) FileLock::FileLock: File locks cannot be
created on local disk - will fall back on locking the actual file.
12/17/10 09:45:11 (pid:5168) Warning: Failed to open event rotation
lock file /var/log/condor/EventLog.lock: 13 (Permission denied)

I don't know what EUID is being used to create the lock file, so I
don't know whether the Permission Denied errors are appropriate.  The
EventLog.lock issues aren't new, but the
directory_util::rec_touch_file lines are new.  I think the
EventLog.lock has always been rotated with the wrong permissions in
the most recent versions of Condor.

So, there's lots of things happening, quite a few errors in the logs,
but it appears the system is working.  I would appreciate whatever
background folks can provide.

Brian

The relevant ticket is #1310. I've been trolling through its code lately.

+1 re a configurable location for condorLocks, currently hardcoded as
TMP/condorLocks

You'd also like to see hash_func(filename) ->  UID/ha/sh/value.lock
instead of ha/sh/UID/value.lock? Or maybe just knowing where the UID is in the scheme is enough.

If you set D_PRIV you'll see what EUID is used during lock creation. It may be interacting badly with the EVENT_LOG.


Thanks Matt.  I posted some comments on #1310.  It seems
CreateHashName uses getuid instead of geteuid, meaning the schedd
creates all lock files with UID=0.  Other than that, it appears to be
working fine.

I think EVENT_LOG and locking have been interacting poorly for
several versions now.  I can't find any tickets for it though.  I can
confirm that it is attempting to take the event log lock with the
EUID of the user.  In my case, it looks like this:

[root@gpn-husker condorLocks]# ll /tmp/condorLocks/29/85/0/285458.lockc
-rw-r--r-- 1 ligo grid 0 Dec 17 09:42 /tmp/condorLocks/29/85/0/285458.lockc

So, whoever has the first event after the schedd is turned on the
first time with new locking gets the EventLog forever... luckily
(unfortunately?) for us, ligo is going to be hanging here for quite
awhile.

Brian


I can confirm the getuid issue.

/tmp/condorLocks/38/91/0:
total 0
0 -rw-r--r--. 1 matt matt 0 Dec 17 11:31 760032.lockc

The idea was to use the uid of the owner of the file to be locked (target file). I'm not sure using even geteuid would be a good idea. The lock file needs to be locatable using just information about target file, and process uid/euid would not qualify there.

Best,


matt