HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] New-style locking



On Dec 17, 2010, at 10:18 AM, Matthew Farrellee wrote:

> On 12/17/2010 10:54 AM, Brian Bockelman wrote:
>> Hi folks,
>> 
>> Can someone explain the new-style locking to me (or at least point me
>> to the design document)?  I just upgraded to a bleeding-edge Condor
>> and new-style locking started to be used for the first time.
>> 
>> I saw lots of files being created in /tmp/condorLocks (which is
>> inappropriate, that's what /var/lock/condor is for).  Inside it,
>> there are hundreds of files (after running the schedd for a few
>> minutes, 2000+ total entries in the directory tree) and a lot of
>> directories.  The files are owned by many users and the directories
>> (again, owned by many users) are world-writeable.
>> 
>> Note - it would be very helpful if you could organize the locks by
>> owner, instead of having an apparently random scheme.  The semantics
>> are probably identical, but it'll help sysadmins understand what's
>> happening.
>> 
>> I see a lot of errors in the ScheddLog along the lines of this:
>> 
>> 12/17/10 09:45:11 (pid:5168) directory_util::rec_touch_file: File
>> /tmp/condorLocks/29/85/0/285458.lockc cannot be created (Permission
>> denied)
>> 12/17/10 09:45:11 (pid:5168) directory_util::rec_touch_file: File
>> /tmp/condorLocks/29/85/0/285458.lockc cannot be created (Permission
>> denied)
>> 12/17/10 09:45:11 (pid:5168) FileLock::FileLock: File locks cannot be
>> created on local disk - will fall back on locking the actual file.
>> 12/17/10 09:45:11 (pid:5168) Warning: Failed to open event rotation
>> lock file /var/log/condor/EventLog.lock: 13 (Permission denied)
>> 
>> I don't know what EUID is being used to create the lock file, so I
>> don't know whether the Permission Denied errors are appropriate.  The
>> EventLog.lock issues aren't new, but the
>> directory_util::rec_touch_file lines are new.  I think the
>> EventLog.lock has always been rotated with the wrong permissions in
>> the most recent versions of Condor.
>> 
>> So, there's lots of things happening, quite a few errors in the logs,
>> but it appears the system is working.  I would appreciate whatever
>> background folks can provide.
>> 
>> Brian
> 
> The relevant ticket is #1310. I've been trolling through its code lately.
> 
> +1 re a configurable location for condorLocks, currently hardcoded as
> TMP/condorLocks
> 
> You'd also like to see hash_func(filename) -> UID/ha/sh/value.lock
> instead of ha/sh/UID/value.lock? Or maybe just knowing where the UID is in the scheme is enough.
> 
> If you set D_PRIV you'll see what EUID is used during lock creation. It may be interacting badly with the EVENT_LOG.
> 

Thanks Matt.  I posted some comments on #1310.  It seems CreateHashName uses getuid instead of geteuid, meaning the schedd creates all lock files with UID=0.  Other than that, it appears to be working fine.

I think EVENT_LOG and locking have been interacting poorly for several versions now.  I can't find any tickets for it though.  I can confirm that it is attempting to take the event log lock with the EUID of the user.  In my case, it looks like this:

[root@gpn-husker condorLocks]# ll /tmp/condorLocks/29/85/0/285458.lockc
-rw-r--r-- 1 ligo grid 0 Dec 17 09:42 /tmp/condorLocks/29/85/0/285458.lockc

So, whoever has the first event after the schedd is turned on the first time with new locking gets the EventLog forever... luckily (unfortunately?) for us, ligo is going to be hanging here for quite awhile.

Brian

Attachment: smime.p7s
Description: S/MIME cryptographic signature