Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] condor q high availability problem
Hi Steve,
Sorry, this is so delayed. This was just a regular NFS exported
directory, which wasn't exported in the required fashion. The
permission problems concerning the directories in spool which couldn't
be a chowned didn't turn out to be the killer. The basic result was
that it failed when trying to manage the lock files (SCHEDD.lock) -
root one machine A couldn't write to a file owned by root on machine
B.
The fix is obvioulsly a properly exported NFS mount.
Regards,
James
On Sat, Mar 8, 2008 at 12:41 AM, Steven Timm <timm@xxxxxxxx> wrote:
> Hi James--if this works out for you, there are other people who
> would be interested to know what type of shared disk space you
> were using for the high-availability schedd. If I remember correctly,
> each of the cluster* directories in the schedd spool area is
> supposed to be owned by the individual user who is
> running the job, at least if you are in a unix uid/gid model where
> every condor user is running as an independent uid.
>
> Steve Timm
>
>
>
>
> On Fri, 7 Mar 2008, Wojtek Goscinski wrote:
>
> > Howdy,
> >
> > I'm currently in the process of implementing high availability of our
> > condor queue. I've setup SPOOL to point to a shared space, ad
> > described in 3.10.1 of the condor manual.
> > However, this shared space is ONLY writable and readable by the condor
> > user - this is a current limitation of the way we're creatinig and
> > sharing a common mount between the machines which will run the shedd.
> >
> > Currently, things seem to be running ok. I've switched one machine
> > over to the new configuration and it locks the spool and manages the
> > queue without problems - other machines will be switched over during
> > downtime.
> >
> > However, i'm seeing the following error messages in my Shedlog. I
> > assume this has to do with the limitations of our mount. I'm wondering
> > if this is a serious problem which will bite us later on? As I
> > mentioned, at the moment, things seem to be running fine.
> >
> > SchedLog:3/7 16:25:58 (fd:11) (pid:2235) Error: Unable to chown
> > '/opt/sw/Sponge/share/spool/cluster1.proc0.subproc0' from 109 to
> > 42407.1089
> > SchedLog:3/7 16:25:58 (fd:11) (pid:2235) (1.0) Failed to chown
> > /opt/sw/Sponge/share/spool/cluster1.proc0.subproc0 from 109 to
> > 42407.1089. Job may run into permissions problems when it starts.
> >
> > SchedLog.old:3/7 16:09:29 (fd:11) (pid:1688) Error: Unable to chown
> > '/opt/sw/Sponge/share/spool/cluster569.proc43.subproc0' from 42407 to
> > 109.109
> > SchedLog.old:3/7 16:09:29 (fd:11) (pid:1688) (569.43) Failed to chown
> > /opt/sw/Sponge/share/spool/cluster569.proc43.subproc0 from 42407 to
> > 109.109. User may run into permissions problems when fetching
> > sandbox.
> >
> > Any comments or suggestions are most welcome.
> >
> > Regards,
> >
> > James
> > _______________________________________________
> > Condor-users mailing list
> > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/condor-users/
> >
>
> --
> ------------------------------------------------------------------
> Steven C. Timm, Ph.D (630) 840-8525
> timm@xxxxxxxx http://home.fnal.gov/~timm/
> Fermilab Computing Division, Scientific Computing Facilities,
> Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>