Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Problems setting up condor on local node, jobs do not start
- Date: Fri, 13 Sep 2013 08:10:40 -0500
- From: Brian Bockelman <bbockelm@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Problems setting up condor on local node, jobs do not start
On Sep 13, 2013, at 6:39 AM, Alex Seeholzer <alex.seeholzer@xxxxxxx> wrote:
> hi condor-users
>
> I am trying to set up condor 8.1.0 on a local ubuntu 12.04 cluster, and running into quite a few problems even on a single node setup with fairly standard config files. Here is my progression so far:
>
> - ps -efwwww | grep condor_ gives
> condor 21958 1 0 13:05 ? 00:00:00 /usr/sbin/condor_master -pidfile /var/run/condor/condor.pid
> root 21961 21958 0 13:05 ? 00:00:00 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 10000000 -S 60 -C 124
> condor 21962 21958 0 13:05 ? 00:00:00 condor_collector -f
> condor 21963 21958 0 13:05 ? 00:00:00 condor_negotiator -f
> condor 21964 21958 0 13:05 ? 00:00:00 condor_schedd -f
> condor 21965 21958 0 13:05 ? 00:00:00 condor_startd -f
>
> - condor_status returns nothing with the vanilla config files, I had to set ALLOW_WRITE = * to get any nodes to appear. Even setting the own machines IP manually did not work. If I set ALLOW_WRITE = * I can continue, although this is not really satisfactory
Before opening to the world, try looking at /var/log/condor/CollectorLog and look for PERMISSION DENIED lines.
> - submitting test jobs does not work. jobs are listed in condor_q as idle. I have 8 available nodes to run the job.
> - running condor-q -analyze shows me that they have not been considered by the matchmaker, checking in NegotiatorLog gives me a
> condor_read() failed: recv(fd=8) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from collector
> - If I change
> ALLOW_NEGOTIATOR = $(CONDOR_HOST), $(IP_ADDRESS) -> ALLOW_NEGOTIATOR = *
Again, I'd look at the CollectorLog to see why your hosts are getting denied.
> jobs seem to get started but then I get:
>
> Error from slot3@mynodename: Failed to open 'myhomedir/testjob/first.job.10.2.out' as standard output: Permission denied (errno 13)
>
What's the UID_DOMAIN on each host? If they are not equal between the worker node and submit node, then the job will run as user 'nobody'.
> Any ideas on how to fix this?
> Thanks, alex
>
> Remark:
> I chose the dev 8.1.0 channel, since 8.0.2 still has python2.6 bindings which I could not provide in ubuntu 12.04 without further hassle. I went through further hassle, however, and this does not change the behaviour described above.
Although they're not posted on the website, HTCondor does a build for Ubuntu12 (linking against python2.7). The nightlies are here:
http://submit-2.batlab.org/results/continuous.php
(note these are indeed nightlies, not releases. I don't know where to find the release tarballs for Ubuntu 12).
TimT -- any reason the Ubuntu release can't be posted on the website alongside Debian?
Brian