Thanks to everyone who has responded trying to help me with this problem. I've tried some of the suggestions and am still having the problem. Here is what I have done so far.
I am submitting a simple job named testlinux3.sub with the following contents:
Executable = /bin/hostname
Requirements = UidDomain == "condor.calumet.purdue.edu" && Arch == "X86_64"
Universe = vanilla
transfer_files = ALWAYS
Output = hostname3.out
Log = hostname3.log
Queue
I use condor_submit testlinux3.sub to submit the job and it goes in the queue. It sits in the queue for 30 minutes and then it flocks to condor.calumet.purdue.edu as expected; however, I immediately start getting shadow errors. At this point the log shows: (ip's have been omitted to protect the guilty :) )
000 (251318.000.000) 07/06 16:19:55 Job submitted from host: <x.x.x.x:57608>
...
001 (251318.000.000) 07/06 16:50:05 Job executing on host: <x.x.x.x:23601>
...
007 (251318.000.000) 07/06 16:50:13 Shadow exception!
Error from starter on vm1@xxxxxxxxxxxxxxxxxxxxxxxxx: Failed to execute '/usr/local/condor/home/execute/dir_14129/condor_exec.exe condor_exec.exe': No such file or directory
0 - Run Bytes Sent By Job
10740 - Run Bytes Received By Job
Permissions on /usr/local/condor/home/execute are:
drwxrwxrwt 2 root root 4.0K Jul 6 15:15 execute
There is no other file or directory inside the execute directory. Condor runs as root on this server. Also, I have configured this server to use Lowport: 23410 and Highport: 23914. As you can see from the log above, it appears to be in the proper range.
What else can I do to check this?
Thanks again.
John Alberts
Technical Assistant for EMS
alberts@xxxxxxxxxxxxxxxxxx
219-989-2083
CLO 332
http://public.xdi.org/=john.alberts
________________________________
From: condor-users-bounces@xxxxxxxxxxx on behalf of Dan Bradley
Sent: Thu 7/6/2006 9:25 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] How To TroubleShoot Flocking
By the way: the reference to "condor_exec.exe" is expected. This is the
name Condor runs the user's executable as (i.e. argv[0]). Failure to
execute the job is most often the result of files not being accessible
from the execute node. I assume this is a vanilla universe job. What
file-transfer settings are you using?
--Dan
Kewley, J (John) wrote:
> [don't treat below as gospel - I haven't flocked in a while so some
> things may have
> changed or I may have mis-spelled things]
> There a few subtle things that can stop flocking working:
> * set FLOCK_TO and FLOCK_FROM at both ends for a 2 way flock
> * HOSTALLOW values may need to be changed to include these other machines
> * If you have security enabled - then this might need to be made more
> flexible
> to include other authentication mechanisms
> * Machines in other pool may be of a different ARCH or OpSys
> * Your jobs may be setup to use a shared filestore (NFS for instance)
> which
> isn't available from the other pool.
> You can use
> condor_config_val -pool NODE_NAME -name NODE_NAME val
> where val is one of
> hostallow_write, hostallow_read, flock_to, flock_from
> to see what values are set for the different machines
> But the more usual culprits are firewalls.
> Are there any firewalls between the pools? (or is one pool behind a NAT)
> Remember that for jobs to flock, every submit node needs to be able to
> talk to every execute node
> and vice versa over the fixed ports and upper port range, all over
> both tcp and udp.
> If that is not the case, you'll have to relax the firewalls or use GCB.
> See also
> http://www.allhands.org.uk/2005/proceedings/papers/431.pdf
> for more info on firewalls in a Condor Pool
> Cheers
> JK
>
> -----Original Message-----
> *From:* condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx]*On Behalf Of *John Alberts
> *Sent:* Wednesday, July 05, 2006 8:41 PM
> *To:* Condor-Users Mail List
> *Subject:* [Condor-users] How To TroubleShoot Flocking
>
> Hi. I am trying to setup flocking between 2 condor pools. 1 pool I
> have complete control/access to, the other pool I can log in using
> ssh and submit jobs. The administrator of the other pool is
> currently on vacation and said he has configured flocking to/from
> our pool. I'm trying to test it, and it seems like flocking isn't
> working.
>
> I was wondering how I can troubleshoot flocking to see what the
> culprit is. I already tried to submit a job whose requirements can
> only be fulfilled on the other pool. Condor_status -analyze
> <jobid> shows that all machines can't meet the requirements. I
> have also run condor_status -pool <otherpoolname> and it properly
> displays all available machines on the other pool. I'm not sure
> what to check next.
>
> Note: There is a firewall between the pools and our network admin
> has already configured the firewall to allow traffic between pools.
>
> Thanks for any help.
>
> John Alberts
> Technical Assistant for EMS
> alberts@xxxxxxxxxxxxxxxxxx <mailto:alberts@xxxxxxxxxxxxxxxxxx>
> 219-989-2083
> CLO 332
> http://public.xdi.org/=john.alberts
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Condor-users mailing list
>To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>subject: Unsubscribe
>You can also unsubscribe by visiting
>https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>The archives can be found at either
>https://lists.cs.wisc.edu/archive/condor-users/
>http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
>
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
<<winmail.dat>>