Thanks to everyone who has responded trying to help me with this problem. I've tried some of the suggestions and am still having the problem. Here is what I have done so far. I am submitting a simple job named testlinux3.sub with the following contents: Executable = /bin/hostname Requirements = UidDomain == "condor.calumet.purdue.edu" && Arch == "X86_64" Universe = vanilla transfer_files = ALWAYS Output = hostname3.out Log = hostname3.log Queue I use condor_submit testlinux3.sub to submit the job and it goes in the queue. It sits in the queue for 30 minutes and then it flocks to condor.calumet.purdue.edu as expected; however, I immediately start getting shadow errors. At this point the log shows: (ip's have been omitted to protect the guilty :) ) 000 (251318.000.000) 07/06 16:19:55 Job submitted from host: <x.x.x.x:57608> ... 001 (251318.000.000) 07/06 16:50:05 Job executing on host: <x.x.x.x:23601> ... 007 (251318.000.000) 07/06 16:50:13 Shadow exception! Error from starter on vm1@xxxxxxxxxxxxxxxxxxxxxxxxx: Failed to execute '/usr/local/condor/home/execute/dir_14129/condor_exec.exe condor_exec.exe': No such file or directory 0 - Run Bytes Sent By Job 10740 - Run Bytes Received By Job Permissions on /usr/local/condor/home/execute are: drwxrwxrwt 2 root root 4.0K Jul 6 15:15 execute There is no other file or directory inside the execute directory. Condor runs as root on this server. Also, I have configured this server to use Lowport: 23410 and Highport: 23914. As you can see from the log above, it appears to be in the proper range. What else can I do to check this? Thanks again. John Alberts Technical Assistant for EMS alberts@xxxxxxxxxxxxxxxxxx 219-989-2083 CLO 332 http://public.xdi.org/=john.alberts ________________________________ From: condor-users-bounces@xxxxxxxxxxx on behalf of Dan Bradley Sent: Thu 7/6/2006 9:25 AM To: Condor-Users Mail List Subject: Re: [Condor-users] How To TroubleShoot Flocking By the way: the reference to "condor_exec.exe" is expected. This is the name Condor runs the user's executable as (i.e. argv[0]). Failure to execute the job is most often the result of files not being accessible from the execute node. I assume this is a vanilla universe job. What file-transfer settings are you using? --Dan Kewley, J (John) wrote: > [don't treat below as gospel - I haven't flocked in a while so some > things may have > changed or I may have mis-spelled things] > There a few subtle things that can stop flocking working: > * set FLOCK_TO and FLOCK_FROM at both ends for a 2 way flock > * HOSTALLOW values may need to be changed to include these other machines > * If you have security enabled - then this might need to be made more > flexible > to include other authentication mechanisms > * Machines in other pool may be of a different ARCH or OpSys > * Your jobs may be setup to use a shared filestore (NFS for instance) > which > isn't available from the other pool. > You can use > condor_config_val -pool NODE_NAME -name NODE_NAME val > where val is one of > hostallow_write, hostallow_read, flock_to, flock_from > to see what values are set for the different machines > But the more usual culprits are firewalls. > Are there any firewalls between the pools? (or is one pool behind a NAT) > Remember that for jobs to flock, every submit node needs to be able to > talk to every execute node > and vice versa over the fixed ports and upper port range, all over > both tcp and udp. > If that is not the case, you'll have to relax the firewalls or use GCB. > See also > http://www.allhands.org.uk/2005/proceedings/papers/431.pdf > for more info on firewalls in a Condor Pool > Cheers > JK > > -----Original Message----- > *From:* condor-users-bounces@xxxxxxxxxxx > [mailto:condor-users-bounces@xxxxxxxxxxx]*On Behalf Of *John Alberts > *Sent:* Wednesday, July 05, 2006 8:41 PM > *To:* Condor-Users Mail List > *Subject:* [Condor-users] How To TroubleShoot Flocking > > Hi. I am trying to setup flocking between 2 condor pools. 1 pool I > have complete control/access to, the other pool I can log in using > ssh and submit jobs. The administrator of the other pool is > currently on vacation and said he has configured flocking to/from > our pool. I'm trying to test it, and it seems like flocking isn't > working. > > I was wondering how I can troubleshoot flocking to see what the > culprit is. I already tried to submit a job whose requirements can > only be fulfilled on the other pool. Condor_status -analyze > <jobid> shows that all machines can't meet the requirements. I > have also run condor_status -pool <otherpoolname> and it properly > displays all available machines on the other pool. I'm not sure > what to check next. > > Note: There is a firewall between the pools and our network admin > has already configured the firewall to allow traffic between pools. > > Thanks for any help. > > John Alberts > Technical Assistant for EMS > alberts@xxxxxxxxxxxxxxxxxx <mailto:alberts@xxxxxxxxxxxxxxxxxx> > 219-989-2083 > CLO 332 > http://public.xdi.org/=john.alberts > >------------------------------------------------------------------------ > >_______________________________________________ >Condor-users mailing list >To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a >subject: Unsubscribe >You can also unsubscribe by visiting >https://lists.cs.wisc.edu/mailman/listinfo/condor-users > >The archives can be found at either >https://lists.cs.wisc.edu/archive/condor-users/ >http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR > _______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users The archives can be found at either https://lists.cs.wisc.edu/archive/condor-users/ http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
<<winmail.dat>>