[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] job stuck in idle mode - HasFileTransfer



Mr. Agarwal,

First of all, based on your situation, FILESYSTEM_DOMAIN should be set to $(FULL_HOSTNAME) (not "10.8.0.1, condor-mstr") since they don't share a filesystem.  In your submit file, "should_transfer_files" should always be set to "YES" for the same reason.  After you change the configuration file, restart both computers to make sure Condor has a fresh start with the new configuration settings.
That is odd.  HasFileTransfer should be defined, even if it's false for some reason.  What version of Condor are you running?  'condor -v'  Also, recheck the StartLog for unusual errors or warnings.
Do you get anything when you run 'condor_status -long | grep -i transfer'?  If not, what is the complete output of 'condor_status -long'?

Best Regards,
 - Garrett
condor.cs.wlu.edu

From: condor-users-bounces@xxxxxxxxxxx [condor-users-bounces@xxxxxxxxxxx] on behalf of Shiv Agarwal [shiv@xxxxxxxxxxx]
Sent: Friday, August 19, 2011 7:06 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] job stuck in idle mode - HasFileTransfer

Garrett,

Appreciate your quick reply. I tried the commands you mentioned.

condor_status -long | grep ^HasFileTransfer  - did not show any results

condor_status -long | grep ^FileSystemDomain - showed "10.8.0.1, condor-mstr" 

10.8.0.1 is the i.p. of my master node and "condor-mstr" is the hostname. 

In my execute node FILESYSTEM_DOMAIN  = 10.8.0.1, condor-mstr. I set it to both because when I run condor_config_val -v FILESYSTEM_DOMAIN in my master node it shows me "condor-mstr" but in my execute node the same command shows the i.p. which is "10.8.0.1"

I do not have NFS setup so I do need to transfer the files.

I don't even see any errors anywhere and what is driving me crazy is that the master does not even seem to try to transfer files. It just presumes that the execute node does not allow it as if something was preset when the execute node first connected to the master node.

This is my submit file 

Universe   = vanilla
Requirements  = Arch == "INTEL" &&  Memory >= 32
should_transfer_files = IF_NEEDED
when_to_transfer_output = ON_EXIT
Executable = simple
Arguments  = 4 10
Log        = outsimple.log
Output     = outsimple.$(Process).out
Error      = outsimple.error
Queue


Shiv

On Fri, Aug 19, 2011 at 3:54 PM, Koller, Garrett <kollerg14@xxxxxxxxxxxx> wrote:
Mr. Agarwal,

I don't think TRUST_UID_DOMAIN is the problem.  Run 'condor_status -long | grep ^HasFileTransfer' and 'condor_status -long | grep ^FileSystemDomain' to find out which of the two conditions is failing.  First of all, I'm assuming these two conditions have been automatically inserted into your job's requirements because you enabled file transfer in the submission file or Condor needs it by default.  Assuming that file transfer can work on all of your machines, HasFileTransfer should be true for all of your machines and FileSystemDomain should be set to the domain that all of the machines belong to (such as "cs.wisc.edu"), depending on your situation.  Check the FILESYSTEM_DOMAIN variable in the configuration files.  If your machines all share a similar filesystem (using NFS or mounted home directories or something), they should all be set to the same internet subdomain that they all belong to.
I know this is basic stuff, but hopefully this will prompt you to check your configuration to see if anything is wrong.  Besides that, I don't know exactly what causes Condor to set HasFileTransfer to be set to true or false.  Search the documentation for descriptions of these variables for more information.

Best Regards,
 - Garrett Heath Koller
kollerg14@xxxxxxxxxxxx

From: condor-users-bounces@xxxxxxxxxxx [condor-users-bounces@xxxxxxxxxxx] on behalf of Shiv Agarwal [shiv@xxxxxxxxxxx]
Sent: Friday, August 19, 2011 6:16 PM
To: condor-users
Subject: [Condor-users] job stuck in idle mode - HasFileTransfer

I have setup a small condor pool with 1 master node and 1 execute node.

I see not error messages in master or worker node log files whatsoever. In fact, the worker node does not even receive the request to execute the job. From my understanding the master node decides itself not to send the job to the execute node.

condor_q - analyze shows me that this particular requirement did not match ?

 ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == "condor-mstr" ) )  0


I have even set the TRUST_UID_DOMAIN = True


Please HELP!


--
Shiv Agarwal

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/




--
Shiv Agarwal