Hi, I have a Windows-only HTCondor pool, and I’m trying to submit a very simple task to that pool from another Windows machine outside the pool using the grid universe. The batch file that’s being
run is on a network drive that’s accessible by all machines involved, and I don’t care about storing stdout, stderr, and log files, so I don’t want any transferring of files to happen. As a result, I’ve set transfer_executable to False and remote_ShouldTransferFiles
to “NO”. Here are the contents of my submit file: universe = grid # This is accessible to all machines executable = //FileServer/path/to/file/test.bat transfer_executable = False concurrency_limits = 100 accounting_group = group_condor accounting_group_user = farnhamj grid_resource = condor HeadNode.aqrcapital.com HeadNode.aqrcapital.com remote_universe = vanilla +remote_RunAsOwner = True +remote_requirements = HasFincad == True +remote_ShouldTransferFiles = "NO" queue Once the task makes it onto the machine I’m calling HeadNode, it ends up staying Idle forever, because the condor_starter tries and fails to start the job. I found the following message in
the StarterLog.slot1 log on the machine that was trying to start the task: 06/05/15 17:50:22 (pid:2092) Create_Process: CreateProcess failed, errno=267 06/05/15 17:50:22 (pid:2092) SharedPortEndpoint: Inside stop listener. 06/05/15 17:50:22 (pid:2092) Create_Process(//FileServer/path/to/file/test.bat,, ...) failed:
06/05/15 17:50:22 (pid:2092) In OwnerProfile::loaded() 06/05/15 17:50:22 (pid:2092) Failed to start job, exiting 06/05/15 17:50:22 (pid:2092) ShutdownFast all jobs. 06/05/15 17:50:22 (pid:2092) Got ShutdownFast when no jobs running. 06/05/15 17:50:22 (pid:2092) HOOK_JOB_EXIT not configured. 06/05/15 17:50:22 (pid:2092) Entering JICShadow::updateShadow() 06/05/15 17:50:22 (pid:2092) Sent job ClassAd update to startd. 06/05/15 17:50:22 (pid:2092) Leaving JICShadow::updateShadow(): success 06/05/15 17:50:22 (pid:2092) Inside JICShadow::transferOutput(void) 06/05/15 17:50:22 (pid:2092) JICShadow::transferOutput(void): Transferring... 06/05/15 17:50:22 (pid:2092) Inside JICShadow::transferOutputMopUp(void) 06/05/15 17:50:22 (pid:2092) dirscat: dirpath = / 06/05/15 17:50:22 (pid:2092) dirscat: subdir = C:\condor\execute 06/05/15 17:50:22 (pid:2092) Initializing Directory: curr_dir = /\C:\condor\execute\ 06/05/15 17:50:22 (pid:2092) **** condor_starter (condor_STARTER) pid 2092 EXITING WITH STATUS 0 The last four lines look suspicious to me. It seems like Condor is trying to run out of C:\condor\execute instead of the location of the script, //FileServer/path/to/file/test.bat, which might
why the condor_starter is failing to start. In addition, when I use condor_q -l to look at the job’s ClassAd on the machine I’m calling HeadNode, I see the following: Iwd = "C:\condor\spool\2133\0\cluster22133.proc0.subproc0" This doesn’t look right--shouldn’t the initial working directory be //FileServer/path/to/file/test.bat? Finally, every machine in question has the same value set for FILESYSTEM_DOMAIN, which was my attempt to avoid issues accessing the //FileServer/path/to/file UNC path. I know this is a detailed question--thanks for any help you can provide. Regards, Jesse Farnham Disclaimer: This e-mail may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please notify the sender immediately and destroy/delete this e-mail. You are hereby notified that any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly prohibited. This communication is for informational purposes only. It is not intended as an offer or solicitation for the purchase or sale of any financial instrument or as an official confirmation of any transaction. All information contained in this communication is not warranted as to completeness or accuracy and is subject to change without notice. Any comments or statements made in this communication do not necessarily reflect those of AQR Capital Management, LLC and its affiliates. |