[HTCondor-users] Issues with Two Machine Pool on Windows 7

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Hello,

I am a new HT Condor user. I have searched the internet and some of the forum but am having some issues getting my htcondor setup. I believe it maybe permission or configuration based.

Iâm hoping with your wisdom you can give me a lead.

This is the version information

C:\Projects\FMAChallanges\Challenge_Model_1\TUFLOW\runs]condor_status -version

$CondorVersion: 8.2.6 Dec 10 2014 BuildID: 287355 $

$CondorPlatform: x86_64_Windows8 $

CONFIGURATION

I am running a two computer pool, a central manager and computer C1. I have no local configuration files setup. I have changed the ip addresses and host locations to Central Manager and C1 respectively for this example.

On my central manager I have the following config content:

#===============================================

#---------------------CENTRAL MANAGER-----------------------------

RELEASE_DIR = C:\condor

LOCAL_CONFIG_FILE = $(LOCAL_DIR)\condor_config.local

REQUIRE_LOCAL_CONFIG_FILE = FALSE

LOCAL_CONFIG_DIR = $(LOCAL_DIR)\config

use SECURITY : HOST_BASED

CONDOR_HOST = WEBR1525.xxxxx.xxxx

UID_DOMAIN = xxxxx

CONDOR_ADMIN = xxxx.xxx@xxxx

SMTP_SERVER = mailblahblah

COLLECTOR_NAME = FloodiesMod

COLLECTOR_HOST = $(CONDOR_HOST)

ALLOW_READ = *

ALLOW_WRITE = $(CONDOR_HOST), $(IP_ADDRESS), *

ALLOW_ADMINISTRATOR = $(IP_ADDRESS)

ALLOW_NEGOTIATOR = $(IP_ADDRESS)

ALLOW_DAEMON = *

JAVA = C:\PROGRA~2\Java\JRE18~1.0_2\bin\java.exe

START = TRUE

SUSPEND = FALSE

WANT_SUSPEND = TRUE

WANT_VACATE = FALSE

PREEMPT = FALSE

DAEMON_LIST = MASTER SCHEDD COLLECTOR NEGOTIATOR STARTD

#===============================================

On C1 I have the following config:

#===============================================

#-------------------------------C1-----------------------------------------

RELEASE_DIR = C:\condor

LOCAL_CONFIG_FILE = $(LOCAL_DIR)\condor_config.local

REQUIRE_LOCAL_CONFIG_FILE = FALSE

LOCAL_CONFIG_DIR = $(LOCAL_DIR)\config

use SECURITY : HOST_BASED

CONDOR_HOST = WEBR1525.xxxxx.xxxx

UID_DOMAIN = xxxxx

CONDOR_ADMIN = xxxx.xxx@xxxx

SMTP_SERVER = mailblahblah

COLLECTOR_NAME = FloodiesMod

COLLECTOR_HOST = $(CONDOR_HOST)

ALLOW_READ = *

ALLOW_WRITE = $(CONDOR_HOST), $(IP_ADDRESS), *

ALLOW_ADMINISTRATOR = $(IP_ADDRESS)

JAVA = C:\PROGRA~2\Java\JRE18~1.0_2\bin\java.exe

START = TRUE

SUSPEND = FALSE

WANT_SUSPEND = TRUE

WANT_VACATE = FALSE

PREEMPT = FALSE

DAEMON_LIST = MASTER SCHEDD STARTD

#===============================================

RUN TESTING

I have managed to successfully run the following description file by submitting on the central manager and running on the central manager. This results in the successful simulation of the exe C:\TUFLOW\w64\TUFLOW_iSP_w64.exe

#===============================================

## Runfile.txt

#===============================================

universe = vanilla

executable = C:\TUFLOW\w64\TUFLOW_iSP_w64.exe

arguments = "-b -x -s 15ft FMA_T1_~s1~_001.tcf"

output = TUFLOW.out

error = TUFLOW.err

log = example1.log

should_transfer_files = IF_NEEDED

when_to_transfer_output = ON_EXIT

queue

#===============================================

Where the problems are startingâ.

If I try to run the same job on another on computer C1 from my central server using the requirements command as per RunfileC1.txt:

#===============================================

## RunfileC1.txt

#===============================================

universe = vanilla

executable = C:\TUFLOW\w64\TUFLOW_iSP_w64.exe

arguments = "-b -x -s 15ft FMA_T1_~s1~_001.tcf"

#input = input.in

output = TUFLOW.out

error = TUFLOW.err

log = example1.log

Requirements = (machine == "WEBR1436. .xxxxx.xxxx ")

should_transfer_files = IF_NEEDED

when_to_transfer_output = ON_EXIT

queue

#===============================================

LOG ERRORS

I donât get any errors per se. The starter log on slot 2 suggests that the scratch directory execute\dir_2456 has been successful. However when I go into this execute folder on computer C1 there is no files or folders within the execute directory on computer C1.

StarterLog.slot2 on computer C1

/20/15 18:35:17 (pid:2456) ******************************************************

01/20/15 18:35:17 (pid:2456) ** condor_starter (CONDOR_STARTER) STARTING UP

01/20/15 18:35:17 (pid:2456) ** C:\condor\bin\condor_starter.exe

01/20/15 18:35:17 (pid:2456) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)

01/20/15 18:35:17 (pid:2456) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON

01/20/15 18:35:17 (pid:2456) ** $CondorVersion: 8.2.6 Dec 10 2014 BuildID: 287355 $

01/20/15 18:35:17 (pid:2456) ** $CondorPlatform: x86_64_Windows8 $

01/20/15 18:35:17 (pid:2456) ** PID = 2456

01/20/15 18:35:17 (pid:2456) ** Log last touched 1/20 18:27:07

01/20/15 18:35:17 (pid:2456) ******************************************************

01/20/15 18:35:17 (pid:2456) Using config source: C:\condor\condor_config

01/20/15 18:35:17 (pid:2456) Using local config sources:

01/20/15 18:35:17 (pid:2456) C:\condor\condor_config.local

01/20/15 18:35:17 (pid:2456) config Macros = 48, Sorted = 47, StringBytes = 1190, TablesBytes = 1368

01/20/15 18:35:17 (pid:2456) CLASSAD_CACHING is OFF

01/20/15 18:35:17 (pid:2456) Daemon Log is logging: D_ALWAYS D_ERROR

01/20/15 18:35:17 (pid:2456) DaemonCore: command socket at <IP C1>

01/20/15 18:35:17 (pid:2456) DaemonCore: private command socket at < IP C1>

01/20/15 18:35:17 (pid:2456) GLEXEC_JOB not supported on this platform; ignoring

01/20/15 18:35:17 (pid:2456) Communicating with shadow < IP Central Master >

01/20/15 18:35:17 (pid:2456) Submitting machine is "webr1525.xxxxx"

01/20/15 18:35:17 (pid:2456) setting the orig job name in starter

01/20/15 18:35:17 (pid:2456) setting the orig job iwd in starter

01/20/15 18:35:17 (pid:2456) Chirp config summary: IO false, Updates false, Delayed updates true.

01/20/15 18:35:17 (pid:2456) Initialized IO Proxy.

01/20/15 18:35:17 (pid:2456) Setting resource limits not implemented!

01/20/15 18:35:18 (pid:2456) File transfer completed successfully.

01/20/15 18:35:19 (pid:2456) Job 121.0 set to execute immediately

01/20/15 18:35:19 (pid:2456) Starting a VANILLA universe job with ID: 121.0

01/20/15 18:35:19 (pid:2456) Tracking process family by login "condor-slot2"

01/20/15 18:35:19 (pid:2456) IWD: C:\condor\execute\dir_2456

01/20/15 18:35:19 (pid:2456) Output file: C:\condor\execute\dir_2456\_condor_stdout

01/20/15 18:35:19 (pid:2456) Error file: C:\condor\execute\dir_2456\_condor_stderr

01/20/15 18:35:19 (pid:2456) Renice expr "10" evaluated to 10

01/20/15 18:35:19 (pid:2456) About to exec C:\condor\execute\dir_2456\condor_exec.exe -b -x -s 15ft FMA_T1_~s1~_001.tcf

01/20/15 18:35:19 (pid:2456) Running job as user condor-slot2

01/20/15 18:35:19 (pid:2456) Create_Process succeeded, pid=3296

01/20/15 18:35:19 (pid:2456) Process exited, pid=3296, status=-1073741515

01/20/15 18:35:19 (pid:2456) Got SIGQUIT. Performing fast shutdown.

01/20/15 18:35:19 (pid:2456) ShutdownFast all jobs.

01/20/15 18:35:23 (pid:2456) **** condor_starter (condor_STARTER) pid 2456 EXITING WITH STATUS 0

ShadowLog on Central Server

01/20/15 18:35:17 ** condor_shadow (CONDOR_SHADOW) STARTING UP

01/20/15 18:35:17 ** C:\condor\bin\condor_shadow.exe

01/20/15 18:35:17 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)

01/20/15 18:35:17 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON

01/20/15 18:35:17 ** $CondorVersion: 8.2.6 Dec 10 2014 BuildID: 287355 $

01/20/15 18:35:17 ** $CondorPlatform: x86_64_Windows8 $

01/20/15 18:35:17 ** PID = 6404

01/20/15 18:35:17 ** Log last touched 1/20 18:29:37

01/20/15 18:35:17 ******************************************************

01/20/15 18:35:17 Using config source: C:\condor\condor_config

01/20/15 18:35:17 Using local config sources:

01/20/15 18:35:17 C:\condor\condor_config.local

01/20/15 18:35:17 config Macros = 47, Sorted = 47, StringBytes = 1205, TablesBytes = 400

01/20/15 18:35:17 CLASSAD_CACHING is OFF

01/20/15 18:35:17 Daemon Log is logging: D_ALWAYS D_ERROR

01/20/15 18:35:17 DaemonCore: command socket at <IP Central Master>

01/20/15 18:35:17 DaemonCore: private command socket at < IP Central Master >

01/20/15 18:35:17 Initializing a VANILLA shadow for job 121.0

01/20/15 18:35:17 (121.0) (6404): Request to run on slot2@WEBR1436. .xxxxx.xxxx < IP C1> was ACCEPTED

01/20/15 18:35:19 (121.0) (6404): Job 121.0 terminated: exited with status -1073741515

01/20/15 18:35:19 (121.0) (6404): Reporting job exit reason 100 and attempting to fetch new job.

01/20/15 18:35:19 (121.0) (6404): **** condor_shadow (condor_SHADOW) pid 6404 EXITING WITH STATUS 100

If you could help me out that would be so helpful.

If you have any other pointers regarding the config that would also be much appreciated. Kind regards, Mitch.

Mitchell Smith
Flood/Coastal Engineer
BMT WBM Pty Ltd

E-mail confidentiality notice and disclaimer:
The contents of this e-mail are intended for the use of the mail addressee(s) shown. If you are not that person, you are not allowed to read, action, copy, forward, distribute or disclose the contents and you should delete it from your system. BMT WBM accepts no liability for any errors or omissions in the content of this e-mail, nor does it accept liability for statements which are those of the author and clearly not made on behalf of the company.

Commercial Terms and Conditions:
Unless otherwise agreed by BMT WBM in writing, all services or products supplied by BMT WBM shall be subject to and governed by BMT WBMâs standard terms and conditions, which are available on BMT WBM's website (www.bmtwbm.com.au).

Mailing List Archives

Authenticated access

[HTCondor-users] Issues with Two Machine Pool on Windows 7