Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] condor 7.2 on windows--dumb batch file fails that worked on 7.1.0
- Date: Wed, 7 Jan 2009 16:04:18 -0600
- From: "Grant Goodyear" <grant@xxxxxxxxxxxxxxxxx>
- Subject: [Condor-users] condor 7.2 on windows--dumb batch file fails that worked on 7.1.0
We have a 100-node windows cluster running 7.1.0, except for one
machine (plowshare) that I've updated to 7.2.
I'm having difficulties getting the 7.2 machine to run jobs, so I
assembled a stupidly simple batch file, and just sent
that. On a randomly chosen 7.1.0 machine (43), there's no
problem--the job runs, and the output file contains
what one would expect. On the 7.2 machine, however, the job
terminates with exit code 128, and nothing is
written to the output or error files.
mystupid.bat -- "executable"
-------------------
mkdir temp
echo "dir:"
dir
set TMP=%_CONDOR_SCRATCH_DIR%\temp
set TEMP=%_CONDOR_SCRATCH_DIR%\temp
echo "dir temp"
dir temp
whoami
mystupid.log.43 -- log file on a 7.1.0 machine
-----------------------
000 (108.000.000) 01/07 15:08:07 Job submitted from host: <34.52.12.4:38333>
...
001 (108.000.000) 01/07 15:08:10 Job executing on host: <34.52.8.225:1055>
...
005 (108.000.000) 01/07 15:08:15 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
866 - Run Bytes Sent By Job
138 - Run Bytes Received By Job
866 - Total Bytes Sent By Job
138 - Total Bytes Received By Job
mystupid.out.43 -- output file on a 7.1.0 machine
-----------------------
"dir:"
Volume in drive C has no label.
Volume Serial Number is 80BB-A5FA
Directory of C:\condor\execute\dir_1280
01/07/2009 03:08 PM <DIR> .
01/07/2009 03:08 PM <DIR> ..
01/07/2009 03:05 PM 138 condor_exec.bat
01/07/2009 03:08 PM 0 mystupid.err
01/07/2009 03:08 PM 0 mystupid.out
01/07/2009 03:08 PM <DIR> temp
3 File(s) 138 bytes
3 Dir(s) 37,877,338,112 bytes free
"dir temp"
Volume in drive C has no label.
Volume Serial Number is 80BB-A5FA
Directory of C:\condor\execute\dir_1280\temp
01/07/2009 03:08 PM <DIR> .
01/07/2009 03:08 PM <DIR> ..
0 File(s) 0 bytes
2 Dir(s) 37,877,334,016 bytes free
enaus00053043\condor-reuse-slot1
mystupid.log.plowshare -- log file on the 7.2 machine
----------------------------------
000 (109.000.000) 01/07 15:10:14 Job submitted from host: <34.52.12.4:38333>
...
001 (109.000.000) 01/07 15:10:16 Job executing on host: <34.52.8.222:4465>
...
005 (109.000.000) 01/07 15:10:16 Job terminated.
(1) Normal termination (return value 128)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
138 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
138 - Total Bytes Received By Job
mystupid.out.plowshare -- output file on the 7.2 machine (empty)
----------------------------------
0-byte file
Help, please?
Here's some possibly-relevant log snippets from the 7.2 machine:
StarterLog.slot1
-----------------------
1/7 15:10:15 ******************************************************
1/7 15:10:15 ** condor_starter (CONDOR_STARTER) STARTING UP
1/7 15:10:15 ** C:\condor\bin\condor_starter.exe
1/7 15:10:15 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
1/7 15:10:15 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
1/7 15:10:15 ** $CondorVersion: 7.2.0 Dec 21 2008 BuildID: none $
1/7 15:10:15 ** $CondorPlatform: INTEL-WINNT50 $
1/7 15:10:15 ** PID = 3132
1/7 15:10:15 ** Log last touched 1/7 14:58:15
1/7 15:10:15 ******************************************************
1/7 15:10:15 Using config source: C:\condor\condor_config
1/7 15:10:15 Using local config sources:
1/7 15:10:15 C:\condor/condor_config.local
1/7 15:10:15 DaemonCore: Command Socket at <34.52.8.222:4585>
1/7 15:10:15 GLEXEC_JOB not supported on this platform; ignoring
1/7 15:10:15 Setting resource limits not implemented!
1/7 15:10:15 Communicating with shadow <34.52.12.4:41983>
1/7 15:10:15 Submitting machine is "feynman.corp.halliburton.com"
1/7 15:10:15 setting the orig job name in starter
1/7 15:10:15 setting the orig job iwd in starter
1/7 15:10:15 File transfer completed successfully.
1/7 15:10:16 Job 109.0 set to execute immediately
1/7 15:10:16 Starting a VANILLA universe job with ID: 109.0
1/7 15:10:16 Tracking process family by login "condor-reuse-slot1"
1/7 15:10:16 IWD: C:\condor\execute\dir_3132
1/7 15:10:16 Output file: C:\condor\execute\dir_3132\mystupid.out
1/7 15:10:16 Error file: C:\condor\execute\dir_3132\mystupid.err
1/7 15:10:16 Renice expr "10" evaluated to 10
1/7 15:10:16 About to exec C:\WINNT\system32\cmd.exe /Q /C condor_exec.bat
1/7 15:10:16 Create_Process succeeded, pid=2116
1/7 15:10:16 Process exited, pid=2116, status=128
1/7 15:10:16 Got SIGQUIT. Performing fast shutdown.
1/7 15:10:16 ShutdownFast all jobs.
1/7 15:10:16 **** condor_starter (condor_STARTER) pid 3132 EXITING WITH STATUS 0
StartLog
------------
1/7 15:10:14 slot1: match_info called
1/7 15:10:14 slot1: Received match <34.52.8.222:4465>#1231361361#8#...
1/7 15:10:14 slot1: State change: match notification protocol successful
1/7 15:10:14 slot1: Changing state: Unclaimed -> Matched
1/7 15:10:14 slot1: Request accepted.
1/7 15:10:14 slot1: Remote owner is grant@xxxxxxxxxxxxxxxxxxxx
1/7 15:10:14 slot1: State change: claiming protocol successful
1/7 15:10:14 slot1: Changing state: Matched -> Claimed
1/7 15:10:14 slot1: Got activate_claim request from shadow (<34.52.12.4:36569>)
1/7 15:10:14 slot1: Remote job ID is 109.0
1/7 15:10:15 slot1: Got universe "VANILLA" (5) from request classad
1/7 15:10:15 slot1: State change: claim-activation protocol successful
1/7 15:10:15 slot1: Changing activity: Idle -> Busy
1/7 15:10:16 slot1: Called deactivate_claim_forcibly()
1/7 15:10:16 condor_write(): Socket closed when trying to write 56
bytes to <34.52.12.4:40339>, fd is 228
1/7 15:10:16 Buf::write(): condor_write() failed
1/7 15:10:16 slot1: State change: received RELEASE_CLAIM command
1/7 15:10:16 slot1: Changing state and activity: Claimed/Busy ->
Preempting/Vacating
1/7 15:10:16 Starter pid 3132 exited with status 0
1/7 15:10:16 slot1: State change: starter exited
1/7 15:10:16 slot1: State change: No preempting claim, returning to owner
1/7 15:10:16 slot1: Changing state and activity: Preempting/Vacating
-> Owner/Idle
1/7 15:10:16 slot1: State change: IS_OWNER is false
1/7 15:10:16 slot1: Changing state: Owner -> Unclaimed
The submission machine is a linux box running 7.1.0.
Thanks,
Grant
--
Grant Goodyear
web: http://www.grantgoodyear.org
e-mail: grant@xxxxxxxxxxxxxxxxx