Hi Everyone,
I think there might be a bug with condor v6.8.3 working with Windows
2003. I have 2 Windows 2003 Servers and a Windows XP box connected to
one pool. The pool Manager is on a windows 2003 box, which does not run
any jobs.
I have a job which consists of a batch file which runs a PHP script by
copying PHP onto the machine and runs the script. With the Same Pool
when the Win XP machine is assigned the job, it runs no problem. However
when it is assigned to the windows 2003 box, I get an error as below...
(more info to follow.....)
-------------------------------------------------------------------------------------------------------
001 (030.000.000) 01/15 16:51:18 Job executing on host: <192.168.2.202:4544>
...
007 (030.000.000) 01/15 16:51:18 Shadow exception!
Error from starter on vm1@STAGING:
Create_Process(C:\WINDOWS\system32\cmd.exe,/Q /C condor_exec.bat
translate_desc_en_pt.php, VIDEOID, ...) failed
0 - Run Bytes Sent By Job
8139560 - Run Bytes Received By Job
...
---------------------------------------------------------------------------------------------
Submit file
--------------------------------------------------------------------------------------------
# file name: my_program.condor
# Condor submit description file for my_program
Executable = p.bat
Universe = vanilla
Error = logs/$(cluster).err.log
Output = logs/$(cluster).out.log
Log = logs/$(cluster).log
initialdir = files
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files =
translate_desc_en_pt.php,php.exe,gtkextra.dll,iconv.dll,intl.dll,libgdk-0.dll,libglade.dll,libglib-2.0-0.dll,libgmodule-2.0-0.dll,libgobject-2.0-0.dll,libgthread-2.0-0.dll,libgtk-0.dll,libxml2.dll,php4ts.dll,php.ini,php.ini-gtk,php_gtk.dll,php_gtk_combobutton.dll,php_gtk_extra.dll,php_gtk_libglade.dll,php_gtk_scintilla.dll,php_gtk_scrollpane.dll,php_gtk_spaned.dll,php_gtk_sqpane.dll,php_win.exe,
php-cgi.exe,zlib.dll
Arguments = translate_desc_en_pt.php, VIDEOID
#Arguments = -?
Requirements = OpSys != "Dummy" && Arch != "Dummy"
Queue
--------------------------------------------------------------------------------------------
1/15 16:51:15 ******************************************************
1/15 16:51:15 ** condor_shadow (CONDOR_SHADOW) STARTING UP
1/15 16:51:15 ** C:\condor\bin\condor_shadow.exe
1/15 16:51:15 ** $CondorVersion: 6.8.3 Jan 5 2007 $
1/15 16:51:15 ** $CondorPlatform: INTEL-WINNT50 $
1/15 16:51:15 ** PID = 3948
1/15 16:51:15 ** Log last touched 1/15 16:51:13
1/15 16:51:15 ******************************************************
1/15 16:51:15 Using config source: C:\condor\condor_config
1/15 16:51:15 Using local config sources:
1/15 16:51:15 C:\condor/condor_config.local
1/15 16:51:15 DaemonCore: Command Socket at <192.168.2.124:4788>
1/15 16:51:15 Initializing a VANILLA shadow for job 30.0
1/15 16:51:15 (30.0) (3948): Request to run on <192.168.2.202:4544> was
ACCEPTED
1/15 16:51:18 (30.0) (3948): ERROR "Error from starter on vm1@STAGING:
Create_Process(C:\WINDOWS\system32\cmd.exe,/Q /C condor_exec.bat
translate_desc_en_pt.php, VIDEOID, ...) failed" at line 643 in file
..\src\condor_shadow.V6.1\pseudo_ops.C
1/15 16:53:59 ******************************************************
I then looked up a previous users post whose problem was similar and
using the
http://condor.optena.com/display/CONDOR/Common+Windows+Problems page I
can see that there needs a VM1_USER in the configuration which I have
done...
Then the error I get is below....
1/15 17:18:48 ******************************************************
1/15 17:18:48 ** condor_shadow (CONDOR_SHADOW) STARTING UP
1/15 17:18:48 ** C:\condor\bin\condor_shadow.exe
1/15 17:18:48 ** $CondorVersion: 6.8.3 Jan 5 2007 $
1/15 17:18:48 ** $CondorPlatform: INTEL-WINNT50 $
1/15 17:18:48 ** PID = 2080
1/15 17:18:48 ** Log last touched 1/15 17:11:04
1/15 17:18:48 ******************************************************
1/15 17:18:48 Using config source: C:\condor\condor_config
1/15 17:18:48 Using local config sources:
1/15 17:18:48 C:\condor/condor_config.local
1/15 17:18:48 DaemonCore: Command Socket at <192.168.2.124:1098>
1/15 17:18:48 Initializing a VANILLA shadow for job 32.0
1/15 17:18:48 (32.0) (2080): Request to run on <192.168.2.202:3310> was
ACCEPTED
1/15 17:18:49 (32.0) (2080): condor_read(): recv() returned -1, errno =
10054, assuming failure reading 5 bytes from <192.168.2.202:3310>.
1/15 17:18:49 (32.0) (2080): Can no longer talk to condor_starter
<192.168.2.202:3310>
1/15 17:18:49 (32.0) (2080): Trying to reconnect to disconnected job
1/15 17:18:49 (32.0) (2080): LastJobLeaseRenewal: 1168881529 Mon Jan 15
17:18:49 2007
1/15 17:18:49 (32.0) (2080): JobLeaseDuration: 1200 seconds
1/15 17:18:49 (32.0) (2080): JobLeaseDuration remaining: 1200
1/15 17:18:49 (32.0) (2080): Attempting to locate disconnected starter
1/15 17:18:49 (32.0) (2080): Found starter: <192.168.2.202:3362>
1/15 17:18:49 (32.0) (2080): Attempting to reconnect to starter
<192.168.2.202:3362>
1/15 17:18:50 (32.0) (2080): attempt to connect to <192.168.2.202:3362>
failed: connect errno = 10061 connection refused.
1/15 17:18:50 (32.0) (2080): Attempt to reconnect failed: Failed to
connect to starter <192.168.2.202:3362>
1/15 17:18:50 (32.0) (2080): JobLeaseDuration remaining: 1199
1/15 17:18:50 (32.0) (2080): Scheduling another attempt to reconnect in
8 seconds
1/15 17:18:58 (32.0) (2080): Attempting to locate disconnected starter
1/15 17:18:58 (32.0) (2080): locateStarter(): ClaimId
(<192.168.2.202:3310>#1168881462#1) and GlobalJobId (
cellast-cxo5mw2#1168881032#32.0 ) not found
1/15 17:18:58 (32.0) (2080): Reconnect FAILED: Job not found at
execution machine
1/15 17:18:58 (32.0) (2080): **** condor_shadow (condor_SHADOW) EXITING
WITH STATUS 107
My Gut feeling is that its a bug with the file transfer of multiple
files with Windows 2003. The reason I say its the multiple files... is
that I can get a simple hello world transferring the hello.exe accross
no problems... its just when its multiple files.
The exact same job description works fine on the same pool to windows XP.
Any thoughts would be muchly appreciated.
Regards
Mark
--
Mark Ellul
Research and Development Manager
This email and any attachments may be confidential or legally privileged.
If you received this message in error or are not the intended recipient. you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information containing herein. Please inform us of the erroneous delivery by return e-mail. Thank you for your co-operation.
www.cellcast.tv
150 Great Portland Street
London
W1W 6QD
UK
Tel: (020) 7190 0300
Fax: (020) 7190 0301
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR