Subject: Re: [Condor-users] Job fails to run / Job leaves around unkillable processes
Here is everything in the starter log from the last 2 seconds of running that process. As you can see from the log below, IWD is set to C:\condor\execute\dir_6728. You can also see it failing to delete that directory later. This is a directory that it created. Again, usernames and domains have been changed to protect the guilty. I'm not sure why the starter is allowed to create a directory, copy an executable into it, but then can't run it or later delete the directory. This is very strange.
10/29 12:47:28 ERROR "Create_Process(C:\condor\execute\dir_6728\condor_exec.exe,, ...) failed: " at line 530 in file ..\src\condor_starter.V6.1\os_proc.cpp
10/29 12:47:28 ShutdownFast all jobs.
10/29 12:47:28 Got ShutdownFast when no jobs running.
10/29 12:47:28 Attempting to remove C:\condor\execute\dir_6728 as SuperUser (system)
10/29 12:47:28 Removing "C:\condor\execute\dir_6728" as SuperUser (system) failed: /bin/rm exited with status 32
10/29 12:47:28 perm::init() starting up for account (SYSTEM) domain (NT AUTHORITY)
10/29 12:47:28 perm::init: Found Account Name SYSTEM
10/29 12:47:28 set_acls() found a matching ACE already in the ACL, so skipping the add
10/29 12:47:28 set_acls() found a matching ACE already in the ACL, so skipping the add
10/29 12:47:28 set_acls() found a matching ACE already in the ACL, so skipping the add
10/29 12:47:28 set_acls() found a matching ACE already in the ACL, so skipping the add
10/29 12:47:28 set_acls() found a matching ACE already in the ACL, so skipping the add
10/29 12:47:28 Attempting to remove C:\condor\execute\dir_6728 as SuperUser (system)
10/29 12:47:28 Removing "C:\condor\execute\dir_6728" as SuperUser (system) failed: /bin/rm exited with status 32
10/29 12:47:28 ERROR: C:\condor\execute\dir_6728 still exists after trying to add Full control to ACLs for PRIV_ROOT
10/29 12:47:28 Deleting the StarterHookMgr
On Mon, Nov 1, 2010 at 8:02 AM, John (TJ) Knoeller <johnkn@xxxxxxxxxxx> wrote:
Is there a line in the starter log that looks like this
IWD: <some path>
It would be before the message that Create_Process failed. This is
the Initial directory, if it's different
than that path to the executable, then that might be the directory
that's invalid.
-tj
On 10/29/2010 3:00 PM, Torrin Jones wrote:
Thanks for the response. Good points. However . . .
V: is actually a physical hard drive on my computer and at
the moment, condor is only installed on my computer. I was
doing a test to see if my software would work with the latest
version. So everything is contained on my computer that has V
as a physical hard drive. So condor should be able to get at
it. I also checked to see if this directories actually does
exist. They do and as far as I can tell, they are accessible by
anybody, including condor (which is running as NT
AUTHORITY\SYSTEM).
After all this, I wanted to be sure, so I moved everything to
c:\temp and changed all paths in the submit description file to
relative paths and then submitted to condor to see if anything
changed. Unfortunately, I still have the same problem.
I've attached the new submit description file and the output
log file. IP address, port numbers, usernames, etc. have been
changed to protect the guilty. Below is what came out of
StarterLog.slot1.
10/29 12:47:28 ERROR
"Create_Process(C:\condor\execute\dir_6728\condor_exec.exe,,
...) failed: " at line 530 in file
..\src\condor_starter.V6.1\os_proc.cpp
On Fri, Oct 29, 2010 at 8:51 AM, John
(TJ) Knoeller <johnkn@xxxxxxxxxxx>
wrote:
yep 267 is "The
directory name is invalid". From looking at your .job
file. I'm wondering if the invalid directory isn't
v:\temp\condor or v:\shared\condor rather than
c:\condor\execute\dir_6136 as the error message seems to
imply.
I'm guessing that v: is a network drive. So I gotta
wonder, v: really valid in the context of the job?
10/28 08:35:33 ERROR
"Create_Process(C:\condor\execute\dir_6136\condor_exec.exe,,
...) failed: " at line 530 in file
..\src\condor_starter.V6.1\os_proc.cpp
The MSDN says 267 means, "The directory name
is invalid." However, the directory name is
there. Here is the scenario. I submit a small
job. condor_dummy.job attached. All
condor_dummy.exe does is print out a line like
this . . .
Run by DOMAIN\USER on COMPUTERNAME at DATE
TIME.
It's basically a quick condor test.
Anyway, I submit the job and condor tries to
run it. However it fails and I get the above
message in the StarterLog.slot1. Here is the
kicker. It will retry and fail. However, if I
leave it in the queue long enough, it will
eventually succeed. When I ran the job
yesterday, it tried 28 times. The final time,
it succeeded. Here is another thing I'm seeing.
After it succeeded, I looked in Process
Explorer and saw 27 condor_exec.exe running.
The condor_exec.exe's were unkillable. I tried
every approach I could think of. Killing them
as Admin, as NT AUTHORITY/SYSTEM, even putting a
debugger on them and killing them that way,
nothing works.