Subject: Re: [Condor-users] Job fails to run / Job leaves around unkillable processes
so where it says here
10/29 12:47:27 TokenCache contents:
USER@DOMAIN
you have replaced the real value with USER@DOMAIN, correct? And
the value you are getting
for the tokencache is reasonable for your system?
If the directory exits, then this pretty much has to be a
privilege problem of some sort. If this was a
unix system I might suspect that condor_exec.exe didn't have the
execute bit set. But windows
filesystems don't have execute bits, so the only other possibility
that seems reasonable to me is that USER@DOMAIN doesn't have the
right to read dir_6728.
I'm assuming that C: is formatted as NTFS, which has support for
user-level access rights.
-tj
On 11/3/2010 2:50 PM, Torrin Jones wrote:
Here is everything in the starter log from the last 2 seconds
of running that process. As you can see from the log below, IWD
is set to C:\condor\execute\dir_6728. You can also see it
failing to delete that directory later. This is a directory
that it created. Again, usernames and domains have been changed
to protect the guilty. I'm not sure why the starter is allowed
to create a directory, copy an executable into it, but then
can't run it or later delete the directory. This is very
strange.
10/29 12:47:28 ERROR
"Create_Process(C:\condor\execute\dir_6728\condor_exec.exe,,
...) failed: " at line 530 in file
..\src\condor_starter.V6.1\os_proc.cpp
10/29 12:47:28 ShutdownFast all jobs.
10/29 12:47:28 Got ShutdownFast when no jobs running.
10/29 12:47:28 Attempting to remove
C:\condor\execute\dir_6728 as SuperUser (system)
10/29 12:47:28 Removing "C:\condor\execute\dir_6728" as
SuperUser (system) failed: /bin/rm exited with status 32
10/29 12:47:28 perm::init() starting up for account (SYSTEM)
domain (NT AUTHORITY)
10/29 12:47:28 perm::init: Found Account Name SYSTEM
10/29 12:47:28 set_acls() found a matching ACE already in the
ACL, so skipping the add
10/29 12:47:28 set_acls() found a matching ACE already in the
ACL, so skipping the add
10/29 12:47:28 set_acls() found a matching ACE already in the
ACL, so skipping the add
10/29 12:47:28 set_acls() found a matching ACE already in the
ACL, so skipping the add
10/29 12:47:28 set_acls() found a matching ACE already in the
ACL, so skipping the add
10/29 12:47:28 Attempting to remove
C:\condor\execute\dir_6728 as SuperUser (system)
10/29 12:47:28 Removing "C:\condor\execute\dir_6728" as
SuperUser (system) failed: /bin/rm exited with status 32
10/29 12:47:28 ERROR: C:\condor\execute\dir_6728 still exists
after trying to add Full control to ACLs for PRIV_ROOT
10/29 12:47:28 Deleting the StarterHookMgr
On Mon, Nov 1, 2010 at 8:02 AM, John (TJ) Knoeller <johnkn@xxxxxxxxxxx>
wrote:
Is there a line in the
starter log that looks like this
IWD: <some path>
It would be before the message that Create_Process failed.
This is the Initial directory, if it's different
than that path to the executable, then that might be the
directory that's invalid.
-tj
On 10/29/2010 3:00 PM, Torrin Jones wrote:
Thanks for the response. Good
points. However . . .
V: is actually a physical hard drive on my
computer and at the moment, condor is only installed
on my computer. I was doing a test to see if my
software would work with the latest version. So
everything is contained on my computer that has V as
a physical hard drive. So condor should be able to
get at it. I also checked to see if this
directories actually does exist. They do and as far
as I can tell, they are accessible by anybody,
including condor (which is running as NT
AUTHORITY\SYSTEM).
After all this, I wanted to be sure, so I moved
everything to c:\temp and changed all paths in the
submit description file to relative paths and then
submitted to condor to see if anything changed.
Unfortunately, I still have the same problem.
I've attached the new submit description file and
the output log file. IP address, port numbers,
usernames, etc. have been changed to protect the
guilty. Below is what came out of
StarterLog.slot1.
10/29 12:47:28 ERROR
"Create_Process(C:\condor\execute\dir_6728\condor_exec.exe,,
...) failed: " at line 530 in file
..\src\condor_starter.V6.1\os_proc.cpp
On Fri, Oct 29, 2010 at
8:51 AM, John (TJ) Knoeller <johnkn@xxxxxxxxxxx>
wrote:
yep
267 is "The directory name is invalid".
From looking at your .job file. I'm
wondering if the invalid directory isn't
v:\temp\condor or v:\shared\condor rather
than c:\condor\execute\dir_6136 as the error
message seems to imply.
I'm guessing that v: is a network drive. So
I gotta wonder, v: really valid in the
context of the job?
10/28 08:35:33 ERROR
"Create_Process(C:\condor\execute\dir_6136\condor_exec.exe,,
...) failed: " at line 530 in
file
..\src\condor_starter.V6.1\os_proc.cpp
The MSDN says 267 means, "The
directory name is invalid."
However, the directory name is
there. Here is the scenario. I
submit a small job.
condor_dummy.job attached. All
condor_dummy.exe does is print out a
line like this . . .
Run by DOMAIN\USER on
COMPUTERNAME at DATE TIME.
It's basically a quick condor
test.
Anyway, I submit the job and
condor tries to run it. However it
fails and I get the above message in
the StarterLog.slot1. Here is the
kicker. It will retry and fail.
However, if I leave it in the queue
long enough, it will eventually
succeed. When I ran the job
yesterday, it tried 28 times. The
final time, it succeeded. Here is
another thing I'm seeing. After it
succeeded, I looked in Process
Explorer and saw 27 condor_exec.exe
running. The condor_exec.exe's were
unkillable. I tried every approach
I could think of. Killing them as
Admin, as NT AUTHORITY/SYSTEM, even
putting a debugger on them and
killing them that way, nothing
works.