[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Starter exited with status -1073740940



That error code is 0xC0000374 which is STATUS_HEAP_CORRUPTION defined in ntstatus.h.

Logging of the priv change only happens after the change was successful so the crash is
whatever happens next after this line.

c:\condor\execute\dir_18128\userdir\src\condor_starter.v6.1\basestarter.cpp:1789

The next thing to happen is a
  mkdir the working directory
  write machine ad and job ad into the working directory
  set acls are on the working directory
if (ENCRYPT_EXECUTE_DIRECTORY) Load ADVAPI32.dll to get EncryptFile() function and use it to encrypt the working dir
  chdir to working dir
dprintf( D_FULLDEBUG, "Done moving to directory \"%s\"\n", WorkingDir.Value() );

So since you aren't seeing "Done moving to directory...", the problem must happen as we are setting up the working directory.

Can you tell how far into the process we got?
was the working directory made?
were ads written to it?
do you have encryption enabled?

None of this code ever calls exit, so the exit must be happening down inside some library.

-tj


On 7/10/2014 1:21 PM, Ben Cotton wrote:
I'm running HTCondor 8.2.1 in a small cluster on AWS and I'm having a
hard time getting my Windows jobs to run. The Windows execute node is
Server 2k8 R2 (which HTCondor identifies as Windows 7). The job
matches, appears to start, but then the condor_starter.exe dies. The
StartLog records:

ïïïïïïï07/10/14 17:48:47 condor_read() failed: recv(fd=1012) returned
-1, errno = 10054 , reading 5 bytes from <127.0.0.1:50882>.
07/10/14 17:48:47 IO: Failed to read packet header
07/10/14 17:48:47 Closing job ClassAd update socket from starter.
07/10/14 17:48:47 Starter pid 336 exited with status -1073740940

 From the StarterLog:
07/10/14 17:48:47 (fd:7) (pid:336) (D_HOSTNAME) Daemon client (shadow)
address determined: name: "ip-10-151-7-218.ec2.internal", pool:
"NULL", alias: "NULL", addr: "<10.151.7.218:48140?noUDP>"
07/10/14 17:48:47 (fd:7) (pid:336) (D_ALWAYS) Communicating with
shadow <10.151.7.218:48140?noUDP>
07/10/14 17:48:47 (fd:7) (pid:336) (D_ALWAYS) Submitting machine is
"ip-10-151-7-218.ec2.internal"
07/10/14 17:48:47 (fd:7) (pid:336) (D_SYSCALLS) Doing
CONDOR_register_starter_info
07/10/14 17:48:47 (fd:7) (pid:336) (D_NETWORK) condor_write(fd=604
<10.151.7.218:59144>,,size=515,timeout=300,flags=0,non_blocking=0)
07/10/14 17:48:47 (fd:7) (pid:336) (D_NETWORK) condor_read(fd=604
<10.151.7.218:59144>,,size=5,timeout=300,flags=0,non_blocking=0)
07/10/14 17:48:47 (fd:7) (pid:336) (D_NETWORK) condor_read(fd=604
<10.151.7.218:59144>,,size=8,timeout=300,flags=0,non_blocking=0)
07/10/14 17:48:47 (fd:7) (pid:336) (D_ALWAYS) setting the orig job
name in starter
07/10/14 17:48:47 (fd:7) (pid:336) (D_ALWAYS) setting the orig job iwd
in starter
07/10/14 17:48:47 (fd:7) (pid:336) (D_PRIV) PRIV_CONDOR -->
PRIV_CONDOR at c:\condor\execute\dir_18128\userdir\src\condor_starter.v6.1\basestarter.cpp:1789

And then it goes poof. I see on the MagicNumbers page[1] that negative
statuses might mean "Possibly missing libraries or missing functions
in libraries on Windows. Try running from the command line to see if
you get any errors." I tried running from the command line and got no
output, error or otherwise. The other daemons seem to be fine. Any
ideas what's going on here?

[1] https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=MagicNumbers


Thanks,
BC