Hi,
I am using Condor for some time as user. Recently, I am using MSYS (http://www.mingw.org/wiki/msys), calling MSYS command in some scripts.We have 8 condor_starter running concurrently in a 8 core Windows XP machine.
To simulate the issue, I create a script just to copy some files and then remove them using MSYS command.
I submit 10000 jobs running this script (with MSYS calls) to the condor farm. After running concurrently for some time (1 or 2 hours), there is 1 MSYS command (eg. cp.exe) hang. I try to attach the hanging command with mingw gdb and get the following call stack.
(gdb) where
#0 0x7d61002e in strchr () from C:\WINDOWS\system32\ntdll.dll
#1 0x7d666ea1 in
ntdll!RtlCopyUnicodeString ()
from C:\WINDOWS\system32\ntdll.dll
#2 0x00000000 in ?? ()
With the hanging process, the subsequent jobs with MSYS call will fail mysteriously with the following error.
cp: cannot stat `s:/data/regtestfiles/main/current/regtest/infrastructure/regutils/reg_run/reg_copy_files//prev/reg.rout': No such file or directory
...
This file does exist actually. If I kill the hanging process, the subsequent jobs will be back to normal.
I check the the process explorer, this hanging process and my script is the child process of condor_master->condor_starter.
Btw, it works fine on Windows 7.
I also tried simulate the same process on 8 cmd prompts (not from condor_starter), everything run fine.
Thanks in advance for your replies and insight.
Regards,
Mun Soon