The interesting bit seems to be this, which seems to indicate a
memory trashing bug. Could you send the core
dump for the CRED. Also, you say this is repeatable. are the core
dumps for MASTER and CRED always the same?
Address Frame
77427F1A 01AFE2E0
RtlAnsiStringToUnicodeString+171
7742730A 01AFE3D8
RtlEnumerateGenericTableWithoutSplaying+548
77427545 01AFE3F4
RtlEnumerateGenericTableWithoutSplaying+783
770B9A26 01AFE408 HeapFree+14
1001E4E6 01AFE448
pcre_version+2396
0048473C 01AFE47C Regex::~Regex
(c:\condor\execute\dir_560\userdir\src\condor_c++_util\regex.cpp:72)
On 2/10/2011 4:52 PM, Michael O'Donnell wrote:
Here is what the
core.MASTER.WIN32
file states. I do not know enough about interpreting these, but
the other
files seem similar.
thanks
//=====================================================
PID: 4052
Exception code: C0000005
ACCESS_VIOLATION
Fault address: 77427F1A
01:00066F1A
C:\Windows\system32\ntdll.dll
Registers:
EAX:0126CCB8
EBX:01340000
ECX:00000000
EDX:00000000
ESI:0126CCB0
EDI:0127A100
CS:EIP:001B:77427F1A
SS:ESP:0023:01AFE2B8
EBP:01AFE2E0
DS:0023 ES:0023 FS:003B
GS:0000
Flags:00010246
Call stack:
Address Frame
77427F1A 01AFE2E0
RtlAnsiStringToUnicodeString+171
7742730A 01AFE3D8
RtlEnumerateGenericTableWithoutSplaying+548
77427545 01AFE3F4
RtlEnumerateGenericTableWithoutSplaying+783
770B9A26 01AFE408 HeapFree+14
1001E4E6 01AFE448
pcre_version+2396
0048473C 01AFE47C
Regex::~Regex
(c:\condor\execute\dir_560\userdir\src\condor_c++_util\regex.cpp:72)
0048F9E7 01AFE4B8
__ArrayUnwind
(f:\dd\vctools\crt_bld\self_x86\crt\prebuild\eh\ehvecdtr.cpp:128)
0048FA86 01AFE4CC `eh vector
destructor iterator'
(f:\dd\vctools\crt_bld\self_x86\crt\prebuild\eh\ehvecdtr.cpp:134)
004BB986 01AFE4D0
__NLG_Return2+0
004B0E60 01AFE4FC
_local_unwind4+80
004B0F2C 01AFE510
@_EH4_LocalUnwind@16+10
0049F227 01AFE544
_except_handler4+187
77425F79 01AFE568
RtlRaiseStatus+B4
77425F4B 01AFE930
RtlRaiseStatus+86
773C9C0F 01AFE954
WinSqmStartSession+490
773C4081 01AFE97C
RtlGetLengthWithoutTrailingPathSeperators+431
77425F79 01AFE9A0
RtlRaiseStatus+B4
77425F4B 01AFEA50
RtlRaiseStatus+86
77425DD7 01AFED78
KiUserExceptionDispatcher+F
7742730A 01AFEE70
RtlEnumerateGenericTableWithoutSplaying+548
77427545 01AFEE8C
RtlEnumerateGenericTableWithoutSplaying+783
770B9A26 01AFEEA0 HeapFree+14
1001E4E6 01AFEEE0
pcre_version+2396
0048473C 01AFEF14
Regex::~Regex
(c:\condor\execute\dir_560\userdir\src\condor_c++_util\regex.cpp:72)
0048FA52 01AFEF48 `eh vector
destructor iterator'
(f:\dd\vctools\crt_bld\self_x86\crt\prebuild\eh\ehvecdtr.cpp:134)
00478523 01AFEF98
MapFile::CanonicalMapEntry::`vector
deleting destructor'+20
00478B06 01AFF044
ExtArray<MapFile::CanonicalMapEntry>::operator[]
(c:\condor\execute\dir_560\userdir\src\condor_c++_util\extarray.h:152)
0041FA9F 01AFF0E8
Authentication::map_authentication_name_to_canonical_name
(c:\condor\execute\dir_560\userdir\src\condor_io\authentication.cpp:408)
004205AE 01AFF16C
Authentication::authenticate_inner
(c:\condor\execute\dir_560\userdir\src\condor_io\authentication.cpp:358)
0042066D 01AFF190
Authentication::authenticate
(c:\condor\execute\dir_560\userdir\src\condor_io\authentication.cpp:113)
0042069F 01AFF1B0
Authentication::authenticate
(c:\condor\execute\dir_560\userdir\src\condor_io\authentication.cpp:86)
0041AEC8 01AFF1FC
ReliSock::perform_authenticate
(c:\condor\execute\dir_560\userdir\src\condor_io\reli_sock.cpp:973)
0041AF5D 01AFF21C
ReliSock::authenticate
(c:\condor\execute\dir_560\userdir\src\condor_io\reli_sock.cpp:1001)
00422F56 01AFF264
SecManStartCommand::authenticate_inner
(c:\condor\execute\dir_560\userdir\src\condor_io\condor_secman.cpp:1797)
00426497 01AFF2C0
SecManStartCommand::startCommand_inner
(c:\condor\execute\dir_560\userdir\src\condor_io\condor_secman.cpp:1164)
00422941 01AFF2E8
SecManStartCommand::startCommand
(c:\condor\execute\dir_560\userdir\src\condor_io\condor_secman.cpp:1095)
00425638 01AFF32C
SecManStartCommand::DoTCPAuth_inner
(c:\condor\execute\dir_560\userdir\src\condor_io\condor_secman.cpp:2114)
00425C3A 01AFF3F0
SecManStartCommand::sendAuthInfo_inner
(c:\condor\execute\dir_560\userdir\src\condor_io\condor_secman.cpp:1375)
00498934 01AFF44C
vfprintf_helper
(f:\dd\vctools\crt_bld\self_x86\crt\src\vfprintf.c:79)
00416EE4 01AFF478
_condor_dprintf_va
(c:\condor\execute\dir_560\userdir\src\condor_util_lib\dprintf.c:385)
00413F27 01AFF50C dprintf
(c:\condor\execute\dir_560\userdir\src\condor_util_lib\dprintf_common.c:76)
00422941 01AFF534
SecManStartCommand::startCommand
(c:\condor\execute\dir_560\userdir\src\condor_io\condor_secman.cpp:1095)
00423BAD 01AFF55C
SecMan::startCommand
(c:\condor\execute\dir_560\userdir\src\condor_io\condor_secman.cpp:984)
0046A09F 01AFF5D8
Daemon::startCommand
(c:\condor\execute\dir_560\userdir\src\condor_daemon_client\daemon.cpp:581)
0046BA1F 01AFF61C
Daemon::startCommand
(c:\condor\execute\dir_560\userdir\src\condor_daemon_client\daemon.cpp:634)
0046BA55 01AFF65C
Daemon::startCommand
(c:\condor\execute\dir_560\userdir\src\condor_daemon_client\daemon.cpp:643)
0047F29A 01AFF6B0
DCMessenger::sendBlockingMsg
(c:\condor\execute\dir_560\userdir\src\condor_daemon_client\dc_message.cpp:352)
0046A9AE 01AFF6E0
Daemon::sendBlockingMsg
(c:\condor\execute\dir_560\userdir\src\condor_daemon_client\daemon.cpp:2307)
0043B950 01AFF718
DaemonCore::Send_Signal
(c:\condor\execute\dir_560\userdir\src\condor_daemon_core.v6\daemon_core.cpp:5367)
0043CB3A 01AFF748
DaemonCore::Send_Signal
(c:\condor\execute\dir_560\userdir\src\condor_daemon_core.v6\daemon_core.cpp:5116)
0040520A 01AFF784 daemon::Kill
(c:\condor\execute\dir_560\userdir\src\condor_master.v6\masterdaemon.cpp:1187)
00407970 01AFF7D0
daemon::Reconfig
(c:\condor\execute\dir_560\userdir\src\condor_master.v6\masterdaemon.cpp:1226)
00438C74 01AFF9C8
DaemonCore::HandleReq
(c:\condor\execute\dir_560\userdir\src\condor_daemon_core.v6\daemon_core.cpp:4894)
00438EAD 01AFF9D8
DaemonCore::HandleReq
(c:\condor\execute\dir_560\userdir\src\condor_daemon_core.v6\daemon_core.cpp:3772)
00439067 01AFFA0C
DaemonCore::CallSocketHandler_worker
(c:\condor\execute\dir_560\userdir\src\condor_daemon_core.v6\daemon_core.cpp:3468)
0043938E 01AFFA2C
DaemonCore::CallSocketHandler_worker_demarshall
(c:\condor\execute\dir_560\userdir\src\condor_daemon_core.v6\daemon_core.cpp:3424)
00439663 01AFFA54
DaemonCore::CallSocketHandler
(c:\condor\execute\dir_560\userdir\src\condor_daemon_core.v6\daemon_core.cpp:3412)
0043B664 01AFFAF0
DaemonCore::Driver
(c:\condor\execute\dir_560\userdir\src\condor_daemon_core.v6\daemon_core.cpp:3325)
0048F1A1 01AFFAFC free
(f:\dd\vctools\crt_bld\self_x86\crt\src\free.c:110)
//=====================================================
- - - - - - - - - - - - - - - - - - - - - - - - - -
Michael O'Donnell
ADP Software Specialist, ASRC Management Services
USGS Fort Collins Science Center
2150 Centre Ave., Bldg C
Fort Collins, CO 80526
Phone: 970.226.9407
Fax: 970.226.9230
Email: odonnellm@xxxxxxxx
The file being generated should be a core dump
file.
You should be able to look inside it to see where Condor is
crashing, or
send it our way for us to investigate.
Z
Condor Project
On Thu, Feb 10, 2011 at 2:23 PM, Michael O'Donnell
<odonnellm@xxxxxxxx>
wrote:
While trying to figure this out I am noticing a couple things.
First, my
cred service is dying on the central manager, which throws the
core.CRED.WIN32
file. If I delete this file the service will generally restart,
but sometimes
I have to restart the Condor service to get the cred service to
start again.
I am also noticing that on my submit machine a core.STARTD.WIN32
file is
created and this might be related to why jobs are remaining in
idle.
However, I do not know what any of this means. The load average
on the
CM is on average 30%, with spikes as high as 70%. This seems a
little high
since we are not running any other services on the server. The
collector
is usually at about 25% and the spikes are caused from the other
Condor
services (mainly the negotiator).
My search on google for access violation to C:\Windows\system32\ntdll.dll
and memory problems are plentiful, but because they vary and
because we
were not having problems before I am not making a lot of
progress trying
to figure this out. It does seem like these files are related to
the inability
of jobs to match when in fact I know that machines are
available.
thanks,
mike
I have noticed on our central manager that two files are
created. These
files include:
core.MASTER.WIN32 and core.CRED.WIN32
The header content of the files include:
PID: 660
Exception code: C0000005 ACCESS_VIOLATION
Fault address: 77427F1A 01:00066F1A
C:\Windows\system32\ntdll.dll
If I delete the files they are re-created, and I do not recall
seeing the
files in the past. Does anyone know what this access violation
is about.
Could there be a problem with antivirus or something. Our pool
is functioning
with the exception that all jobs remain in idle, which started
after expanding
our pool from 100 cores to 200 cores (posted earlier
today--[Condor-users]
Job remains in idle (worked until I increased pool size). I
don't think
this is related, but I am trying to troubleshoot this.
Thank you for your help,
Mike _______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to
condor-users-request@xxxxxxxxxxx with
a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
|