Hi Alain, Todd and all the Condor team,I am enclosing below a similar error message received from the same machine, this time it has some more information about it's memory status.
-Guy This is an automated email from the Condor system on machine "L002W021.pubclass.ad.bgu.ac.il". Do not reply. "C:\Condor/bin/condor_startd.exe" on "L002W021.pubclass.ad.bgu.ac.il" died due to exception STACK_OVERFLOW. Condor will automatically restart this process in 17 seconds. *** Last 20 line(s) of file StartLog: 1/1 12:33:02 ** C:\Condor\bin\condor_startd.exe 1/1 12:33:02 ** $CondorVersion: 6.6.10 Jun 22 2005 $ 1/1 12:33:02 ** $CondorPlatform: INTEL-WINNT50 $ 1/1 12:33:02 ** PID = 904 1/1 12:33:02 ****************************************************** 1/1 12:33:02 Using config file: C:\Condor\condor_config 1/1 12:33:02 Using local config files: C:\Condor/condor_config.local 1/1 12:33:02 DaemonCore: Command Socket at <132.72.69.42:4054> 1/1 12:33:02 "C:\Condor/bin/condor_starter.exe -classad" did not produce any output, ignoring 1/1 12:33:02 "C:\Condor/bin/condor_starter.pvm -classad" did not produce any output, ignoring 1/1 12:33:02 "C:\Condor/bin/condor_starter.std -classad" did not produce any output, ignoring 1/1 12:33:02 New machine resource allocated 1/1 12:33:07 no loadavg samples this minute, maybe thread died??? 1/1 12:33:07 About to run initial benchmarks. 1/1 12:33:13 Completed initial benchmarks. 1/1 12:33:13 State change: IS_OWNER is false 1/1 12:33:13 Changing state: Owner -> Unclaimed 1/1 12:33:13 new Packet failed. out of memory 1/1 12:33:13 ERROR "new Packet failed. out of memory" at line 625 in file ..\src\condor_io\SafeMsg.C 1/1 12:33:13 Deleting Cronmgr *** End of file StartLog *** Last entry in core file core.STARTD.WIN32 ============================ Exception code: C00000FD STACK_OVERFLOW Fault address: 004503A7 01:0004F3A7 C:\Condor\bin\condor_startd.exe Registers: EAX:0000BB54 EBX:00000000 ECX:0011DD74 EDX:008AF718 ESI:00804638 EDI:00000000 CS:EIP:001B:004503A7 SS:ESP:0023:00120D6C EBP:00120D80 DS:0023 ES:0023 FS:0038 GS:0000 Flags:00010202 Call stack: Address Frame Logical addr Module 004503A7 00120D80 0001:0004F3A7 C:\Condor\bin\condor_startd.exe 0041790A 00120E14 0001:0001690A C:\Condor\bin\condor_startd.exe 00408D61 00120F94 0001:00007D61 C:\Condor\bin\condor_startd.exe 00408768 001211C8 0001:00007768 C:\Condor\bin\condor_startd.exe 0043976E 001211E8 0001:0003876E C:\Condor\bin\condor_startd.exe 00436553 00121204 0001:00035553 C:\Condor\bin\condor_startd.exe 0044265C 0012FDC0 0001:0004165C C:\Condor\bin\condor_startd.exe 0044535A 0012FDF8 0001:0004435A C:\Condor\bin\condor_startd.exe 0043D556 0012FE30 0001:0003C556 C:\Condor\bin\condor_startd.exe 00444C3E 00489698 0001:00043C3E C:\Condor\bin\condor_startd.exe *** End of file core.STARTD.WIN32 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Questions about this message or Condor in general? Email address of the local Condor administrator: tel-zur@xxxxxxxxxxxx The Official Condor Homepage is http://www.cs.wisc.edu/condor Todd Tannenbaum wrote:
On Tue, Jan 03, 2006 at 12:59:31PM +0000, Angel de Vicente wrote:thanks for the suggestion. I would love to know how to get the standard universe without the shadow/IO abilities, anyone?In the v6.7.x series, you can put want_remote_io=false into your submit file. Does this accomplish what you want?From the condor_submit man page (again, ver 6.7.x):want_remote_io = <True | False> This option controls how a file is opened and manipulated in a standard universe job. If this option is true, which is the default, then the condor_ shadow makes all decisions about how each and every file should be opened by the executing job. This entails a network round trip (or more) from the job to the condor_ shadow and back again for every single open() in addition to other needed information about the file. If set to false, then when the job queries the condor_ shadow for the first time about how to open a file, the condor_ shadow will inform the job to automatically perform all of its file manipulation on the local file system on the execute machine and any file remapping will be ignored. This means that there must be a shared file system (such as NFS or AFS) between the execute machine and the submit machine and that ALL paths that the job could open on the execute machine must be valid. The ability of the standard universe job to checkpoint, possibly to a checkpoi! nt server, is not affected by this attribute. However, when the job resumes it will be expecting the same file system conditions that were present when the job checkpointed.regards, Todd _______________________________________________ Condor-users mailing list Condor-users@xxxxxxxxxxx https://lists.cs.wisc.edu/mailman/listinfo/condor-users