[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-devel] ProcAPI Messages
- Date: Thu, 14 May 2009 08:46:29 -0500
- From: Matthew Farrellee <matt@xxxxxxxxxx>
- Subject: Re: [Condor-devel] ProcAPI Messages
Have you checked the Procd's log? It is often configured with PROCD_LOG
= $(LOG)/ProcLog
You can also turn the debug level up in the procd with PROCD_DEBUG =
D_FULLDEBUG
Best,
matt
Jim Summers wrote:
> Hello All,
>
> I set RESTART_PROCD_ON_ERROR to false and then started up the daemons again.
> It confirmed my suspicions that the schedd was still not happy. Actually it
> seems that it is something called condor_procd that seems to be the one that
> is having trouble. The SchedLog has the following:
> ===
> 5/14 08:14:55 (fd:12) (pid:93500) DaemonCore--> id = 15, when = 1242335695,
> period = 0, handler_descrip=<dc_touch_lock_files>
> 5/14 08:14:55 (fd:12) (pid:93500) DaemonCore--> id = 6, when = 1242335702,
> period = 28807, handler_descrip=<DaemonCore::refreshDNS()>
> 5/14 08:14:55 (fd:12) (pid:93500) DaemonCore--> id = 11, when = 1242393295,
> period = 86400, handler_descrip=<CleanJobQueue>
> 5/14 08:14:55 (fd:12) (pid:93500)
> 5/14 08:14:55 (fd:12) (pid:93500) leaving DaemonCore NewTimer, id=7
> 5/14 08:14:55 (fd:12) (pid:93500) DaemonCore Timeout() Complete, returning 5
> 5/14 08:14:55 (fd:12) (pid:93500) selector 0xbffff554 resetting
> 5/14 08:14:55 (fd:12) (pid:93500) selector 0xbffff554 adding fd 8 ()
> 5/14 08:14:55 (fd:12) (pid:93500) selector 0xbffff554 adding fd 9 ()
> 5/14 08:14:55 (fd:12) (pid:93500) selector 0xbffff554 adding fd 7 ()
> 5/14 08:14:55 (fd:12) (pid:93500) selector 0xbffff554 adding fd 10 ()
> 5/14 08:14:55 (fd:12) (pid:93500) selector 0xbffff554 adding fd 3 ()
> 5/14 08:14:55 (fd:12) (pid:93500) Calling Handler <HandleDC_SERVICEWAITPIDS()>
> for Signal 60009 <DC_SERVICEWAITPIDS>
> 5/14 08:14:55 (fd:12) (pid:93500) DaemonCore: pid 93501 exited with status 10,
> invoking reaper 1 <condor_procd reaper>
> 5/14 08:14:55 (fd:12) (pid:93500) procd (pid = 93501) exited unexpectedly with
> status 10
> 5/14 08:14:55 (fd:12) (pid:93500) Config 'RESTART_PROCD_ON_ERROR': no prefix
> ==> 'False'
> 5/14 08:14:55 (fd:12) (pid:93500) ERROR "ProcD has failed" at line 599 in file
> proc_family_proxy.cpp
> 5/14 08:15:06 (fd:3) (pid:93520) LOGS_USE_TIMESTAMP is undefined, using
> default value of False
> ===
>
> I am not sure what the issue at hand could be?
>
> Ideas or suggestions greatly appreciated.
>
> Thanks
>
>
>
> Jim Summers wrote:
>> Hello All,
>>
>> Still working on getting 7.2.2 to fly on an OSx Leopard. After hacking the
>> age, creation_time and sample_time variables use an unsigned long type, things
>> seemed to get better, but still no joy. Then using Peter's suggestion of
>> upping the maxprocs that helped things a lot, but alas, still no joy.
>>
>> Now the condor_master, schedd, startd are all running but no job submissions
>> will run. It seems that schedd is generating tons of log and consuming
>> anywhere from 18-25% of the cpu as reported by top.
>>
>> I found an OSX crash report message for condor_procd in /var/log/system.log.
>> Here is the contents of the crash report:
>>
>> Process: condor_procd [1324]
>> Path: /usr/local/condor/sbin/condor_procd
>> Identifier: condor_procd
>> Version: ??? (???)
>> Code Type: X86 (Native)
>> Parent Process: condor_schedd [98692]
>>
>> Date/Time: 2009-05-13 16:20:23.369 -0500
>> OS Version: Mac OS X 10.5.6 (9G3553)
>> Report Version: 6
>>
>> Exception Type: EXC_BAD_ACCESS (SIGBUS)
>> Exception Codes: KERN_PROTECTION_FAILURE at 0x0000000000000219
>> Crashed Thread: 0
>>
>> Thread 0 Crashed:
>> 0 condor_procd 0x0000633b
>> ProcFamilyMonitor::snapshot() + 63 (proc_family_monitor.cpp:509)
>> 1 condor_procd 0x000072ce
>> ProcFamilyMonitor::ProcFamilyMonitor(int, long, int) + 1270
>> 2 condor_procd 0x00002495 main + 451 (procd_main.cpp:337)
>> 3 condor_procd 0x00001e46 start + 54
>>
>> Thread 0 crashed with X86 Thread State (32-bit):
>> eax: 0x000001f5 ebx: 0x00006308 ecx: 0xbffff8cc edx: 0x00800224
>> edi: 0x00000000 esi: 0x00100350 ebp: 0xbffff818 esp: 0xbffff7d0
>> ss: 0x0000001f efl: 0x00010206 eip: 0x0000633b cs: 0x00000017
>> ds: 0x0000001f es: 0x0000001f fs: 0x00000000 gs: 0x00000037
>> cr2: 0x00000219
>>
>> Binary Images:
>> 0x1000 - 0x12ff0 +condor_procd ??? (???)
>> <b3c764c9b34f126e5833933112905bc2> /usr/local/condor/sbin/condor_procd
>> 0x8fe00000 - 0x8fe2db43 dyld 97.1 (???) <9736a715ebabb914fef61680520dc1e0>
>> /usr/lib/dyld
>> 0x9234b000 - 0x92352fe9 libgcc_s.1.dylib ??? (???)
>> <e280ddf3f5fb3049e674edcb109f389a> /usr/lib/libgcc_s.1.dylib
>> 0x936ca000 - 0x936e8fff libresolv.9.dylib ??? (???)
>> <39f6d8651f3dca7a1534fa04322e6763> /usr/lib/libresolv.9.dylib
>> 0x9372d000 - 0x93894ff3 libSystem.B.dylib ??? (???)
>> <0ddbaae699690b09239f69dea7d0fbb0> /usr/lib/libSystem.B.dylib
>> 0x9487f000 - 0x948dcffb libstdc++.6.dylib ??? (???)
>> <7d389389a99ce696726cf4c8980cc505> /usr/lib/libstdc++.6.dylib
>> 0x96fa9000 - 0x96fadfff libmathCommon.A.dylib ??? (???)
>> /usr/lib/system/libmathCommon.A.dylib
>> 0xffff0000 - 0xffff1780 libSystem.B.dylib ??? (???) /usr/lib/libSystem.B.dylib
>>
>> It is beyond my comprehension, but I thought it may help.
>>
>>
>> I am going to find the code contribution agreement and get that submitted
>> also. Although at this point I am not confident that the unsigned long was
>> the right thing to do.
>>
>> TIA
>>
>> Peter Keller wrote:
>>> On Mon, May 11, 2009 at 03:32:20PM -0500, Jim Summers wrote:
>>>> Hello All,
>>>>
>>>> I modified the procapi.h file so that all of the age, creation_tim and
>>>> sample_time variables use an unsigned long type. That seems to have fixed the
>>>> ProcAPI errors that we were seeing.
>>>>
>>>> But now we are seeing the following in SchedLog:
>>>> 5/11 14:53:38 (fd:7) (pid:57011) In
>>>> DaemonCore::Create_Process(/usr/local/condor/sbin/condor_procd,...)
>>>> 5/11 14:53:38 (fd:7) (pid:57011) PRIV_CONDOR --> PRIV_ROOT at daemon_core.cpp:6852
>>>> 5/11 14:53:38 (fd:7) (pid:57011) PRIV_ROOT --> PRIV_CONDOR at daemon_core.cpp:6885
>>>> 5/11 14:53:38 (fd:11) (pid:57011) Create Process: fork() failed: Resource
>>>> temporarily unavailable (35)
>>>> 5/11 14:53:38 (fd:7) (pid:57011) start_procd: unable to execute the procd
>>>> 5/11 14:53:38 (fd:5) (pid:57011) Close_Pipe(pipe_end=65536) succeeded
>>>> 5/11 14:53:38 (fd:5) (pid:57011) Close_Pipe(pipe_end=65537) succeeded
>>>> 5/11 14:53:38 (fd:5) (pid:57011) ERROR "unable to start the ProcD" at line 620
>>>> in file proc_family_proxy.cpp
>>>>
>>>> I am not sure what to do at this point?
>>>>
>>>> Ideas / Suggestions?
>>> Do you have your process limit set really low for your uid?
>>>
>>> As for the code changes, you could attach a patch to you message
>>> and we can see if we can apply it. I'll have to scrutinize the patch
>>> closely because even though one might think the age of a process can't
>>> be negative, due to kernel issues a negative age actually could be
>>> calculated, so I'd need to do some inspection.
>>>
>>> Have you signed our code contribution agreement?
>>>
>>> Thank you.
>>>
>>> -pete
>>> _______________________________________________
>>> Condor-devel mailing list
>>> Condor-devel@xxxxxxxxxxx
>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-devel
>