[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-devel] ProcAPI Messages
- Date: Tue, 02 Jun 2009 00:07:18 -0500
- From: Matthew Farrellee <matt@xxxxxxxxxx>
- Subject: Re: [Condor-devel] ProcAPI Messages
Jim Summers wrote:
>
>
> Matthew Farrellee wrote:
>> All I can say is those are mighty large pids.
>>
>
> I am sure that number grew rapidly earlier in the week when I had the
> schedd_restart set to true. It would just continually restart the
> schedd and also procd was firing.
>
> I bounced the machine and the pids are back to smaller number. But
> schedd is still dies and procd eventually can't start. Same errors as
> below.
>
> I also tried setting the D_ALL for procd but keep getting errors. Is
> there any way I can turn up the debug for procd?
PROCD_DEBUG = D_FULLDEBUG (or maybe D_ALL, lotsa spam).
Best,
matt
> TIA
>
>> Best,
>>
>>
>> matt
>>
>> Jim Summers wrote:
>>> Matthew Farrellee wrote:
>>>> Have you checked the Procd's log? It is often configured with PROCD_LOG
>>>> = $(LOG)/ProcLog
>>>>
>>>> You can also turn the debug level up in the procd with PROCD_DEBUG =
>>>> D_FULLDEBUG
>>> Made those changes and here is what I am seeing:
>>> ==
>>> bash-3.2# cat ProcLog.SCHEDD
>>> ***********************************
>>> * condor_procd STARTING UP
>>> ***********************************
>>> taking a snapshot...
>>> no methods have determined process 93860 to be in a monitored family
>>> ERROR: master has exited
>>> ***********************************
>>> * condor_procd STARTING UP
>>> ***********************************
>>> taking a snapshot...
>>> no methods have determined process 93872 to be in a monitored family
>>> ERROR: master has exited
>>> ***********************************
>>> * condor_procd STARTING UP
>>> ***********************************
>>> taking a snapshot...
>>> no methods have determined process 93880 to be in a monitored family
>>> ERROR: master has exited
>>> ***********************************
>>> * condor_procd STARTING UP
>>> ***********************************
>>> taking a snapshot...
>>> no methods have determined process 93885 to be in a monitored family
>>> ERROR: master has exited
>>> ***********************************
>>> * condor_procd STARTING UP
>>> ***********************************
>>> taking a snapshot...
>>> no methods have determined process 93891 to be in a monitored family
>>> ERROR: master has exited
>>> ***********************************
>>> * condor_procd STARTING UP
>>> ***********************************
>>> taking a snapshot...
>>> no methods have determined process 93898 to be in a monitored family
>>> ERROR: master has exited
>>> ***********************************
>>> * condor_procd STARTING UP
>>> ***********************************
>>> taking a snapshot...
>>> no methods have determined process 93902 to be in a monitored family
>>> ERROR: master has exited
>>> ***********************************
>>> * condor_procd STARTING UP
>>> ***********************************
>>> taking a snapshot...
>>> no methods have determined process 93908 to be in a monitored family
>>> ERROR: master has exited
>>> ==
>>>
>>> Is there another parameter I have overlooked?
>>>
>>> Thanks Again,
>>>
>>>> Best,
>>>>
>>>>
>>>> matt
>>>>
>>>> Jim Summers wrote:
>>>>> Hello All,
>>>>>
>>>>> I set RESTART_PROCD_ON_ERROR to false and then started up the daemons
>>>>> again. It confirmed my suspicions that the schedd was still not
>>>>> happy. Actually it seems that it is something called condor_procd
>>>>> that seems to be the one that is having trouble. The SchedLog has
>>>>> the following:
>>>>> ===
>>>>> 5/14 08:14:55 (fd:12) (pid:93500) DaemonCore--> id = 15, when =
>>>>> 1242335695, period = 0, handler_descrip=<dc_touch_lock_files>
>>>>> 5/14 08:14:55 (fd:12) (pid:93500) DaemonCore--> id = 6, when =
>>>>> 1242335702, period = 28807, handler_descrip=<DaemonCore::refreshDNS()>
>>>>> 5/14 08:14:55 (fd:12) (pid:93500) DaemonCore--> id = 11, when =
>>>>> 1242393295, period = 86400, handler_descrip=<CleanJobQueue>
>>>>> 5/14 08:14:55 (fd:12) (pid:93500)
>>>>> 5/14 08:14:55 (fd:12) (pid:93500) leaving DaemonCore NewTimer, id=7
>>>>> 5/14 08:14:55 (fd:12) (pid:93500) DaemonCore Timeout() Complete,
>>>>> returning 5
>>>>> 5/14 08:14:55 (fd:12) (pid:93500) selector 0xbffff554 resetting
>>>>> 5/14 08:14:55 (fd:12) (pid:93500) selector 0xbffff554 adding fd 8 ()
>>>>> 5/14 08:14:55 (fd:12) (pid:93500) selector 0xbffff554 adding fd 9 ()
>>>>> 5/14 08:14:55 (fd:12) (pid:93500) selector 0xbffff554 adding fd 7 ()
>>>>> 5/14 08:14:55 (fd:12) (pid:93500) selector 0xbffff554 adding fd 10 ()
>>>>> 5/14 08:14:55 (fd:12) (pid:93500) selector 0xbffff554 adding fd 3 ()
>>>>> 5/14 08:14:55 (fd:12) (pid:93500) Calling Handler
>>>>> <HandleDC_SERVICEWAITPIDS()> for Signal 60009 <DC_SERVICEWAITPIDS>
>>>>> 5/14 08:14:55 (fd:12) (pid:93500) DaemonCore: pid 93501 exited with
>>>>> status 10, invoking reaper 1 <condor_procd reaper>
>>>>> 5/14 08:14:55 (fd:12) (pid:93500) procd (pid = 93501) exited
>>>>> unexpectedly with status 10
>>>>> 5/14 08:14:55 (fd:12) (pid:93500) Config 'RESTART_PROCD_ON_ERROR': no
>>>>> prefix ==> 'False'
>>>>> 5/14 08:14:55 (fd:12) (pid:93500) ERROR "ProcD has failed" at line
>>>>> 599 in file proc_family_proxy.cpp
>>>>> 5/14 08:15:06 (fd:3) (pid:93520) LOGS_USE_TIMESTAMP is undefined,
>>>>> using default value of False
>>>>> ===
>>>>>
>>>>> I am not sure what the issue at hand could be?
>>>>>
>>>>> Ideas or suggestions greatly appreciated.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> Jim Summers wrote:
>>>>>> Hello All,
>>>>>>
>>>>>> Still working on getting 7.2.2 to fly on an OSx Leopard. After
>>>>>> hacking the age, creation_time and sample_time variables use an
>>>>>> unsigned long type, things seemed to get better, but still no joy.
>>>>>> Then using Peter's suggestion of upping the maxprocs that helped
>>>>>> things a lot, but alas, still no joy.
>>>>>>
>>>>>> Now the condor_master, schedd, startd are all running but no job
>>>>>> submissions will run. It seems that schedd is generating tons of
>>>>>> log and consuming anywhere from 18-25% of the cpu as reported by top.
>>>>>>
>>>>>> I found an OSX crash report message for condor_procd in
>>>>>> /var/log/system.log. Here is the contents of the crash report:
>>>>>>
>>>>>> Process: condor_procd [1324]
>>>>>> Path: /usr/local/condor/sbin/condor_procd
>>>>>> Identifier: condor_procd
>>>>>> Version: ??? (???)
>>>>>> Code Type: X86 (Native)
>>>>>> Parent Process: condor_schedd [98692]
>>>>>>
>>>>>> Date/Time: 2009-05-13 16:20:23.369 -0500
>>>>>> OS Version: Mac OS X 10.5.6 (9G3553)
>>>>>> Report Version: 6
>>>>>>
>>>>>> Exception Type: EXC_BAD_ACCESS (SIGBUS)
>>>>>> Exception Codes: KERN_PROTECTION_FAILURE at 0x0000000000000219
>>>>>> Crashed Thread: 0
>>>>>>
>>>>>> Thread 0 Crashed:
>>>>>> 0 condor_procd 0x0000633b
>>>>>> ProcFamilyMonitor::snapshot() + 63 (proc_family_monitor.cpp:509)
>>>>>> 1 condor_procd 0x000072ce
>>>>>> ProcFamilyMonitor::ProcFamilyMonitor(int, long, int) + 1270
>>>>>> 2 condor_procd 0x00002495 main + 451
>>>>>> (procd_main.cpp:337)
>>>>>> 3 condor_procd 0x00001e46 start + 54
>>>>>>
>>>>>> Thread 0 crashed with X86 Thread State (32-bit):
>>>>>> eax: 0x000001f5 ebx: 0x00006308 ecx: 0xbffff8cc edx: 0x00800224
>>>>>> edi: 0x00000000 esi: 0x00100350 ebp: 0xbffff818 esp: 0xbffff7d0
>>>>>> ss: 0x0000001f efl: 0x00010206 eip: 0x0000633b cs: 0x00000017
>>>>>> ds: 0x0000001f es: 0x0000001f fs: 0x00000000 gs: 0x00000037
>>>>>> cr2: 0x00000219
>>>>>>
>>>>>> Binary Images:
>>>>>> 0x1000 - 0x12ff0 +condor_procd ??? (???)
>>>>>> <b3c764c9b34f126e5833933112905bc2>
>>>>>> /usr/local/condor/sbin/condor_procd
>>>>>> 0x8fe00000 - 0x8fe2db43 dyld 97.1 (???)
>>>>>> <9736a715ebabb914fef61680520dc1e0> /usr/lib/dyld
>>>>>> 0x9234b000 - 0x92352fe9 libgcc_s.1.dylib ??? (???)
>>>>>> <e280ddf3f5fb3049e674edcb109f389a> /usr/lib/libgcc_s.1.dylib
>>>>>> 0x936ca000 - 0x936e8fff libresolv.9.dylib ??? (???)
>>>>>> <39f6d8651f3dca7a1534fa04322e6763> /usr/lib/libresolv.9.dylib
>>>>>> 0x9372d000 - 0x93894ff3 libSystem.B.dylib ??? (???)
>>>>>> <0ddbaae699690b09239f69dea7d0fbb0> /usr/lib/libSystem.B.dylib
>>>>>> 0x9487f000 - 0x948dcffb libstdc++.6.dylib ??? (???)
>>>>>> <7d389389a99ce696726cf4c8980cc505> /usr/lib/libstdc++.6.dylib
>>>>>> 0x96fa9000 - 0x96fadfff libmathCommon.A.dylib ??? (???)
>>>>>> /usr/lib/system/libmathCommon.A.dylib
>>>>>> 0xffff0000 - 0xffff1780 libSystem.B.dylib ??? (???)
>>>>>> /usr/lib/libSystem.B.dylib
>>>>>>
>>>>>> It is beyond my comprehension, but I thought it may help.
>>>>>>
>>>>>>
>>>>>> I am going to find the code contribution agreement and get that
>>>>>> submitted also. Although at this point I am not confident that the
>>>>>> unsigned long was the right thing to do.
>>>>>>
>>>>>> TIA
>>>>>>
>>>>>> Peter Keller wrote:
>>>>>>> On Mon, May 11, 2009 at 03:32:20PM -0500, Jim Summers wrote:
>>>>>>>> Hello All,
>>>>>>>>
>>>>>>>> I modified the procapi.h file so that all of the age, creation_tim
>>>>>>>> and sample_time variables use an unsigned long type. That seems
>>>>>>>> to have fixed the ProcAPI errors that we were seeing.
>>>>>>>>
>>>>>>>> But now we are seeing the following in SchedLog:
>>>>>>>> 5/11 14:53:38 (fd:7) (pid:57011) In
>>>>>>>> DaemonCore::Create_Process(/usr/local/condor/sbin/condor_procd,...)
>>>>>>>> 5/11 14:53:38 (fd:7) (pid:57011) PRIV_CONDOR --> PRIV_ROOT at
>>>>>>>> daemon_core.cpp:6852
>>>>>>>> 5/11 14:53:38 (fd:7) (pid:57011) PRIV_ROOT --> PRIV_CONDOR at
>>>>>>>> daemon_core.cpp:6885
>>>>>>>> 5/11 14:53:38 (fd:11) (pid:57011) Create Process: fork() failed:
>>>>>>>> Resource temporarily unavailable (35)
>>>>>>>> 5/11 14:53:38 (fd:7) (pid:57011) start_procd: unable to execute
>>>>>>>> the procd
>>>>>>>> 5/11 14:53:38 (fd:5) (pid:57011) Close_Pipe(pipe_end=65536)
>>>>>>>> succeeded
>>>>>>>> 5/11 14:53:38 (fd:5) (pid:57011) Close_Pipe(pipe_end=65537)
>>>>>>>> succeeded
>>>>>>>> 5/11 14:53:38 (fd:5) (pid:57011) ERROR "unable to start the ProcD"
>>>>>>>> at line 620 in file proc_family_proxy.cpp
>>>>>>>>
>>>>>>>> I am not sure what to do at this point?
>>>>>>>>
>>>>>>>> Ideas / Suggestions?
>>>>>>> Do you have your process limit set really low for your uid?
>>>>>>>
>>>>>>> As for the code changes, you could attach a patch to you message
>>>>>>> and we can see if we can apply it. I'll have to scrutinize the patch
>>>>>>> closely because even though one might think the age of a process
>>>>>>> can't
>>>>>>> be negative, due to kernel issues a negative age actually could be
>>>>>>> calculated, so I'd need to do some inspection.
>>>>>>>
>>>>>>> Have you signed our code contribution agreement?
>>>>>>>
>>>>>>> Thank you.
>>>>>>>
>>>>>>> -pete
>>>>>>> _______________________________________________
>>>>>>> Condor-devel mailing list
>>>>>>> Condor-devel@xxxxxxxxxxx
>>>>>>> https://lists.cs.wisc.edu/mailman/listinfo/condor-devel
>