HTCondor Project List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] ProcAPI Messages

Date: Sat, 16 May 2009 09:31:14 -0500
From: Jim Summers <jsummers@xxxxxxxxxxxxxxxxx>
Subject: Re: [Condor-devel] ProcAPI Messages



Matthew Farrellee wrote:

All I can say is those are mighty large pids.

I am sure that number grew rapidly earlier in the week when I had theschedd_restart set to true. It would just continually restart theschedd and also procd was firing.

I bounced the machine and the pids are back to smaller number. Butschedd is still dies and procd eventually can't start. Same errors asbelow.

I also tried setting the D_ALL for procd but keep getting errors. Isthere any way I can turn up the debug for procd?

TIA

Best,


matt

Jim Summers wrote:

Matthew Farrellee wrote:

Have you checked the Procd's log? It is often configured with PROCD_LOG
= $(LOG)/ProcLog

You can also turn the debug level up in the procd with PROCD_DEBUG =
D_FULLDEBUG

Made those changes and here is what I am seeing:
==
bash-3.2# cat ProcLog.SCHEDD
***********************************
* condor_procd STARTING UP
***********************************
taking a snapshot...
no methods have determined process 93860 to be in a monitored family
ERROR: master has exited
***********************************
* condor_procd STARTING UP
***********************************
taking a snapshot...
no methods have determined process 93872 to be in a monitored family
ERROR: master has exited
***********************************
* condor_procd STARTING UP
***********************************
taking a snapshot...
no methods have determined process 93880 to be in a monitored family
ERROR: master has exited
***********************************
* condor_procd STARTING UP
***********************************
taking a snapshot...
no methods have determined process 93885 to be in a monitored family
ERROR: master has exited
***********************************
* condor_procd STARTING UP
***********************************
taking a snapshot...
no methods have determined process 93891 to be in a monitored family
ERROR: master has exited
***********************************
* condor_procd STARTING UP
***********************************
taking a snapshot...
no methods have determined process 93898 to be in a monitored family
ERROR: master has exited
***********************************
* condor_procd STARTING UP
***********************************
taking a snapshot...
no methods have determined process 93902 to be in a monitored family
ERROR: master has exited
***********************************
* condor_procd STARTING UP
***********************************
taking a snapshot...
no methods have determined process 93908 to be in a monitored family
ERROR: master has exited
==

Is there another parameter I have overlooked?

Thanks Again,

Best,


matt

Jim Summers wrote:

Hello All,

I set RESTART_PROCD_ON_ERROR to false and then started up the daemons
again. It confirmed my suspicions that the schedd was still not
happy.  Actually it seems that it is something called condor_procd
that seems to be the one that is having trouble.  The SchedLog has
the following:
===
5/14 08:14:55 (fd:12) (pid:93500) DaemonCore--> id = 15, when =
1242335695, period = 0, handler_descrip=<dc_touch_lock_files>
5/14 08:14:55 (fd:12) (pid:93500) DaemonCore--> id = 6, when =
1242335702, period = 28807, handler_descrip=<DaemonCore::refreshDNS()>
5/14 08:14:55 (fd:12) (pid:93500) DaemonCore--> id = 11, when =
1242393295, period = 86400, handler_descrip=<CleanJobQueue>
5/14 08:14:55 (fd:12) (pid:93500)
5/14 08:14:55 (fd:12) (pid:93500) leaving DaemonCore NewTimer, id=7
5/14 08:14:55 (fd:12) (pid:93500) DaemonCore Timeout() Complete,
returning 5
5/14 08:14:55 (fd:12) (pid:93500) selector 0xbffff554 resetting
5/14 08:14:55 (fd:12) (pid:93500) selector 0xbffff554 adding fd 8 ()
5/14 08:14:55 (fd:12) (pid:93500) selector 0xbffff554 adding fd 9 ()
5/14 08:14:55 (fd:12) (pid:93500) selector 0xbffff554 adding fd 7 ()
5/14 08:14:55 (fd:12) (pid:93500) selector 0xbffff554 adding fd 10 ()
5/14 08:14:55 (fd:12) (pid:93500) selector 0xbffff554 adding fd 3 ()
5/14 08:14:55 (fd:12) (pid:93500) Calling Handler
<HandleDC_SERVICEWAITPIDS()> for Signal 60009 <DC_SERVICEWAITPIDS>
5/14 08:14:55 (fd:12) (pid:93500) DaemonCore: pid 93501 exited with
status 10, invoking reaper 1 <condor_procd reaper>
5/14 08:14:55 (fd:12) (pid:93500) procd (pid = 93501) exited
unexpectedly with status 10
5/14 08:14:55 (fd:12) (pid:93500) Config 'RESTART_PROCD_ON_ERROR': no
prefix ==> 'False'
5/14 08:14:55 (fd:12) (pid:93500) ERROR "ProcD has failed" at line
599 in file proc_family_proxy.cpp
5/14 08:15:06 (fd:3) (pid:93520) LOGS_USE_TIMESTAMP is undefined,
using default value of False
===

I am not sure what the issue at hand could be?

Ideas or suggestions greatly appreciated.

Thanks



Jim Summers wrote:

Hello All,

Still working on getting 7.2.2 to fly on an OSx Leopard.  After
hacking the age, creation_time and sample_time variables use an

unsigned long type, things seemed to get better, but still no joy.Then using Peter's suggestion of upping the maxprocs that helped

things a lot, but alas, still no joy.

Now the condor_master, schedd, startd are all running but no job
submissions will run.  It seems that schedd is generating tons of
log and consuming anywhere from 18-25% of the cpu as reported by top.

I found an OSX crash report message for condor_procd in
/var/log/system.log. Here is the contents of the crash report:

Process:         condor_procd [1324]
Path:            /usr/local/condor/sbin/condor_procd
Identifier:      condor_procd
Version:         ??? (???)
Code Type:       X86 (Native)
Parent Process:  condor_schedd [98692]

Date/Time:       2009-05-13 16:20:23.369 -0500
OS Version:      Mac OS X 10.5.6 (9G3553)
Report Version:  6

Exception Type:  EXC_BAD_ACCESS (SIGBUS)
Exception Codes: KERN_PROTECTION_FAILURE at 0x0000000000000219
Crashed Thread:  0

Thread 0 Crashed:
0   condor_procd                        0x0000633b
ProcFamilyMonitor::snapshot() + 63 (proc_family_monitor.cpp:509)
1   condor_procd                        0x000072ce
ProcFamilyMonitor::ProcFamilyMonitor(int, long, int) + 1270
2   condor_procd                        0x00002495 main + 451
(procd_main.cpp:337)
3   condor_procd                        0x00001e46 start + 54

Thread 0 crashed with X86 Thread State (32-bit):
   eax: 0x000001f5  ebx: 0x00006308  ecx: 0xbffff8cc  edx: 0x00800224
   edi: 0x00000000  esi: 0x00100350  ebp: 0xbffff818  esp: 0xbffff7d0
    ss: 0x0000001f  efl: 0x00010206  eip: 0x0000633b   cs: 0x00000017
    ds: 0x0000001f   es: 0x0000001f   fs: 0x00000000   gs: 0x00000037
   cr2: 0x00000219

Binary Images:
     0x1000 -    0x12ff0 +condor_procd ??? (???)
<b3c764c9b34f126e5833933112905bc2> /usr/local/condor/sbin/condor_procd
0x8fe00000 - 0x8fe2db43  dyld 97.1 (???)
<9736a715ebabb914fef61680520dc1e0> /usr/lib/dyld
0x9234b000 - 0x92352fe9  libgcc_s.1.dylib ??? (???)
<e280ddf3f5fb3049e674edcb109f389a> /usr/lib/libgcc_s.1.dylib
0x936ca000 - 0x936e8fff  libresolv.9.dylib ??? (???)
<39f6d8651f3dca7a1534fa04322e6763> /usr/lib/libresolv.9.dylib
0x9372d000 - 0x93894ff3  libSystem.B.dylib ??? (???)
<0ddbaae699690b09239f69dea7d0fbb0> /usr/lib/libSystem.B.dylib
0x9487f000 - 0x948dcffb  libstdc++.6.dylib ??? (???)
<7d389389a99ce696726cf4c8980cc505> /usr/lib/libstdc++.6.dylib
0x96fa9000 - 0x96fadfff  libmathCommon.A.dylib ??? (???)
/usr/lib/system/libmathCommon.A.dylib
0xffff0000 - 0xffff1780  libSystem.B.dylib ??? (???)
/usr/lib/libSystem.B.dylib

It is beyond my comprehension, but I thought it may help.


I am going to find the code contribution agreement and get that
submitted also.  Although at this point I am not confident that the
unsigned long was the right thing to do.

TIA

Peter Keller wrote:

On Mon, May 11, 2009 at 03:32:20PM -0500, Jim Summers wrote:

Hello All,

I modified the procapi.h file so that all of the age, creation_tim
and sample_time variables use an unsigned long type.  That seems
to have fixed the ProcAPI errors that we were seeing.

But now we are seeing the following in SchedLog:
5/11 14:53:38 (fd:7) (pid:57011) In
DaemonCore::Create_Process(/usr/local/condor/sbin/condor_procd,...)
5/11 14:53:38 (fd:7) (pid:57011) PRIV_CONDOR --> PRIV_ROOT at
daemon_core.cpp:6852
5/11 14:53:38 (fd:7) (pid:57011) PRIV_ROOT --> PRIV_CONDOR at
daemon_core.cpp:6885
5/11 14:53:38 (fd:11) (pid:57011) Create Process: fork() failed:
Resource temporarily unavailable (35)
5/11 14:53:38 (fd:7) (pid:57011) start_procd: unable to execute
the procd
5/11 14:53:38 (fd:5) (pid:57011) Close_Pipe(pipe_end=65536) succeeded
5/11 14:53:38 (fd:5) (pid:57011) Close_Pipe(pipe_end=65537) succeeded
5/11 14:53:38 (fd:5) (pid:57011) ERROR "unable to start the ProcD"
at line 620 in file proc_family_proxy.cpp

I am not sure what to do at this point?

Ideas / Suggestions?

Do you have your process limit set really low for your uid?

As for the code changes, you could attach a patch to you message
and we can see if we can apply it. I'll have to scrutinize the patch
closely because even though one might think the age of a process can't
be negative, due to kernel issues a negative age actually could be
calculated, so I'd need to do some inspection.

Have you signed our code contribution agreement?

Thank you.

-pete
_______________________________________________
Condor-devel mailing list
Condor-devel@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-devel


--
Jim Summers
Computer Science - University of Oklahoma

References:
- [Condor-devel] ProcAPI Messages
  - From: Jim Summers
- Re: [Condor-devel] ProcAPI Messages
  - From: Jim Summers
- Re: [Condor-devel] ProcAPI Messages
  - From: Peter Keller
- Re: [Condor-devel] ProcAPI Messages
  - From: Jim Summers
- Re: [Condor-devel] ProcAPI Messages
  - From: Jim Summers
- Re: [Condor-devel] ProcAPI Messages
  - From: Matthew Farrellee
- Re: [Condor-devel] ProcAPI Messages
  - From: Jim Summers
- Re: [Condor-devel] ProcAPI Messages
  - From: Matthew Farrellee

Prev by Date: Re: [Condor-devel] ProcAPI Messages
Previous by thread: Re: [Condor-devel] ProcAPI Messages
Index(es):
- Date
- Thread