My
condor_master daemon on the Central Manager machine in my cluster was continuously
taking 25% of the CPU load so I turned logging up to D_ALL to see what was
going on. When I did that I get the following message over and over (it
filled 40 MB of logs in about 20 seconds.) 10/10
12:43:20 (fd:15) (pid:2345) In DaemonCore Timeout() 10/10
12:43:20 (fd:15) (pid:2345) 10/10
12:43:20 (fd:15) (pid:2345) DaemonCore--> Timers 10/10
12:43:20 (fd:15) (pid:2345) DaemonCore--> ~~~~~~ 10/10
12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 7, when = 1160502226, period
= 300, handler_descrip=<Daemons::UpdateCollector()> 10/10
12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 0, when = 1160502234, period
= 300, handler_descrip=<check_session_cache> 10/10
12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 9, when = 1160502239, period
= 300, handler_descrip=<Daemons::CheckForNewExecutable()> 10/10
12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 4, when = 1160502249, period
= 60, handler_descrip=<ProcFamily::takesnapshot> 10/10
12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 5, when = 1160502249, period
= 60, handler_descrip=<ProcFamily::takesnapshot> 10/10
12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 6, when = 1160502249, period
= 60, handler_descrip=<ProcFamily::takesnapshot> 10/10
12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 2, when = 1160502294, period
= 240, handler_descrip=<self_monitor> 10/10
12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 3, when = 1160502534, period
= 0, handler_descrip=<DaemonCore::ReInit()> 10/10
12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 1, when = 1160503142, period
= 1801, handler_descrip=<handle_cookie_refresh> 10/10
12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 10, when = 1160505527, period
= 0, handler_descrip=<DaemonCore::HungChildTimeout> 10/10
12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 11, when = 1160505527, period
= 0, handler_descrip=<DaemonCore::HungChildTimeout> 10/10
12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 12, when = 1160505536, period
= 0, handler_descrip=<DaemonCore::HungChildTimeout> 10/10
12:43:20 (fd:15) (pid:2345) DaemonCore--> id = 8, when = 1160534934, period
= 86400, handler_descrip=<run_preen()> 10/10
12:43:20 (fd:15) (pid:2345) 10/10
12:43:20 (fd:15) (pid:2345) DaemonCore Timeout() Complete, returning 26 The
return value seems to slowly go up, but everything else stays the same. A
google search on "HungChildTimeout" or "DaemonCore Timers" didn't
give me anything, so I'm hoping someone on this list can offer some insight… Thanks
a lot -Colin This email and any files transmitted with it are confidential, proprietary and intended solely for the individual or entity to whom they are addressed. If you have received this email in error please delete it immediately. |