I see an alarming amount of time passing in your logs in the following: 06/28/12 16:49:09 (pid:18733) Sent ad to 1 collectors for pandey3@xxxxxxxxxxxxxxx 06/28/12 16:50:02 (pid:18733) Failed to start non-blocking update to unknown. 06/28/12 16:50:51 (pid:18733) Failed to start non-blocking update to unknown. 06/28/12 16:51:44 (pid:18733) Failed to start non-blocking update to unknown. 06/28/12 16:52:39 (pid:18733) Failed to start non-blocking update to unknown. 06/28/12 16:53:29 (pid:18733) Failed to start non-blocking update to unknown. 06/28/12 16:54:23 (pid:18733) Failed to start non-blocking update to unknown. 06/28/12 16:55:06 (pid:18733) Failed to start non-blocking update to unknown. 06/28/12 16:56:02 (pid:18733) Failed to start non-blocking update to unknown. Every time this "update to unknown" message appears in your logs, it is preceded by a long gap in time. The above occurrence of many of these in a row probably caused the final collapse. I'm not sure what would cause this. Have you got something in your flocking list that isn't a valid DNS name? --Dan On 6/28/12 4:29 PM, Ben Cotton wrote:
We've been seeing a problem on our XSEDE Condor frontend (running version 7.6.7) where after a few hours the schedd seems to fall apart. Job submissions hang, the number of running shadow processes drops to zero or near-zero, and condor_q returns: -- Failed to fetch ads from: <128.211.128.45:51941> : tg-condor.rcac.purdue.edu SECMAN:2007:Failed to end classad message. The only way to get jobs running again is to restart Condor, but since it only takes a few hours for the schedd to keel over, we're not seeing much job completion. Other identically configured schedds do not exhibit this behavior. There are no obvious indications of other system problems, and I'm at a loss for where to look next. Logs from the host can be found at: http://boilergrid.rcac.purdue.edu/tickets/tg-condor_schedd_deaths/ >From the count of running condor_shadow processes, it appears the problem begins manifesting itself around 16:56. Any guidance would be greatly appreciated. Thanks, BC |