Yea, I suppose that would've been helpful.. I could un-wedge things by holding a slew of the jobs on that schedd... I release the whole mess of them, and can get this to happen again. SchedLog bits attached from when the negotiation tries to run: -Preston
Attachment:
SchedLog
Description: Binary data
On Mar 1, 2006, at 11:55 AM, Jaime Frey wrote:
On Feb 28, 2006, at 3:00 PM, Preston Smith wrote:Right as our condor pools reach about 100% capacity, one of the busiest schedds basically stops running jobs.. almost all run down to idle.. The negotiator logs: 2/28 15:44:45 Got NO_MORE_JOBS; done negotiating 2/28 15:44:45 Negotiating with user@xxxxxxxxxxxxxxx at <128.211.128.11:59684> 2/28 15:45:15 condor_read(): timeout reading buffer. 2/28 15:45:15 Failed to get reply from schedd 2/28 15:45:15 Error: Ignoring schedd for this cycle condor_q on that schedd shows: 3342 jobs; 3330 idle, 10 running, 2 held ShadowLog on 128.211.128.11 shows: 2/28 15:48:08 (21939.0) (32200): condor_read(): timeout reading buffer. 2/28 15:48:08 (21939.0) (32200): AUTHENTICATE: handshake failed! 2/28 15:48:08 (21939.0) (32200): Authentication Error AUTHENTICATE:1002:Failure performing handshake Any suggestions on troubleshooting these timeouts? We're running 6.6.10..The most useful information would be the schedd log of 128.211.128.11 at the time of the timeout. +--------------------------------+-----------------------------------+ | Jaime Frey | I used to be a heavy gambler. | | jfrey@xxxxxxxxxxx | But now I just make mental bets. | | http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind. | +--------------------------------+-----------------------------------+ _______________________________________________ Condor-users mailing list Condor-users@xxxxxxxxxxx https://lists.cs.wisc.edu/mailman/listinfo/condor-users
-- Preston Smith <psmith@xxxxxxxxxx> Systems Research Engineer Rosen Center for Advanced Computing, Purdue University