Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Trivial jobs occasionally running for hours
- Date: Wed, 13 Oct 2010 17:14:31 +0100
- From: Paul Haldane <paul.haldane@xxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Trivial jobs occasionally running for hours
[I wouldn't normally top-post but best I can do without losing replies]
Thanks - Mark Calleja pointed me at the error that I was consistently overlooking. I'd also had had one of those "doh" moments shortly after sending the original message and had spotted the (now very obvious) error in the logs - "Sock::bindWithin - failed to bind any port within (9600 ~ 9700)".
I've changed the range to 9600-19700 and restarted. I don't get the errors but first attempt only ran one job at a time and second isn't running any (though they're all nicely queued). I suspect this is a completely unrelated problem.
I have wondered whether as you suggest using trivial jobs for testing is unfair on the system. I should probably set up a more substantial test job (and be more patient).
Thanks
Paul
From: Ian Chesal
Sent: 13 October 2010 16:53
To: Condor-Users Mail List
Subject: Re: [Condor-users] Trivial jobs occasionally running for hours
Condor isn't particularly adept and handling very short running jobs. So if your jobs are only running an echo and then ending you could end up with a lot of machines Claimed+Idle as the schedd can't handle spawning shadows fast enough to keep up with the job startup rate.
And then there's Windows. I don't know where to start with Windows. There's certainly a lot of issues that come up when you're using Windows and network storage and you want to use that storage repeatedly, quickly. And ports. Windows doesn't seem to recycle the pool of available ports very quickly so shadows coming and going, fast, will exhaust that pool and you'll end up with no-comm errors between the shadows and startds.
That's my experience at least.
I suggest adding a sleep to your simple jobs so that they don't run shorter than 2-3 minutes and see if that improves the situation. You could also throttle the job startup rate at the schedd. Get yourself to a stable state and then start to decrease the sleep time or increase the job startup rate until you're seeing the failures, then back it off a bit.
- Ian
On Wed, Oct 13, 2010 at 11:25 AM, Paul Haldane <paul.haldane@xxxxxxxxxxxxxxx> wrote:
We're in the process of refreshing our Condor provision and have a cluster with 7.4.3 Linux central manager/submit node and 7.4.2 Windows 7 worker nodes.
Occasionally we see trivial jobs being accepted by nodes and then (apparently) running for hours. At the moment this is a completely trivial job (batch script doing "echo hello") - I'm queuing 100 instances of the job.
The workers are staying online all the time (they get pinged every five minutes as part of our data gathering).
I sometimes see the following in STARTD log but not associated with the slot or job which is hanging around ...
10/13 12:45:54 slot3: State change: claim-activation protocol successful
10/13 12:45:54 slot3: Changing activity: Idle -> Busy
10/13 12:45:54 condor_read() failed: recv() returned -1, errno = 10054 , reading 5 bytes from <127.0.0.1:51442>.
10/13 12:45:54 IO: Failed to read packet header
10/13 12:45:54 Starter pid 3752 exited with status 4
10/13 12:45:54 slot3: State change: starter exited
10/13 12:45:54 slot3: Changing activity: Busy -> Idle
Picking one of the four jobs that are hanging around I see the following in ShadowLog on the central manager/submit node ...
10/13 12:19:11 Initializing a VANILLA shadow for job 1772.21
10/13 12:19:11 (1772.21) (13082): Request to run on slot1@xxxxxxxxxxxxxxxxxxxxxxx <10.15.0.62:49389> was ACCEPTED
10/13 12:19:11 (1772.21) (13082): Sock::bindWithin - failed to bind any port within (9600 ~ 9700)
10/13 12:19:11 (1772.21) (13082): ERROR: SECMAN:2003:TCP auth connection to <10.8.232.5:9605> failed.
10/13 12:19:11 (1772.21) (13082): Failed to send alive to <10.8.232.5:9605>, will try again...
10/13 12:19:18 (1772.21) (13082): Sock::bindWithin - failed to bind any port within (9600 ~ 9700)
10/13 12:19:18 (1772.21) (13082): Failed to connect to transfer queue manager for job 1772.21 (/home/ucs/200/nph9/esw3/ex.err): CEDAR:6001:Failed to connect to <10.8.232.5:9697>.
10/13 12:19:18 (1772.21) (13082): Sending NO GoAhead for 10.15.0.62 to send /home/ucs/200/nph9/esw3/ex.err.
10/13 12:19:18 (1772.21) (13082): Failed to connect to transfer queue manager for job 1772.21 (/home/ucs/200/nph9/esw3/ex.err): CEDAR:6001:Failed to connect to <10.8.232.5:9697>.
10/13 12:34:18 (1772.21) (13082): Sock::bindWithin - failed to bind any port within (9600 ~ 9700)
10/13 12:34:18 (1772.21) (13082): Can't connect to queue manager: CEDAR:6001:Failed to connect to <10.8.232.5:9697>
Whilst the jobs were being farmed out I'd see occasional failures of condor_status with the "CEDAR:6001 error".
My instinct is that something on the manager/submit node is running out of resources (file descriptors, ports, something else I've not thought of) and reality and Condor then manage to get out of sync.
(a) Is there anything in particular I should be looking at in terms of system limits on the manager? I've looked at http://www.cs.wisc.edu/condor/condorg/linux_scalability.html and don't think I'm hitting any of those (but happy to be told that is where I should be looking).
(b) Any other logs/tools I should be looking at to help diagnose this?
Thanks
Paul
--
Paul Haldane
Information Systems and Services
Newcastle University