> > Ian,
> >
> > Can you give an estimate of the number of shadows that are
> running at
> > any given time on this box?
>
> The user currently has 52 condor_shadow's running. We have
> the potential for users to run 200+ simulatenous jobs with
> all the startd nodes in our system.
>
> > Depending on the system's memory configuration, processor
> speed, and
> > non-condor load, the schedd can fall behind when spawning shadows,
> > causing this type of thing to happen.
> > Depending on the nature of your jobs, there are ways to push the
> > schedd harder, but that is a whole separate topic.
>
> I think I need to know this. It's not unreasonable for our
> users to need to run 100+ jobs simulatenously. We submit
> vanialla jobs that use condor's file transfer mechanism to
> transfer a small (~3.5k bytes) script to kick the job off.
> The remained of the file transfer is done over FTP by our job
> and does not use condor. Then upon job completion condor
> returns wrapper.log and wrapper.err to the user. The log file
> is typically around 33k bytes.
>
> > There is also a registry setting that can affect the number
> of shadows
> > your schedd can spawn based on Desktop Heap. The default value for
> > Desktop Heap on windows is 512k. Brooklin Gore at Micron has found
> > that 60 shadows consume about 256k of Desktop Heap. Therefore if
> > you're trying to run much over 100 shadows, they'll instantly crash
> > before they can do anything. The solution is to modify the
> following
> > registry key:
> >
> > HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Session
> > Manager\SubSystems\Windows
> >
> > The third SharedSection value controls the desktop heap.
> You can try,
> > say 1024, and see if things improve. You can read more about this
> > problem here:
> >
> > http://support.microsoft.com/kb/q184802/
>
> The only problem is I'm seeing these condor_wrapper errors on
> machines running Linux as well. The only thing in common is
> the machines are running >30 or so shadow processes. Is this
> known to affect Linux as well?
>
> What about the error message creating the user on the starter
> side? It appears that each time ttc-eahmed3 tries to start a
> job this error occurs. The starter then allows ttc-eahmed3 to
> hang on to the claim machine and about 12 minutes later the
> starter tries again to create the user. So far the second
> attempt that I've observed has always succeeded, but could it
> get into an infinite loop trying to do this? Which may be
> what happened to my machines over the weekend (I didn't think
> to look at the starter.vm# logs though).
>
> - Ian
My user's schedd went to >90 jobs sure enough it began to report more
jobs running that there really were (by quite a bit). So I've made the
recommended registry change to the machine and rebooted it. It's in a
good state now: condor_q -name and condor_status -sub are showing the
same queued and running count for the user. The number of jobs running
is only 48 right now. I'll keep an eye on it for the next couple of
hours.
- Ian
|