> Ian,
>
> Can you give an estimate of the number of shadows that are
> running at any given time on this box?
The user currently has 52 condor_shadow's running. We have the potential
for users to run 200+ simulatenous jobs with all the startd nodes in our
system.
> Depending on the system's memory configuration, processor
> speed, and non-condor load, the schedd can fall behind when
> spawning shadows, causing this type of thing to happen.
> Depending on the nature of your jobs, there are ways to push
> the schedd harder, but that is a whole separate topic.
I think I need to know this. It's not unreasonable for our users to need
to run 100+ jobs simulatenously. We submit vanialla jobs that use
condor's file transfer mechanism to transfer a small (~3.5k bytes)
script to kick the job off. The remained of the file transfer is done
over FTP by our job and does not use condor. Then upon job completion
condor returns wrapper.log and wrapper.err to the user. The log file is
typically around 33k bytes.
> There is also a registry setting that can affect the number
> of shadows your schedd can spawn based on Desktop Heap. The
> default value for Desktop Heap on windows is 512k. Brooklin
> Gore at Micron has found that 60 shadows consume about 256k
> of Desktop Heap. Therefore if you're trying to run much over
> 100 shadows, they'll instantly crash before they can do
> anything. The solution is to modify the following registry key:
>
> HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Session
> Manager\SubSystems\Windows
>
> The third SharedSection value controls the desktop heap. You
> can try, say 1024, and see if things improve. You can read
> more about this problem here:
>
> http://support.microsoft.com/kb/q184802/
The only problem is I'm seeing these condor_wrapper errors on machines
running Linux as well. The only thing in common is the machines are
running >30 or so shadow processes. Is this known to affect Linux as
well?
What about the error message creating the user on the starter side? It
appears that each time ttc-eahmed3 tries to start a job this error
occurs. The starter then allows ttc-eahmed3 to hang on to the claim
machine and about 12 minutes later the starter tries again to create the
user. So far the second attempt that I've observed has always succeeded,
but could it get into an infinite loop trying to do this? Which may be
what happened to my machines over the weekend (I didn't think to look at
the starter.vm# logs though).
- Ian
|