[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] condor-mpi observation...
I have been using condor (6.6) to run MPI jobs for a couple of weeks
now, and I've noticed that it really only functions well when I only
submit one job from a particular user. The problem is that gathering
machines in the pool to actually run the mpi job is done by the
DedicatedScheduler user. Since DedicatedScheduler and the actual user
each have their own user priorities, DedicatedScheduler can kick the
real user's MPI jobs off while trying to secure machines for some other
MPI job for the *same* user. This means that when I submit two jobs,
and after my user priority has raised a bit, my two jobs will start
competing for resources... this has resulted in a stale mate several
times, with DedicatedServer hanging on to resources seemingly
indefinitely, and neither job actually executing. This is a huge waste
of resources.
I'm wondering if this issue has been addressed, or if it will be
addressed in future versions. Using the "real" user during the
allocation cycle would seem to make more sense here and perhaps
partially resolve this issue. I've "solved" this by not running more
than one MPI job at a time, but this is far from optimal. Are
allocations handled differently in 6.7?
rok