Hello,
When submitting N instances of a job, generally N/2 jobs run in the
expected time and the other N/2 jobs take longer to complete. The system
has 10 nodes each with 32 slots and uses a shared filesystem
(GlusterFS). All of the executables and data files are located on the
shared file system; however, the problem does not seem to be an I/O or
network bottleneck.
When submitting 2 instances, the two times are the following:
Instance 1
real7m13.950s
user5m36.766s
sys0m14.436s
Instance 2
real6m2.555s
user5m35.747s
sys0m13.170s
When submitting 22 instances, the difference in times are more drastic.
The two categories that the times fall into are the following:
Category 1:
real18m28.193s
user5m39.153s
sys0m15.111s
Category 2:
real6m12.578s
user5m36.433s
sys0m12.644s
Does anybody have insight into this issue?
Thanks,
Vishal