Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] [PATCH] Speeding up condor_dagman submission
- Date: Tue, 07 Jul 2015 14:49:38 +0100
- From: Brian Candler <b.candler@xxxxxxxxx>
- Subject: [HTCondor-users] [PATCH] Speeding up condor_dagman submission
Following up to
https://www-auth.cs.wisc.edu/lists/htcondor-users/2014-December/msg00053.shtml
In summary: on both Linux and OSX, dagman imposes a 12-second startup
delay. This is annoying, especially for running automated test suites
and when using SUBDAG EXTERNAL. It's still an issue with 8.3.6.
I have been analysing the code further. The calculation boils down to
this in src/condor_procapi/processid.cpp:
int
ProcessId::computeWaitTime() const
{
// convert the precision range to seconds
double range_sec = ((double)precision_range) / time_units_in_sec;
// calculate the time to sleep until
// 1 range for the child
// 1 range as a buffer
// 1 range for the process that will reuse this pid
int sleepLength = (int)ceil(3.0 * range_sec);
// ensure it is atleast one second
if( sleepLength < 1 ){
sleepLength = 1;
}
return( sleepLength );
}
The attributes which I find set in the ProcessId object are
precision_range=400, time_units_in_sec=100.0.
These values in turn come from src/condor_procapi/procapi.cpp, method
createProcessId().
* TIME_UNITS_IN_SEC is obtained from the system; under Linux it uses HZ.
After casting to double it's 100.0. This works fine.
* If you don't pass in a precision range pointer[^1] then the default
value is NULL, and it so it uses the DEFAULT_PRECISION_RANGE which is 4,
multiplies it by TIME_UNITS_IN_SEC[^2] and rounds up. That makes 400
Now, the attribute "precision_range" is in jiffies; you can this in the
above code, because it divides it by time_units_in_sec to make seconds.
The return value is in whole seconds.
So if precision range is 4 we would allow 4 jiffies for the child, 4
jiffies as a buffer, 4 jiffies for the process that will re-use this
pid. That seems reasonable.
But the caller has already multiplied precision_range by
TIME_UNITS_PER_SECOND, so they are passing in whole seconds' worth of
jiffies! I can't see any reason why they should be doing that. And if
that was really what was intended, why not just pass in seconds in the
first place?
If that's the logic error, then the fix is simple:
--- src/condor_procapi/procapi.cpp.orig 2015-04-07 16:10:11.000000000
+0100
+++ src/condor_procapi/procapi.cpp 2015-07-07 14:07:20.583286037 +0100
@@ -3276,10 +3276,6 @@
precision_range = &DEFAULT_PRECISION_RANGE;
}
- // convert to the same time units as the rest of
- // the process id
- *precision_range = (int)ceil( ((double)*precision_range) *
TIME_UNITS_PER_SEC);
-
/* Initialize the Process Id
This WILL ALWAYS create memory the caller is responsible for.
*/
This reduces the sleep time to the minimum 1 second[^3], and has the
nice side-effect of also fixing another bug described in [^2] below.
I don't know when this code was introduced; it exists in
git show 37f4b37d:src/condor_procapi/procapi.cpp
which is the earliest commit I can find on this file in git history
(from Nov 2008)
However this all depends on what the semantics of "precision_range" are
supposed to be. I don't actually understand why it's necessary to do
*any* sleeping when creating a procId, let alone three lots of 4
seconds. Obviously it's trying to avoid some sort of race condition, but
I don't understand what the race actually is. After all, the
user-visible procId is allocated sequentially, not based on the time of day.
And just to give another inconsistency:
src/deployment_tools/uniq_pid_tool_main.cpp has help text which says
that the "--precision" argument is in seconds, but it doesn't multiply
it up, so "--precision 4" really means "4 jiffies". If that's what's
intended, the help text could be easily fixed.
Anyone who understands this code, would you please raise your hand!
Thanks,
Brian Candler.
[^1] I can see only one place in the code which passes in the
precision_range pointer, and this is
src/deployment_tools/uniq_pid_tool_main.cpp. It passes null unless
you've given a command line argument.
[^2] This code is *very* dubious, since if you pass null, it takes the
value stored in DEFAULT_PRECISION_RANGE and multiplies it in-place.
Therefore if you call this function more than once, each time you call
it the value will increase by a factor of 100 !
I see no particular reason to pass a pointer for this optional argument;
precision_range could just be an integer, and a value of 0 or -1 could
be used to mean "use the default".
This would be an easy change, although it involves changing the function
signature (I hope not part of a public API?) and the place where it is
called from uniq_pid_tool_main.cpp
[^3] If it were to return microseconds for usleep then the delay could
be reduced to 0.12 seconds instead of 1 second, which would be nice, but
I can live without that.