Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Numerous, short jobs using HTCondor
- Date: Thu, 14 Jan 2016 09:45:01 +0100
- From: Mathieu PeyrÃga <mathieu.peyrega@xxxxxxxxx>
- Subject: Re: [HTCondor-users] Numerous, short jobs using HTCondor
Hello
I had quite a similar issue and it seems to get solved by adding the
following to my condor_config file :
JOB_RENICE_INCREMENT = 0
SYSAPI_GET_LOADAVG = False
WANT_VACATE_VANILLA = False
WANT_SUSPEND_VANILLA = False
START_VANILLA = True
SUSPEND_VANILLA = False
CONTINUE_VANILLA = True
PREEMPT_VANILLA = False
KILL_VANILLA = False
(see "New to ht condor and have basic questions" thrad on this mail-list)
Regards,
Mathieu
Le 13/01/2016 17:23, Matthew Hinton a écrit :
> Hi,
>
> We currently need to use HTCondor to run a large number (order 10k) of
> short jobs (taking approximately 10 seconds). I believe that HTCondor is
> not really designed for this, but these jobs are an adaptation of older
> jobs which take order minutes, against a new, more split, dataset, so we
> still need the resource management provided by HTCondor.
>
> I've had some fairly large issues getting tests of this to run with
> reasonable times, so was wondering if there are any
> settings/configuration which I should be looking at to improve this issue.
>
> Current condor version: 8.5.1, all systems on Ubuntu 14.04.
> All jobs using vanilla universe. We have a single manager, which is used
> as the SCHEDD, COLLECTOR, NEGOTIATOR and then 5 STARTD nodes.
>
> Steps to reproduce:
>
> Set up a dag containing 10,000 jobs, labelled "JOB x test.sub"
> where test.sub:
> /executable = /bin/sleep/
> /arguments = 1/
> /universe = vanilla/
> /transfer_executable = false/
> /requirements = TARGET.Machine == "<machine with 48 slots>"/
> /queue/
> /
> /
> Submit that dag.
>
> The real processing time of these jobs should be 10,000s / 48 slots
> which is under 3.5 minutes.
> However, this dag takes approximately 30 mins to complete, meaning that
> the overhead for this (albeit extreme) example is around 900% of real
> processing time.
>
> We currently have DAGMAN_MAX_SUBMITS_PER_INTERVAL set at 200, but this
> doesn't seem to be the issue, since jobs are in the schedd queue, they
> are just not taking the expecting 1s to run. Instead we are seeing run
> times of up to 9 seconds.
>
> We see the same issue by changing the above submit file to /queue 10000
> /and submitting that.
>
> Please, could someone explain what is going on here which is taking so
> long? I would certainly expect some overhead, but this seems very high
> to me. If anyone has any suggestions on what to try to reduce this, then
> it would be greatly appreciated!
>
> Thanks,
>
> --
> *Matt Hinton*
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
--
tel : +33 (0)6 87 30 83 59
######################################################################
##
## condor_config
##
## This is the global configuration file for condor. This is where
## you define where the local config file is. Any settings
## made here may potentially be overridden in the local configuration
## file. KEEP THAT IN MIND! To double-check that a variable is
## getting set from the configuration file that you expect, use
## condor_config_val -v <variable name>
##
## condor_config.annotated is a more detailed sample config file
##
## Unless otherwise specified, settings that are commented out show
## the defaults that are used if you don't define a value. Settings
## that are defined here MUST BE DEFINED since they have no default
## value.
##
######################################################################
## Where have you installed the bin, sbin and lib condor directories?
RELEASE_DIR = C:\condor
## Where is the local condor directory for each host? This is where the local config file(s), logs and
## spool/execute directories are located. this is the default for Linux and Unix systems.
#LOCAL_DIR = $(TILDE)
## this is the default on Windows sytems
#LOCAL_DIR = $(RELEASE_DIR)
## Where is the machine-specific local config file for each host?
LOCAL_CONFIG_FILE = $(LOCAL_DIR)\condor_config.local
## If your configuration is on a shared file system, then this might be a better default
#LOCAL_CONFIG_FILE = $(RELEASE_DIR)\etc\$(HOSTNAME).local
## If the local config file is not present, is it an error? (WARNING: This is a potential security issue.)
REQUIRE_LOCAL_CONFIG_FILE = FALSE
## The normal way to do configuration with RPMs is to read all of the
## files in a given directory that don't match a regex as configuration files.
## Config files are read in lexicographic order.
LOCAL_CONFIG_DIR = $(LOCAL_DIR)\config
#LOCAL_CONFIG_DIR_EXCLUDE_REGEXP = ^((\..*)|(.*~)|(#.*)|(.*\.rpmsave)|(.*\.rpmnew))$
## Use a host-based security policy. By default CONDOR_HOST and the local machine will be allowed
use SECURITY : HOST_BASED
## To expand your condor pool beyond a single host, set ALLOW_WRITE to match all of the hosts
#ALLOW_WRITE = *.cs.wisc.edu
## FLOCK_FROM defines the machines that grant access to your pool via flocking. (i.e. these machines can join your pool).
#FLOCK_FROM =
## FLOCK_TO defines the central managers that your schedd will advertise itself to (i.e. these pools will give matches to your schedd).
#FLOCK_TO = condor.cs.wisc.edu, cm.example.edu
##--------------------------------------------------------------------
## Values set by the condor_configure script:
##--------------------------------------------------------------------
CONDOR_HOST = $(FULL_HOSTNAME)
NETWORK_INTERFACE = 192.168.1.181
COLLECTOR_NAME = ATLAS
UID_DOMAIN =
CONDOR_ADMIN =
SMTP_SERVER =
ALLOW_READ = *
ALLOW_WRITE = $(CONDOR_HOST), $(IP_ADDRESS)
ALLOW_ADMINISTRATOR = $(IP_ADDRESS)
JAVA = C:\PROGRA~2\Java\JRE18~1.0_6\bin\java.exe
use POLICY : ALWAYS_RUN_JOBS
WANT_VACATE = False
WANT_SUSPEND = False
START = True
SUSPEND = False
CONTINUE = True
PREEMPT = False
KILL = False
JOB_RENICE_INCREMENT = 0
SYSAPI_GET_LOADAVG = False
WANT_VACATE_VANILLA = False
WANT_SUSPEND_VANILLA = False
START_VANILLA = True
SUSPEND_VANILLA = False
CONTINUE_VANILLA = True
PREEMPT_VANILLA = False
KILL_VANILLA = False
NEGOTIATOR_CONSIDER_PREEMPTION = False
DAEMON_LIST = MASTER SCHEDD COLLECTOR NEGOTIATOR STARTD