Why are your running a Debian 7 distribution of HTCondor on a
machine called ubuntu?
The HTCondor should ideally be built for the distribution of
Linux that you are using.
...Tim
On 02/15/2017 02:24 AM, Minjun Hong
wrote:
Hi~ I'm following
the 'HTCondor Quick Start Guide'(https://research.cs.wisc.edu/htcondor/manual/quickstart.html)
After I submit a job and it ran
for about 5 seconds, it was turned into IDLE state from RUN
state.
After It took too much time, its
output file was successfully printed.
I cannot correctly count how much
time it took but I just suppose about 20~30 min.
I thought there are some problems,
so I ask condor-user mailing list about this problem.
I specify all information of
current status of my machines from now.
The job file is:
#!/bin/bash
# file name: sleep.sh
TIMETOWAIT="10"
echo "sleeping for $TIMETOWAIT seconds"
/bin/sleep $TIMETOWAIT
The submit specification file is:
executable =
sleep.sh
log = sleep.log
output = outfile.txt
error = errors.txt
should_transfer_files = Yes
when_to_transfer_output = ON_EXIT
queue
its log file is(sleep.log):
000 (012.000.000) 02/10
21:34:25 Job submitted from host: <10.150.21.171:9618?addrs=10.150.21.171-9618+[--1]-9618&noUDP&sock=42970_bd0c_3>
...
001 (012.000.000) 02/10 21:34:26 Job executing on host: <10.150.21.170:9618?addrs=10.150.21.170-9618+[--1]-9618&noUDP&sock=297370_fa77_62>
...
007 (012.000.000) 02/10 21:34:28 Shadow exception!
Error from slot1@ubuntu: Create_Process failed to
register the job with the ProcD
0 - Run Bytes Sent By Job
114 - Run Bytes Received By Job
...
## above message repeated ##
001 (012.000.000) 02/10 22:02:30 Job executing on host: <10.150.21.171:9618?addrs=10.150.21.171-9618+[--1]-9618&noUDP&sock=42970_bd0c_4>
...
006 (012.000.000) 02/10 22:02:38 Image size of job updated:
380
1 - MemoryUsage of job (MB)
380 - ResidentSetSize of job (KB)
...
005 (012.000.000) 02/10 22:02:41 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run
Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local
Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total
Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total
Local Usage
24 - Run Bytes Sent By Job
114 - Run Bytes Received By Job
24 - Total Bytes Sent By Job
2052 - Total Bytes Received By Job
Partitionable Resources : Usage Request
Allocated
Cpus : 1
1
Disk (KB) : 9 1
27474539
Memory (MB) : 1 1
4025
...
When I checked ShadowLog(/var/log/condor/ShadowLog),
it says:
02/07/17 15:57:54 **
condor_shadow (CONDOR_SHADOW) STARTING UP
02/07/17 15:57:54 ** /usr/sbin/condor_shadow
02/07/17 15:57:54 ** SubsystemInfo: name=SHADOW
type=SHADOW(6) class=DAEMON(1)
02/07/17 15:57:54 ** Configuration: subsystem:SHADOW
local:<NONE> class:DAEMON
02/07/17 15:57:54 ** $CondorVersion: 8.6.0 Jan 26 2017
BuildID: 395190 $
02/07/17 15:57:54 ** $CondorPlatform: x86_64_Debian7 $
02/07/17 15:57:54 ** PID = 209324
02/07/17 15:57:54 ** Log last touched 2/7 15:57:52
02/07/17 15:57:54 ******************************************************
02/07/17 15:57:54 Using config source:
/etc/condor/condor_config
02/07/17 15:57:54 Using local config sources:
02/07/17 15:57:54 /etc/condor/condor_config.local
02/07/17 15:57:54 config Macros = 67, Sorted = 67,
StringBytes = 1769, TablesBytes = 1112
02/07/17 15:57:54 CLASSAD_CACHING is OFF
02/07/17 15:57:54 Daemon Log is logging: D_ALWAYS
D_ERROR
02/07/17 15:57:54 SharedPortEndpoint: waiting for
connections to named socket 209324_bd59
02/07/17 15:57:54 DaemonCore: command socket at <10.150.21.170:9618?addrs=10.150.21.170-9618+[--1]-9618&noUDP&sock=209324_bd59>
02/07/17 15:57:54 DaemonCore: private command socket at
<10.150.21.170:9618?addrs=10.150.21.170-9618+[--1]-9618&noUDP&sock=209324_bd59>
02/07/17 15:57:54 ERROR "Assertion ERROR on
(job_ad_file)" at line 165 in file
/slots/01/dir_17483/sources/src/condor_shadow.V6.1/shadow_v61_main.cpp
Additionally, I add configuration information for
HTCondor.
1. condor_config in central manager machine:
LOCAL_DIR = /var
## Where is the machine-specific local config file for
each host?
LOCAL_CONFIG_FILE = /etc/condor/condor_config.local
## If your configuration is on a shared file system,
then this might be a better default
#LOCAL_CONFIG_FILE = $(RELEASE_DIR)/etc/$(HOSTNAME).local
## If the local config file is not present, is it an
error? (WARNING: This is a potential security issue.)
REQUIRE_LOCAL_CONFIG_FILE = false
STARTER_ALLOW_RUNAS_OWNER = TRUE
## The normal way to do configuration with RPMs is to
read all of the
## files in a given directory that don't match a regex
as configuration files.
## Config files are read in lexicographic order.
LOCAL_CONFIG_DIR = /etc/condor/config.d
#LOCAL_CONFIG_DIR_EXCLUDE_REGEXP =
^((\..*)|(.*~)|(#.*)|(.*\.rpmsave)|(.*\.rpmnew))$
## Use a host-based security policy. By default
CONDOR_HOST and the local machine will be allowed
use SECURITY : HOST_BASED
## To expand your condor pool beyond a single host, set
ALLOW_WRITE to match all of the hosts
ALLOW_WRITE = nickeys-*.xxxxx.ac.kr
ALLOW_READ = nickeys-*.xxxxx.ac.kr
## FLOCK_FROM defines the machines that grant access to
your pool via flocking. (i.e. these machines can join
your pool).
FLOCK_FROM = nickeys-fs.xxxxx.ac.kr, nickeys-1.xxxxx.ac.kr, nickeys-2.xxxxx.ac.kr, nickeys-3.xxxxx.ac.kr, nickeys-4.xxxxx.ac.kr, nickeys-5.xxxxx.ac.kr, nickeys-6.xxxxx.ac.kr, nickeys-7.xxxxx.ac.kr, nickeys-8.xxxxx.ac.kr
## FLOCK_TO defines
the central managers that your schedd will advertise
itself to (i.e. these pools will give matches to your
schedd).
FLOCK_TO = nickeys-fs.xxxxx.ac.kr, nickeys-1.xxxxx.ac.kr, nickeys-2.xxxxx.ac.kr, nickeys-3.xxxxx.ac.kr, nickeys-4.xxxxx.ac.kr, nickeys-5.xxxxx.ac.kr, nickeys-6.xxxxx.ac.kr, nickeys-7.xxxxx.ac.kr, nickeys-8.xxxxx.ac.kr
UID_DOMAIN = xxxxx.ac.kr
RUN = $(LOCAL_DIR)/run/condor
LOG = $(LOCAL_DIR)/log/condor
LOCK = $(LOCAL_DIR)/lock/condor
SPOOL = $(LOCAL_DIR)/lib/condor/spool
EXECUTE = $(LOCAL_DIR)/lib/condor/execute
BIN = $(RELEASE_DIR)/bin
LIB = $(RELEASE_DIR)/lib/condor
INCLUDE = $(RELEASE_DIR)/include/condor
SBIN = $(RELEASE_DIR)/sbin
LIBEXEC = $(RELEASE_DIR)/lib/condor/libexec
SHARE = $(RELEASE_DIR)/share/condor
GANGLIA_LIB64_PATH = /lib,/usr/lib,/usr/local/lib
PROCD_ADDRESS = $(RUN)/procd_pipe
## What machine is your central manager?
CONDOR_HOST = nickeys-fs.xxxxx.ac.kr
FILESYSTEM_DOMAIN = xxxxx.ac.kr
## This macro determines what daemons the condor_master
will start and keep its watchful eyes on.
## The list is a comma or space separated list of
subsystem names
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD,
STARTD
2. condor_config.local in the execution machine:
FILESYSTEM_DOMAIN = xxxxx.ac.kr
I wrote all information about my HTCondor system as I
know as.
Please give me any small hint, I
have been suffered from this problem for 3 days...
I could not find any clue about
it, even with googling.
Sincerely,
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
--
Tim Theisen
Release Manager
HTCondor & Open Science Grid
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin - Madison
4261 Computer Sciences and Statistics
1210 W Dayton St
Madison, WI 53706-1685
+1 608 265 5736
|