Re: [HTCondor-users] Please help me; about Shadow exception!

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Hi~ I'm following the 'HTCondor Quick Start Guide'(https://research.cs.wisc.edu/htcondor/manual/quickstart.html)

After I submit a job and it ran for about 5 seconds, it was turned into IDLE state from RUN state.

After It took too much time, its output file was successfully printed.

I cannot correctly count how much time it took but I just suppose about 20~30 min.

I thought there are some problems, so I ask condor-user mailing list about this problem.

I specify all information of current status of my machines from now.

The job file is:

#!/bin/bash
# file name: sleep.sh
TIMETOWAIT="10"
echo "sleeping for $TIMETOWAIT seconds"
/bin/sleep $TIMETOWAIT

The submit specification file is:

executable = sleep.sh
log = sleep.log
output = outfile.txt
error = errors.txt
should_transfer_files = Yes
when_to_transfer_output = ON_EXIT
queue

its log file is(sleep.log):

000 (012.000.000) 02/10 21:34:25 Job submitted from host: <10.150.21.171:9618?addrs=10.150.21.171-9618+[--1]-9618&noUDP&sock=42970_bd0c_3>
...
001 (012.000.000) 02/10 21:34:26 Job executing on host: <10.150.21.170:9618?addrs=10.150.21.170-9618+[--1]-9618&noUDP&sock=297370_fa77_62>
...
007 (012.000.000) 02/10 21:34:28 Shadow exception!
Error from slot1@ubuntu: Create_Process failed to register the job with the ProcD
0 - Run Bytes Sent By Job
114 - Run Bytes Received By Job
...
## above message repeated ##
001 (012.000.000) 02/10 22:02:30 Job executing on host: <10.150.21.171:9618?addrs=10.150.21.171-9618+[--1]-9618&noUDP&sock=42970_bd0c_4>
...
006 (012.000.000) 02/10 22:02:38 Image size of job updated: 380
1 - MemoryUsage of job (MB)
380 - ResidentSetSize of job (KB)
...
005 (012.000.000) 02/10 22:02:41 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
24 - Run Bytes Sent By Job
114 - Run Bytes Received By Job
24 - Total Bytes Sent By Job
2052 - Total Bytes Received By Job
Partitionable Resources : Usage Request Allocated
Cpus : 1 1
Disk (KB) : 9 1 27474539
Memory (MB) : 1 1 4025
...

When I checked ShadowLog(/var/log/condor/ShadowLog), it says:

02/07/17 15:57:54 ** condor_shadow (CONDOR_SHADOW) STARTING UP
02/07/17 15:57:54 ** /usr/sbin/condor_shadow
02/07/17 15:57:54 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
02/07/17 15:57:54 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
02/07/17 15:57:54 ** $CondorVersion: 8.6.0 Jan 26 2017 BuildID: 395190 $
02/07/17 15:57:54 ** $CondorPlatform: x86_64_Debian7 $
02/07/17 15:57:54 ** PID = 209324
02/07/17 15:57:54 ** Log last touched 2/7 15:57:52
02/07/17 15:57:54 ******************************************************
02/07/17 15:57:54 Using config source: /etc/condor/condor_config
02/07/17 15:57:54 Using local config sources:
02/07/17 15:57:54 /etc/condor/condor_config.local
02/07/17 15:57:54 config Macros = 67, Sorted = 67, StringBytes = 1769, TablesBytes = 1112
02/07/17 15:57:54 CLASSAD_CACHING is OFF
02/07/17 15:57:54 Daemon Log is logging: D_ALWAYS D_ERROR
02/07/17 15:57:54 SharedPortEndpoint: waiting for connections to named socket 209324_bd59
02/07/17 15:57:54 DaemonCore: command socket at <10.150.21.170:9618?addrs=10.150.21.170-9618+[--1]-9618&noUDP&sock=209324_bd59>
02/07/17 15:57:54 DaemonCore: private command socket at <10.150.21.170:9618?addrs=10.150.21.170-9618+[--1]-9618&noUDP&sock=209324_bd59>
02/07/17 15:57:54 ERROR "Assertion ERROR on (job_ad_file)" at line 165 in file /slots/01/dir_17483/sources/src/condor_shadow.V6.1/shadow_v61_main.cpp

Additionally, I add configuration information for HTCondor.

1. condor_config in central manager machine:

LOCAL_DIR = /var
## Where is the machine-specific local config file for each host?
LOCAL_CONFIG_FILE = /etc/condor/condor_config.local
## If your configuration is on a shared file system, then this might be a better default
#LOCAL_CONFIG_FILE = $(RELEASE_DIR)/etc/$(HOSTNAME).local
## If the local config file is not present, is it an error? (WARNING: This is a potential security issue.)
REQUIRE_LOCAL_CONFIG_FILE = false
STARTER_ALLOW_RUNAS_OWNER = TRUE
## The normal way to do configuration with RPMs is to read all of the
## files in a given directory that don't match a regex as configuration files.
## Config files are read in lexicographic order.
LOCAL_CONFIG_DIR = /etc/condor/config.d
#LOCAL_CONFIG_DIR_EXCLUDE_REGEXP = ^((\..*)|(.*~)|(#.*)|(.*\.rpmsave)|(.*\.rpmnew))$
## Use a host-based security policy. By default CONDOR_HOST and the local machine will be allowed
use SECURITY : HOST_BASED
## To expand your condor pool beyond a single host, set ALLOW_WRITE to match all of the hosts
ALLOW_WRITE = nickeys-*.xxxxx.ac.kr
ALLOW_READ = nickeys-*.xxxxx.ac.kr
## FLOCK_FROM defines the machines that grant access to your pool via flocking. (i.e. these machines can join your pool).

FLOCK_FROM = nickeys-fs.xxxxx.ac.kr, nickeys-1.xxxxx.ac.kr, nickeys-2.xxxxx.ac.kr, nickeys-3.xxxxx.ac.kr, nickeys-4.xxxxx.ac.kr, nickeys-5.xxxxx.ac.kr, nickeys-6.xxxxx.ac.kr, nickeys-7.xxxxx.ac.kr, nickeys-8.xxxxx.ac.kr

## FLOCK_TO defines the central managers that your schedd will advertise itself to (i.e. these pools will give matches to your schedd).

FLOCK_TO = nickeys-fs.xxxxx.ac.kr, nickeys-1.xxxxx.ac.kr, nickeys-2.xxxxx.ac.kr, nickeys-3.xxxxx.ac.kr, nickeys-4.xxxxx.ac.kr, nickeys-5.xxxxx.ac.kr, nickeys-6.xxxxx.ac.kr, nickeys-7.xxxxx.ac.kr, nickeys-8.xxxxx.ac.kr

UID_DOMAIN = xxxxx.ac.kr
RUN = $(LOCAL_DIR)/run/condor
LOG = $(LOCAL_DIR)/log/condor
LOCK = $(LOCAL_DIR)/lock/condor
SPOOL = $(LOCAL_DIR)/lib/condor/spool
EXECUTE = $(LOCAL_DIR)/lib/condor/execute
BIN = $(RELEASE_DIR)/bin
LIB = $(RELEASE_DIR)/lib/condor
INCLUDE = $(RELEASE_DIR)/include/condor
SBIN = $(RELEASE_DIR)/sbin
LIBEXEC = $(RELEASE_DIR)/lib/condor/libexec
SHARE = $(RELEASE_DIR)/share/condor
GANGLIA_LIB64_PATH = /lib,/usr/lib,/usr/local/lib
PROCD_ADDRESS = $(RUN)/procd_pipe
## What machine is your central manager?
CONDOR_HOST = nickeys-fs.xxxxx.ac.kr
FILESYSTEM_DOMAIN = xxxxx.ac.kr
## This macro determines what daemons the condor_master will start and keep its watchful eyes on.
## The list is a comma or space separated list of subsystem names
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD

2. condor_config.local in the execution machine:

FILESYSTEM_DOMAIN = xxxxx.ac.kr

I wrote all information about my HTCondor system as I know as.

Please give me any small hint, I have been suffered from this problem for 3 days...

I could not find any clue about it, even with googling.

Sincerely,

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Please help me; about Shadow exception!