[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Trouble trying to make HTCondor work in a Docker container



Howdy Condorfolk,

I'm trying to get Condor to run using Kubernetes and have run into problem when using CCB (to make port mapping work).Â

For my purposes I don't need any security inside the Docker container so my condor_config.local looks like this (once this works then the startd will be removed and the minions added). The IP address is obtained from the Docker $HOSTNAME using getent since if I use the name then I couldn't figure out how to get this far when using shared port.

NO_DNS = TRUE
TRUST_UID_DOMAIN = TRUE
UID_DOMAIN = *

USE_SHARED_PORT =ÂTRUE
SHARED_PORT_ARGS = -p 9886

SEC_DEFAULT_NEGOTIATION = NEVER
SEC_DEFAULT_AUTHENTICATION = NEVER

ALLOW_READ Â Â Â Â Â Â= *,*@*
ALLOW_WRITE Â Â Â Â Â = *,*@*
ALLOW_ADMINISTRATOR Â = *,*@*
ALLOW_CONFIG Â Â Â Â Â= *,*@*
ALLOW_NEGOTIATOR Â Â Â= *,*@*
ALLOW_DAEMON Â Â Â Â Â= *,*@*

DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD, SHARED_PORT

CONDOR_HOST = 172.17.0.71

But when a job is submitted it never starts and the error messages are:

SchedLog:
11/09/14 01:12:00 (pid:39) Shadow pid 100 for job 1.0 exited with status 4
11/09/14 01:12:00 (pid:39) ERROR: Shadow exited with job exception code!

ShadowLog:
11/09/14 01:12:00 (1.0) (105): ERROR "Can no longer talk to condor_starter <172.17.0.71:9886>" at line 220 in fileÂ

StartLog:
11/09/14 01:12:00 Starter pid 101 died on signal 11 (signal 11 (Segmentation fault))

StarterLog.slot1:
Stack dump for process 101 at timestamp 1415495520 (12 frames)
/usr/lib64/condor/libcondor_utils_8_2_3.so(dprintf_dump_stack+0x12d)[0x7f44e8c7ab2d]
/usr/lib64/condor/libcondor_utils_8_2_3.so(_Z18linux_sig_coredumpi+0x40)[0x7f44e8d9ccc0]
/lib64/libpthread.so.0(+0xf710)[0x7f44e4975710]
/lib64/libc.so.6(+0x81171)[0x7f44e4653171]
condor_starter(_ZN9JICShadow18publishStarterInfoEPN14compat_classad7ClassAdE+0x7c)[0x437acc]
condor_starter(_ZN9JICShadow19registerStarterInfoEv+0x31)[0x437ee1]
condor_starter(_ZN9JICShadow4initEv+0xba)[0x43974a]
condor_starter(_ZN8CStarter4InitEP19JobInfoCommunicatorPKcbiii+0x575)[0x445915]
condor_starter(_Z9main_initiPPc+0x61)[0x431661]
/usr/lib64/condor/libcondor_utils_8_2_3.so(_Z7dc_mainiPPc+0x1790)[0x7f44e8d9ee40]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7f44e45f0d5d]
condor_starter[0x41fb19]


SchedLog:
11/09/14 01:11:04 (pid:39) ******************************************************
11/09/14 01:11:04 (pid:39) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
11/09/14 01:11:04 (pid:39) ** /usr/sbin/condor_schedd
11/09/14 01:11:04 (pid:39) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)
11/09/14 01:11:04 (pid:39) ** Configuration: subsystem:SCHEDD local:<NONE> class:DAEMON
11/09/14 01:11:04 (pid:39) ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $
11/09/14 01:11:04 (pid:39) ** $CondorPlatform: x86_64_RedHat6 $
11/09/14 01:11:04 (pid:39) ** PID = 39
11/09/14 01:11:04 (pid:39) ** Log last touched time unavailable (No such file or directory)
11/09/14 01:11:04 (pid:39) ******************************************************
11/09/14 01:11:04 (pid:39) Using config source: /etc/condor/condor_config
11/09/14 01:11:04 (pid:39) Using local config sources:Â
11/09/14 01:11:04 (pid:39) Â Â/etc/condor/condor_config.local
11/09/14 01:11:04 (pid:39) config Macros = 68, Sorted = 68, StringBytes = 1662, TablesBytes = 2496
11/09/14 01:11:04 (pid:39) CLASSAD_CACHING is ENABLED
11/09/14 01:11:04 (pid:39) Daemon Log is logging: D_ALWAYS D_ERROR
11/09/14 01:11:04 (pid:39) SharedPortEndpoint: waiting for connections to named socket 34_770f_5
11/09/14 01:11:04 (pid:39) DaemonCore: command socket at <172.17.0.71:9886?sock=34_770f_5>
11/09/14 01:11:04 (pid:39) DaemonCore: private command socket at <172.17.0.71:9886?sock=34_770f_5>
11/09/14 01:11:04 (pid:39) History file rotation is enabled.
11/09/14 01:11:04 (pid:39) Â Maximum history file size is: 20971520 bytes
11/09/14 01:11:04 (pid:39) Â Number of rotated history files is: 2
11/09/14 01:11:04 (pid:39) Received a superuser command
11/09/14 01:11:09 (pid:39) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
11/09/14 01:11:09 (pid:39) TransferQueueManager upload 1m I/O load: 0 bytes/s Â0.000 disk load Â0.000 net load
11/09/14 01:11:09 (pid:39) TransferQueueManager download 1m I/O load: 0 bytes/s Â0.000 disk load Â0.000 net load
11/09/14 01:12:00 (pid:39) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
11/09/14 01:12:00 (pid:39) TransferQueueManager upload 1m I/O load: 0 bytes/s Â0.000 disk load Â0.000 net load
11/09/14 01:12:00 (pid:39) TransferQueueManager download 1m I/O load: 0 bytes/s Â0.000 disk load Â0.000 net load
11/09/14 01:12:00 (pid:39) Sent ad to central manager for submit@*
11/09/14 01:12:00 (pid:39) Sent ad to 1 collectors for submit@*
11/09/14 01:12:00 (pid:39) Using negotiation protocol: NEGOTIATE
11/09/14 01:12:00 (pid:39) Negotiating for owner: submit@*
11/09/14 01:12:00 (pid:39) AutoCluster:config() significant attributes changed to JobUniverse,LastCheckpointPlatform,NumCkpts
11/09/14 01:12:00 (pid:39) Checking consistency running and runnable jobs
11/09/14 01:12:00 (pid:39) Tables are consistent
11/09/14 01:12:00 (pid:39) Rebuilt prioritized runnable job list in 0.000s.
11/09/14 01:12:00 (pid:39) Finished negotiating for submit in local pool: 1 matched, 0 rejected
11/09/14 01:12:00 (pid:39) Starting add_shadow_birthdate(1.0)
11/09/14 01:12:00 (pid:39) Started shadow for job 1.0 on slot1@ <172.17.0.71:9886?sock=34_770f_6> for submit, (shadow pid = 100)
11/09/14 01:12:00 (pid:39) Shadow pid 100 for job 1.0 exited with status 4
11/09/14 01:12:00 (pid:39) ERROR: Shadow exited with job exception code!
11/09/14 01:12:00 (pid:39) Checking consistency running and runnable jobs
11/09/14 01:12:00 (pid:39) Tables are consistent
11/09/14 01:12:00 (pid:39) Rebuilt prioritized runnable job list in 0.000s.
11/09/14 01:12:00 (pid:39) match (slot1@ <172.17.0.71:9886?sock=34_770f_6> for submit) switching to job 1.0
11/09/14 01:12:00 (pid:39) Starting add_shadow_birthdate(1.0)
11/09/14 01:12:00 (pid:39) Started shadow for job 1.0 on slot1@ <172.17.0.71:9886?sock=34_770f_6> for submit, (shadow pid = 105)
11/09/14 01:12:00 (pid:39) Shadow pid 105 for job 1.0 exited with status 4
11/09/14 01:12:00 (pid:39) ERROR: Shadow exited with job exception code!
... repeated ad infinitum ...

ShadowLog:
11/09/14 01:12:00 ******************************************************
11/09/14 01:12:00 ** condor_shadow (CONDOR_SHADOW) STARTING UP
11/09/14 01:12:00 ** /usr/sbin/condor_shadow
11/09/14 01:12:00 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
11/09/14 01:12:00 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
11/09/14 01:12:00 ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $
11/09/14 01:12:00 ** $CondorPlatform: x86_64_RedHat6 $
11/09/14 01:12:00 ** PID = 100
11/09/14 01:12:00 ** Log last touched time unavailable (No such file or directory)
11/09/14 01:12:00 ******************************************************
11/09/14 01:12:00 Using config source: /etc/condor/condor_config
11/09/14 01:12:00 Using local config sources:Â
11/09/14 01:12:00 Â Â/etc/condor/condor_config.local
11/09/14 01:12:00 config Macros = 70, Sorted = 70, StringBytes = 1752, TablesBytes = 1168
11/09/14 01:12:00 CLASSAD_CACHING is OFF
11/09/14 01:12:00 Daemon Log is logging: D_ALWAYS D_ERROR
11/09/14 01:12:00 SharedPortEndpoint: waiting for connections to named socket 39_3a9b_1
11/09/14 01:12:00 DaemonCore: command socket at <172.17.0.71:9886?sock=39_3a9b_1>
11/09/14 01:12:00 DaemonCore: private command socket at <172.17.0.71:9886?sock=39_3a9b_1>
11/09/14 01:12:00 Initializing a VANILLA shadow for job 1.0
11/09/14 01:12:00 (1.0) (100): Request to run on slot1@ <172.17.0.71:9886?sock=34_770f_6> was ACCEPTED
11/09/14 01:12:00 (1.0) (100): ERROR "Can no longer talk to condor_starter <172.17.0.71:9886>" at line 220 in file /slots/12/dir_40576/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
11/09/14 01:12:00 ******************************************************
11/09/14 01:12:00 ** condor_shadow (CONDOR_SHADOW) STARTING UP
11/09/14 01:12:00 ** /usr/sbin/condor_shadow
11/09/14 01:12:00 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
11/09/14 01:12:00 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
11/09/14 01:12:00 ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $
11/09/14 01:12:00 ** $CondorPlatform: x86_64_RedHat6 $
11/09/14 01:12:00 ** PID = 105
11/09/14 01:12:00 ** Log last touched 11/9 01:12:00
11/09/14 01:12:00 ******************************************************
11/09/14 01:12:00 Using config source: /etc/condor/condor_config
11/09/14 01:12:00 Using local config sources:Â
11/09/14 01:12:00 Â Â/etc/condor/condor_config.local
11/09/14 01:12:00 config Macros = 70, Sorted = 70, StringBytes = 1752, TablesBytes = 1168
11/09/14 01:12:00 CLASSAD_CACHING is OFF
11/09/14 01:12:00 Daemon Log is logging: D_ALWAYS D_ERROR
11/09/14 01:12:00 SharedPortEndpoint: waiting for connections to named socket 39_3a9b_2
11/09/14 01:12:00 DaemonCore: command socket at <172.17.0.71:9886?sock=39_3a9b_2>
11/09/14 01:12:00 DaemonCore: private command socket at <172.17.0.71:9886?sock=39_3a9b_2>
11/09/14 01:12:00 Initializing a VANILLA shadow for job 1.0
11/09/14 01:12:00 (1.0) (105): Request to run on slot1@ <172.17.0.71:9886?sock=34_770f_6> was ACCEPTED
11/09/14 01:12:00 (1.0) (105): ERROR "Can no longer talk to condor_starter <172.17.0.71:9886>" at line 220 in file /slots/12/dir_40576/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
11/09/14 01:12:00 ******************************************************
... repeated ad infinitum ...

StartLog:
11/09/14 01:11:04 ******************************************************
11/09/14 01:11:04 ** condor_startd (CONDOR_STARTD) STARTING UP
11/09/14 01:11:04 ** /usr/sbin/condor_startd
11/09/14 01:11:04 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1)
11/09/14 01:11:04 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON
11/09/14 01:11:04 ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $
11/09/14 01:11:04 ** $CondorPlatform: x86_64_RedHat6 $
11/09/14 01:11:04 ** PID = 42
11/09/14 01:11:04 ** Log last touched time unavailable (No such file or directory)
11/09/14 01:11:04 ******************************************************
11/09/14 01:11:04 Using config source: /etc/condor/condor_config
11/09/14 01:11:04 Using local config sources:Â
11/09/14 01:11:04 Â Â/etc/condor/condor_config.local
11/09/14 01:11:04 config Macros = 68, Sorted = 68, StringBytes = 1662, TablesBytes = 2496
11/09/14 01:11:04 CLASSAD_CACHING is ENABLED
11/09/14 01:11:04 Daemon Log is logging: D_ALWAYS D_ERROR
11/09/14 01:11:04 SharedPortEndpoint: waiting for connections to named socket 34_770f_6
11/09/14 01:11:04 DaemonCore: command socket at <172.17.0.71:9886?sock=34_770f_6>
11/09/14 01:11:04 DaemonCore: private command socket at <172.17.0.71:9886?sock=34_770f_6>
11/09/14 01:11:05 VM-gahp server reported an internal error
11/09/14 01:11:05 VM universe will be tested to check if it is available
11/09/14 01:11:05 History file rotation is enabled.
11/09/14 01:11:05 Â Maximum history file size is: 20971520 bytes
11/09/14 01:11:05 Â Number of rotated history files is: 2
11/09/14 01:11:05 Allocating auto shares for slot type 0: Cpus: auto, Memory: auto, Swap: auto, Disk: auto
slot type 0: Cpus: 1.000000, Memory: 249, Swap: 25.00%, Disk: 25.00%
slot type 0: Cpus: 1.000000, Memory: 249, Swap: 25.00%, Disk: 25.00%
slot type 0: Cpus: 1.000000, Memory: 249, Swap: 25.00%, Disk: 25.00%
slot type 0: Cpus: 1.000000, Memory: 249, Swap: 25.00%, Disk: 25.00%
11/09/14 01:11:05 slot1: New machine resource allocated
11/09/14 01:11:05 Setting up slot pairings
11/09/14 01:11:05 slot2: New machine resource allocated
11/09/14 01:11:05 Setting up slot pairings
11/09/14 01:11:05 slot3: New machine resource allocated
11/09/14 01:11:05 Setting up slot pairings
11/09/14 01:11:05 slot4: New machine resource allocated
11/09/14 01:11:05 Setting up slot pairings
11/09/14 01:11:05 CronJobList: Adding job 'mips'
11/09/14 01:11:05 CronJobList: Adding job 'kflops'
11/09/14 01:11:05 CronJob: Initializing job 'mips' (/usr/libexec/condor/condor_mips)
11/09/14 01:11:05 CronJob: Initializing job 'kflops' (/usr/libexec/condor/condor_kflops)
11/09/14 01:11:05 slot1: State change: IS_OWNER is false
11/09/14 01:11:05 slot1: Changing state: Owner -> Unclaimed
11/09/14 01:11:05 State change: RunBenchmarks is TRUE
11/09/14 01:11:05 slot1: Changing activity: Idle -> Benchmarking
11/09/14 01:11:05 BenchMgr:StartBenchmarks()
11/09/14 01:11:05 slot2: State change: IS_OWNER is false
11/09/14 01:11:05 slot2: Changing state: Owner -> Unclaimed
11/09/14 01:11:05 State change: RunBenchmarks is TRUE
11/09/14 01:11:05 slot2: Changing activity: Idle -> Benchmarking
11/09/14 01:11:05 slot2: Changing activity: Benchmarking -> Idle
11/09/14 01:11:05 slot3: State change: IS_OWNER is false
11/09/14 01:11:05 slot3: Changing state: Owner -> Unclaimed
11/09/14 01:11:05 State change: RunBenchmarks is TRUE
11/09/14 01:11:05 slot3: Changing activity: Idle -> Benchmarking
11/09/14 01:11:05 slot3: Changing activity: Benchmarking -> Idle
11/09/14 01:11:05 slot4: State change: IS_OWNER is false
11/09/14 01:11:05 slot4: Changing state: Owner -> Unclaimed
11/09/14 01:11:05 State change: RunBenchmarks is TRUE
11/09/14 01:11:05 slot4: Changing activity: Idle -> Benchmarking
11/09/14 01:11:05 slot4: Changing activity: Benchmarking -> Idle
11/09/14 01:11:32 State change: benchmarks completed
11/09/14 01:11:32 slot1: Changing activity: Benchmarking -> Idle
11/09/14 01:12:00 slot1: Request accepted.
11/09/14 01:12:00 slot1: Remote owner is submit@*
11/09/14 01:12:00 slot1: State change: claiming protocol successful
11/09/14 01:12:00 slot1: Changing state: Unclaimed -> Claimed
11/09/14 01:12:00 slot1: match_info called
11/09/14 01:12:00 slot1: Got activate_claim request from shadow (172.17.0.71)
11/09/14 01:12:00 slot1: Remote job ID is 1.0
11/09/14 01:12:00 slot1: Got universe "VANILLA" (5) from request classad
11/09/14 01:12:00 slot1: State change: claim-activation protocol successful
11/09/14 01:12:00 slot1: Changing activity: Idle -> Busy
11/09/14 01:12:00 Starter pid 101 died on signal 11 (signal 11 (Segmentation fault))
11/09/14 01:12:00 slot1: State change: starter exited
11/09/14 01:12:00 slot1: Changing activity: Busy -> Idle
11/09/14 01:12:00 slot1: Got activate_claim request from shadow (172.17.0.71)
11/09/14 01:12:00 slot1: Remote job ID is 1.0
11/09/14 01:12:00 slot1: Got universe "VANILLA" (5) from request classad
11/09/14 01:12:00 slot1: State change: claim-activation protocol successful
11/09/14 01:12:00 slot1: Changing activity: Idle -> Busy
11/09/14 01:12:00 Starter pid 106 died on signal 11 (signal 11 (Segmentation fault))
11/09/14 01:12:00 slot1: State change: starter exited
... repeated ad infinitum ...

StarterLog.slot1:
11/09/14 01:12:00 (pid:101) ** condor_starter (CONDOR_STARTER) STARTING UP
11/09/14 01:12:00 (pid:101) ** /usr/sbin/condor_starter
11/09/14 01:12:00 (pid:101) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
11/09/14 01:12:00 (pid:101) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
11/09/14 01:12:00 (pid:101) ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $
11/09/14 01:12:00 (pid:101) ** $CondorPlatform: x86_64_RedHat6 $
11/09/14 01:12:00 (pid:101) ** PID = 101
11/09/14 01:12:00 (pid:101) ** Log last touched time unavailable (No such file or directory)
11/09/14 01:12:00 (pid:101) ******************************************************
11/09/14 01:12:00 (pid:101) Using config source: /etc/condor/condor_config
11/09/14 01:12:00 (pid:101) Using local config sources:Â
11/09/14 01:12:00 (pid:101) Â Â/etc/condor/condor_config.local
11/09/14 01:12:00 (pid:101) config Macros = 70, Sorted = 69, StringBytes = 1759, TablesBytes = 2568
11/09/14 01:12:00 (pid:101) CLASSAD_CACHING is OFF
11/09/14 01:12:00 (pid:101) Daemon Log is logging: D_ALWAYS D_ERROR
11/09/14 01:12:00 (pid:101) SharedPortEndpoint: waiting for connections to named socket 42_57bb_3
11/09/14 01:12:00 (pid:101) DaemonCore: command socket at <172.17.0.71:9886?sock=42_57bb_3>
11/09/14 01:12:00 (pid:101) DaemonCore: private command socket at <172.17.0.71:9886?sock=42_57bb_3>
11/09/14 01:12:00 (pid:101) Communicating with shadow <172.17.0.71:9886?sock=39_3a9b_1>
11/09/14 01:12:00 (pid:101) Submitting machine is "172.17.0.71"
Stack dump for process 101 at timestamp 1415495520 (12 frames)
/usr/lib64/condor/libcondor_utils_8_2_3.so(dprintf_dump_stack+0x12d)[0x7f44e8c7ab2d]
/usr/lib64/condor/libcondor_utils_8_2_3.so(_Z18linux_sig_coredumpi+0x40)[0x7f44e8d9ccc0]
/lib64/libpthread.so.0(+0xf710)[0x7f44e4975710]
/lib64/libc.so.6(+0x81171)[0x7f44e4653171]
condor_starter(_ZN9JICShadow18publishStarterInfoEPN14compat_classad7ClassAdE+0x7c)[0x437acc]
condor_starter(_ZN9JICShadow19registerStarterInfoEv+0x31)[0x437ee1]
condor_starter(_ZN9JICShadow4initEv+0xba)[0x43974a]
condor_starter(_ZN8CStarter4InitEP19JobInfoCommunicatorPKcbiii+0x575)[0x445915]
condor_starter(_Z9main_initiPPc+0x61)[0x431661]
/usr/lib64/condor/libcondor_utils_8_2_3.so(_Z7dc_mainiPPc+0x1790)[0x7f44e8d9ee40]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7f44e45f0d5d]
condor_starter[0x41fb19]
... repeated ad infinitum ...

This is build using the official HTCondor RPM for 8.2.3 on Centos 6.

bash-4.1# condor_version
$CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $
$CondorPlatform: x86_64_RedHat6 $
bash-4.1# uname -a
Linux 9933c89c4b4d 3.16.4-tinycore64 #1 SMP Tue Oct 14 01:10:32 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

The dockerfile can be seen here:Âhttps://github.com/jimwhite/dockerfiles/tree/master/condor/kubernetes

Any help greatly appreciated because I don't really know what to try next and this has already taken way more time than expected (and the end doesn't appear to be in sight yet). Would trying an earlier version of Condor be worthwhile?

Thanks!

Jim