[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] evicted multicore jobs





Hi everyone,

I'm running multicore jobs and all are evicted and I don't understand what
is the reason.
This the shadow log output for a job(21509.0):

1/14/17 15:34:00 (21509.0) (3772013): Calling Timer handler 33
(dc_touch_log_file)
11/14/17 15:34:00 (21509.0) (3772013): Return from Timer handler 33
(dc_touch_log_file)
11/14/17 15:34:02 (21509.0) (3772013): Calling Timer handler 13
(checkPeriodic)
11/14/17 15:34:02 (21509.0) (3772013): Return from Timer handler 13
(checkPeriodic)
11/14/17 15:34:15 (21509.0) (3772013): Calling Timer handler 2 (check_parent)
11/14/17 15:34:15 (21509.0) (3772013): Return from Timer handler 2
(check_parent)
11/14/17 15:34:35 (21509.0) (3772013): In handleJobRemoval(), sig 10
11/14/17 15:34:35 (21509.0) (3772013): setting exit reason on
slot1_1@xxxxxxxxxxxxxx to 102
11/14/17 15:34:35 (21509.0) (3772013): Resource slot1_1@xxxxxxxxxxxxxx
changing state from EXECUTING to FINISHED
11/14/17 15:34:35 (21509.0) (3772013): Requesting graceful removal of job.
11/14/17 15:34:35 (21509.0) (3772013): Entering
DCStartd::deactivateClaim(graceful)
11/14/17 15:34:35 (21509.0) (3772013):
DCStartd::deactivateClaim(DEACTIVATE_CLAIM,...) making connection to
<192.168.181.182:9618?addrs=192.168.181.182-9618+[--1]-9618&noUDP&sock=8572_41ea_3>
11/14/17 15:34:35 (21509.0) (3772013): condor_write(fd=5
<192.168.181.182:9618>,,size=134,timeout=20,flags=0,non_blocking=0)
11/14/17 15:34:35 (21509.0) (3772013): SharedPortClient: sent connection
request to <192.168.181.182:9618> for shared port id 8572_41ea_3
11/14/17 15:34:35 (21509.0) (3772013): condor_write(fd=5
<192.168.181.182:9618>,,size=785,timeout=20,flags=0,non_blocking=0)
11/14/17 15:34:35 (21509.0) (3772013): encrypting secret
11/14/17 15:34:35 (21509.0) (3772013): condor_write(fd=5
<192.168.181.182:9618>,,size=164,timeout=20,flags=0,non_blocking=0)
11/14/17 15:34:35 (21509.0) (3772013): condor_read(fd=5
<192.168.181.182:9618>,,size=21,timeout=20,flags=0,non_blocking=0)
11/14/17 15:34:35 (21509.0) (3772013): condor_read(): fd=5
11/14/17 15:34:35 (21509.0) (3772013): condor_read(): select returned 1
11/14/17 15:34:35 (21509.0) (3772013): condor_read(fd=5
<192.168.181.182:9618>,,size=23,timeout=20,flags=0,non_blocking=0)
11/14/17 15:34:35 (21509.0) (3772013): condor_read(): fd=5
11/14/17 15:34:35 (21509.0) (3772013): condor_read(): select returned 1
11/14/17 15:34:35 (21509.0) (3772013): DCStartd::deactivateClaim:
successfully sent command
11/14/17 15:34:35 (21509.0) (3772013): CLOSE TCP <192.168.181.13:29201> fd=5
11/14/17 15:34:35 (21509.0) (3772013): Killed starter (graceful) at
<192.168.181.182:9618?addrs=192.168.181.182-9618+[--1]-9618&noUDP&sock=8572_41ea_3>
11/14/17 15:34:37 (21509.0) (3772013): Calling Handler <HandleSyscalls> (1)
11/14/17 15:34:37 (21509.0) (3772013): condor_read(fd=4 startd
slot1_1@xxxxxxxxxxxxxx,,size=21,timeout=300,flags=0,non_blocking=0)
11/14/17 15:34:37 (21509.0) (3772013): condor_read(): fd=4
11/14/17 15:34:37 (21509.0) (3772013): condor_read(): select returned 1
11/14/17 15:34:37 (21509.0) (3772013): condor_read(fd=4 startd
slot1_1@xxxxxxxxxxxxxx,,size=452,timeout=300,flags=0,non_blocking=0)
11/14/17 15:34:37 (21509.0) (3772013): condor_read(): fd=4
11/14/17 15:34:37 (21509.0) (3772013): condor_read(): select returned 1
11/14/17 15:34:37 (21509.0) (3772013): Inside
RemoteResource::updateFromStarter()
1/14/17 15:34:37 (21509.0) (3772013): condor_write(fd=4 startd
slot1_1@xxxxxxxxxxxxxx,,size=29,timeout=300,flags=0,non_blocking=0)
11/14/17 15:34:37 (21509.0) (3772013): Return from Handler
<HandleSyscalls> 0.000604s
11/14/17 15:34:39 (21509.0) (3772013): Calling Handler <HandleSyscalls> (1)
11/14/17 15:34:39 (21509.0) (3772013): condor_read(fd=4 startd
slot1_1@xxxxxxxxxxxxxx,,size=21,timeout=300,flags=0,non_blocking=0)
11/14/17 15:34:39 (21509.0) (3772013): condor_read(): fd=4
11/14/17 15:34:39 (21509.0) (3772013): condor_read(): select returned 1
11/14/17 15:34:39 (21509.0) (3772013): condor_read(fd=4 startd
slot1_1@xxxxxxxxxxxxxx,,size=158,timeout=300,flags=0,non_blocking=0)
11/14/17 15:34:39 (21509.0) (3772013): condor_read(): fd=4
11/14/17 15:34:39 (21509.0) (3772013): condor_read(): select returned 1
11/14/17 15:34:39 (21509.0) (3772013): Inside
RemoteResource::updateFromStarter()
11/14/17 15:34:39 (21509.0) (3772013): Inside RemoteResource::resourceExit()
11/14/17 15:34:39 (21509.0) (3772013): condor_write(fd=4 startd
slot1_1@xxxxxxxxxxxxxx,,size=29,timeout=300,flags=0,non_blocking=0)
11/14/17 15:34:39 (21509.0) (3772013): Job 21509.0 is being evicted from
slot1_1@xxxxxxxxxxxxxx
11/14/17 15:34:39 (21509.0) (3772013):
Daemon::startCommand(QMGMT_WRITE_CMD,...) making connection to
<81.180.86.133:20346>
11/14/17 15:34:39 (21509.0) (3772013): CONNECT bound to
<81.180.86.133:3323> fd=5 peer=<81.180.86.133:20346>
11/14/17 15:34:39 (21509.0) (3772013): condor_write(fd=5 schedd at
<81.180.86.133:20346>,,size=818,timeout=300,flags=0,non_blocking=0)
11/14/17 15:34:39 (21509.0) (3772013): condor_write(fd=5 schedd at
<81.180.86.133:20346>,,size=40,timeout=300,flags=0,non_blocking=0)
............................
11/14/17 15:34:39 (21509.0) (3772013): CLOSE TCP <81.180.86.133:3323> fd=5
11/14/17 15:34:39 (21509.0) (3772013): CLOSE TCP  fd=20
11/14/17 15:34:39 (21509.0) (3772013): **** condor_shadow (condor_SHADOW)
pid 3772013 EXITING WITH STATUS 102

And the Startd log for the same job is:

11/14/17 15:34:35 Calling Handler
<SharedPortEndpoint::HandleListenerAccept> (0)
11/14/17 15:34:35 SharedPortEndpoint: received command 76
SHARED_PORT_PASS_SOCK on named socket
ca0b82630723aa5347090cd7dabd050dd2c84dee3cbd9f6af878bffd3daa11fc/8572_41ea_3
11/14/17 15:34:35 SharedPortEndpoint: received forwarded connection from
<192.168.181.13:29201>.
11/14/17 15:34:35 Return from Handler
<SharedPortEndpoint::HandleListenerAccept> 0.000208s
11/14/17 15:34:35 Calling Handler
<DaemonCommandProtocol::WaitForSocketData> (2)
11/14/17 15:34:35 Calling HandleReq <command_handler> (0) for command 403
(DEACTIVATE_CLAIM) from condor_pool@xxxxxxxx <192.168.181.13:29201>
11/14/17 15:34:35 slot1_1: Called deactivate_claim()
11/14/17 15:34:35 slot1_1: In Starter::kill() with pid 1446, sig 15 (SIGTERM)
11/14/17 15:34:35 Send_Signal(): Doing kill(1446,15) [SIGTERM]
11/14/17 15:34:35 slot1_1: Using max vacate time of 600s for this job.
11/14/17 15:34:35 Return from HandleReq <command_handler> (handler:
0.000193s, sec: 0.000s, payload: 0.000s)
11/14/17 15:34:35 Return from Handler
<DaemonCommandProtocol::WaitForSocketData> 0.000480s


Have any one any idea?


Thanks in advance,
Mihai