[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] evicted multicore jobs



On 11/15/2017 4:11 AM, Mihai Ciubancan wrote:
Hi Todd,

Sorry that I didn't mentioned in the first email, all the jobs are removed
by the user:

  condor_history 21509.0 -l|grep Remove
PeriodicRemove = false
RemoveReason = "via condor_rm (by user griduser01)"
OnExitRemove = true
 > I also want to mention that I'm running Atlas(CERN) jobs with a Docker
universe  and the memory and cpu requested are:

RequestCpus = 8
RequestMemory = 15000
While the machines have 8 cores and 16GBRAM.
I notice that the jobs are killed also  while the input files are
transferred, so very few memory is used.
I din't find any pattern(clue) for job killing process. Jobs can run for
few minutes or up to 1 hour.

Best wishes,
Mihai


Hi Mihai,

According to HTCondor, someone (or some script or service) contacted the condor_schedd on your submit machine, authenticated as user griduser01, and told HTCondor to remove job 21509.0. When HTCondor is told to remove a job, HTCondor will behave exactly as the logs and history ad shows. I.e. it will kill any running instance of that job and the remove the job from the queue.

Maybe some admin at your site setup a script somewhere that periodically scans the queue and issues condor_rm on jobs it doesn't like?

If the job was submitted by a remote service, seems like the remote service decided to remove it after a while...

regards
Todd




On 11/14/2017 8:06 AM, Mihai Ciubancan wrote:



Hi everyone,

I'm running multicore jobs and all are evicted and I don't understand
what
is the reason.

Hi Mihai,

When you say "evicted", do you mean the job went from state Running back
to state Idle (i.e. still queued and waiting to be rescheduled), or did
something else happen?

Doing just a quick look at the shadow log below, looks like job 21509.0
was either removed or put on hold.

Is the job still in the queue (i.e. does it still appear with condor_q)?

If not, what does "condor_history -l 21509.0" reveal?  If the job was
removed or placed on hold, the attributes RemoveReason or HoldReason,
respectively, should tell why.

regards
Todd



This the shadow log output for a job(21509.0):

1/14/17 15:34:00 (21509.0) (3772013): Calling Timer handler 33
(dc_touch_log_file)
11/14/17 15:34:00 (21509.0) (3772013): Return from Timer handler 33
(dc_touch_log_file)
11/14/17 15:34:02 (21509.0) (3772013): Calling Timer handler 13
(checkPeriodic)
11/14/17 15:34:02 (21509.0) (3772013): Return from Timer handler 13
(checkPeriodic)
11/14/17 15:34:15 (21509.0) (3772013): Calling Timer handler 2
(check_parent)
11/14/17 15:34:15 (21509.0) (3772013): Return from Timer handler 2
(check_parent)
11/14/17 15:34:35 (21509.0) (3772013): In handleJobRemoval(), sig 10
11/14/17 15:34:35 (21509.0) (3772013): setting exit reason on
slot1_1@xxxxxxxxxxxxxx to 102
11/14/17 15:34:35 (21509.0) (3772013): Resource slot1_1@xxxxxxxxxxxxxx
changing state from EXECUTING to FINISHED
11/14/17 15:34:35 (21509.0) (3772013): Requesting graceful removal of
job.
11/14/17 15:34:35 (21509.0) (3772013): Entering
DCStartd::deactivateClaim(graceful)
11/14/17 15:34:35 (21509.0) (3772013):
DCStartd::deactivateClaim(DEACTIVATE_CLAIM,...) making connection to
<192.168.181.182:9618?addrs=192.168.181.182-9618+[--1]-9618&noUDP&sock=8572_41ea_3>
11/14/17 15:34:35 (21509.0) (3772013): condor_write(fd=5
<192.168.181.182:9618>,,size=134,timeout=20,flags=0,non_blocking=0)
11/14/17 15:34:35 (21509.0) (3772013): SharedPortClient: sent connection
request to <192.168.181.182:9618> for shared port id 8572_41ea_3
11/14/17 15:34:35 (21509.0) (3772013): condor_write(fd=5
<192.168.181.182:9618>,,size=785,timeout=20,flags=0,non_blocking=0)
11/14/17 15:34:35 (21509.0) (3772013): encrypting secret
11/14/17 15:34:35 (21509.0) (3772013): condor_write(fd=5
<192.168.181.182:9618>,,size=164,timeout=20,flags=0,non_blocking=0)
11/14/17 15:34:35 (21509.0) (3772013): condor_read(fd=5
<192.168.181.182:9618>,,size=21,timeout=20,flags=0,non_blocking=0)
11/14/17 15:34:35 (21509.0) (3772013): condor_read(): fd=5
11/14/17 15:34:35 (21509.0) (3772013): condor_read(): select returned 1
11/14/17 15:34:35 (21509.0) (3772013): condor_read(fd=5
<192.168.181.182:9618>,,size=23,timeout=20,flags=0,non_blocking=0)
11/14/17 15:34:35 (21509.0) (3772013): condor_read(): fd=5
11/14/17 15:34:35 (21509.0) (3772013): condor_read(): select returned 1
11/14/17 15:34:35 (21509.0) (3772013): DCStartd::deactivateClaim:
successfully sent command
11/14/17 15:34:35 (21509.0) (3772013): CLOSE TCP <192.168.181.13:29201>
fd=5
11/14/17 15:34:35 (21509.0) (3772013): Killed starter (graceful) at
<192.168.181.182:9618?addrs=192.168.181.182-9618+[--1]-9618&noUDP&sock=8572_41ea_3>
11/14/17 15:34:37 (21509.0) (3772013): Calling Handler <HandleSyscalls>
(1)
11/14/17 15:34:37 (21509.0) (3772013): condor_read(fd=4 startd
slot1_1@xxxxxxxxxxxxxx,,size=21,timeout=300,flags=0,non_blocking=0)
11/14/17 15:34:37 (21509.0) (3772013): condor_read(): fd=4
11/14/17 15:34:37 (21509.0) (3772013): condor_read(): select returned 1
11/14/17 15:34:37 (21509.0) (3772013): condor_read(fd=4 startd
slot1_1@xxxxxxxxxxxxxx,,size=452,timeout=300,flags=0,non_blocking=0)
11/14/17 15:34:37 (21509.0) (3772013): condor_read(): fd=4
11/14/17 15:34:37 (21509.0) (3772013): condor_read(): select returned 1
11/14/17 15:34:37 (21509.0) (3772013): Inside
RemoteResource::updateFromStarter()
1/14/17 15:34:37 (21509.0) (3772013): condor_write(fd=4 startd
slot1_1@xxxxxxxxxxxxxx,,size=29,timeout=300,flags=0,non_blocking=0)
11/14/17 15:34:37 (21509.0) (3772013): Return from Handler
<HandleSyscalls> 0.000604s
11/14/17 15:34:39 (21509.0) (3772013): Calling Handler <HandleSyscalls>
(1)
11/14/17 15:34:39 (21509.0) (3772013): condor_read(fd=4 startd
slot1_1@xxxxxxxxxxxxxx,,size=21,timeout=300,flags=0,non_blocking=0)
11/14/17 15:34:39 (21509.0) (3772013): condor_read(): fd=4
11/14/17 15:34:39 (21509.0) (3772013): condor_read(): select returned 1
11/14/17 15:34:39 (21509.0) (3772013): condor_read(fd=4 startd
slot1_1@xxxxxxxxxxxxxx,,size=158,timeout=300,flags=0,non_blocking=0)
11/14/17 15:34:39 (21509.0) (3772013): condor_read(): fd=4
11/14/17 15:34:39 (21509.0) (3772013): condor_read(): select returned 1
11/14/17 15:34:39 (21509.0) (3772013): Inside
RemoteResource::updateFromStarter()
11/14/17 15:34:39 (21509.0) (3772013): Inside
RemoteResource::resourceExit()
11/14/17 15:34:39 (21509.0) (3772013): condor_write(fd=4 startd
slot1_1@xxxxxxxxxxxxxx,,size=29,timeout=300,flags=0,non_blocking=0)
11/14/17 15:34:39 (21509.0) (3772013): Job 21509.0 is being evicted from
slot1_1@xxxxxxxxxxxxxx
11/14/17 15:34:39 (21509.0) (3772013):
Daemon::startCommand(QMGMT_WRITE_CMD,...) making connection to
<81.180.86.133:20346>
11/14/17 15:34:39 (21509.0) (3772013): CONNECT bound to
<81.180.86.133:3323> fd=5 peer=<81.180.86.133:20346>
11/14/17 15:34:39 (21509.0) (3772013): condor_write(fd=5 schedd at
<81.180.86.133:20346>,,size=818,timeout=300,flags=0,non_blocking=0)
11/14/17 15:34:39 (21509.0) (3772013): condor_write(fd=5 schedd at
<81.180.86.133:20346>,,size=40,timeout=300,flags=0,non_blocking=0)
............................
11/14/17 15:34:39 (21509.0) (3772013): CLOSE TCP <81.180.86.133:3323>
fd=5
11/14/17 15:34:39 (21509.0) (3772013): CLOSE TCP  fd=20
11/14/17 15:34:39 (21509.0) (3772013): **** condor_shadow
(condor_SHADOW)
pid 3772013 EXITING WITH STATUS 102

And the Startd log for the same job is:

11/14/17 15:34:35 Calling Handler
<SharedPortEndpoint::HandleListenerAccept> (0)
11/14/17 15:34:35 SharedPortEndpoint: received command 76
SHARED_PORT_PASS_SOCK on named socket
ca0b82630723aa5347090cd7dabd050dd2c84dee3cbd9f6af878bffd3daa11fc/8572_41ea_3
11/14/17 15:34:35 SharedPortEndpoint: received forwarded connection from
<192.168.181.13:29201>.
11/14/17 15:34:35 Return from Handler
<SharedPortEndpoint::HandleListenerAccept> 0.000208s
11/14/17 15:34:35 Calling Handler
<DaemonCommandProtocol::WaitForSocketData> (2)
11/14/17 15:34:35 Calling HandleReq <command_handler> (0) for command
403
(DEACTIVATE_CLAIM) from condor_pool@xxxxxxxx <192.168.181.13:29201>
11/14/17 15:34:35 slot1_1: Called deactivate_claim()
11/14/17 15:34:35 slot1_1: In Starter::kill() with pid 1446, sig 15
(SIGTERM)
11/14/17 15:34:35 Send_Signal(): Doing kill(1446,15) [SIGTERM]
11/14/17 15:34:35 slot1_1: Using max vacate time of 600s for this job.
11/14/17 15:34:35 Return from HandleReq <command_handler> (handler:
0.000193s, sec: 0.000s, payload: 0.000s)
11/14/17 15:34:35 Return from Handler
<DaemonCommandProtocol::WaitForSocketData> 0.000480s


Have any one any idea?


Thanks in advance,
Mihai

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/




Dr. Mihai Ciubancan
IT Department
National Institute of Physics and Nuclear Engineering "Horia Hulubei"
Str. Reactorului no. 30, P.O. BOX MG-6
077125, Magurele - Bucharest, Romania
http://www.ifin.ro
Work:   +40214042360
Mobile: +40761345687
Fax:    +40214042395

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685