[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] evicted multicore jobs



Hi Todd,

Sorry that I didn't mentioned in the first email, all the jobs are removed
by the user:

 condor_history 21509.0 -l|grep Remove
PeriodicRemove = false
RemoveReason = "via condor_rm (by user griduser01)"
OnExitRemove = true

I also want to mention that I'm running Atlas(CERN) jobs with a Docker
universe  and the memory and cpu requested are:

RequestCpus = 8
RequestMemory = 15000
While the machines have 8 cores and 16GBRAM.
I notice that the jobs are killed also  while the input files are
transferred, so very few memory is used.
I din't find any pattern(clue) for job killing process. Jobs can run for
few minutes or up to 1 hour.

Best wishes,
Mihai



> On 11/14/2017 8:06 AM, Mihai Ciubancan wrote:
>>
>>
>>
>> Hi everyone,
>>
>> I'm running multicore jobs and all are evicted and I don't understand
>> what
>> is the reason.
>
> Hi Mihai,
>
> When you say "evicted", do you mean the job went from state Running back
> to state Idle (i.e. still queued and waiting to be rescheduled), or did
> something else happen?
>
> Doing just a quick look at the shadow log below, looks like job 21509.0
> was either removed or put on hold.
>
> Is the job still in the queue (i.e. does it still appear with condor_q)?
>
> If not, what does "condor_history -l 21509.0" reveal?  If the job was
> removed or placed on hold, the attributes RemoveReason or HoldReason,
> respectively, should tell why.
>
> regards
> Todd
>
>
>
>> This the shadow log output for a job(21509.0):
>>
>> 1/14/17 15:34:00 (21509.0) (3772013): Calling Timer handler 33
>> (dc_touch_log_file)
>> 11/14/17 15:34:00 (21509.0) (3772013): Return from Timer handler 33
>> (dc_touch_log_file)
>> 11/14/17 15:34:02 (21509.0) (3772013): Calling Timer handler 13
>> (checkPeriodic)
>> 11/14/17 15:34:02 (21509.0) (3772013): Return from Timer handler 13
>> (checkPeriodic)
>> 11/14/17 15:34:15 (21509.0) (3772013): Calling Timer handler 2
>> (check_parent)
>> 11/14/17 15:34:15 (21509.0) (3772013): Return from Timer handler 2
>> (check_parent)
>> 11/14/17 15:34:35 (21509.0) (3772013): In handleJobRemoval(), sig 10
>> 11/14/17 15:34:35 (21509.0) (3772013): setting exit reason on
>> slot1_1@xxxxxxxxxxxxxx to 102
>> 11/14/17 15:34:35 (21509.0) (3772013): Resource slot1_1@xxxxxxxxxxxxxx
>> changing state from EXECUTING to FINISHED
>> 11/14/17 15:34:35 (21509.0) (3772013): Requesting graceful removal of
>> job.
>> 11/14/17 15:34:35 (21509.0) (3772013): Entering
>> DCStartd::deactivateClaim(graceful)
>> 11/14/17 15:34:35 (21509.0) (3772013):
>> DCStartd::deactivateClaim(DEACTIVATE_CLAIM,...) making connection to
>> <192.168.181.182:9618?addrs=192.168.181.182-9618+[--1]-9618&noUDP&sock=8572_41ea_3>
>> 11/14/17 15:34:35 (21509.0) (3772013): condor_write(fd=5
>> <192.168.181.182:9618>,,size=134,timeout=20,flags=0,non_blocking=0)
>> 11/14/17 15:34:35 (21509.0) (3772013): SharedPortClient: sent connection
>> request to <192.168.181.182:9618> for shared port id 8572_41ea_3
>> 11/14/17 15:34:35 (21509.0) (3772013): condor_write(fd=5
>> <192.168.181.182:9618>,,size=785,timeout=20,flags=0,non_blocking=0)
>> 11/14/17 15:34:35 (21509.0) (3772013): encrypting secret
>> 11/14/17 15:34:35 (21509.0) (3772013): condor_write(fd=5
>> <192.168.181.182:9618>,,size=164,timeout=20,flags=0,non_blocking=0)
>> 11/14/17 15:34:35 (21509.0) (3772013): condor_read(fd=5
>> <192.168.181.182:9618>,,size=21,timeout=20,flags=0,non_blocking=0)
>> 11/14/17 15:34:35 (21509.0) (3772013): condor_read(): fd=5
>> 11/14/17 15:34:35 (21509.0) (3772013): condor_read(): select returned 1
>> 11/14/17 15:34:35 (21509.0) (3772013): condor_read(fd=5
>> <192.168.181.182:9618>,,size=23,timeout=20,flags=0,non_blocking=0)
>> 11/14/17 15:34:35 (21509.0) (3772013): condor_read(): fd=5
>> 11/14/17 15:34:35 (21509.0) (3772013): condor_read(): select returned 1
>> 11/14/17 15:34:35 (21509.0) (3772013): DCStartd::deactivateClaim:
>> successfully sent command
>> 11/14/17 15:34:35 (21509.0) (3772013): CLOSE TCP <192.168.181.13:29201>
>> fd=5
>> 11/14/17 15:34:35 (21509.0) (3772013): Killed starter (graceful) at
>> <192.168.181.182:9618?addrs=192.168.181.182-9618+[--1]-9618&noUDP&sock=8572_41ea_3>
>> 11/14/17 15:34:37 (21509.0) (3772013): Calling Handler <HandleSyscalls>
>> (1)
>> 11/14/17 15:34:37 (21509.0) (3772013): condor_read(fd=4 startd
>> slot1_1@xxxxxxxxxxxxxx,,size=21,timeout=300,flags=0,non_blocking=0)
>> 11/14/17 15:34:37 (21509.0) (3772013): condor_read(): fd=4
>> 11/14/17 15:34:37 (21509.0) (3772013): condor_read(): select returned 1
>> 11/14/17 15:34:37 (21509.0) (3772013): condor_read(fd=4 startd
>> slot1_1@xxxxxxxxxxxxxx,,size=452,timeout=300,flags=0,non_blocking=0)
>> 11/14/17 15:34:37 (21509.0) (3772013): condor_read(): fd=4
>> 11/14/17 15:34:37 (21509.0) (3772013): condor_read(): select returned 1
>> 11/14/17 15:34:37 (21509.0) (3772013): Inside
>> RemoteResource::updateFromStarter()
>> 1/14/17 15:34:37 (21509.0) (3772013): condor_write(fd=4 startd
>> slot1_1@xxxxxxxxxxxxxx,,size=29,timeout=300,flags=0,non_blocking=0)
>> 11/14/17 15:34:37 (21509.0) (3772013): Return from Handler
>> <HandleSyscalls> 0.000604s
>> 11/14/17 15:34:39 (21509.0) (3772013): Calling Handler <HandleSyscalls>
>> (1)
>> 11/14/17 15:34:39 (21509.0) (3772013): condor_read(fd=4 startd
>> slot1_1@xxxxxxxxxxxxxx,,size=21,timeout=300,flags=0,non_blocking=0)
>> 11/14/17 15:34:39 (21509.0) (3772013): condor_read(): fd=4
>> 11/14/17 15:34:39 (21509.0) (3772013): condor_read(): select returned 1
>> 11/14/17 15:34:39 (21509.0) (3772013): condor_read(fd=4 startd
>> slot1_1@xxxxxxxxxxxxxx,,size=158,timeout=300,flags=0,non_blocking=0)
>> 11/14/17 15:34:39 (21509.0) (3772013): condor_read(): fd=4
>> 11/14/17 15:34:39 (21509.0) (3772013): condor_read(): select returned 1
>> 11/14/17 15:34:39 (21509.0) (3772013): Inside
>> RemoteResource::updateFromStarter()
>> 11/14/17 15:34:39 (21509.0) (3772013): Inside
>> RemoteResource::resourceExit()
>> 11/14/17 15:34:39 (21509.0) (3772013): condor_write(fd=4 startd
>> slot1_1@xxxxxxxxxxxxxx,,size=29,timeout=300,flags=0,non_blocking=0)
>> 11/14/17 15:34:39 (21509.0) (3772013): Job 21509.0 is being evicted from
>> slot1_1@xxxxxxxxxxxxxx
>> 11/14/17 15:34:39 (21509.0) (3772013):
>> Daemon::startCommand(QMGMT_WRITE_CMD,...) making connection to
>> <81.180.86.133:20346>
>> 11/14/17 15:34:39 (21509.0) (3772013): CONNECT bound to
>> <81.180.86.133:3323> fd=5 peer=<81.180.86.133:20346>
>> 11/14/17 15:34:39 (21509.0) (3772013): condor_write(fd=5 schedd at
>> <81.180.86.133:20346>,,size=818,timeout=300,flags=0,non_blocking=0)
>> 11/14/17 15:34:39 (21509.0) (3772013): condor_write(fd=5 schedd at
>> <81.180.86.133:20346>,,size=40,timeout=300,flags=0,non_blocking=0)
>> ............................
>> 11/14/17 15:34:39 (21509.0) (3772013): CLOSE TCP <81.180.86.133:3323>
>> fd=5
>> 11/14/17 15:34:39 (21509.0) (3772013): CLOSE TCP  fd=20
>> 11/14/17 15:34:39 (21509.0) (3772013): **** condor_shadow
>> (condor_SHADOW)
>> pid 3772013 EXITING WITH STATUS 102
>>
>> And the Startd log for the same job is:
>>
>> 11/14/17 15:34:35 Calling Handler
>> <SharedPortEndpoint::HandleListenerAccept> (0)
>> 11/14/17 15:34:35 SharedPortEndpoint: received command 76
>> SHARED_PORT_PASS_SOCK on named socket
>> ca0b82630723aa5347090cd7dabd050dd2c84dee3cbd9f6af878bffd3daa11fc/8572_41ea_3
>> 11/14/17 15:34:35 SharedPortEndpoint: received forwarded connection from
>> <192.168.181.13:29201>.
>> 11/14/17 15:34:35 Return from Handler
>> <SharedPortEndpoint::HandleListenerAccept> 0.000208s
>> 11/14/17 15:34:35 Calling Handler
>> <DaemonCommandProtocol::WaitForSocketData> (2)
>> 11/14/17 15:34:35 Calling HandleReq <command_handler> (0) for command
>> 403
>> (DEACTIVATE_CLAIM) from condor_pool@xxxxxxxx <192.168.181.13:29201>
>> 11/14/17 15:34:35 slot1_1: Called deactivate_claim()
>> 11/14/17 15:34:35 slot1_1: In Starter::kill() with pid 1446, sig 15
>> (SIGTERM)
>> 11/14/17 15:34:35 Send_Signal(): Doing kill(1446,15) [SIGTERM]
>> 11/14/17 15:34:35 slot1_1: Using max vacate time of 600s for this job.
>> 11/14/17 15:34:35 Return from HandleReq <command_handler> (handler:
>> 0.000193s, sec: 0.000s, payload: 0.000s)
>> 11/14/17 15:34:35 Return from Handler
>> <DaemonCommandProtocol::WaitForSocketData> 0.000480s
>>
>>
>> Have any one any idea?
>>
>>
>> Thanks in advance,
>> Mihai
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>> with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>


Dr. Mihai Ciubancan
IT Department
National Institute of Physics and Nuclear Engineering "Horia Hulubei"
Str. Reactorului no. 30, P.O. BOX MG-6
077125, Magurele - Bucharest, Romania
http://www.ifin.ro
Work:   +40214042360
Mobile: +40761345687
Fax:    +40214042395