[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] job cannot reconnect to starter running MPI



While the job is running the logs are quiet, that's the reason why there is no records between 14:37:13 and 14:40:04.

I did the same test and when the application starts htcondor stops to register information in the logs (14:37:13), until the problem appears and the errors are added to the logs(14:40:04).


Thanks in advance!

----- Mensagem original -----
> De: "Todd L Miller" <tlmiller@xxxxxxxxxxx>
> Para: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
> Enviadas: Quinta-feira, 8 de junho de 2017 16:17:51
> Assunto: Re: [HTCondor-users] job cannot reconnect to starter running MPI
> 
> > I was monitoring the logs in the execution nodes as suggested
> > earlier
> > and I got some errors just after HTCondor set the job from Running
> > ->
> > Idle.
> 
>  	My question is, why is the job going idle?  What was in the
> starter log before it caught the sigquit (why is it being killed)?
>  The
> ShadowLog you quoted below runs from 14:37:13 to 14:40:04; what does
> the
> StartLog and StarterLog.* say for that period of time for those
> nodes?
> Why is the StartLog you showed full of deactivate_claim_forcibly()?
> 
> - ToddM
> _______________________________________________


== NODE01: StarterLog.slot* ==

06/08/17 17:30:16 About to exec /home/carlosadean/condor-examples/test_cosmosis/mp1script /mnt/eups/packages/Linux64/cosmosis_portal/1.4+1/bin/cosmosis --mpi /home/carlosadean/condor-examples/test_cosmosis/acf_grid.ini
06/08/17 17:30:16 Running job as user root
06/08/17 17:30:16 Create_Process succeeded, pid=12999
06/08/17 17:30:16 IOProxy: accepting connection from 10.1.255.219
06/08/17 17:30:16 condor_write() failed: send() 1 bytes to <10.1.255.219:37322> returned -1, timeout=0, errno=32 Broken pipe.
06/08/17 17:30:16 IOProxyHandler: closing connection to 10.1.255.219
06/08/17 17:30:20 IOProxy: accepting connection from 10.1.255.219
06/08/17 17:30:20 IOProxyHandler: closing connection to 10.1.255.219
06/08/17 17:30:20 IOProxy: accepting connection from 10.1.255.219
06/08/17 17:30:20 IOProxyHandler: closing connection to 10.1.255.219
06/08/17 17:32:55 Got SIGQUIT.  Performing fast shutdown.
06/08/17 17:32:55 ShutdownFast all jobs.
06/08/17 17:32:55 Got SIGTERM. Performing graceful shutdown.
06/08/17 17:32:55 ShutdownGraceful all jobs.
06/08/17 17:32:55 Process exited, pid=12999, status=0
06/08/17 17:32:55 condor_write(): Socket closed when trying to write 330 bytes to <10.1.1.12:46279>, fd is 6
06/08/17 17:32:55 Buf::write(): condor_write() failed
06/08/17 17:32:55 condor_write(): Socket closed when trying to write 78 bytes to <10.1.1.12:46279>, fd is 6
06/08/17 17:32:55 Buf::write(): condor_write() failed
06/08/17 17:32:55 Failed to send job exit status to shadow
06/08/17 17:32:55 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
06/08/17 17:32:55 Returning from CStarter::JobReaper()
06/08/17 17:33:37 Can't open directory "/var/opt/condor/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory)
06/08/17 17:33:37 Setting maximum accepts per cycle 8.
06/08/17 17:33:37 ******************************************************
06/08/17 17:33:37 ** condor_starter (CONDOR_STARTER) STARTING UP
06/08/17 17:33:37 ** /opt/condor/sbin/condor_starter
06/08/17 17:33:37 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
06/08/17 17:33:37 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON

== NODE01: StartLog ==

06/08/17 17:30:15 slot4: Remote job ID is 1212.0
06/08/17 17:30:15 slot4: Got universe "PARALLEL" (11) from request classad
06/08/17 17:30:15 slot4: State change: claim-activation protocol successful
06/08/17 17:30:15 slot4: Changing activity: Idle -> Busy
06/08/17 17:30:15 slot5: Got activate_claim request from shadow (10.1.1.12)
06/08/17 17:30:15 slot5: Remote job ID is 1212.0
06/08/17 17:30:15 slot5: Got universe "PARALLEL" (11) from request classad
06/08/17 17:30:15 slot5: State change: claim-activation protocol successful
06/08/17 17:30:15 slot5: Changing activity: Idle -> Busy
06/08/17 17:32:54 slot10: State change: SUSPEND is TRUE
06/08/17 17:32:54 slot10: Changing activity: Busy -> Suspended
06/08/17 17:32:54 Starter pid 12987 exited with status 0
06/08/17 17:32:54 slot10: State change: starter exited
06/08/17 17:32:54 slot10: Changing activity: Suspended -> Idle
06/08/17 17:32:55 slot6: Called deactivate_claim_forcibly()
06/08/17 17:32:55 slot7: Called deactivate_claim_forcibly()
06/08/17 17:32:55 slot8: Called deactivate_claim_forcibly()
06/08/17 17:32:55 slot9: Called deactivate_claim_forcibly()
06/08/17 17:32:55 slot10: Called deactivate_claim_forcibly()
06/08/17 17:32:55 slot1: Called deactivate_claim_forcibly()
06/08/17 17:32:55 slot2: Called deactivate_claim_forcibly()
06/08/17 17:32:55 slot3: Called deactivate_claim_forcibly()
06/08/17 17:32:55 slot4: Called deactivate_claim_forcibly()
06/08/17 17:32:55 slot5: Called deactivate_claim_forcibly()
06/08/17 17:32:55 condor_write(): Socket closed when trying to write 28 bytes to <10.1.1.12:42628>, fd is 6
06/08/17 17:32:55 Buf::write(): condor_write() failed

== NODE02: StarterLog.slot* ==
06/08/17 17:30:39 IOProxy: accepting connection from 10.1.255.217
06/08/17 17:30:39 IOProxyHandler: closing connection to 10.1.255.217
06/08/17 17:30:39 IOProxy: accepting connection from 10.1.255.217
06/08/17 17:30:39 IOProxyHandler: closing connection to 10.1.255.217
06/08/17 17:30:39 IOProxy: accepting connection from 10.1.255.217
06/08/17 17:30:39 IOProxyHandler: closing connection to 10.1.255.217
06/08/17 17:32:54 Got SIGQUIT.  Performing fast shutdown.
06/08/17 17:32:54 ShutdownFast all jobs.
06/08/17 17:32:55 Got SIGTERM. Performing graceful shutdown.
06/08/17 17:32:55 ShutdownGraceful all jobs.
06/08/17 17:33:13 Process exited, pid=8645, status=0
06/08/17 17:33:13 condor_write(): Socket closed when trying to write 336 bytes to <10.1.1.12:40804>, fd is 6
06/08/17 17:33:13 Buf::write(): condor_write() failed
06/08/17 17:33:13 condor_write(): Socket closed when trying to write 78 bytes to <10.1.1.12:40804>, fd is 6
06/08/17 17:33:13 Buf::write(): condor_write() failed
06/08/17 17:33:13 Failed to send job exit status to shadow
06/08/17 17:33:13 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
06/08/17 17:33:13 Returning from CStarter::JobReaper()
06/08/17 17:33:36 Can't open directory "/var/opt/condor/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory)
06/08/17 17:33:36 Setting maximum accepts per cycle 8.
06/08/17 17:33:36 ******************************************************
06/08/17 17:33:36 ** condor_starter (CONDOR_STARTER) STARTING UP

== NODE02: StartLog ==

06/08/17 17:30:15 slot4: Got universe "PARALLEL" (11) from request classad
06/08/17 17:30:15 slot4: State change: claim-activation protocol successful
06/08/17 17:30:15 slot4: Changing activity: Idle -> Busy
06/08/17 17:30:15 slot5: Got activate_claim request from shadow (10.1.1.12)
06/08/17 17:30:15 slot5: Remote job ID is 1212.0
06/08/17 17:30:15 slot5: Got universe "PARALLEL" (11) from request classad
06/08/17 17:30:15 slot5: State change: claim-activation protocol successful
06/08/17 17:30:15 slot5: Changing activity: Idle -> Busy
06/08/17 17:32:54 slot6: Called deactivate_claim_forcibly()
06/08/17 17:32:55 slot7: Called deactivate_claim_forcibly()
06/08/17 17:32:55 slot8: Called deactivate_claim_forcibly()
06/08/17 17:32:55 slot9: Called deactivate_claim_forcibly()
06/08/17 17:32:55 slot10: Called deactivate_claim_forcibly()
06/08/17 17:32:55 slot1: Called deactivate_claim_forcibly()
06/08/17 17:32:55 slot2: Called deactivate_claim_forcibly()
06/08/17 17:32:55 slot3: Called deactivate_claim_forcibly()


--
Carlos Adean
IT Team
linea.gov.br
skype: carlosadean