Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] "aborted by the user" in successful job
- Date: Tue, 9 Apr 2019 14:02:22 +0000
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] "aborted by the user" in successful job
On 4/9/2019 4:32 AM, Alex Armstrong wrote:
> Dear htcondor users,
>
> Is there a reason why I would see abort events (see [1]) in the logs of
> my successful condor jobs. I have not run condor_rm on the job below,
> which is why it finished normally and returns the desired output. The
> full order of log events is below at [2].
>
I am a bit confused. Both your job event log entry [1] and your snippet
apparently from the ShadowLog at [2] show the same thing, that job
986318.954 was removed by condor_rm. [2] also shows that job 986321.018
did terminate normally, but that has nothing to do with what happened to
job 986321.954. Note it is possible, although unlikely, for a removed
job to still deposit the desired output in your home directory due to
race conditions - for instance, if the job completed on the execute node
at the same second you do a condor_rm on the submit node.
> I am trying to parse the log files to determine which jobs were aborted
> and need to be re-run. However, the abort event (i.e 009) is appearing
> in log files that were not aborted and so I cannot use that as a handle
> for identifying user aborted jobs.
>
HTCondor will never "abort" (remove) jobs on its own without being told
to do so. Either condor_rm was run, or some policy expression in the
submit file or condor_config file was configured to remove the job upon
some condition (like after X amount of failure) - but in the latter
case, I don't think the abort entry would say "via condor_rm". I think
the only way you see the "via condor_rm" is if indeed condor_rm was run.
Did you submit these jobs via DAGMan, ie did you use condor_submit_dag?
If so be aware that jobs submitted by DAGMan are removed if you remove
the DAGMan job itself.
Hope this helps
Todd
> Thanks,
> Alex
>
> [1]
> 009 (986318.954.000) 04/08 13:20:03 Job was aborted by the user.
> Â Â Â Â via condor_rm (by user alarmstr)
>
> [2]
> 000 (986321.018.000) 04/08 13:16:38 Job submitted from host:
> 028 (986321.018.000) 04/08 13:16:38 Job ad information event triggered.
> 001 (986321.018.000) 04/08 13:17:19 Job executing on host:
> 028 (986321.018.000) 04/08 13:17:19 Job ad information event triggered.
> 006 (986321.018.000) 04/08 13:17:27 Image size of job updated: 34912
> 028 (986321.018.000) 04/08 13:17:27 Job ad information event triggered.
> 024 (986318.954.000) 04/08 13:20:03 Job reconnection failed
> 028 (986318.954.000) 04/08 13:20:03 Job ad information event triggered.
> 009 (986318.954.000) 04/08 13:20:03 Job was aborted by the user.
> 028 (986318.954.000) 04/08 13:20:03 Job ad information event triggered.
> 006 (986321.018.000) 04/08 13:22:27 Image size of job updated: 991768
> 028 (986321.018.000) 04/08 13:22:27 Job ad information event triggered.
> 005 (986321.018.000) 04/08 13:24:19 Job terminated.
>
> (1) Normal termination (return value 0)
>
> 028 (986321.018.000) 04/08 13:24:19 Job ad information event triggered.
>