Hi all,
I have been debugging the issue, and I have noticed two things:
1) When the hook gets called, the execution environment has already
been deleted, but it does not know about it (I checked doing a pwd
and trying both ls and ls .. within the hook... result: pwd (the
directory under EXECUTE) is no longer there.
2) The hook now (as of HTCondor 8.0) gets killed after 1 or 2
seconds, even if HOOKNAME_HOOK_JOB_EXIT_TIMEOUT is set to 300
(obviously, HOOKNAME matches the hook name).
3) The output directory is deleted while the script is executing
(tried a sleep 1 loop and ls each second, the first second the files
are there, the next they aren't).
In short, it seems as if the cleaning process ignores the hook and
keeps on deleting everything and such. (and the process ended
naturally, so I don't think things such as KILLING_TIMEOUT should
even apply). Has this code path been changed recently? Where could I
look for this in the source code? (some pointer would be most
welcome).
Thanks,
Joan
El 01/07/13 12:28, Joan J. Piles
escribió:
Hi all,
We have been having troubles with our JOB_EXIT_HOOKS, both in
HTCondor 7.8 and in HTCondor 8.0. Some of them (and the amount is
strangely increasing with time) don't get any job classAd at all.
At first we thought it could be a timeout issue (we had our share
of these as well), but it doesn't seem to be the case as the hook
script continues its execution. Just in case, we have set both
KILLING_TIMEOUT and xxxxx_HOOK_JOB_EXIT_TIMEOUT to 300 seconds,
which should be more than enough for it.
The first thing our hook script tries to do is to dump the whole
classad to a file (for debugging purposes), and it is creating
empty files:
#!/bin/bash
TMPFILE=`mktemp /tmp/condorlog.XXXXXX`
cat > $TMPFILE
The script keeps going from there (reading the stored classad and
processing it). We can see that the script tries to do its job,
but it complains about not having any data to work on. That's why
we have discarded the possibility of a timeout.
I found a similar report in the list from four years ago [1], but
it didn't seem to get any solution. Is there anything I could do
to further debug this issue?
Thanks,
Joan
[1]:
https://lists.cs.wisc.edu/archive/htcondor-users/2009-July/msg00165.shtml
--
--------------------------------------------------------------------------
Joan Josep Piles Contreras - Analista de sistemas
I3A - Instituto de Investigación en Ingeniería de Aragón
Tel: 876 55 51 47 (ext. 845147)
http://i3a.unizar.es -- jpiles@xxxxxxxxx
--------------------------------------------------------------------------
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
--
--------------------------------------------------------------------------
Joan Josep Piles Contreras - Analista de sistemas
I3A - Instituto de Investigación en Ingeniería de Aragón
Tel: 876 55 51 47 (ext. 845147)
http://i3a.unizar.es -- jpiles@xxxxxxxxx
--------------------------------------------------------------------------
|