HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] how could Condor do this? (was Re: [Condor-users] Suppress Windows error dialogs popping up for crashing Condor jobs)



The only ways to control this that I know of are the registry key (system wide) or from within the application itself.

so no.  I don't think so.

-tj


On 5/18/2011 5:05 PM, Todd Tannenbaum wrote:
TJ -

Is there a way *Condor* could change the system default exception filter behavior for the jobs it runs?

regards
Todd


John (TJ) Knoeller wrote:
SetErrorMode will prevent errors from Windows regarding failed disk acess or missing dlls, but It has no effect on the messages that are a result of crash. (i.e can't read from or write to memory, invalid instruction, etc). Those messages come from a default ExceptionFilter that is wrapped around all of the threads in a process.

To get rid of the dialog boxes on memory access errors, you need to wrap your threads in a __try/__except that returns EXCEPTION_EXECUTE_HANDLER and have the handler call ExitProcess(error_code), instead of letting the default ExceptionFilter deal with the exceptions.

Alternatively, you can call SetUnhandledExceptionFilter
http://msdn.microsoft.com/en-us/library/ms680634(VS.85).aspx

and provide a default filter that exits the process with an error code.

The registry setting that you found basically changes the behavior of the system default exception filter.

-tj


On 5/18/2011 10:53 AM, Derrick Karimi wrote:
Thanks for your input. I still need a check like you are talking about with cpu usage and kill-time. In particular I want to guard against programmer infinite loop bugs, or if they popped up a message box on purpose. They know they aren't supposed to pop up an error message explicitly in a job, actually they are supposed to call our wrapped versions of all API calls, which if resulting in a GUI will go through a switch that just logs in condor mode.

My problem was an unexpected crash that made windows produce an invalid memory access error message, and for some reason the programtic method of using windows API SetErrorMode was missing this one, for windows XP. The registry key I listed fixed that, but it is a tough option to decide to tell the customer to edit their registry, or additional dev/test/doc time to develop a configuration tool for them. If there is a way to keep that dialog from appearing all from in the code on XP I would love to know about it.

As for implementing your idea of auto killing a long running job that was not using much CPU...do you implement this in condor with a periodic remove? Or do you implement this in your a thread of the Condor job via it's python wrapper?


--Derrick

On Wed, May 18, 2011 at 9:30 AM, Michael O'Donnell <odonnellm@xxxxxxxx <mailto:odonnellm@xxxxxxxx>> wrote:

    Derrick, I have run into similar problems and generally this is
    handled in
the application. One thought is to check if the developers can add a
    switch that causes the program to exit with a STDOUT error code
    versus a
popup message. I was working on a numerical hydrologic model that was
    written by someone else in Fortran and they essentially had a
    popop that
    required the user to click ok when the program completed
    successfully (as
    if you would not know the program completed its analysis
    successfully).
    Anyhow, I was able to change the underlining code so popups did
    not occur.
    I would imagine this could be done in your case.

Most of my applications that I run are wrapped inside a python script,
    which allows me a better programming language then using something
    like
DOS batch files. VBS or something else could also be used. I had also looked into sendkeys, but I had a difficult time getting this to work
    because there was something different about the window station
    environment
(a popup occurs, but it does not actually exist) and although sendkeys
    worked running the application locally, it would not work when
    executed
    via condor.

    A couple other ideas are to evaluate the CPU for the exe task. If
    it falls
    below a threshold and remains there for a certain duration then
    kill it.
    You can also set a maximum runtime for a condor job and if this is
    exceeded then kill it. Although these methods work, in my opinion
    the best
    method is to add a switch or something that allows errors messages
    to be
    sent to STDOUT versus a popup. There may be a better way, but this
    is what
    I did in the past.


    mike





    From:
    Derrick Karimi <derrick.karimi@xxxxxxxxx
<mailto:derrick.karimi@xxxxxxxxx>>
    To:
    Condor-Users Mail List <condor-users@xxxxxxxxxxx
<mailto:condor-users@xxxxxxxxxxx>>
    Date:
    05/18/2011 07:13 AM
    Subject:
[Condor-users] Suppress Windows error dialogs popping up for crashing
    Condor jobs
    Sent by:
    condor-users-bounces@xxxxxxxxxxx
<mailto:condor-users-bounces@xxxxxxxxxxx>



    Hi,

    I am working on fault tolerance on our system.  When our job's
    run sometimes they crash.  I told the developers to fix the code
    but they
    told me to rerun the job because they can't reproduce the
    problem...I will
    work on their attitude later.

    My problem was windows popping up various error reporting and crash
dialogs. When the dialog pops up the process won't exit till the user
    clicks OK, and eventually condor will restart the job.  The first
    process
    is still holding resources and the second process keeps failing.
     After
    mucking with 4 different places in the registry and UI on xp,
    vista and 7
(as wall as every place in the UI I could control error reporting, and disabling the error reporting service), I was still seeing popups. I
    started using the windows SetErrorMode function, which in practice
    only
    worked for me on Windows 7 and Vista.  I was still seeing a popup
Application Error, memory could not be "read" on a simple null value
    dereference

    Finally I came across the article
    http://support.microsoft.com/kb/128642


    which tells you to set in the registry:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Windows\ErrorMode
    = 2

    This seems to suppress the failure dialog on the XP systems.
    As a Note: I am still not sure if you need to also disable the Dr.
    Watson
debugger...but I have done that on the way to finding this solution.

    --Derrick_______________________________________________
    Condor-users mailing list
    To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
<mailto:condor-users-request@xxxxxxxxxxx> with a
    subject: Unsubscribe
    You can also unsubscribe by visiting
    https://lists.cs.wisc.edu/mailman/listinfo/condor-users

    The archives can be found at:
    https://lists.cs.wisc.edu/archive/condor-users/



    _______________________________________________
    Condor-users mailing list
    To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
<mailto:condor-users-request@xxxxxxxxxxx> with a
    subject: Unsubscribe
    You can also unsubscribe by visiting
    https://lists.cs.wisc.edu/mailman/listinfo/condor-users

    The archives can be found at:
    https://lists.cs.wisc.edu/archive/condor-users/




--
--Derrick
  _______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx <mailto:condor-users-request@xxxxxxxxxxx> with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

------------------------------------------------------------------------

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/