Thanks for your input. I still need a check like you are talking
about with cpu usage and kill-time. In particular I want to guard
against programmer infinite loop bugs, or if they popped up a
message box on purpose. They know they aren't supposed to pop up an
error message explicitly in a job, actually they are supposed to
call our wrapped versions of all API calls, which if resulting in a
GUI will go through a switch that just logs in condor mode.
My problem was an unexpected crash that made windows produce an
invalid memory access error message, and for some reason the
programtic method of using windows API SetErrorMode was missing this
one, for windows XP. The registry key I listed fixed that, but it
is a tough option to decide to tell the customer to edit their
registry, or additional dev/test/doc time to develop a configuration
tool for them. If there is a way to keep that dialog from appearing
all from in the code on XP I would love to know about it.
As for implementing your idea of auto killing a long running job
that was not using much CPU...do you implement this in condor with a
periodic remove? Or do you implement this in your a thread of the
Condor job via it's python wrapper?
--Derrick
On Wed, May 18, 2011 at 9:30 AM, Michael O'Donnell
<odonnellm@xxxxxxxx <mailto:odonnellm@xxxxxxxx>> wrote:
Derrick, I have run into similar problems and generally this is
handled in
the application. One thought is to check if the developers can
add a
switch that causes the program to exit with a STDOUT error code
versus a
popup message. I was working on a numerical hydrologic model
that was
written by someone else in Fortran and they essentially had a
popop that
required the user to click ok when the program completed
successfully (as
if you would not know the program completed its analysis
successfully).
Anyhow, I was able to change the underlining code so popups did
not occur.
I would imagine this could be done in your case.
Most of my applications that I run are wrapped inside a python
script,
which allows me a better programming language then using something
like
DOS batch files. VBS or something else could also be used. I had
also
looked into sendkeys, but I had a difficult time getting this to
work
because there was something different about the window station
environment
(a popup occurs, but it does not actually exist) and although
sendkeys
worked running the application locally, it would not work when
executed
via condor.
A couple other ideas are to evaluate the CPU for the exe task. If
it falls
below a threshold and remains there for a certain duration then
kill it.
You can also set a maximum runtime for a condor job and if this is
exceeded then kill it. Although these methods work, in my opinion
the best
method is to add a switch or something that allows errors messages
to be
sent to STDOUT versus a popup. There may be a better way, but this
is what
I did in the past.
mike
From:
Derrick Karimi <derrick.karimi@xxxxxxxxx
<mailto:derrick.karimi@xxxxxxxxx>>
To:
Condor-Users Mail List <condor-users@xxxxxxxxxxx
<mailto:condor-users@xxxxxxxxxxx>>
Date:
05/18/2011 07:13 AM
Subject:
[Condor-users] Suppress Windows error dialogs popping up for
crashing
Condor jobs
Sent by:
condor-users-bounces@xxxxxxxxxxx
<mailto:condor-users-bounces@xxxxxxxxxxx>
Hi,
I am working on fault tolerance on our system. When our job's
run sometimes they crash. I told the developers to fix the code
but they
told me to rerun the job because they can't reproduce the
problem...I will
work on their attitude later.
My problem was windows popping up various error reporting and crash
dialogs. When the dialog pops up the process won't exit till
the user
clicks OK, and eventually condor will restart the job. The first
process
is still holding resources and the second process keeps failing.
After
mucking with 4 different places in the registry and UI on xp,
vista and 7
(as wall as every place in the UI I could control error
reporting, and
disabling the error reporting service), I was still seeing
popups. I
started using the windows SetErrorMode function, which in practice
only
worked for me on Windows 7 and Vista. I was still seeing a popup
Application Error, memory could not be "read" on a simple null
value
dereference
Finally I came across the article
http://support.microsoft.com/kb/128642
which tells you to set in the registry:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Windows\ErrorMode
= 2
This seems to suppress the failure dialog on the XP systems.
As a Note: I am still not sure if you need to also disable the Dr.
Watson
debugger...but I have done that on the way to finding this
solution.
--Derrick_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
<mailto:condor-users-request@xxxxxxxxxxx> with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
<mailto:condor-users-request@xxxxxxxxxxx> with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
--
--Derrick
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
<mailto:condor-users-request@xxxxxxxxxxx> with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/