Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[condor-users] Dagman stalling with shadow exception messages?
- Date: Tue, 6 Apr 2004 21:28:20 -0700
- From: "Michael S. Root" <mike@xxxxxxxxxxxxxx>
- Subject: [condor-users] Dagman stalling with shadow exception messages?
Hi all, I'm trying to track down the cause of this intermittent problem
we're having with Condor 6.6.1 on RedHat 8. For some reason, every once
in a while, a dagman job will get "stuck" in a state where it can no
longer submit any jobs to any render machines.
In the last three months we've run well over 500 dagman jobs here (each
with as many as 1600 individual jobs), and things generally work pretty
well. I'd say this has happened on at most 5% of the dagman jobs. When
it does happen, it happens after many of the jobs in the dag have already
run, and when there are definitely resources available that should match
the remaining jobs. There are no dag dependencies that it could be
waiting on. There's plenty of RAM and disk space on both the submit host
and the render hosts. The only workaround seems to be to delete the dag
job from the queue and re-submit the remaining jobs (which then proceed to
run fine).
Anyone else have a problem like this, or have we uncovered an obscure bug
in dagman?
Thanks for any help....
-Mike
Here are snippets from some logs which may be relevent:
----dagman.out (this line repeated many times...):
4/6 21:00:24 Event: ULOG_SHADOW_EXCEPTION for Condor Job
st006_comp_tk25__296-300 (22190.0.0)
-- ShadowLog on submit host:
4/6 21:00:27 Initializing a VANILLA shadow
4/6 21:00:27 (22190.0) (7173): Request to run on <192.168.1.111:32771> was
ACCEPTED
4/6 21:00:27 (22190.0) (7173): ERROR "Can no longer talk to condor_starter
on execute machine (192.168.1.111)" at line 63 in file NTreceivers.C
----------------------
Note: The above message is repeated for any render host that gets matched,
and the hosts are definitely up and visible to the submit host. In
addition, that same render host will happily render other jobs from other
dags in other people's queues.
-- StartLog on render host:
4/6 21:00:02 DaemonCore: Command received via UDP from host
<192.168.1.88:36926>
4/6 21:00:02 DaemonCore: received command 403 (DEACTIVATE_CLAIM), calling
handler (command_handler)
4/6 21:00:02 vm1: Called deactivate_claim()
4/6 21:00:02 Starter pid 16818 exited with status 0
4/6 21:00:02 vm1: State change: starter exited
4/6 21:00:02 vm1: Changing activity: Busy -> Idle
4/6 21:00:02 DaemonCore: Command received via TCP from host
<192.168.1.88:45808>
4/6 21:00:02 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY),
calling handler (command_handler)
4/6 21:00:02 vm1: Called deactivate_claim_forcibly()
4/6 21:00:02 DaemonCore: Command received via UDP from host
<192.168.1.88:36926>
4/6 21:00:02 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_handler)
4/6 21:00:02 vm1: State change: received RELEASE_CLAIM command
4/6 21:00:02 vm1: Changing state and activity: Claimed/Idle ->
Preempting/Vacating
4/6 21:00:02 vm1: State change: No preempting claim, returning to owner
4/6 21:00:02 vm1: Changing state and activity: Preempting/Vacating ->
Owner/Idle
4/6 21:00:02 vm1: State change: IS_OWNER is false
4/6 21:00:02 vm1: Changing state: Owner -> Unclaimed
4/6 21:00:02 DaemonCore: Command received via UDP from host
<192.168.1.88:36926>
4/6 21:00:02 DaemonCore: received command 443 (RELEASE_CLAIM), calling
handler (command_handler)
4/6 21:00:02 Error: can't find resource with capability
(<192.168.1.111:32771>#7698602094)
----------------------
Note: That last line puzzles me. I don't know what the #7698602094 referrs
to.
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>