Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] sched crash on dag removal on windows
- Date: Wed, 11 Jan 2006 12:54:36 +0100
- From: "Horvatth Szabolcs" <szabolcs@xxxxxxxxxxxxx>
- Subject: [Condor-users] sched crash on dag removal on windows
Hi,
I sent the following mail to condor-admin but beside the automatic response (case #13153)
I received no response for quite a few days.
Since its still a major problem for me I try to post it on the forum too, maybe someone
knows how to avoid the problem.
Cheers,
Szabolcs
--
I tried 6.7.14 to test whether my dag problems had been fixed but had no luck.
When I removed a dag job from the queue all of its child jobs were still left in the queue
and I received the - usual - scheduler crash message.
(I kept quite a few crash masseges from the past and found a few with similar log-file errors,
so it is not a 6.7.14 log path issue.)
---
"C:\Condor/bin/condor_schedd.exe" on "snoopy.digicpictures.local" died due to exception ACCESS_VIOLATION.
Condor will automatically restart this process in 10 seconds.
*** Last 20 line(s) of file SchedLog:
1/5 11:29:19 IO: Failed to read packet header
1/5 11:29:19 IO: Failed to read packet header
1/5 11:29:20 IO: Failed to read packet header
1/5 11:29:24 DaemonCore: Command received via TCP from host <192.168.0.71:3595>
1/5 11:29:24 DaemonCore: received command 478 (ACT_ON_JOBS), calling handler (actOnJobs)
1/5 11:29:24 UserLog::initialize: fopen("X:\temp\CondorJobs\1136456862/x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag.dagman.log") failed - errno 22 (Invalid argument)
1/5 11:29:24 WARNING: Invalid user log file specified: X:\temp\CondorJobs\1136456862/x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag.dagman.log
1/5 11:29:26 IO: Failed to read packet header
1/5 11:29:30 IO: Failed to read packet header
1/5 11:29:38 IO: Failed to read packet header
1/5 11:29:39 IO: Failed to read packet header
1/5 11:29:44 IO: Failed to read packet header
1/5 11:29:44 IO: Failed to read packet header
1/5 11:29:46 DaemonCore: Command received via TCP from host <192.168.0.71:3602>
1/5 11:29:46 DaemonCore: received command 478 (ACT_ON_JOBS), calling handler (actOnJobs)
1/5 11:29:46 IO: Failed to read packet header
1/5 11:29:46 DaemonCore: Command received via TCP from host <192.168.0.71:3604>
1/5 11:29:46 DaemonCore: received command 478 (ACT_ON_JOBS), calling handler (actOnJobs)
1/5 11:29:46 DaemonCore: Command received via TCP from host <192.168.0.71:3605>
1/5 11:29:46 DaemonCore: received command 478 (ACT_ON_JOBS), calling handler (actOnJobs)
*** End of file SchedLog
*** Last entry in core file core.SCHEDD.WIN32
==============================
Exception code: C0000005 ACCESS_VIOLATION
Fault address: 0040A8D6 01:000098D6 C:\Condor\bin\condor_schedd.exe
Registers:
EAX:00923A14
EBX:00000000
ECX:00019AE4
EDX:00000002
ESI:0000000E
EDI:00019AE4
CS:EIP:001B:0040A8D6
SS:ESP:0023:001292C4 EBP:001292C8
DS:0023 ES:0023 FS:003B GS:0000
Flags:00010206
Call stack:
Address Frame
0040A8D6 001292C8 DestroyProc+1EB
0040A85C 001293F0 DestroyProc+171
004143D0 00129658 Scheduler::actOnJobs+C28
0047599C 0012C0E0 DaemonCore::HandleReq+15E5
0047439A 0012D108 DaemonCore::ServiceCommandSocket+D5
00414425 0012D368 Scheduler::actOnJobs+C7D
0047599C 0012FDF0 DaemonCore::HandleReq+15E5
004741DD 0012FE34 DaemonCore::Driver+977
0047BDB4 0012FF68 dc_main+A4C
0047BEC3 0012FF80 main+CE
004A1A5E 00000001 mainCRTStartup+C5
*** End of file core.SCHEDD.WIN32
---
This part of the message looks suspicious, since it seems like a bad concatenation of the same root dir using both
forward and backward slashes:
X:\temp\CondorJobs\1136456862/x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag.dagman.log
This is the submit file of the dag job:
# Filename: x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag.condor.sub
# Generated by condor_submit_dag x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag
universe = scheduler
executable = C:\Condor\bin\condor_dagman.exe
getenv = True
output = x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag.lib.out
error = x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag.lib.out
log = x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag.dagman.log
remove_kill_sig = SIGUSR1
on_exit_remove = (ExitBySignal == false || ExitSignal =!= 9)
arguments = -f -l . -Debug 3 -Lockfile x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag.lock -Condorlog X:\temp\CondorJobs\1136456862\logs/Job.log -Dag x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag -Rescue x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag.rescue
environment =_CONDOR_DAGMAN_LOG=x:/temp/CondorJobs/1136456862/_AfterEffects_render_090_PU_010_v001_1136456862.dag.dagman.out|_CONDOR_MAX_DAGMAN_LOG=0
queue
Hope it helps tracking these problems since its a major pain hadling dag jobs on windows.
Cheers,
Szabolcs