Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Lazy jobs that never really start running
- Date: Wed, 06 Jul 2005 14:54:21 +0200
- From: "Horvatth Szabolcs" <szabolcs@xxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Lazy jobs that never really start running
>You're transferring dagman itself? Why?
I use the default dagman_submit command and that creates a submit file that
transfers the executable by default. At least it seems to me...
>condor_status reports what the *collector* says. this is always
>delayed (or plain inaccurate if there are problems with a machine as
>it tends to fail to report the right thing).
I see. And how can I get the *real* computer info?
>The machine may be loosing track of the shadows.
>How about the MasterLog (reports of processes dying and the like
I don't see anything like that. It looks OK.
>Does a condor_reconfig do the same?
No, reconfig does not fix the problem.
>How about net stop condor/net start condor?
I tried that but the condor process could not be stopped (thats why I had to restart
the machine). I was kinda surprised that the jobs went along nicely, except a DAG job
that "forgot" to submit its child tasks after it completed.
BTW. When I finally removed the jobs from the queue (I used leave_in_queue = True to be able the
restart jobs) I got the following error mail:
(And it happens once in a while...)
---
To: szabolcs.horvatth@xxxxxxxxxxxxxxxxx
From: SYSTEM@snoopy
Subject: [Condor] Problem
This is an automated email from the Condor system
on machine "snoopy.digicpictures.local". Do not reply.
"C:\Condor/bin/condor_schedd.exe" on "snoopy.digicpictures.local" died due to exception ACCESS_VIOLATION.
Condor will automatically restart this process in 10 seconds.
*** Last 20 line(s) of file SchedLog:
7/6 14:37:17 Writing record to user logfile=//sv/rendertest/ch/logs/_dagLog.log owner=szabolcs
7/6 14:37:17 init_user_ids: want user 'szabolcs@DIGICPICTURES', current is '(null)@(null)'
7/6 14:37:17 init_user_ids: Already have handle for szabolcs@DIGICPICTURES, so returning.
7/6 14:37:17 TokenCache contents:
szabolcs@DIGICPICTURES
7/6 14:37:17 ENABLE_USERLOG_LOCKING is undefined, using default value of True
7/6 14:37:17 TokenCache contents:
szabolcs@DIGICPICTURES
7/6 14:37:18 KEEP_OUTPUT_SANDBOX is undefined, using default value of False
7/6 14:37:18 Saving classad to history file
7/6 14:37:18 Writing record to user logfile=//sv/rendertest/g/logs/_dagLog.log owner=szabolcs
7/6 14:37:18 init_user_ids: want user 'szabolcs@DIGICPICTURES', current is '(null)@(null)'
7/6 14:37:18 init_user_ids: Already have handle for szabolcs@DIGICPICTURES, so returning.
7/6 14:37:18 TokenCache contents:
szabolcs@DIGICPICTURES
7/6 14:37:18 ENABLE_USERLOG_LOCKING is undefined, using default value of True
7/6 14:37:18 TokenCache contents:
szabolcs@DIGICPICTURES
7/6 14:37:18 KEEP_OUTPUT_SANDBOX is undefined, using default value of False
7/6 14:37:18 Saving classad to history file
*** End of file SchedLog
*** Last entry in core file core.SCHEDD.WIN32
==============================
Exception code: C0000005 ACCESS_VIOLATION
Fault address: 0040940D 01:0000840D C:\Condor\bin\condor_schedd.exe
Registers:
EAX:00D34E6C
EBX:00000000
ECX:00002E24
EDX:7FFE0304
ESI:000001D4
EDI:00002E24
CS:EIP:001B:0040940D
SS:ESP:0023:001292C4 EBP:001292C8
DS:0023 ES:0023 FS:003B GS:0000
Flags:00010206
Call stack:
Address Frame
0040940D 001292C8 DestroyProc+1E5
00409393 001293F0 DestroyProc+16B
00412E7B 00129658 Scheduler::actOnJobs+C30
0046D5CA 0012C0E0 DaemonCore::HandleReq+15D9
0046BFD4 0012D108 DaemonCore::ServiceCommandSocket+CA
00412EC8 0012D368 Scheduler::actOnJobs+C7D
0046D5CA 0012FDF0 DaemonCore::HandleReq+15D9
0046BE1F 0012FE30 DaemonCore::Driver+918
004734A4 0012FF68 dc_main+A44
004735B3 0012FF80 main+CE
00496BCD 00000001 mainCRTStartup+C5
*** End of file core.SCHEDD.WIN32
---
Cheers,
Szabolcs