[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Lazy jobs that never really start running



>You're transferring dagman itself? Why?

I use the default dagman_submit command and that creates a submit file that
transfers the executable by default. At least it seems to me...


>condor_status reports what the *collector* says. this is always
>delayed (or plain inaccurate if there are problems with a machine as
>it tends to fail to report the right thing).

I see. And how can I get the *real* computer info?


>The machine may be loosing track of the shadows.
 
>How about the MasterLog (reports of processes dying and the like

I don't see anything like that. It looks OK.

 
>Does a condor_reconfig do the same?

No, reconfig does not fix the problem.


>How about net stop condor/net start condor?

I tried that but the condor process could not be stopped (thats why I had to restart
the machine). I was kinda surprised that the jobs went along nicely, except a DAG job
that "forgot" to submit its child tasks after it completed.


BTW. When I finally removed the jobs from the queue (I used leave_in_queue = True to be able the
restart jobs) I got the following error mail:
(And it happens once in a while...)

---
To: szabolcs.horvatth@xxxxxxxxxxxxxxxxx
From: SYSTEM@snoopy
Subject: [Condor] Problem

This is an automated email from the Condor system
on machine "snoopy.digicpictures.local".  Do not reply.

"C:\Condor/bin/condor_schedd.exe" on "snoopy.digicpictures.local" died due to exception ACCESS_VIOLATION.

Condor will automatically restart this process in 10 seconds.

*** Last 20 line(s) of file SchedLog:
7/6 14:37:17 Writing record to user logfile=//sv/rendertest/ch/logs/_dagLog.log owner=szabolcs

7/6 14:37:17 init_user_ids: want user 'szabolcs@DIGICPICTURES', current is '(null)@(null)'
7/6 14:37:17 init_user_ids: Already have handle for szabolcs@DIGICPICTURES, so returning.
7/6 14:37:17 TokenCache contents: 
szabolcs@DIGICPICTURES
7/6 14:37:17 ENABLE_USERLOG_LOCKING is undefined, using default value of True
7/6 14:37:17 TokenCache contents: 
szabolcs@DIGICPICTURES
7/6 14:37:18 KEEP_OUTPUT_SANDBOX is undefined, using default value of False
7/6 14:37:18 Saving classad to history file
7/6 14:37:18 Writing record to user logfile=//sv/rendertest/g/logs/_dagLog.log owner=szabolcs

7/6 14:37:18 init_user_ids: want user 'szabolcs@DIGICPICTURES', current is '(null)@(null)'
7/6 14:37:18 init_user_ids: Already have handle for szabolcs@DIGICPICTURES, so returning.
7/6 14:37:18 TokenCache contents: 
szabolcs@DIGICPICTURES
7/6 14:37:18 ENABLE_USERLOG_LOCKING is undefined, using default value of True
7/6 14:37:18 TokenCache contents: 
szabolcs@DIGICPICTURES
7/6 14:37:18 KEEP_OUTPUT_SANDBOX is undefined, using default value of False
7/6 14:37:18 Saving classad to history file
*** End of file SchedLog

*** Last entry in core file core.SCHEDD.WIN32

==============================
Exception code: C0000005 ACCESS_VIOLATION
Fault address:  0040940D 01:0000840D C:\Condor\bin\condor_schedd.exe

Registers:
EAX:00D34E6C
EBX:00000000
ECX:00002E24
EDX:7FFE0304
ESI:000001D4
EDI:00002E24
CS:EIP:001B:0040940D
SS:ESP:0023:001292C4  EBP:001292C8
DS:0023  ES:0023  FS:003B  GS:0000
Flags:00010206

Call stack:
Address   Frame
0040940D  001292C8  DestroyProc+1E5
00409393  001293F0  DestroyProc+16B
00412E7B  00129658  Scheduler::actOnJobs+C30
0046D5CA  0012C0E0  DaemonCore::HandleReq+15D9
0046BFD4  0012D108  DaemonCore::ServiceCommandSocket+CA
00412EC8  0012D368  Scheduler::actOnJobs+C7D
0046D5CA  0012FDF0  DaemonCore::HandleReq+15D9
0046BE1F  0012FE30  DaemonCore::Driver+918
004734A4  0012FF68  dc_main+A44
004735B3  0012FF80  main+CE
00496BCD  00000001  mainCRTStartup+C5

*** End of file core.SCHEDD.WIN32
---


Cheers,
Szabolcs