Hi,
I have these flags for STARTD_DEBUG
D_COMMAND D_FULLDEBUG D_MACHINE D_LOAD D_MATCH D_JOB D_ACCOUNTANT
About the ProcAPI error. Yes, its there. Here it a clip from the log.
It
is sporadic, and appears to occur in bunches, so I missed it untill you
asked.
/**
5/10 14:54:33 vm1: Computing percent CPU usage with pids: 5371 5373
5378
5421 5424 5425 5631 5632
5/10 14:54:33 ProcAPI: pid 5421 does not exist.
5/10 14:54:33 ProcAPI::getProcSetInfo: Pid 5421 does not exist,
ignoring.
5/10 14:54:33 ProcAPI: pid 5424 does not exist.
5/10 14:54:33 ProcAPI::getProcSetInfo: Pid 5424 does not exist,
ignoring.
5/10 14:54:33 ProcAPI: pid 5425 does not exist.
5/10 14:54:33 ProcAPI::getProcSetInfo: Pid 5425 does not exist,
ignoring.
5/10 14:54:33 ProcAPI: pid 5631 does not exist.
5/10 14:54:33 ProcAPI::getProcSetInfo: Pid 5631 does not exist,
ignoring.
5/10 14:54:33 ProcAPI: pid 5632 does not exist.
5/10 14:54:33 ProcAPI::getProcSetInfo: Pid 5632 does not exist,
ignoring.
**/
I came across the same error in another context. Here is the log, but I
am not sure if these are relevant to the problem.
/**
5/2 18:08:21 Load avg: 0.00 0.00 0.01
5/2 18:08:21 vm1: Computing percent CPU usage with pids: 5976 5978 5983
6026 6029
5/2 18:08:21 vm1: Percent CPU usage for those pids is: 0.000000
5/2 18:08:21 vm1: LoadQueue: Adding 5 entries of value 0.000000
5/2 18:08:21 vm1: LoadQueue: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.
00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00
5/2 18:08:21 vm1: LoadQueue: Size: 60 Avg value: 0.00 Share of system
load: 0.00
5/2 18:08:21 vm2: Computing percent CPU usage with pids: 5974 5975
5/2 18:08:21 vm2: Percent CPU usage for those pids is: 0.000000
5/2 18:08:21 vm2: LoadQueue: Adding 5 entries of value 0.000000
5/2 18:08:21 vm2: LoadQueue: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.
00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00
5/2 18:08:21 vm2: LoadQueue: Size: 60 Avg value: 0.00 Share of system
load: 0.00
5/2 18:08:21 SystemLoad: 0.000 TotalCondorLoad: 0.000 TotalOwnerLoad:
0.000
5/2 18:08:21 vm1: SystemLoad: 0.00 CondorLoad: 0.00 OwnerLoad: 0.00
5/2 18:08:21 vm2: SystemLoad: 0.00 CondorLoad: 0.00 OwnerLoad: 0.00
5/2 18:08:21 vm2: State change: PREEMPT is TRUE
5/2 18:08:21 vm2: State change: WANT_VACATE is TRUE
5/2 18:08:21 vm2: In Starter::kill() with pid 5974, sig 101
(DC_SIGCONTINUE)
5/2 18:08:21 STARTD_TIMEOUT_MULTIPLIER is undefined, using default
value
of 0
5/2 18:08:21 SEC_DEBUG_PRINT_KEYS is undefined, using default value of
False
5/2 18:08:21 vm2: Canceled claim timer (134)
5/2 18:08:21 vm2: Changing state and activity: Claimed/Suspended ->
Preempting/Vacating
5/2 18:08:21 vm2: In Starter::kill() with pid 5974, sig 15 (SIGTERM)
5/2 18:08:21 STARTD_TIMEOUT_MULTIPLIER is undefined, using default
value
of 0
5/2 18:08:21 SEC_DEBUG_PRINT_KEYS is undefined, using default value of
False
5/2 18:08:21 DaemonCore: No more children processes to reap.
5/2 18:08:21 Starter pid 5974 exited with status 0
5/2 18:08:21 ProcAPI::buildFamily failed: parent 5974 not found on
system.
5/2 18:08:21 ProcAPI: pid 5974 does not exist.
5/2 18:08:21 ProcAPI: pid 5975 does not exist.
5/2 18:08:21 vm2: State change: starter exited
5/2 18:08:21 Entered vacate_client <192.168.55.100:34034>
mcf000.tifr-mcfarm...
5/2 18:08:21 STARTD_TIMEOUT_MULTIPLIER is undefined, using default
value
of 0
5/2 18:08:21 SEC_DEBUG_PRINT_KEYS is undefined, using default value of
False
5/2 18:08:21 vm2: State change: No preempting claim, returning to owner
5/2 18:08:21 vm2: Changing state and activity: Preempting/Vacating ->
Owner/Idle
5/2 18:08:21 vm2: State change: IS_OWNER is false
5/2 18:08:21 vm2: Changing state: Owner -> Unclaimed
5/2 18:08:21 DaemonCore: No more children processes to reap.
**/
I checked the list of pid's that appear in the startd log.
5/10 12:09:18 vm2: Computing percent CPU usage with pids: 23718 23720
23725 24346 24404 24405 24409 26504 27293 27294 27296 1238
/** clippings from "top"
23718 root 15 0 1964 1964 1552 S 0.0 0.0 0:10 0
condor_starter
23720 samgrid 25 10 272 272 228 S N 0.0 0.0 0:00 1 time
23725 samgrid 25 10 1068 1068 900 S N 0.0 0.0 0:00 1
sandbox_mgr
24346 samgrid 25 10 1104 1104 912 S N 0.0 0.0 0:00 1
jim_rapper.sh
24404 samgrid 25 10 344 344 288 S N 0.0 0.0 0:00 1 time
24405 samgrid 25 10 1136 1136 924 S N 0.0 0.0 0:00 1
mc_sam_rapper.s
24409 samgrid 29 10 1060 1060 904 S N 0.0 0.0 0:00 0
accelerator.sh
26504 samgrid 25 10 13684 13M 2248 S N 0.0 0.6 0:03 0
python
27293 samgrid 25 10 988 988 852 S N 0.0 0.0 0:00 0 sh
27294 samgrid 25 10 1016 1016 876 S N 0.0 0.0 0:00 0
Tata20737488711
27296 samgrid 25 10 9272 9272 1744 S N 0.0 0.4 0:00 1
python
1238 samgrid 29 10 504 504 428 S N 0.0 0.0 0:00 0 sleep
**/
But I notice that the the pid's listed against vm2 (show above) are
all
related to the user process running on vm1. The one big user process
(shown below) that takes up 100% of vm1 is not
listed in the log against vm1
/**
27356 samgrid 35 10 156M 156M 28424 R N 50.0 7.7 600:31 1
d0gstar2.x
**/
This does not seem to be OK. Because on the node with the older Linux,
the log shows about 20 pid's (for one user job) against the
corresponding vm. On the problem nodes, Condor seems to be watching
about half of the user processes that "top" lists.
I hope I am reporting all that is relevant, though this mail has become
rather long.
Thank you sincerely.
Nagaraj
Dan Bradley wrote:
Nagaraj,
Condor reads /proc/<pid>/stat to get the CPU utilization of a process.
I've observed that this works under Scientific Linux 3, so I don't
know
why it is reporting no load for your processes.
Looking at the code that reads /proc/<pid>/stat, I see that some
failures to read the file are only reported if you add D_FULLDEBUG to
your STARTD_DEBUG list. Could you please do that and see if there are
errors containing the string "ProcAPI::getProcInfo()"?
Also, does the list of pids that it is checking make sense? In your
log
below, there is a message "Computing percent CPU usage with pids: 3559
3561 3566 3871 3929 3930 3934 4450 5239 5240 5242 12775". Presumably
one or more of these was generating the cpu load. Can you please
verify
that? The question I am wondering about is whether the problem is the
reading of process stat information or whether it is the list of pids
itself that is the trouble. If there are pids associated with the job
that are generating cpu load but which are not in the list, that would
cause a problem similar to what you are observing. If you just use
'top' to see the top few pids and verify that these are in the list
being watched by Condor, that should be enough to see if this is
really
the problem or not.
--Dan
Nagaraj Panyam wrote:
Hi Dan,
Yes, by "user process" I refer to job run by condor on behalf of a
user.
There is no other user process, because the nodes are dedicated
compute
nodes.
I did try the D_LOAD setting. But I could not make out much out of
the
output - but the load values dont seem to agree with the the
Claimed/Busy status. Here is a clipping from a node for which
condor_status reports "vm1 owner/idle loadav=1.000" and "vm2
claimed/busy loadav=0.000". I request you to have a look at the
output
and see if you find a clue, or please advise me what to look out for.
/****
5/9 11:34:33 Load avg: 1.00 1.00 0.96
5/9 11:34:33 vm1: LoadQueue: Adding 5 entries of value 0.000000
5/9 11:34:33 vm1: LoadQueue: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.
00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00
5/9 11:34:33 vm1: LoadQueue: Size: 60 Avg value: 0.00 Share of
system
load: 0.00
5/9 11:34:33 vm2: Computing percent CPU usage with pids: 3559 3561
3566
3871 3929 3930 3934 4450 5239 5240 5242 12775
5/9 11:34:33 ProcAPI: new boottime = 1145611802; old_boottime =
1145611802; /proc/stat boottime = 1145611803; /proc/uptime boo
ttime = 1145611802
5/9 11:34:33 vm2: Percent CPU usage for those pids is: 0.000000
5/9 11:34:33 vm2: LoadQueue: Adding 5 entries of value 0.000000
5/9 11:34:33 vm2: LoadQueue: 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.
00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0
0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00
5/9 11:34:33 vm2: LoadQueue: Size: 60 Avg value: 0.00 Share of
system
load: 0.00
5/9 11:34:33 SystemLoad: 1.000 TotalCondorLoad: 0.000
TotalOwnerLoad:
1.000
5/9 11:34:33 vm1: SystemLoad: 1.00 CondorLoad: 0.00 OwnerLoad: 1.00
5/9 11:34:33 vm2: SystemLoad: 0.00 CondorLoad: 0.00 OwnerLoad: 0.00
***/
Just to summarize my earlier mail - something seems to have changed
with
an OS upgrade, because with the older OS, the same node (same
condor_config) a user process on each vm. Also, if I turn
Hyperthreading
on for the same node, two vm's run user process, and two are stuck in
the "claimed/busy" state.
Thanks a lot for helping me with this.
Nagaraj
Dan Bradley wrote:
Nagaraj,
Off the top of my head, I can't think of any reason for this change
in
behavior. When you say "user process" you are referring to a job
run
by Condor on behalf of a user, right? You are not talking about
processes run by users outside of Condor. Just want to be sure I
understand.
You could try adding D_LOAD in your STARTD_DEBUG settings. This
will
show extra information about what's going on while monitoring system
load.
--Dan
On May 8, 2006, at 12:46 PM, P. Nagaraj wrote:
Hi,
We have dual cpu nodes in our condor pool. Some of these run an
older
version of Linux (RH 7.1), and these take two user jobs as shown
below
There is a user process on each cpu here. CondorVersion is 6.6.5
and
platforms are all Intel/Linux
vm1 Claimed/Busy/LoadAv=1.000/Mem=502
vm2 Claimed/Busy/loadAv=1.020/Mem=502
When I upgrade the nodes (Scientific Linux 3), the behaviour
changes,
condor_config being unchanged. The VM that is running the user
process
shows up as vm2 below, while the vm that has no LoadAv shows up as
Busy.
vm1 Claimed/Busy/LoadAv=0.000/Mem=500
vm2 Owner/Idle/LoadAv=1.000/Mem=500
Refering to the vm1 just above, its classads are as show here
below.
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
CondorLoadAvg = 0.000000
LoadAvg = 0.000000
TotalLoadAvg = 1.000000 --this from the other vm which has
LoadAv=1
TotalCondorLoadAvg = 0.000000
CpuBusyTime = 0
CpuIsBusy = FALSE
State = "Claimed"
Activity = "Busy"
Start = (((LoadAvg - CondorLoadAvg) <= 0.300000) || (State !=
"Unclaimed"
&& State != "Owner"))
Requirements = START
Why is there this mismatch - CpuisBusy="FALSE" and Activity=Busy?
The Loadaverages indicate that a procss can start, but something
makes
the
Activity="Busy". How do I find out why one vm is always Busy,
which was
now so before the OS upgrade?
Similarly, if HT is enabled in BIOS, there are 4 vm's on a node.
Two of
these are Claimed/Busy (Loadav=0) and the other two are Owner/Idle
(doing
user process, LoadAv=1).
Thanks in advance for any help on this
Nagaraj
--
+----------------------------------
+--------------------------------------+
Nagaraj Panyam | Office tel: +91-22-22782610
Dept of High Energy Physics | Office fax: +91-22-22804610
Tata Instt. of Fundamental Research| Home tel : +91-22-22804936
Mumbai - 400 005, INDIA | **Email** : pn@xxxxxxxxxxx
+----------------------------------
+--------------------------------------+
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
--
+----------------------------------
+--------------------------------------+
Nagaraj Panyam | Office tel: +91-22-22782610
Dept of High Energy Physics | Office fax: +91-22-22804610
Tata Instt. of Fundamental Research| Home tel : +91-22-22804936
Mumbai - 400 005, INDIA | **Email** : pn@xxxxxxxxxxx
+----------------------------------
+--------------------------------------+
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users