Re: [Gems-users] panic and send mondo case


Date: Wed, 15 Oct 2008 16:20:14 +0200
From: "Daniel Sánchez Pedreño" <sanatox@xxxxxxxxx>
Subject: Re: [Gems-users] panic and send mondo case
Dear list,

I've running some experiments with GEMS2.1 using opal+ruby. For that, I've created checkpoints with a sarek machine with Solaris 10. The problem I am experiencing is that several simulations "never" stop. This is because, before executing the magic call to terminate execution, the program crash with this error (in this case, the program is ocean with 8 processors):

send mondo timeout [1148663 NACK 0 BUSY]
IDSR 0x2  cpuids: 0x0
panic: failed to stop cpu0

^Mpanic[cpu1]/thread=30005389300: send_mondo_set: timeout

000002a100a7ad00 SUNW,UltraSPARC-III+:send_mondo_set+454 (2a100a7aee0, 8000000000000000, 2080d1c42, 1, 2a100a7adf0, 0)
  %l0-3: 0000000000002aaa 0000000000000007 0000000000000007 0000000000000000
  %l4-7: 0000000001209800 00000002080d1c83 0000000000000002 0000000000000040
000002a100a7ae30 unix:xt_some+194 (2a100a7b108, 2a100a7af30, fd, fffffffffffffff8, 2a100a7aee8, 0)
  %l0-3: 00000000018850f4 000002a100a7af30 0000000000000002 0000000000000000
  %l4-7: 00000000000000fd 000002a100a7af78 000002a100a7af30 0000000000000000
000002a100a7afc0 unix:sfmmu_cache_flush+a0 (c2ab, 0, 2a100a7b108, 11ff800, fd, fffffffffffffff8)
  %l0-3: 0000030002a8c000 0000000000002000 00000700006a4d80 0000000000000000
  %l4-7: 0000000000000001 0000030002a8c000 fffffffffffffff8 00000000000000ff
000002a100a7b150 unix:sfmmu_vac_conflict+90 (300035dca60, 0, 700006a4d80, 366000, 1b3, 182a000)
  %l0-3: 0000000000000001 0000000000000fdf 0000000000000001 0000000000000002
  %l4-7: 0000000000000000 0000000000000001 0000000000000000 0000000000000000
000002a100a7b250 unix:sfmmu_tteload_addentry+248 (300035dca60, 300038ecd80, 2a100a7b4d0, 366000, 20000, 8000000)
  %l0-3: 00000700006a4d80 0000000000000000 00000300038ece18 0000000000000000
  %l4-7: 000002a100a7b368 000000000005e03a 000000000005e03b 0000000000000003
000002a100a7b370 unix:___const_seg_900001201+9928 (300035dca60, 2a100a7b4d0, 366000, 2a100a7b4d8, 0, 4)
  %l0-3: 0000000000000003 0000000000020040 0000000000010000 00000700000d4ac0
  %l4-7: 0000000000000000 0000000000000000 0000000000000000 0000000080000000
000002a100a7b420 unix:hat_memload+f8 (300035dca60, 366000, 700006a4d80, f, 1079000, 10)
  %l0-3: 0000000000000000 0000000000000000 00000300035bb0f0 ffffffffffffffff
  %l4-7: 0000000c000ecad4 0000000001884c00 0000000000000001 00000700006a4d80
000002a100a7b4e0 genunix:segvn_faultpage+32c (2a100a7b760, 300039587e0, 366000, 0, 0, 300035bb0f0)
  %l0-3: 0000000000000001 0000000000000000 000000000000000f 0000000000000000
  %l4-7: 0000000000000000 0000000000000002 0000000000000000 0000030003a11590
000002a100a7b600 genunix:segvn_fault+ac0 (0, 300039587e0, 368000, 1076f00, 366000, 0)
  %l0-3: 0000000000000002 0000000000000000 0000000000000000 0000000000366000
  %l4-7: 000002a100a7b760 0000000000000000 0000030003a11590 0000030003b2b538
000002a100a7b7c0 genunix:as_fault+4c8 (300039587e0, 30003964388, 366000, 300035bb168, 18bee98, 0)
  %l0-3: 0000000000000000 0000000000000002 00000300035bb140 00000300039587e0
  %l4-7: 0000000000002000 00000000018bf580 0000000000366000 0000000000002000
000002a100a7b8d0 unix:pagefault+ac (366000, 0, 2, 0, 300035bb0f0, 2)
  %l0-3: 0000030002a8c000 0000000000000000 0000030002a8c000 0000000000000000
  %l4-7: 0000000001881000 000000000187e400 0000000000000000 0000030003a78fc8
000002a100a7b990 unix:trap+d44 (2a100a7bb90, ffffffffffffff1d, 0, 2, 15d1c, 0)
  %l0-3: 0000000000000000 0000030003a78fc8 0000000000010031 0000030003964388
  %l4-7: 0000000000000000 0000000000000004 000000000182a000 0000030003a791a8

panic: entering debugger (continue to save dump)
Type 'go' to resume
ERROR: ^@idle_a_cpu: grab_cpu 0 failed
ERROR: ^@idle_other_cpus: cpu id 0 failed to stop: state 1
debugger entered.
{1} ok

I've executed the same experiments without opal (just ruby) and then the experiments finish properly. This is essentially the same problem that Doe experimented some months ago and reported here https://lists.cs.wisc.edu/archive/gems-users/2008-January/msg00008.shtml. Any idea on what is happenning here?
PS. the threads are atached to a processor by using pset_bind function to avoid migration.
[← Prev in Thread] Current Thread [Next in Thread→]
  • Re: [Gems-users] panic and send mondo case, Daniel Sánchez Pedreño <=