7. Every #pragma parallel or #pragma parallel for has an implicit barrier at the end, unless nobarrier is specified.
  8. You have added the barrier in the wrong place. You are actually introducing artificial load imbalance with its current placement. Put the barrier /before/ the magic instruction. 
 9. Our art source code differs. I show that loop as schedule(dynamic), not schedule(static). schedule(dynamic) should do a better job reducing imbalance if there is variance in iteration duration. We are using specOMP2001 version string 35. 
 On Thu, Feb 25, 2010 at 2:31 PM,   <ubaid001@xxxxxxx> wrote: 
Hi,
1. Lack of explicit locking does not imply lack of spinning. There are
 
implicit barriers at the end of most OpenMP #pragmas, and looking at art's 
source code, I see that there are several #pragmas without barrier elision. 
Moreover, it is possible to spin in the OS, or in other processes. 
 
 
    Yeah i agree that there are pragmas but I have added barriers wherever needed. 
 
I run simics till this magic breakpoint and the load Opal and Ruby and simulate the parallel loop. 
 
puts("magic1"); 
MAGIC(0x40000); 
#pragma omp barrier// the barrier so that all Simics proc can reach the same pt 
 
#pragma omp for private (k,m,n, gPassFlag) schedule(static)    for (ij = 0; ij < ijmx; ij++)       {        j = ((ij/inum) * gStride) + gStartY; 
      i = ((ij%inum) * gStride) +gStartX; 
      k=0; 
      for (m=j;m<(gLheight+j);m++) 
#pragma noswp 
        for (n=i;n<(gLwidth+i);n++) 
          f1_layer[o][k++].I[0] = cimage[m][n]; 
                      gPassFlag =0; 
      gPassFlag = match(o,i,j, &mat_con[ij], busp); 
 
      if (gPassFlag==1) 
      { 
#ifdef DEBUG 
       printf(" at X= %d Y = %d\n",i,j); 
#endif 
       if (set_high[o][0]==TRUE) 
       { 
         highx[o][0] = i; 
         highy[o][0] = j; 
         set_high[o][0] = FALSE; 
       } 
       if (set_high[o][1]==TRUE)        { 
         highx[o][1] = i; 
         highy[o][1] = j; 
         set_high[o][1] = FALSE; 
       } 
      } 
 
#ifdef DEBUG 
      else if (DB3) 
       printf("0.00#%dx%da%2.1fb%2.1f\n",i,j,a,b); 
#endif 
 
      }   
puts("magic2"); 
 
    } 
 
 
And this load imbalance was present on other benchmarks as well. Plus these benchmarks do not show a load imbalance when run on an actual multi processor system (itanium 2 montecito processor 4 processor SMP with each processor being dual core).
2.It is entirely possible that something other than art is running on your
 
cores. Look into pset_bind, and processor_bind. With openMP, I'd recommend 
pset_create instead of explicit binding. 
 
 
  This is a good possibilty. Will look into this. 
 
3. I'm sure its been said on this list before (because I have said it) that
 
instruction count is a BAD metric for multithreaded code. 
 
 
  Yeah. But the faulty core is almost off the others in execution by almost 100% which is spurious. 
 
 
 
Suhail
On Feb 25 2010, Dan Gibson wrote:
 
I believe what you are observing is inherent to the simulation, and to real 
executions, for the following reasons: 
 
1. Lack of explicit locking does not imply lack of spinning. There are 
implicit barriers at the end of most OpenMP #pragmas, and looking at art's 
source code, I see that there are several #pragmas without barrier elision. 
Moreover, it is possible to spin in the OS, or in other processes. 
2. It is entirely possible that something other than art is running on your 
cores. Look into pset_bind, and processor_bind. With openMP, I'd recommend 
pset_create instead of explicit binding. 
3. /Simics/ does not choose what code runs on which core. The operating 
system does that. Look for ways to affect the OS, not Simics. 
4. I'm sure its been said on this list before (because I have said it) that 
instruction count is a BAD metric for multithreaded code. 
5. art on my serengeti target takes a LOT of TLB misses (one almost every 
iteration). I'm not sure if individual cores would react differently or not 
to TLB misses. 
6. art uses a dynamically-scheduled parallel section. Load imbalance in 
those iterations would cause one core to lag or complete early. 
 
Regards, 
Dan 
 
On Thu, Feb 25, 2010 at 1:51 PM, <ubaid001@xxxxxxx> wrote: 
 
 
Since there are no mutex locks, no processor is spinning. I have only my 
benchmark running on my Simics Target Machine (Serengeti). Is there a 
possiblity that the faulty core is running some other program rather than 
the art thread? 
 
Also is there any way in Simics, to bind a thread to a particular 
processor? 
So that i know for sure that all my processors are running the user thread. 
 
 
 
Suhail 
 
 
 
On Feb 25 2010, ubaid001@xxxxxxx wrote: 
 
 Hi, 
 
I had brought this issue earlier. I am running Spec Open MP benchmark 
(art) 
on a 4 core CMP system (OPAL + RUBY). 
 
But there seems to be a huge difference in the number of instructions 
executed between one processor and the rest. I know that there are no mutex 
locks in the code. Infact I load opal and ruby only from the parallel 
section of the program. 
 
The one core either lags behind or leads the other processor. And this 
happens on every single simulation. 
 
Can anyone shed more light on this? 
 
Suhail 
 
 _______________________________________________ 
 
Gems-users mailing list 
Gems-users@xxxxxxxxxxx 
https://lists.cs.wisc.edu/mailman/listinfo/gems-users 
Use Google to search the GEMS Users mailing list by adding "site: 
https://lists.cs.wisc.edu/archive/gems-users/" to your search. 
 
 
 
 
 
 
 
_______________________________________________ 
Gems-users mailing list
 Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users
Use Google to search the GEMS Users mailing list by adding "site: https://lists.cs.wisc.edu/archive/gems-users/" to your search.
   
 
  --  http://www.cs.wisc.edu/~gibson [esc]:wq! 
 |