Re: [Gems-users] high execution time despite high hit ratio


Date: Wed, 09 Apr 2008 11:04:04 -0500
From: Dan Gibson <degibson@xxxxxxxx>
Subject: Re: [Gems-users] high execution time despite high hit ratio
I should add that profiling we've done suggests that Simics (2.2.19) behaves very differently (i.e. much slower) when ANY timing model is installed, even a one-line timing module that always returns a stall time of zero.

Regards,
Dan

Niket wrote:
I have L1 cache statistics that basically say that I have an almost perfect hit-ratio in both instruction and data-cache. Still, the execution time is several factors higher compared to the case when I run the same application in simics without ruby.

I expected the difference to be much smaller since the cache hit ratio is so good.
I guess by execution time you mean simulation time. The execution time is directly related to the amount of work being done. When you simulate ruby, on every memory access (even if this is an L1 hit), ruby is invoked which stalls the Simics processor and does some work. On finishing the timing simulation, ruby wakes up Simics. Without ruby, none of this occurs. Take a look at the ISCA tutorial for GEMS and it mentions the slowdown you expect when ruby/opal are loaded.

Cheers,
Niket

Do you have a possible explanation for this strange result? Since I'm not an expert on ruby, I can imagine that some of the parameters in the configuration file might affect the overall result significantly. You can find my configuration parameters below.

Regards,
Mladen

g_RANDOM_SEED: 1

g_DEADLOCK_THRESHOLD: 50000
g_FORWARDING_ENABLED: false

RANDOMIZATION: false
g_SYNTHETIC_DRIVER: false
g_DETERMINISTIC_DRIVER: false

// MOSI_CMP_token1 parameters
g_FILTERING_ENABLED: false
g_DISTRIBUTED_PERSISTENT_ENABLED: true
g_RETRY_THRESHOLD: 1
g_DYNAMIC_TIMEOUT_ENABLED: true
g_FIXED_TIMEOUT_LATENCY: 300

g_trace_warmup_length: 1000000
g_bash_bandwidth_adaptive_threshold: 0.75

g_tester_length: 0
// # of synthetic locks == 16 * 128
g_synthetic_locks: 2048
g_deterministic_addrs: 1
g_SpecifiedGenerator: DetermInvGenerator
g_callback_counter: 0
g_NUM_COMPLETIONS_BEFORE_PASS: 0

g_think_time: 5
g_hold_time:  5
g_wait_time:  5

PROTOCOL_DEBUG_TRACE: true
DEBUG_FILTER_STRING: none
DEBUG_VERBOSITY_STRING: none
DEBUG_START_TIME: 0
DEBUG_OUTPUT_FILENAME: none

SIMICS_RUBY_MULTIPLIER: 2
OPAL_RUBY_MULTIPLIER: 2

TRANSACTION_TRACE_ENABLED: false
USER_MODE_DATA_ONLY: false
PROFILE_HOT_LINES: false

PROFILE_ALL_INSTRUCTIONS: false
PRINT_INSTRUCTION_TRACE: false
BLOCK_STC: false
PERFECT_MEMORY_SYSTEM: false
PERFECT_MEMORY_SYSTEM_LATENCY: 0
DATA_BLOCK: false

REMOVE_SINGLE_CYCLE_DCACHE_FAST_PATH: false

// *********************************************
// CACHE & MEMORY PARAMETERS
// *********************************************

g_SIMICS: true

L1_CACHE_ASSOC: 4
L1_CACHE_NUM_SETS_BITS: 7
L2_CACHE_ASSOC: 4
L2_CACHE_NUM_SETS_BITS: 11

// 32 bits = 4 GB address space
g_MEMORY_SIZE_BYTES: 4294967296
g_DATA_BLOCK_BYTES: 64
g_PAGE_SIZE_BYTES: 4096
g_NUM_PROCESSORS: 0
g_NUM_L2_BANKS: 0
g_NUM_MEMORIES: 0
g_PROCS_PER_CHIP: 1

// The following group of parameters are calculated.  They must
// _always_ be left at zero.
g_NUM_CHIPS: 0
g_NUM_CHIP_BITS: 0
g_MEMORY_SIZE_BITS: 0
g_DATA_BLOCK_BITS: 0
g_PAGE_SIZE_BITS: 0
g_NUM_PROCESSORS_BITS: 0
g_PROCS_PER_CHIP_BITS: 0
g_NUM_L2_BANKS_BITS: 0
g_NUM_L2_BANKS_PER_CHIP: 0
g_NUM_L2_BANKS_PER_CHIP_BITS: 0
g_NUM_MEMORIES_BITS: 0
g_NUM_MEMORIES_PER_CHIP: 0
g_MEMORY_MODULE_BITS: 0
g_MEMORY_MODULE_BLOCKS: 0

// determines whether the lowest bits of a block address
// are used to index to a L2 cache bank or into the sets of a
// single bank
// lowest highest // true: g_DATA_BLOCK_BITS | g_NUM_L2_BANKS_PER_CHIP_BITS | L2_CACHE_NUM_SETS_BITS // false: g_DATA_BLOCK_BITS | L2_CACHE_NUM_SETS_BITS | g_NUM_L2_BANKS_PER_CHIP_BITS
MAP_L2BANKS_TO_LOWEST_BITS: true

// TIMING PARAMETERS
DIRECTORY_CACHE_LATENCY: 6

NULL_LATENCY: 1
ISSUE_LATENCY: 2
CACHE_RESPONSE_LATENCY: 12
L1_RESPONSE_LATENCY: 1
L2_RESPONSE_LATENCY: 6
COLLECTOR_REQUEST_LATENCY: 1
MEMORY_RESPONSE_LATENCY_MINUS_2: 78
DIRECTORY_LATENCY: 80
NETWORK_LINK_LATENCY: 40
COPY_HEAD_LATENCY: 4
ON_CHIP_LINK_LATENCY: 1
RECYCLE_LATENCY: 3
L2_RECYCLE_LATENCY: 5
TIMER_LATENCY: 10000
TBE_RESPONSE_LATENCY: 1
PERIODIC_TIMER_WAKEUPS: true

// constants used by TM protocols
RETRY_LATENCY: 100
RESTART_DELAY: 1000
PROFILE_EXCEPTIONS: false
PROFILE_XACT: false
XACT_NUM_CURRENT: 0 // must be 0
XACT_LAST_UPDATE: 0 // must be 0

// constants used by CMP protocols
// cache bank access times
L1_REQUEST_LATENCY: 2
L2_REQUEST_LATENCY: 4
// Allows on a single accesses to a multi-cycle L2 bank.
// Ensures the cache array is only accessed once for every L2_REQUEST_LATENCY
// number of cycles.  However the TBE table can be accessed in parallel.
SINGLE_ACCESS_L2_BANKS: true

// Ruby cycles between when a sequencer issues a request and it arrives at
// the L1 cache controller
SEQUENCER_TO_CONTROLLER_LATENCY: 4

// Number of transitions each controller state machines can complete per cycle
// i.e. the number of ports to each controller
// L1cache is the sum of the L1I and L1D cache ports
L1CACHE_TRANSITIONS_PER_RUBY_CYCLE: 32
// Note: if SINGLE_ACCESS_L2_BANKS is enabled, this will probably enforce a
// much greater constraint on the concurrency of a L2 cache bank
L2CACHE_TRANSITIONS_PER_RUBY_CYCLE: 32
DIRECTORY_TRANSITIONS_PER_RUBY_CYCLE: 32
COLLECTOR_TRANSITIONS_PER_RUBY_CYCLE: 32

// Maximum number of requests (including SW prefetches) outstanding from
// the sequencer (Note: this also include items buffered in the store
// buffer)
g_SEQUENCER_OUTSTANDING_REQUESTS: 16

// Number of TBEs available for demand misses, ALL prefetches, and replacements
// used by one-level protocols
NUMBER_OF_TBES: 128
// two-level protocols
NUMBER_OF_L1_TBES: 32
NUMBER_OF_L2_TBES: 32

// Number of Monitor Activity Entries
NUMBER_OF_MATES: 4

// NOTE: Finite buffering allows us to simulate a realistic virtual cut-through // routed network with idealized flow control. FINITE_BUFFERING: false
// All message buffers within the network (i.e. the switch's input and
// output buffers) are set to the size specified below by the FINITE_BUFFER_SIZE
FINITE_BUFFER_SIZE: 3
// g_SEQUENCER_OUTSTANDING_REQUESTS (above) controlls the number of demand requests
// issued by the sequencer.  The PROCESSOR_BUFFER_SIZE controlls the
// number of requests in the mandatory queue
// Only effects the simualtion when FINITE_BUFFERING is enabled
PROCESSOR_BUFFER_SIZE: 10
// The PROTOCOL_BUFFER_SIZE limits the size of all other buffers connecting to // Controllers. Controlls the number of request issued by the L2 HW Prefetcher
PROTOCOL_BUFFER_SIZE: 32
TSO: false

g_MASK_PREDICTOR_CONFIG: AlwaysBroadcast
g_TOKEN_REISSUE_THRESHOLD: 2

g_PERSISTENT_PREDICTOR_CONFIG: None

g_NETWORK_TOPOLOGY: HIERARCHICAL_SWITCH
g_CACHE_DESIGN: NUCA
g_endpoint_bandwidth: 10000
g_adaptive_routing: true
NUMBER_OF_VIRTUAL_NETWORKS: 6
FAN_OUT_DEGREE: 4
g_PRINT_TOPOLOGY: false

// the following variables must be calculated
g_NUM_DNUCA_BANK_SET_BITS: 0
g_NUM_BANKS_IN_BANK_SET_BITS: 0
g_NUM_BANKS_IN_BANK_SET: 0

// NUCA variables
g_NUM_DNUCA_BANK_SETS: 32
g_NUCA_PREDICTOR_CONFIG: NULL
ENABLE_MIGRATION: false
ENABLE_REPLICATION: false
COLLECTOR_HANDLES_OFF_CHIP_REQUESTS: false

// when PERFECT_DNUCA_SEARCH is only valid for DNUCA protocols
// assumes perfect L2 cache knowledge at the L1 controllers
PERFECT_DNUCA_SEARCH: true



_______________________________________________
Gems-users mailing list
Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users
Use Google to search the GEMS Users mailing list by adding "site:https://lists.cs.wisc.edu/archive/gems-users/"; to your search.


_______________________________________________
Gems-users mailing list
Gems-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/gems-users
Use Google to search the GEMS Users mailing list by adding "site:https://lists.cs.wisc.edu/archive/gems-users/"; to your search.


--
http://www.cs.wisc.edu/~gibson [esc]:wq!

[← Prev in Thread] Current Thread [Next in Thread→]