Dan and Mike,
Thanks for the quick response. I double checked REMOVE_SINGLE_CYCLE_DCACHE_FAST_PATH and it is set to true. In several sets of simulation, I gave 2 and 11 alternatively to L1_REQUEST_LATENCY, L1_RESPONSE_LATENCY and SEQUENCER_TO_CONTROLLER_LATENCY. I am running fft in an 8 processor CMP. It was surprising that in the case that I set all three to 11, I noticed only 10 percent increase in Ruby_cycles. And this is so strange, because I think the hit rate of L1 in Splash2 benchmarks are quite high. Any ideas?
Could you please take a quick look at the parameters I attached and see if I am missing sth?
And one more thing. what exactly this if statement is doing? (it is in doRequest function in Sequencer.C)
if (hit && ((request.getType() == CacheRequestType_IFETCH) || !REMOVE_SINGLE_CYCLE_DCACHE_FAST_PATH)) { DEBUG_MSG(SEQUENCER_COMP, MedPrio, "Fast path hit"); hitCallback(request, *data_ptr, GenericMachineType_L1Cache);
return true; }
Is there any relation between this and L1 latency?
Thanks,
Mojtaba
================ Begin System Configuration Print ================
Ruby Configuration ------------------ protocol: MSI_MOSI_CMP_directory simics_version: Simics 3.0.22 compiled_at: 02:46:55, Nov 11 2006 RUBY_DEBUG: false hostname: m45-010.pool g_RANDOM_SEED: 1
g_DEADLOCK_THRESHOLD: 50000 g_FORWARDING_ENABLED: false RANDOMIZATION: false g_SYNTHETIC_DRIVER: false g_SYNTHETIC_GENERATOR: locks g_DETERMINISTIC_DRIVER: false g_FILTERING_ENABLED: false g_DISTRIBUTED_PERSISTENT_ENABLED: true
g_DYNAMIC_TIMEOUT_ENABLED: true g_RETRY_THRESHOLD: 1 g_FIXED_TIMEOUT_LATENCY: 300 g_trace_warmup_length: 1000000 g_bash_bandwidth_adaptive_threshold: 0.75 g_tester_length: 0 g_synthetic_locks: 2048
g_deterministic_addrs: 1 g_SpecifiedGenerator: DetermInvGenerator g_callback_counter: 0 g_NUM_COMPLETIONS_BEFORE_PASS: 0 g_think_time: 5 g_hold_time: 5 g_wait_time: 5 PROTOCOL_DEBUG_TRACE: true DEBUG_FILTER_STRING: none
DEBUG_VERBOSITY_STRING: none DEBUG_START_TIME: 0 DEBUG_OUTPUT_FILENAME: none SIMICS_RUBY_MULTIPLIER: 2 OPAL_RUBY_MULTIPLIER: 2 TRANSACTION_TRACE_ENABLED: false USER_MODE_DATA_ONLY: false PROFILE_HOT_LINES: false
PROFILE_ALL_INSTRUCTIONS: false PRINT_INSTRUCTION_TRACE: false BLOCK_STC: false PERFECT_MEMORY_SYSTEM: false PERFECT_MEMORY_SYSTEM_LATENCY: 0 DATA_BLOCK: false REMOVE_SINGLE_CYCLE_DCACHE_FAST_PATH: true
g_SIMICS: true L1_CACHE_ASSOC: 4 L1_CACHE_NUM_SETS_BITS: 8 L2_CACHE_ASSOC: 8 L2_CACHE_NUM_SETS_BITS: 10 g_MEMORY_SIZE_BYTES: 4294967296 g_DATA_BLOCK_BYTES: 64 g_PAGE_SIZE_BYTES: 4096 g_NUM_PROCESSORS: 8
g_NUM_L2_BANKS: 4 g_NUM_MEMORIES: 4 g_PROCS_PER_CHIP: 8 g_NUM_CHIPS: 1 g_NUM_CHIP_BITS: 0 g_MEMORY_SIZE_BITS: 32 g_DATA_BLOCK_BITS: 6 g_PAGE_SIZE_BITS: 12 g_NUM_PROCESSORS_BITS: 3 g_PROCS_PER_CHIP_BITS: 3
g_NUM_L2_BANKS_BITS: 2 g_NUM_L2_BANKS_PER_CHIP_BITS: 2 g_NUM_L2_BANKS_PER_CHIP: 4 g_NUM_MEMORIES_BITS: 2 g_NUM_MEMORIES_PER_CHIP: 4 g_MEMORY_MODULE_BITS: 24 g_MEMORY_MODULE_BLOCKS: 16777216 MAP_L2BANKS_TO_LOWEST_BITS: true
DIRECTORY_CACHE_LATENCY: 1 NULL_LATENCY: 1 ISSUE_LATENCY: 2 CACHE_RESPONSE_LATENCY: 1 L2_RESPONSE_LATENCY: 22 L1_RESPONSE_LATENCY: 11 COLLECTOR_REQUEST_LATENCY: 1 MEMORY_RESPONSE_LATENCY_MINUS_2: 118
DIRECTORY_LATENCY: 1 NETWORK_LINK_LATENCY: 1 COPY_HEAD_LATENCY: 1 ON_CHIP_LINK_LATENCY: 1 RECYCLE_LATENCY: 1 L2_RECYCLE_LATENCY: 1 TIMER_LATENCY: 10000 TBE_RESPONSE_LATENCY: 1 PERIODIC_TIMER_WAKEUPS: true
LOG_BASE: 4294967296 RETRY_LATENCY: 100 RESTART_DELAY: 1000 PROFILE_EXCEPTIONS: false PROFILE_XACT: false XACT_NUM_CURRENT: 0 XACT_LAST_UPDATE: 0 L1_REQUEST_LATENCY: 11 L2_REQUEST_LATENCY: 1
SINGLE_ACCESS_L2_BANKS: true SEQUENCER_TO_CONTROLLER_LATENCY: 11 L1CACHE_TRANSITIONS_PER_RUBY_CYCLE: 32 L2CACHE_TRANSITIONS_PER_RUBY_CYCLE: 32 DIRECTORY_TRANSITIONS_PER_RUBY_CYCLE: 32 COLLECTOR_TRANSITIONS_PER_RUBY_CYCLE: 32
g_SEQUENCER_OUTSTANDING_REQUESTS: 16 NUMBER_OF_TBES: 128 NUMBER_OF_MATES: 4 NUMBER_OF_L1_TBES: 32 NUMBER_OF_L2_TBES: 32 FINITE_BUFFERING: false FINITE_BUFFER_SIZE: 3 PROCESSOR_BUFFER_SIZE: 10 PROTOCOL_BUFFER_SIZE: 32
TSO: false g_MASK_PREDICTOR_CONFIG: AlwaysBroadcast g_TOKEN_REISSUE_THRESHOLD: 2 g_PERSISTENT_PREDICTOR_CONFIG: None g_NETWORK_TOPOLOGY: PT_TO_PT g_CACHE_DESIGN: NUCA g_endpoint_bandwidth: 1000 g_adaptive_routing: true
NUMBER_OF_VIRTUAL_NETWORKS: 5 FAN_OUT_DEGREE: 4 g_PRINT_TOPOLOGY: true g_NUM_DNUCA_BANK_SETS: 32 g_NUM_DNUCA_BANK_SET_BITS: 0 g_NUM_BANKS_IN_BANK_SET_BITS: 0 g_NUM_BANKS_IN_BANK_SET: 0 PERFECT_DNUCA_SEARCH: true
g_NUCA_PREDICTOR_CONFIG: NULL ENABLE_MIGRATION: false ENABLE_REPLICATION: false COLLECTOR_HANDLES_OFF_CHIP_REQUESTS: false XACT_LENGTH: 0 XACT_SIZE: 0
Dan Gibson wrote:
Have you tried modifying the L1_REQUEST_LATNECY? Some of the CMP protocols use L1_REQUEST_LATENCY instead of the L1_REPSONSE_LATENCY... I do not recall which.
Regards, Dan Gibson
Mojtaba Mehrara wrote:
Hi,
I am trying to increase the L1 hit latency in the MSI_MOSI_CMP_directory protocol. I did what Mike said in the following post:
"
The L1_RESPONSE_LATENCY, like most of the specified latencies, are specific to an individual protocol. Adjusting the L1 hit latency is unfortunately not at all straightforward. By default, he L1 hit latency is always 1 cycle. This can be changed by turning off "fast path hits", controlled by the REMOVE_SINGLE_CYCLE_DCACHE_FAST_PATH flag. A fast path hit is where the Ruby sequencer (ruby/sequencer.C) directly checks the permissions in the L1 caches before actually issuing a request to Ruby. If you turn this off, the L1 hit latency can be controlled by the SEQUENCER_TO_CONTROLLER_LATENCY parameter.
Sorry this is confusing...hopefully we can clean this up in the future.
--Mike
"
However I notice nearly no change in Ruby_cycles when I increase SEQUENCER_TO_CONTROLLER_LATENCY from 2 to 11.
Followings are my other parameters.( I have unrealistically set some delays to 1 to minimize their effect.)
By tracking down a specific trace in tester.exec, I noticed that L1_REQUEST_LATENCY and L1_RESPONSE_LATENCY are the delays between L1 and L2 and have nothing to do with L1 hit latency itself. Is this correct?(I have tried increasing these two anyways, but I still didn't notice much difference in perfomance)
Am I missing something here?
One more thing. As one on the previous posts noted, I tried to get the L1 miss rate by commenting out the follwoing line in system/Sequencer.C.
// if (!REMOVE_SINGLE_CYCLE_DCACHE_FAST_PATH) { g_system_ptr->getProfiler()->addPrimaryStatSample(msg, m_chip_ptr->getID()); But the reported miss rates are pretty high on Splash2 benchmarks.(more than 90% !!) Is it possible that this is the source of my problem in L1 hit latency? If so, what should I do and how should I measure the actual miss rate?
Thanks in advance,
Mojtaba
|