Wednesday 27 May 2015

Fixing cache issue

What are the configurations?

  • Enabling cache/MMU
  • Enabling SCU
  • Integrating with existing initialization for Raspberry Pi 1 

   

Enabling cache/MMU


For Raspberry Pi 1 

Cache can be enabled simply by setting the bits in CP15 Control Register.

Cache/mmu setup for Raspberry Pi starts in
~/rtems/c/src/lib/libbsp/arm/shared/mminit.c 

This file implements a single memory initialization function.
BSP_START_TEXT_SECTION void bsp_memory_management_initialize(void)

It has ARM1176 specific code. So only a part of the function will be same for Pi2. A separate initialization function would be better.


Next setting up translation table and turn on caches/MMU from CP15.
This is supported in existing code in the file
~/rtems/c/src/lib/libbsp/arm/shared/include/arm-cp15-start.h 
in function arm_cp15_start_setup_translation_table_and_enable_mmu_and_cache

/* Enable MMU and cache */
  ctrl |= ARM_CP15_CTRL_I | ARM_CP15_CTRL_C | ARM_CP15_CTRL_M;

  arm_cp15_set_control(ctrl);


For Raspberry Pi 2

For Cortex-A7, to enable the caches, an additional SMP bit in CP15 Auxiliary Control Register has to be enabled. This is not currently done and so by default the caches and mmu are disabled.
This is similar to cortex-A9.

Following this enable instruction and data caches and MMU.

Reference to Xilinx-zynq
~/libbsp/arm/xilinx-zynq/startup/bspstartmmu.c

The function arm_cp15_start_setup_mmu_and_cache() will be required for Cortex A7.

Next, like for Pi1 (with respective parameters) is a call to function, arm_cp15_start_setup_translation_table_and_enable_mmu_and_cache() which will also be used for A7.


When it comes to MMU, it is about the translation table and access permissions. Here only the translation table concerns us.
ARM MMU allows two types of translations - section based(which requires a single level translation) and page based (which requires two levels of translation). I see that we use section based. So there is is a single translation table used by MMU.
  • The translation table is set up in arm-cp15-start.h file under arm_cp15_start_set_translation_table_entries(ttb, &config_table [i]) . 
  • The ARM memory configuration for Raspberry Pi under RTEMS is provided by arm_cp15_start_mmu_config_table[] in mm_config_table.c .
  • The memory attributes for ARM memory are controlled using several flags which are defined in  arm-cp15.h . These flags are a combination of bits and are present for each entry in the translation table. These bits are according to the ARM v7 translation table descriptor format for sections.


  • Relevant to the issue at hand are the bits controlling cacheability. These are TEX[2:0],  C and B bits.
  • There are settings to control the translation table memory itself. This is done through TTBR0 register (format which supports multiprocessor extensions)
I have reused this mm_config_table[]. I took a look at the configurations for these bits ( experimenting only with these settings. Otherwise the mm_config_table is same as that for Pi 1). There is not much official documentation about Pi 2 caches, but from the references I see that normal memory ( basically ROM and RAM memory for ARM) should use a write back, write allocate policy for cacheable regions for better performance.

Existing settings include:
TEX[2:0]=1,1,1 & C,B=1,1 -> write back, write allocate, normal memory (for both inner and outer). No write allocate is costly.
This has been applied to the cacheable regions of memory (I identified these as as having the CACHED suffix in the macro in mm_config_table).

Without turning caches/mmu on I get  83333 dhrystones/sec. After turning on 76923 dhrystones/sec.

Tried changes:
  1. TEX[2:0]=1,0,1 & C,B=0,1 -> cacheable memory: write back, write allocate, normal memory. This significantly reduced performance (58823 dhrysones/sec).
  2. TEX[2:0]=0,0,1 & C,B=1,1 -> write back, write allocate, normal memory (region does not remain cacheable?). Here the performance was similar to the performance obtained after just enabling caches. Except that, in this case , on first run of dhrystone I get 83333 dhrystones/sec and without this change I get 76923 dhrystones/sec. Else, on subsequent executions of dhrystone the performance fluctuates between the two figures .
  3. Changes to TTBR0, which controls the translation table memory region attributes.

 Code

For now, I have not considered integration of the two Pi variants. I have replaced the existing  
BSP_START_TEXT_SECTION void bsp_memory_management_initialize(void)

with initialization required for Pi2.

The link below explains set up for ARM v7 architecture
  •  Invalidation of caches not done.
  1. Enable SMP bit  (TRM for Cortex A7 mpcore section 4.3.27 System Control) before enabling caches/mmu or performing any cache and TLB maintenance operations.
  2. Call arm_cp15_start_setup_mmu_and_cache(). Commented branch prediction enable. (TRM for Cortex A7 mpcore section 4.3.27 System Control description for Z bit
  3. Set up translation table and set caches/mmu enable bits. Call to function arm_cp15_start_setup_translation_table_and_enable_mmu_and_cache(). This is same as Pi1.
    • In the subsequent call to arm_cp15_start_setup_translation_table() , I have added code to configure bits in TTBR0 for inner and outer write-back, write allocate , cacheable, sharable memory. (v7 set up link above, https://github.com/mrvn/test/blob/master/mmu.cc
    • Set domain clients.
    • Enable MMU/caches. Invalidate branch predictor (ARM v7 Architecture reference manual section B2.2.6 under Branch prediction maintenance operations)



Changed TTBR0 translation table memory attributes (using the multiprocessor extensions register format). (TRM for Cortex A7 mpcore section 5.2.1 Memory types and attributes)




 References 

  • Cortex A7 MPCore Technical Reference Manual
  • ARM v7 Architecture Reference Manual
  • Cortex A7 MPCore Technical Reference Manual
  • ARM1176jzfs Technical Reference Manual
  • Existing RTEMS code for Raspberry Pi, Xilinx-zynq, Realview-pbx-a9 BSPs



Next: Cache problem solved!

See: More on Cache/MMU configuration

Previous: Introduction to the project






 

Tuesday 26 May 2015

Introduction to the project

As part of this project I will be working with Raspberry Pi 2. This new model, launched after the Raspberry Pi Model B+, has a major improvement in terms of the quad-core processor in its SOC. The project aims to improve BSP support for the Pi 2.

Raspberry Pi 2 MOD B
                           

A little background before proceeding...

With some preliminary changes to existing Raspberry Pi BSP code, the new Pi was up and running using its single core. However, after running benchmarks on it under RTEMS, results indicated that changes to cache configurations are required to improve performance. I will be fixing this as part of the project.

Also, with just one functional core out of four, performance is largely limited. And so, the other important goal is to bring up the secondary cores and enable support for symmetric multiprocessing. This will include running tests to ensure a stable SMP environment and check performance.

Quick links to the two broad sub-parts: