Friday, 21 August 2015

Future Work

So as part of GSoC 2015 I had to fix cache configurations and enable a stable SMP environment for Raspberry Pi 2.

With respect to cache configurations, though an improvement has been obtained, the issue of text section access permissions has to be solved completely. Some solutions can help escape the problem but not solve it completely. This matter has to be looked into post the program.

Next, in case of SMP support, I came across issues whose roots could eventually be identified. What needs to be done to get basic SMP working is something I could figure out and have tried to implement. However some problems persist and SMP still cannot be done on Pi 2. Post this edition of GSoC the necessary implementation hurdles have to be resolved and SMP enabled on Pi 2.

Interrupt handling for Pi 2 has to be set up and integrated with existing implementation.

Apart from the above, work remains to be done in relation to Pi 2 peripheral support.

I would certainly like to continue working with Raspberry Pi 2 BSP Support!

Generating Interprocessor interrupt

Since missing IPI was the problem, next step is to implement sending of interrupts between cores. This too is not straightforward as Pi 2 does not have the Generic Interrupt Controller present with most ARM based processors. As an alternative to this, I have come up with the suggestion of using mailboxes of each core for this purpose.

How is the mailbox used?

Out of the four mailboxes of each core, one mailbox is reserved for this purpose. I have used mailbox 3. This might have to be changed to one of mailboxes 0,1,2 as mailbox 3 has a specific purpose for SMP start up.
The mailbox interrupt is generated as long as as a non-zero value is written to the mailbox register. Thus, the core wanting to send an interrupt can simply write a non-zero value to mailbox 3 of target core(s). For now, I just set the mailbox to value of 1. This has been added through _CPU_SMP_Send_interrupt() function in ~/raspberrypi/startup/bspsmp.c .

void _CPU_SMP_Send_interrupt( uint32_t target_processor_index )
{
  /* Generates IPI */
  uint32_t *target_mb_write = (uint32_t*)(BCM2836_MAILBOX_3_WRITE_SET_BASE + 0x10 * target_processor_index);
  *target_mb_write = 0x1;
}


The IPI handler will basically be a handler for mailbox 3 interrupt. The handler resets contents of target mailbox 3 to zero. This clears the interrupt. Then the RTEMS IPI handler _SMP_Inter_processor_interrupt_handler() is called.

/* writing zero to mailbox clears the interrupt */
    *mb_read_clr = 0x01;
   
    _SMP_Inter_processor_interrupt_handler();


As part of initialization for IPI support, the mailbox 3 contents have to be reset to zero and mailbox 3 interrupt enabled in the interrupt control register for each core.

In order to be able to handle the interrupt appropriately, it has to be registered first. This requires an interrupt vector be associated with the mailbox 3 interrupt which will be used by all cores.  This interrupt vector is not yet determined. Once determined, the handler can be installed for the interrupt vector.

Raspberry Pi Interrupt Handling

Some part of the interrupt handling for Raspberry pi happens in ~/raspberrypi/irq/irq.c . How to merge IPI initalization and handling with existing RPi interrupt handling has to be considered. 

References

  • I was later able to confirm the mailbox approach with this reference


Monday, 10 August 2015

Problems in starting secondary cores

Provide the start address and the secondary cores will execute from there. But, I have still not been able to get the Raspberry Pi 2 start with all four cores running soundly.

  1. Without a debugging environment , I was just guessing what potential problems could be. So firstly, I was trying to make sure whether my code is writing the start address to the desired locations or not. Because if that is not happening then certainly the cores cannot be started. When I did get a debugging environment, I used the RPi 1( RPi2 couldn't be used) . With gdb I tried the x address command, which is to see contents of address. I used it to see if 0x4000009C, 0x400000AC, 0x400000BC had the address of _start . That was fine I found. I even verified using Pi 1 and ran a piece of code to see whether or not _start is the right location to go to.

  2. Since jump address and code were not a problem I debugged further. I referred to ~/cpukit/score/include/rtems/score/percpu.h which documents SMP start up with respect to states of the CPUs. The cores send and receive events using the functions, _CPU_SMP_Processor_event_receive() & _CPU_SMP_Processor_event_broadcast(), to carry out state changes. This did not require any RPi2 specific implementation. After understanding this, I came to IPIs. I had understood earlier that IPIs are sent using GIC to wake the cores. I didn't see where IPI would be used for Pi2 and so it was not considered till then.

  3. On investigating for IPI, I found that _CPU_SMP_Send_interrupt() in ~/libbsp/arm/shared/arm-a9mpcore-smp.c is the function that generates the IPI. I referred to the ARM Generic Interrupt Controller Architecture  Specification(version 2.0) to understand it better. To make sure this was the problem I used some mundane methods. I used breakpoints to confirm that the function was indeed invoked. Then I just commented out the body of this function and built the code to work on Realview-pbx-a9. But the smp tests in ~/testsuites/smptests ran normally! And then I spent some time again before I realized there was a build anomaly. When I deleted the built and created another one from scratch (for some other purpose with this change also present) the tests failed to run! I could finally get hold of the problem!

  4. Next I have to figure out how to implement the IPI . In the absence of GIC, I am considering mailboxes of each core to get this done ( could find no references for checking whether this will work). Each core has 4 mailboxes. I think one can be used for communication between processors. I am working on this currently. 

Next: Generating Interprocessor interrupt




Starting secondary cores with RTEMS

So I could find the missing jump address where the secondary cores should start executing from for RTEMS. We use the _start() function defined in the boot.S file. When starting from there, required initializations for all modes are done correctly for each core. So I have used the following instructions which will write the start address to specific memory locations for each core.

   "ldr r2 , =_start\n"
   "ldr r1 , =0x4000009C\n"
   "str r2 , [r1]\n"
   "ldr r1 , =0x400000AC\n"
   "str r2 , [r1]\n"
   "ldr r1 , =0x400000BC\n"
   "str r2 , [r1]\n"

That's it! Starting the secondary cores is as straightforward for bare metal development with the Pi 2. With RTEMS, this has to blend with the process RTEMS uses to bring up SMP for a variant.

So, next we need to decide where this process should be started. The primary core is the only working core at this time so the task of starting other cores has to be done by it. This should be done in an early stage of boot up. I have added a function raspberrypi_wake_secondary_processors() which will be called by core 0 from the bsp_start_hook_0().

When the secondary cores reach bsp_start_hook_0() further initializations like setting up translation table and MMU will be done by calling the function start_on_secondary_processor() there. This is the function which will eventually lead the secondary cores to SMP set up through a call to _SMP_Start_multitasking_on_secondary_processor().

 While the secondary cores execute their thread, the primary core proceeds from bootcard(), and then to rtems_initialize_data_structures() in ~/cpukit/sapi/src/exinit.c under which call to _SMP_Handler_Initialize() is the entry point for SMP initialization from its side. 

However starting SMP didn't turn out to be as easy as it seemed. There were issues and understanding the flow of execution in RTEMS for primary and secondary cores helped.

I used available debugging tools to understand the execution and identify problems. This post on debugging explains what I did.



 Code


As mentioned above, the RTEMS SMP start sequence has to be followed for proper start up. Part of this sequence are certain functions which must be defined. The definitions depend on BSP.
  • For a9mpcore BSPs, the  file ~/libbsp/arm/shared/arm-a9mpcore-smp.c is used in common. Only, the function _CPU_SMP_Start_processor() is specific to each variant. For Pi 2, I have defined these functions in ~/libbsp/arm/raspberrypi/startup/bspsmp.c. This has to be added to raspberry pi Makefile.am.
  • Also the variable bsp_processor_count has to be given a default value for BSPs supporting SMP.  This is 4 in case of Pi 2 since it has 4 cores.  For this we need to add the following line in linkcmds file bsp_processor_count=DEFINED(bsp_processor_count)?bsp_processor_count : 4;
  • When running an application, the maximum number of processors that it must use is specified using CONFIGURE_SMP_MAXIMUM_PROCESSORS which is defined at build time in system.h file for the application. 
  • The number of processors which will eventually be used is minimum of the number actually present in hardware and the number configured.  This can be found in ~/cpukit/score/src/smp.c in _SMP_Handler_initialize()  cpu_count = cpu_count < cpu_max ? cpu_count : cpu_max;
  • The config.ac for Raspberry Pi had to be modified a bit to enable SMP support in the configurations file configure. It adds support for the --enable-smp option for Raspberry Pi.

Thursday, 6 August 2015

Cache problem solved!

There were some settings I tried based on my understanding as explained in the post Fixing cache issue . All the fixes revolved around the translation table registers settings and configurations flags for translation table in memory used by MMU. I narrowed down on the stucture arm_cp15_start_section_config arm_cp15_start_mmu_config_table[] in mminit.c because after all the settings it seemed that data was not being cached at all indicating something wrong in the flags which controlled cacheability of memory sections.




As mentioned earlier, RTEMS uses 1MB memory sections. The  macros used for .flags member are defined in arm-cp15.h. These are a combination of TEX, shareability, cacheability and access permissions control bits. While working on this, I made changes to macros aimed at enabling caching.
It took me time to realize that there was a section, the text section (which has code) is not cached and all the changes that I made in the arm-cp15.h. were not affecting the flag used for it. The flag being used was ARMV7_MMU_READ_WRITE . So I added the cache and buffer enable bits. 

{
    .begin = (uint32_t) bsp_section_text_begin,
    .end = (uint32_t) bsp_section_text_end,
    .flags = /*ARMV7_MMU_CODE_CACHED*/ARMV7_MMU_READ_WRITE | ARM_MMU_SECT_C | ARM_MMU_SECT_B
  }


Was this the only missing setting? Yes! And we could get the tremendous speed up of the faster Raspberry Pi 2 cores! 

So, this seemed to settle the cache problem I was looking to solve. But what was observed next was that some of the other ARM bsps could use a macro ARMV7_CP15_START_DEFAULT_SECTIONS to define their memory map which contained the default sections and their default settings. 


 Why couldn't Raspberry Pi use it as well? Only difference was the flag used for text section. The default sections used ARMV7_MMU_CODE_CACHED which indicates only read permission. So what? So when I used the same flag for raspberry pi the system wouldn't start up. Another issue - why did text section need write permission along with read! what was being written to a strictly read-only area of memory! - and this needed some debugging to know where the execution failed. 
But I had to put this aside till I could figure out how to. The answer was QEMU, not with Pi 2 obviously, but Pi 1 was there. Both use the same arm_cp15_start_mmu_config_table[] right. So this is how I set up the debugging environment.   

And here is what I found...

The problem comes up when bsp_start_clear_bss() is invoked in bsp_start_hook_1() . While trying to set BSS memory region to '0' , the write enters text section. With read-only settings, this leads to an exception and the start up cannot proceed. 




Looks like the bss and text sections overlap (which is a bad thing to happen...)

So while trying to fix cache performance problem, an important issue has come up. Next I am looking at finding a solution for this as well :)
 

Wednesday, 5 August 2015

More on Cache/MMU configuration

Enabling Snoop Control Unit (SCU)

From what I find here and from some more references, there is no separate control for enabling the SCU in Raspberry Pi 2. Since there is no official documentation available for this I have gone ahead assuming SCU is on by default.


 Conditionally setting up Pi 1 or 2

 There are only a few differences between the configurations required for the two. These are mostly in the initial cp15 controls. Otherwise, they both follow the same initialization method. The same memory configuration structure arm_cp15_start_mmu_config_table[] in mminit.c is used for both.
So to provide for bsp specific set up of controls I have added a function raspberrypi_setup_mmu_and_cache() to bspstarthooks.c . The value of BSP_IS_RPI2 helps determine whether bsp being used is Pi 1 or Pi 2. A value of 1 implies bsp is Pi 2. The required parts of code are conditionally compiled and then the controls are passed to bsp_memory_management_initialize() . 



Code

This is how the controls are set up for Pi 1 and 2 in raspberrypi_setup_mmu_and_cache()
 
#if (BSP_IS_RPI2 == 1)
  bsp_initial_mmu_ctrl_clear = ARM_CP15_CTRL_A;
  bsp_initial_mmu_ctrl_set = ARM_CP15_CTRL_AFE | ARM_CP15_CTRL_Z; 
#else
  bsp_initial_mmu_ctrl_clear = 0;
  bsp_initial_mmu_ctrl_set = ARM_CP15_CTRL_AFE | ARM_CP15_CTRL_S
                  | ARM_CP15_CTRL_XP; 
#endif



After the above set up initialization function is called

bsp_memory_management_initialize(
    bsp_initial_mmu_ctrl_set,
    bsp_initial_mmu_ctrl_clear
  );



Debugging

When it came to debugging, I could not get hold of a tool to debug my Pi 2 directly. QEMU was the only resource at my disposal. It does not support Pi 2 as of now, but supports Pi 1 and some SMP capable BSPs. So, to solve my issues, I used QEMU to emulate the Pi 1 and Realview-pbx-a9 multicore BSPs as and when each of them was needed.

With QEMU I used the arm-rtems4.11-gdb to step through the code and set breakpoints.

Using QEMU with Raspberry Pi(v1)

There is a modified QEMU source for Pi which can be obtained from
https://github.com/Torlus/qemu/tree/rpi

Some help on how to build and run QEMU for Pi is present here
http://wiki.osdev.org/Raspberry_Pi_Bare_Bones

If the kernel executable does not run and you see QEMU has hanged (which happened with me ) then the kernel entry address needs to be changed. I used the mkimage tool to create an image with the correct load address and then run it with QEMU. This is how I got an RKI image ready for QEMU.

mkimage -A arm -O rtems -T kernel -a 0x00008000 -e 0x00008000 -C none -d rki.bin kernel.img

Using QEMU with Realview-pbx-a9

An executable can be run easily here. This is how I ran an RTEMS "Hello World" application

qemu-system-arm  -M realview-pbx-a9 -m 256M -kernel hello.exe -serial stdio

The number of cores to be used can be specified with the -smp option, like the following will run QEMU for 2 cores

qemu-system-arm  -M realview-pbx-a9 -m 256M -kernel smp01.exe -smp 2 -serial stdio

Debugging with QEMU

This is a nice reference for using gdb with QEMU
http://wiki.osdev.org/Kernel_Debugging

The same can be achieved using RTEMS tools (specifically the arm-rtems4.11-objcopy and arm-rtems4.11-gdb utilities)

While using gdb, the " thread N " command can be used to switch between cores and it will let you step through the thread which core N is running (N=1,2,3..)







Saturday, 20 June 2015

RTEMS SMP Initialization

This post is about my understanding related to how RTEMS initializes SMP.

From start.S first going to bsp_start_hook_0, then arm_a9mpcore_start_hook_0 (in arm_a9mpcore-start.h) , where the secondary cores call the function arm_a9mpcore_start_on_secondary_processor() ,which performs cache, mmu initialization for them. The primary core thread goes on to call boot_card().

After bsp_hooks, starting from boot_card(), I traced some functions for the primary core (of the highly modular code :)) , which I could get hold of for now. Below is a picture to capture the sequence very briefly

 
Much of SMP work is based on a9mpcore. Parts of the SMP code lead to a9mpcore specific functions in arm-a9mpcore-smp.c and also to bsp specific implementation in bspsmp.c . So what I see is that , to keep this sequence consistent for Pi 2 as well, we will have to provide Pi 2 specific code for these functions instead of using definitions from these files.

There is arm-gic based interrupt handling which will not be relevant to Pi 2 as it simply does not use that interrupt controller.


Next: Starting secondary cores with RTEMS

Tuesday, 16 June 2015

Getting SMP started on Raspberry Pi 2

When the Raspberry Pi 2 starts up, the primary core (core 0) executes the initial boot sequence while secondary cores (cores 2,3 and 4) are powered on and wait for a jump address to be specified.

Each core has four mailboxes. The extra cores read a particular mailbox and wait for its contents to become non-zero. The read content provides the jump address.

The cores read from mailbox 3. Physical addresses for mailbox 3 for each of the three cores can be obtained as

0x4000008C + 0x10 * CPU_ID  for CPU_ID=1,2,3

The jump address can be associated with a function that will be executed by the cores.

This details of this function have to be identified.


Next: RTEMS SMP Initialization

Go to: Starting secondary cores with RTEMS

Wednesday, 27 May 2015

Fixing cache issue

What are the configurations?

  • Enabling cache/MMU
  • Enabling SCU
  • Integrating with existing initialization for Raspberry Pi 1 

   

Enabling cache/MMU


For Raspberry Pi 1 

Cache can be enabled simply by setting the bits in CP15 Control Register.

Cache/mmu setup for Raspberry Pi starts in
~/rtems/c/src/lib/libbsp/arm/shared/mminit.c 

This file implements a single memory initialization function.
BSP_START_TEXT_SECTION void bsp_memory_management_initialize(void)

It has ARM1176 specific code. So only a part of the function will be same for Pi2. A separate initialization function would be better.


Next setting up translation table and turn on caches/MMU from CP15.
This is supported in existing code in the file
~/rtems/c/src/lib/libbsp/arm/shared/include/arm-cp15-start.h 
in function arm_cp15_start_setup_translation_table_and_enable_mmu_and_cache

/* Enable MMU and cache */
  ctrl |= ARM_CP15_CTRL_I | ARM_CP15_CTRL_C | ARM_CP15_CTRL_M;

  arm_cp15_set_control(ctrl);


For Raspberry Pi 2

For Cortex-A7, to enable the caches, an additional SMP bit in CP15 Auxiliary Control Register has to be enabled. This is not currently done and so by default the caches and mmu are disabled.
This is similar to cortex-A9.

Following this enable instruction and data caches and MMU.

Reference to Xilinx-zynq
~/libbsp/arm/xilinx-zynq/startup/bspstartmmu.c

The function arm_cp15_start_setup_mmu_and_cache() will be required for Cortex A7.

Next, like for Pi1 (with respective parameters) is a call to function, arm_cp15_start_setup_translation_table_and_enable_mmu_and_cache() which will also be used for A7.


When it comes to MMU, it is about the translation table and access permissions. Here only the translation table concerns us.
ARM MMU allows two types of translations - section based(which requires a single level translation) and page based (which requires two levels of translation). I see that we use section based. So there is is a single translation table used by MMU.
  • The translation table is set up in arm-cp15-start.h file under arm_cp15_start_set_translation_table_entries(ttb, &config_table [i]) . 
  • The ARM memory configuration for Raspberry Pi under RTEMS is provided by arm_cp15_start_mmu_config_table[] in mm_config_table.c .
  • The memory attributes for ARM memory are controlled using several flags which are defined in  arm-cp15.h . These flags are a combination of bits and are present for each entry in the translation table. These bits are according to the ARM v7 translation table descriptor format for sections.


  • Relevant to the issue at hand are the bits controlling cacheability. These are TEX[2:0],  C and B bits.
  • There are settings to control the translation table memory itself. This is done through TTBR0 register (format which supports multiprocessor extensions)
I have reused this mm_config_table[]. I took a look at the configurations for these bits ( experimenting only with these settings. Otherwise the mm_config_table is same as that for Pi 1). There is not much official documentation about Pi 2 caches, but from the references I see that normal memory ( basically ROM and RAM memory for ARM) should use a write back, write allocate policy for cacheable regions for better performance.

Existing settings include:
TEX[2:0]=1,1,1 & C,B=1,1 -> write back, write allocate, normal memory (for both inner and outer). No write allocate is costly.
This has been applied to the cacheable regions of memory (I identified these as as having the CACHED suffix in the macro in mm_config_table).

Without turning caches/mmu on I get  83333 dhrystones/sec. After turning on 76923 dhrystones/sec.

Tried changes:
  1. TEX[2:0]=1,0,1 & C,B=0,1 -> cacheable memory: write back, write allocate, normal memory. This significantly reduced performance (58823 dhrysones/sec).
  2. TEX[2:0]=0,0,1 & C,B=1,1 -> write back, write allocate, normal memory (region does not remain cacheable?). Here the performance was similar to the performance obtained after just enabling caches. Except that, in this case , on first run of dhrystone I get 83333 dhrystones/sec and without this change I get 76923 dhrystones/sec. Else, on subsequent executions of dhrystone the performance fluctuates between the two figures .
  3. Changes to TTBR0, which controls the translation table memory region attributes.

 Code

For now, I have not considered integration of the two Pi variants. I have replaced the existing  
BSP_START_TEXT_SECTION void bsp_memory_management_initialize(void)

with initialization required for Pi2.

The link below explains set up for ARM v7 architecture
  •  Invalidation of caches not done.
  1. Enable SMP bit  (TRM for Cortex A7 mpcore section 4.3.27 System Control) before enabling caches/mmu or performing any cache and TLB maintenance operations.
  2. Call arm_cp15_start_setup_mmu_and_cache(). Commented branch prediction enable. (TRM for Cortex A7 mpcore section 4.3.27 System Control description for Z bit
  3. Set up translation table and set caches/mmu enable bits. Call to function arm_cp15_start_setup_translation_table_and_enable_mmu_and_cache(). This is same as Pi1.
    • In the subsequent call to arm_cp15_start_setup_translation_table() , I have added code to configure bits in TTBR0 for inner and outer write-back, write allocate , cacheable, sharable memory. (v7 set up link above, https://github.com/mrvn/test/blob/master/mmu.cc
    • Set domain clients.
    • Enable MMU/caches. Invalidate branch predictor (ARM v7 Architecture reference manual section B2.2.6 under Branch prediction maintenance operations)



Changed TTBR0 translation table memory attributes (using the multiprocessor extensions register format). (TRM for Cortex A7 mpcore section 5.2.1 Memory types and attributes)




 References 

  • Cortex A7 MPCore Technical Reference Manual
  • ARM v7 Architecture Reference Manual
  • Cortex A7 MPCore Technical Reference Manual
  • ARM1176jzfs Technical Reference Manual
  • Existing RTEMS code for Raspberry Pi, Xilinx-zynq, Realview-pbx-a9 BSPs



Next: Cache problem solved!

See: More on Cache/MMU configuration

Previous: Introduction to the project






 

Tuesday, 26 May 2015

Introduction to the project

As part of this project I will be working with Raspberry Pi 2. This new model, launched after the Raspberry Pi Model B+, has a major improvement in terms of the quad-core processor in its SOC. The project aims to improve BSP support for the Pi 2.

Raspberry Pi 2 MOD B
                           

A little background before proceeding...

With some preliminary changes to existing Raspberry Pi BSP code, the new Pi was up and running using its single core. However, after running benchmarks on it under RTEMS, results indicated that changes to cache configurations are required to improve performance. I will be fixing this as part of the project.

Also, with just one functional core out of four, performance is largely limited. And so, the other important goal is to bring up the secondary cores and enable support for symmetric multiprocessing. This will include running tests to ensure a stable SMP environment and check performance.

Quick links to the two broad sub-parts: