Friday 21 August 2015

Future Work

So as part of GSoC 2015 I had to fix cache configurations and enable a stable SMP environment for Raspberry Pi 2.

With respect to cache configurations, though an improvement has been obtained, the issue of text section access permissions has to be solved completely. Some solutions can help escape the problem but not solve it completely. This matter has to be looked into post the program.

Next, in case of SMP support, I came across issues whose roots could eventually be identified. What needs to be done to get basic SMP working is something I could figure out and have tried to implement. However some problems persist and SMP still cannot be done on Pi 2. Post this edition of GSoC the necessary implementation hurdles have to be resolved and SMP enabled on Pi 2.

Interrupt handling for Pi 2 has to be set up and integrated with existing implementation.

Apart from the above, work remains to be done in relation to Pi 2 peripheral support.

I would certainly like to continue working with Raspberry Pi 2 BSP Support!

Generating Interprocessor interrupt

Since missing IPI was the problem, next step is to implement sending of interrupts between cores. This too is not straightforward as Pi 2 does not have the Generic Interrupt Controller present with most ARM based processors. As an alternative to this, I have come up with the suggestion of using mailboxes of each core for this purpose.

How is the mailbox used?

Out of the four mailboxes of each core, one mailbox is reserved for this purpose. I have used mailbox 3. This might have to be changed to one of mailboxes 0,1,2 as mailbox 3 has a specific purpose for SMP start up.
The mailbox interrupt is generated as long as as a non-zero value is written to the mailbox register. Thus, the core wanting to send an interrupt can simply write a non-zero value to mailbox 3 of target core(s). For now, I just set the mailbox to value of 1. This has been added through _CPU_SMP_Send_interrupt() function in ~/raspberrypi/startup/bspsmp.c .

void _CPU_SMP_Send_interrupt( uint32_t target_processor_index )
{
  /* Generates IPI */
  uint32_t *target_mb_write = (uint32_t*)(BCM2836_MAILBOX_3_WRITE_SET_BASE + 0x10 * target_processor_index);
  *target_mb_write = 0x1;
}


The IPI handler will basically be a handler for mailbox 3 interrupt. The handler resets contents of target mailbox 3 to zero. This clears the interrupt. Then the RTEMS IPI handler _SMP_Inter_processor_interrupt_handler() is called.

/* writing zero to mailbox clears the interrupt */
    *mb_read_clr = 0x01;
   
    _SMP_Inter_processor_interrupt_handler();


As part of initialization for IPI support, the mailbox 3 contents have to be reset to zero and mailbox 3 interrupt enabled in the interrupt control register for each core.

In order to be able to handle the interrupt appropriately, it has to be registered first. This requires an interrupt vector be associated with the mailbox 3 interrupt which will be used by all cores.  This interrupt vector is not yet determined. Once determined, the handler can be installed for the interrupt vector.

Raspberry Pi Interrupt Handling

Some part of the interrupt handling for Raspberry pi happens in ~/raspberrypi/irq/irq.c . How to merge IPI initalization and handling with existing RPi interrupt handling has to be considered. 

References

  • I was later able to confirm the mailbox approach with this reference


Monday 10 August 2015

Problems in starting secondary cores

Provide the start address and the secondary cores will execute from there. But, I have still not been able to get the Raspberry Pi 2 start with all four cores running soundly.

  1. Without a debugging environment , I was just guessing what potential problems could be. So firstly, I was trying to make sure whether my code is writing the start address to the desired locations or not. Because if that is not happening then certainly the cores cannot be started. When I did get a debugging environment, I used the RPi 1( RPi2 couldn't be used) . With gdb I tried the x address command, which is to see contents of address. I used it to see if 0x4000009C, 0x400000AC, 0x400000BC had the address of _start . That was fine I found. I even verified using Pi 1 and ran a piece of code to see whether or not _start is the right location to go to.

  2. Since jump address and code were not a problem I debugged further. I referred to ~/cpukit/score/include/rtems/score/percpu.h which documents SMP start up with respect to states of the CPUs. The cores send and receive events using the functions, _CPU_SMP_Processor_event_receive() & _CPU_SMP_Processor_event_broadcast(), to carry out state changes. This did not require any RPi2 specific implementation. After understanding this, I came to IPIs. I had understood earlier that IPIs are sent using GIC to wake the cores. I didn't see where IPI would be used for Pi2 and so it was not considered till then.

  3. On investigating for IPI, I found that _CPU_SMP_Send_interrupt() in ~/libbsp/arm/shared/arm-a9mpcore-smp.c is the function that generates the IPI. I referred to the ARM Generic Interrupt Controller Architecture  Specification(version 2.0) to understand it better. To make sure this was the problem I used some mundane methods. I used breakpoints to confirm that the function was indeed invoked. Then I just commented out the body of this function and built the code to work on Realview-pbx-a9. But the smp tests in ~/testsuites/smptests ran normally! And then I spent some time again before I realized there was a build anomaly. When I deleted the built and created another one from scratch (for some other purpose with this change also present) the tests failed to run! I could finally get hold of the problem!

  4. Next I have to figure out how to implement the IPI . In the absence of GIC, I am considering mailboxes of each core to get this done ( could find no references for checking whether this will work). Each core has 4 mailboxes. I think one can be used for communication between processors. I am working on this currently. 

Next: Generating Interprocessor interrupt




Starting secondary cores with RTEMS

So I could find the missing jump address where the secondary cores should start executing from for RTEMS. We use the _start() function defined in the boot.S file. When starting from there, required initializations for all modes are done correctly for each core. So I have used the following instructions which will write the start address to specific memory locations for each core.

   "ldr r2 , =_start\n"
   "ldr r1 , =0x4000009C\n"
   "str r2 , [r1]\n"
   "ldr r1 , =0x400000AC\n"
   "str r2 , [r1]\n"
   "ldr r1 , =0x400000BC\n"
   "str r2 , [r1]\n"

That's it! Starting the secondary cores is as straightforward for bare metal development with the Pi 2. With RTEMS, this has to blend with the process RTEMS uses to bring up SMP for a variant.

So, next we need to decide where this process should be started. The primary core is the only working core at this time so the task of starting other cores has to be done by it. This should be done in an early stage of boot up. I have added a function raspberrypi_wake_secondary_processors() which will be called by core 0 from the bsp_start_hook_0().

When the secondary cores reach bsp_start_hook_0() further initializations like setting up translation table and MMU will be done by calling the function start_on_secondary_processor() there. This is the function which will eventually lead the secondary cores to SMP set up through a call to _SMP_Start_multitasking_on_secondary_processor().

 While the secondary cores execute their thread, the primary core proceeds from bootcard(), and then to rtems_initialize_data_structures() in ~/cpukit/sapi/src/exinit.c under which call to _SMP_Handler_Initialize() is the entry point for SMP initialization from its side. 

However starting SMP didn't turn out to be as easy as it seemed. There were issues and understanding the flow of execution in RTEMS for primary and secondary cores helped.

I used available debugging tools to understand the execution and identify problems. This post on debugging explains what I did.



 Code


As mentioned above, the RTEMS SMP start sequence has to be followed for proper start up. Part of this sequence are certain functions which must be defined. The definitions depend on BSP.
  • For a9mpcore BSPs, the  file ~/libbsp/arm/shared/arm-a9mpcore-smp.c is used in common. Only, the function _CPU_SMP_Start_processor() is specific to each variant. For Pi 2, I have defined these functions in ~/libbsp/arm/raspberrypi/startup/bspsmp.c. This has to be added to raspberry pi Makefile.am.
  • Also the variable bsp_processor_count has to be given a default value for BSPs supporting SMP.  This is 4 in case of Pi 2 since it has 4 cores.  For this we need to add the following line in linkcmds file bsp_processor_count=DEFINED(bsp_processor_count)?bsp_processor_count : 4;
  • When running an application, the maximum number of processors that it must use is specified using CONFIGURE_SMP_MAXIMUM_PROCESSORS which is defined at build time in system.h file for the application. 
  • The number of processors which will eventually be used is minimum of the number actually present in hardware and the number configured.  This can be found in ~/cpukit/score/src/smp.c in _SMP_Handler_initialize()  cpu_count = cpu_count < cpu_max ? cpu_count : cpu_max;
  • The config.ac for Raspberry Pi had to be modified a bit to enable SMP support in the configurations file configure. It adds support for the --enable-smp option for Raspberry Pi.

Thursday 6 August 2015

Cache problem solved!

There were some settings I tried based on my understanding as explained in the post Fixing cache issue . All the fixes revolved around the translation table registers settings and configurations flags for translation table in memory used by MMU. I narrowed down on the stucture arm_cp15_start_section_config arm_cp15_start_mmu_config_table[] in mminit.c because after all the settings it seemed that data was not being cached at all indicating something wrong in the flags which controlled cacheability of memory sections.




As mentioned earlier, RTEMS uses 1MB memory sections. The  macros used for .flags member are defined in arm-cp15.h. These are a combination of TEX, shareability, cacheability and access permissions control bits. While working on this, I made changes to macros aimed at enabling caching.
It took me time to realize that there was a section, the text section (which has code) is not cached and all the changes that I made in the arm-cp15.h. were not affecting the flag used for it. The flag being used was ARMV7_MMU_READ_WRITE . So I added the cache and buffer enable bits. 

{
    .begin = (uint32_t) bsp_section_text_begin,
    .end = (uint32_t) bsp_section_text_end,
    .flags = /*ARMV7_MMU_CODE_CACHED*/ARMV7_MMU_READ_WRITE | ARM_MMU_SECT_C | ARM_MMU_SECT_B
  }


Was this the only missing setting? Yes! And we could get the tremendous speed up of the faster Raspberry Pi 2 cores! 

So, this seemed to settle the cache problem I was looking to solve. But what was observed next was that some of the other ARM bsps could use a macro ARMV7_CP15_START_DEFAULT_SECTIONS to define their memory map which contained the default sections and their default settings. 


 Why couldn't Raspberry Pi use it as well? Only difference was the flag used for text section. The default sections used ARMV7_MMU_CODE_CACHED which indicates only read permission. So what? So when I used the same flag for raspberry pi the system wouldn't start up. Another issue - why did text section need write permission along with read! what was being written to a strictly read-only area of memory! - and this needed some debugging to know where the execution failed. 
But I had to put this aside till I could figure out how to. The answer was QEMU, not with Pi 2 obviously, but Pi 1 was there. Both use the same arm_cp15_start_mmu_config_table[] right. So this is how I set up the debugging environment.   

And here is what I found...

The problem comes up when bsp_start_clear_bss() is invoked in bsp_start_hook_1() . While trying to set BSS memory region to '0' , the write enters text section. With read-only settings, this leads to an exception and the start up cannot proceed. 




Looks like the bss and text sections overlap (which is a bad thing to happen...)

So while trying to fix cache performance problem, an important issue has come up. Next I am looking at finding a solution for this as well :)
 

Wednesday 5 August 2015

More on Cache/MMU configuration

Enabling Snoop Control Unit (SCU)

From what I find here and from some more references, there is no separate control for enabling the SCU in Raspberry Pi 2. Since there is no official documentation available for this I have gone ahead assuming SCU is on by default.


 Conditionally setting up Pi 1 or 2

 There are only a few differences between the configurations required for the two. These are mostly in the initial cp15 controls. Otherwise, they both follow the same initialization method. The same memory configuration structure arm_cp15_start_mmu_config_table[] in mminit.c is used for both.
So to provide for bsp specific set up of controls I have added a function raspberrypi_setup_mmu_and_cache() to bspstarthooks.c . The value of BSP_IS_RPI2 helps determine whether bsp being used is Pi 1 or Pi 2. A value of 1 implies bsp is Pi 2. The required parts of code are conditionally compiled and then the controls are passed to bsp_memory_management_initialize() . 



Code

This is how the controls are set up for Pi 1 and 2 in raspberrypi_setup_mmu_and_cache()
 
#if (BSP_IS_RPI2 == 1)
  bsp_initial_mmu_ctrl_clear = ARM_CP15_CTRL_A;
  bsp_initial_mmu_ctrl_set = ARM_CP15_CTRL_AFE | ARM_CP15_CTRL_Z; 
#else
  bsp_initial_mmu_ctrl_clear = 0;
  bsp_initial_mmu_ctrl_set = ARM_CP15_CTRL_AFE | ARM_CP15_CTRL_S
                  | ARM_CP15_CTRL_XP; 
#endif



After the above set up initialization function is called

bsp_memory_management_initialize(
    bsp_initial_mmu_ctrl_set,
    bsp_initial_mmu_ctrl_clear
  );



Debugging

When it came to debugging, I could not get hold of a tool to debug my Pi 2 directly. QEMU was the only resource at my disposal. It does not support Pi 2 as of now, but supports Pi 1 and some SMP capable BSPs. So, to solve my issues, I used QEMU to emulate the Pi 1 and Realview-pbx-a9 multicore BSPs as and when each of them was needed.

With QEMU I used the arm-rtems4.11-gdb to step through the code and set breakpoints.

Using QEMU with Raspberry Pi(v1)

There is a modified QEMU source for Pi which can be obtained from
https://github.com/Torlus/qemu/tree/rpi

Some help on how to build and run QEMU for Pi is present here
http://wiki.osdev.org/Raspberry_Pi_Bare_Bones

If the kernel executable does not run and you see QEMU has hanged (which happened with me ) then the kernel entry address needs to be changed. I used the mkimage tool to create an image with the correct load address and then run it with QEMU. This is how I got an RKI image ready for QEMU.

mkimage -A arm -O rtems -T kernel -a 0x00008000 -e 0x00008000 -C none -d rki.bin kernel.img

Using QEMU with Realview-pbx-a9

An executable can be run easily here. This is how I ran an RTEMS "Hello World" application

qemu-system-arm  -M realview-pbx-a9 -m 256M -kernel hello.exe -serial stdio

The number of cores to be used can be specified with the -smp option, like the following will run QEMU for 2 cores

qemu-system-arm  -M realview-pbx-a9 -m 256M -kernel smp01.exe -smp 2 -serial stdio

Debugging with QEMU

This is a nice reference for using gdb with QEMU
http://wiki.osdev.org/Kernel_Debugging

The same can be achieved using RTEMS tools (specifically the arm-rtems4.11-objcopy and arm-rtems4.11-gdb utilities)

While using gdb, the " thread N " command can be used to switch between cores and it will let you step through the thread which core N is running (N=1,2,3..)