Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SMP for Cortex-A MPcore is unstable #207

Closed
masayuki2009 opened this issue Feb 4, 2020 · 29 comments · Fixed by #254, #292 or #325
Closed

SMP for Cortex-A MPcore is unstable #207

masayuki2009 opened this issue Feb 4, 2020 · 29 comments · Fixed by #254, #292 or #325
Labels
bug Something isn't working

Comments

@masayuki2009
Copy link
Contributor

There were several discussions on SMP issues for Cortex-A MPcore in NuttX mailing list before.
Yesterday I tried sabre-6quad:smp configuration with QEMU, and noticed that it's unstable even
if I changed the configuration to use only 2cores instead of 4cores.

Actually, heap memory was sometimes corrupted during booting up the secondary core.
Even if it succeeded to boot, heap memory was corrupted when a new task is created.
They are very basic use-cases and should work with QEMU.

So, I think something is wrong with Cortex-A SMP implementation, because Cortex-M SMP
(e.g. Sony Spresense in dual core mode) and RISC-V SMP (K210 Maix-bit) are much more stable.

@masayuki2009
Copy link
Contributor Author

masayuki2009 commented Feb 4, 2020

I thought the memory corruption issue relates to an IPI (Inter Processor Interrupt) issue.
But I noticed that there was no nested interrupts.

Also, I noticed that there is a global temporary storage called irqtmp in arm_vectors.S.
I'm not sure why this storage is needed for Cortex-A implementation (perhaps, to store
registers to change processor mode?) but the irqtmp would be accessed from multiple
CPUs so I think the storage should be protected or should be allocated per CPU.

@masayuki2009
Copy link
Contributor Author

Hello @gregory-nutt,

What do you think about my comments on irqtmp in armv7-a/arm_vectors.S?

@patacongo
Copy link
Contributor

Hi @masayuki2009 , this week i have guests and we will be at the beach for 3 more days. I have only my cellphone now so I will not be able to do much.

From the name, that irqtmo is probably used to handle nested SGI interrupts, but I do not know for sure now.

@patacongo
Copy link
Contributor

I have a vew minutes before we leave.

"From the name, that irqtmo is probably used to handle nested SGI interrupts, but I do not know for sure now."

No, that little storage area is use during normal interrupt processing.

"storage should be protected or should be allocated per CPU."

Yes, I think that is true. Also fiqtmp (FIQs are used with TrustZone). These is other *tmp storage, but these are fatal crashes. These need to be redesigned.

@masayuki2009
Copy link
Contributor Author

Hi @patacongo , I'm not in hurry so enjoy your holiday.

@patacongo patacongo added the bug Something isn't working label Feb 4, 2020
@patacongo
Copy link
Contributor

patacongo commented Feb 4, 2020

Also "They are very basic use-cases and should work with QEMU."

It has been awhile since I have used the sabre-6quad, but this historically has not been an issue on real hardware. I wonder if this is just lucky timing on real hardware, or a QEMU emulation difference?

@masayuki2009
Copy link
Contributor Author

It has been awhile since I have used the sabre-6quad,
but this historically has not been an issue on real hardware.
I wonder if this is just lucky timing on real hardware, or a QEMU emulation difference?

I think it was lucky.

In my experience, QEMU has some differences from a real hardaware.

  1. No cache emulation.
  2. Emulation speed depends on host CPU
  3. No peripheral emulation (in most cases, only CPU and UART)
  4. No alignment exceptions (RISC-V only ?)

However, QEMU is very useful for debugging. (You can debug multi cores)

@masayuki2009
Copy link
Contributor Author

Hi @patacongo,

I also noticed that calling setirqstack in arm_vectors.S destroys $r5 which should be preserved.
This might be a one of root cause.

@patacongo
Copy link
Contributor

patacongo commented Feb 8, 2020

@masayuki2009 "I also noticed that calling setirqstack in arm_vectors.S destroys $r5 which should be preserved."

Are you referring to:

 263         setirqstack r1, r5                              /* SP = IRQ stack top */

it looks like R2 is available. Do you think that should be:

 263         setirqstack r1, r2                              /* SP = IRQ stack top */

No... a value in R2 is assumed at line 315, but I think r3 is available.

@patacongo
Copy link
Contributor

I don't have a test setup here so I can't verify it. If you approve, I can submit the PR.

If you were to document how to use QEMU in your test case, I would try that too. However, I have not used QEMU much (and not for many years), so I am not so interested in that learning curve.

@xiaoxiang781216
Copy link
Contributor

@masayuki2009 QEMU is very important for automatiton test in the next step, it's great if you can create a basic config for both riscv and arm and README file to describe how to setup the simulator and debuging.

@patacongo
Copy link
Contributor

For the case of the i.MX6 QEMU, this would be appropriate in boards/arm/imx6/sabre-6quad/README.txt

@masayuki2009
Copy link
Contributor Author

@xiaoxiang781216 @patacongo I'll add comments on how to run sabre-6quad nuttx image on QEMU later. (FYI, running RISC-V FE310 and K210 images on QEMU is already described)

@masayuki2009
Copy link
Contributor Author

@xiaoxiang781216 @patacongo I've just sent a PR #233

@patacongo
Copy link
Contributor

patacongo commented Feb 12, 2020

I don't see any simple way to implement the first issue you mention, that where g_irqtmp would be invalid in the case of nested interrupts. Nested interrupts would occur only in the SMP case where the SGI inter-processor interrupts are non-maskable.

I see only two solutions and neither are simple:

  1. Implement IRQ/FIQ stacks. I am not clear on the design that would use these stacks, but stacking the data should be able to support nesting.

  2. Implement ICCMPR interrupt controls and prohibit nesting interrupts altogether.

@masayuki2009
Copy link
Contributor Author

Hi @patacongo,

I just tried the latest upstream code with QEMU and confirmed that $r5 is not destroyed now.
What I did is that I added a breakpoint at Line 276 (the next line to .Lintnested:) in arm_vectors.S
then run the nuttx with QEMU and the breakpoint was not hit. Actually ps/free/smp commands worked.

However, if I modified defconfig to add a hello application and type free on nsh, it freezed.
Though the above breakpoint was not hit, heap memory was corrupted in this case.

--- a/boards/arm/imx6/sabre-6quad/configs/smp/defconfig
+++ b/boards/arm/imx6/sabre-6quad/configs/smp/defconfig
@@ -21,11 +21,13 @@ CONFIG_ARCH_STACKDUMP=y
CONFIG_BOARD_LOOPSPERMSEC=99369
CONFIG_BOOT_RUNFROMSDRAM=y
CONFIG_BUILTIN=y
+CONFIG_DEBUG_FULLOPT=y
+CONFIG_DEBUG_SYMBOLS=y
CONFIG_DEV_ZERO=y
+CONFIG_EXAMPLES_HELLO=y
CONFIG_FS_PROCFS=y
CONFIG_HAVE_CXX=y
CONFIG_HAVE_CXXINITIALIZE=y
-CONFIG_HOST_WINDOWS=y
CONFIG_IMX6_UART1=y
CONFIG_IMX_DDR_SIZE=1073741824
CONFIG_INTELHEX_BINARY=y

@masayuki2009
Copy link
Contributor Author

@patacongo I found and fixed another bug for armv7-a smp which is a root cause of this issue.
Please see #292 for details.

@patacongo
Copy link
Contributor

Thanks. I have merged that.

@masayuki2009
Copy link
Contributor Author

masayuki2009 commented Feb 20, 2020

@patacongo I've just sent another PR #325 which improves i.mx6 SMP stability. Now ostest works with QEMU (4cores enabled)

However, as I commented, nested interrupt issue might happen. (actually not happens when ostest is running)

@xiaoxiang781216 xiaoxiang781216 linked a pull request Feb 20, 2020 that will close this issue
@masayuki2009
Copy link
Contributor Author

However, as I commented, nested interrupt issue might happen.
(actually not happens when ostest is running)

To reproduce this issue, more complicated use-cases which generate high frequency interrupts would be needed. For examples, DMA, Ethernet, SDIO ... should be involved but no drivers are implemented so far.

@masayuki2009
Copy link
Contributor Author

I found DMA for i.mx6 is not supported in QEMU.

@patacongo
Copy link
Contributor

Sorry, I didn't mean to close this. I hit the wrong button again, apparently.

@patacongo patacongo reopened this Feb 20, 2020
@patacongo
Copy link
Contributor

@masayuki2009 Should this issue be closed now? You did a lot of work with Cortex-A and got parity with the other architectures, right?

@masayuki2009
Copy link
Contributor Author

@masayuki2009 Should this issue be closed now?
You did a lot of work with Cortex-A and got parity with the other architectures, right?

@patacongo Yes, but I checked SMP for Cortex-A with QMU only.
I believe it should work with a real board but I don't have any boards.
So could you check it with your i.MX6 board?

@patacongo
Copy link
Contributor

Per recommendation of @masayuki2009

@xiaoxiang781216
Copy link
Contributor

I thought the memory corruption issue relates to an IPI (Inter Processor Interrupt) issue.
But I noticed that there was no nested interrupts.

@masayuki2009 Is it still true that "cpsid i" can't disable IPI interrupt?
https://github.com/apache/incubator-nuttx/blob/master/boards/arm/imx6/sabre-6quad/README.txt#L178-L200
If yes, could you point the document which mention this special behaviour?

@masayuki2009
Copy link
Contributor Author

@xiaoxiang781216

@masayuki2009 Is it still true that "cpsid i" can't disable IPI interrupt?

No, "cpsid" ** can *** disable IPI interrupt.

https://github.com/apache/incubator-nuttx/blob/master/boards/arm/imx6/sabre-6quad/README.txt#L178-L200

Oh, I updated the TODO file but forgot to update the README.txt
#2346

Anyway, I will create a new PR to update the README.txt

@xiaoxiang781216
Copy link
Contributor

Ok, I will send PR to remove all related trick.

@xiaoxiang781216
Copy link
Contributor

Here is the patch: #4094

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants