Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
rohanverma94 authored Apr 30, 2018
1 parent 8a25c25 commit 7e3eb09
Showing 1 changed file with 14 additions and 9 deletions.
23 changes: 14 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,24 @@
# stm32f4xxx-SIMD-add
This is a demostration of SIMD code in stm32f4xxx microcontroller. This repository was based on my answer on Quora : [Rohan Verma's answer](https://qr.ae/TUT1OT)

Now I’m fan of the **pair-programming** style of coding the solution because it lets you know the efficiency of your counterpart better. So lets devise this solution together, lets start with some information about** SIMD programming **in ARM microcontroller.
Now I’m fan of the **pair-programming** style of coding the solution because it lets you know the efficiency of your counterpart better. So lets devise this solution together, lets start with some information about** SIMD programming **in ARM microcontroller.
**What is SIMD?**
SIMD stands for Single Instruction stream and Multiple Data stream . The same set of instructions is executed in parallel to different sets of data.
This reduces the amount of hardware control logic needed by N times for the same amount of calculations, where N is the width of the SIMD unit.
SIMD computation model is show as follows:
![SIMD pipleine.](main-qimg-84df36e69e2decf629e743020eb31fd0.jpg)
![SIMD pipleine.](main-qimg-84df36e69e2decf629e743020eb31fd0.jpg)
Instruction pipeline is very different concept from SIMD, the instruction pipeline system is accepting multiple instruction which are issued by dispatcher of the CPU, which then afterwards would have to complete the instruction cycle.
Pipelining does not increase the number of instruction *streams* being processed in parallel; the single stream simply flows through a longer channel as it were.
SIMD is a type of machine which has one hardware instruction pointer and multiple hardware channels to write the data to memory. The SIMD is very different from pipelining where the SIMD is a machine which is executing a single instruction over the multiple data pieces. **So pipelining from the programmer’s perspective can be called SISD.**

SIMD is a type of machine which has one hardware instruction pointer and multiple hardware channels to write the data to memory. The SIMD is very different from pipelining where the SIMD is a machine which is executing a single instruction over the multiple data pieces. **So pipelining from the programmer’s perspective can be called SISD.**

The instruction pipeline is way of bringing the concept of single-core CPU parallelism, so from the programmer's perspective it is SISD while from the hardware perspective it is more like MIMD.
**ARM Cortex M4 has 3-stage pipeline.**
Now before diving down inside of the code I want you to look at this CMSIS standard for implementing ARM Cortex M4 instruction set.
Intrinsic Functions for SIMD Instructions [only Cortex-M4 and Cortex-M7] (https://arm-software.github.io/CMSIS_5/Core/html/group__intrinsic__SIMD__gr.html)

**ARM Cortex M4 has 3-stage pipeline.**

Now before diving down inside of the code I want you to look at this CMSIS standard for implementing ARM Cortex M4 instruction set.

Intrinsic Functions for SIMD Instructions [only Cortex-M4 and Cortex-M7] (https://arm-software.github.io/CMSIS_5/Core/html/group__intrinsic__SIMD__gr.html)
Now look at the very simple code:
```
for(uint32_t i = 0 ; i < 1024 ; i++){
Expand All @@ -22,9 +27,9 @@ Now look at the very simple code:
}
```

If you compile this code under** arm-gcc-none-eabi cross-compiler**, the instruction pipeline would take care of machine code produced.
And as per my knowledge, the gcc is always a smart-ass and will automatically do any kind of loop unrolling if required, but given that target is embedded device, so gcc cross-compiler would have no liberty to act like a smart-ass as it were to do normally.
So optimized version ( loop unrolling) on a** pipelined CPU( without SIMD):**
If you compile this code under** arm-gcc-none-eabi cross-compiler**, the instruction pipeline would take care of machine code produced.
And as per my knowledge, the gcc is always a smart-ass and will automatically do any kind of loop unrolling if required, but given that target is embedded device, so gcc cross-compiler would have no liberty to act like a smart-ass as it were to do normally.
So optimized version ( loop unrolling) on a** pipelined CPU( without SIMD):**
```
//Optimization - without SIMD
for(uint32_t i = 0 ; i < 1024 ; i = i+4){
Expand Down

0 comments on commit 7e3eb09

Please sign in to comment.