Corrected typos

nvs4fp · Feb 1, 2022 · dca783e · dca783e
1 parent fab1792
commit dca783e
Showing 1 changed file with 114 additions and 79 deletions.
diff --git a/README.md b/README.md
@@ -1,34 +1,51 @@
 # pfloat: A 8-/16-/32-bit floating point number family
 
-Key words: floating point number, varaible precision, CNN simulation, reduced bit size, FP8, FP16, FP32, bfloat, neural network
+Key words: floating point number representation, variable precision, 
+CNN simulation, reduced bit size, FP8, FP16, FP32, bfloat, BFLOAT16, 
+convolutional neural network
 
 ## Introduction
-Floating point representation of 'real world' numbers are required for simulations of 'real world' systems. The number of available bits as well as the meaning of those bits has a profound impact on the accuracy of such simulations. 
-
-There's a range of floating point number representations available, notably the ones standardized by IEEE plus some important additional formats, worth mentioning here is
-bfloat16.
-
-The IEEE formats (e.g. float, double) use the MSB as sign bit, the next 8 or 11 bits as biased exponent, with the reminder of the (LSB-)bits as mantissa bits. 
-Each exponent bit pattern covers a binary bucket, with the mantissa bits determining the actual number inside this binary bucket.
-The number of exponent and mantissa bits is kept constant, which implies that the exponent bias is a constant. 
-Those number ranges cover ranges from +max to minus max, where max is typically a large number.
+Floating point representation of 'real world' numbers are required for simulations 
+of 'real world' systems. The number of available bits as well as 
+the meaning of those bits have a profound impact on the accuracy of such simulations. 
+
+There's a range of floating point number representations available, notably 
+the ones standardized by IEEE plus some important additional formats, 
+worth mentioning is e.g. bfloat16.
+
+The IEEE formats (e.g. float, double) use the MSB as sign bit, the next 8 or 11 bits as 
+biased exponent, with the reminder of the (LSB-)bits as mantissa bits. 
+Each exponent bit pattern covers a binary bucket, with the mantissa bits determining 
+the actual number inside this binary bucket.
+The number of exponent and mantissa bits is kept constant, which implies that the 
+exponent bias is a constant. 
+Those number ranges cover ranges from +max to minus max, where max is typically a very
+large number.
 
 The number of exponent bits determine the dynamic range of representable numbers, while 
 the number of mantissa bits determine the accuracy. 
 
 We are addressing here 
 general and specific cases where the number of exponent and mantissa bits 
-is not equal for all representable floating point numbers. We implement a selected
-set of 8-/16-/32-bit floating point types which are judged to be good candidates for simulations of CNNs (Convolutional Neural Networks).
+is _not equal_ for all representable floating point numbers. We implement a selected
+set of 8-/16-/32-bit floating point types which are judged to be good candidates 
+for simulations of CNNs (Convolutional Neural Networks).
 
 ## The generalized case: vfloat
 
-We consider a family of floating point representations, which we name **vfloat** for the current discussion, 
-which are basically characterized by not having a constant (but **v**ariable) exponent bias as well as a **v**ariable number of exponent- and mantissa-bits. 
+We consider a family of floating point representations, which we 
+name **vfloat** for the current discussion, 
+which are basically characterized by not having a 
+constant (but **v**ariable) exponent bias as well as 
+a **v**ariable number of exponent- and mantissa-bits. 
 
-In other words: Precious bits are invested to divide the entire range of representable numbers into several (sub-)ranges. 
-Each (sub-)range can have its own exponent bias, the magnitude of which is implicitly defined by the specific vfloat type.
-The exponent bias for vfloat is encoded in range bits. In addition, each encoded range of a vfloat can have its own distinct number of sign-/exponent-/mantissa-bits.
+In other words: Precious bits are invested to divide the 
+entire range of representable numbers into several (sub-)ranges. 
+Each (sub-)range can have its own exponent bias, the magnitude 
+of which is _implicitly_ defined by the specific vfloat type.
+The exponent bias for vfloat is encoded in range bits. In 
+addition, each encoded range of a vfloat can have its own 
+distinct number of sign-/exponent-/mantissa-bits.
 
 For example, the vfloat type
 `vfloat8_32_2_5_0_1` would be an 8-bit 
@@ -50,14 +67,18 @@ exponent and mantissa, with the following exponent/mantissa-distribution for the
 | 2 = 0b10 | **0** | 1 | 5 | 2^4 |
 | 3 = 0b11 | **1** | 2 | 4 | 2^5 |
 
-As a special case, (plus-/minus-) zero is defined as exponent and mantissa bits all zero in range 0, 
-resulting the smallest non-zero magnitude of representable numbers in the above example being 2^-32.
+As a special case, (plus-/minus-) zero is defined as exponent and mantissa 
+bits all zero in range 0, 
+resulting the smallest non-zero magnitude of representable numbers in the 
+above example being 2^-32.
 The largest representable number magnitude is equal to:
-range_bias times 2^range11_exponent_bits times sum of one and fractional contribution from all-ones mantissa bits = 
-2^5*2^1*(1+(2^4 - 1)*2^-4) = 124.0
+range_bias times 2^range11_exponent_bits times sum of one and fractional contribution 
+from all-ones mantissa bits = 2^5*2^1*(1+(2^4 - 1)*2^-4) = 124.0
 
-With 1 sign bit, and 2 range bits, there are 6 different combinations of exponent and mantissa bits available, which 
-can be assigned to one of the 4 (signed) ranges - which allows for the definition of 2'592 different vfloat8 
+With 1 sign bit, and 2 range bits, there are 6 different 
+combinations of exponent and mantissa bits available, which 
+can be assigned to one of the 4 (signed) ranges - which 
+allows for the definition of 2'592 different vfloat8 
 types for each specified smallest exponent bucket.
 
 While it's deemed prohibitive to implement all possible vfloat types, it quickly becomes obvious 
@@ -68,46 +89,54 @@ cost of lower resolution ('less accuracy') where it does not hurt (much).
 We note (for completeness only) that one can design vfloat types with
 non-consecutive number coverage. For example there could be one range
 dedicated to numbers close to zero, e.g [0.0, 2^-64..2^-60), with numbers in 
-[2^-60..2^-16) not being representable, and a second range covering [2^-4 .. 1.0). 
+[2^-60..2^-4) not being representable, and a second range covering [2^-4 .. 1.0). 
 For such cases, we'd recommend considering ranges with all 
 bits being assigned to exponent (see effect of such an approach in 
 range 0b01 above, which is covering 32 binary buckets).
 
-Of particular interest
-(for us) are vfloat type designs with good accuracy close to (+-) 1.0 (Volt), while 
-having dynamic ranges of either ~2^-30 (~1nVolt) or 2^-15 (~30uVolt), with the 
-possibility to handle double those dynamic ranges (~2^-60) when multyplying two vfloat 
-numbers. We also see potential use for unsigned vfloat types.
+Of particular interest (for us) are vfloat type designs with good 
+accuracy close to (+-) 1.0 (Volt), while having dynamic ranges 
+of either ~2^-30 (~1nVolt) or 2^-15 (~30uVolt), with the 
+possibility to handle double those dynamic ranges (~2^-60) when 
+multiplying two vfloat numbers. We also see potential use for 
+unsigned vfloat types.
 
 ## The specific case: pfloat
 
 ### Design principles
 
-To limit the number of vfloat types that need to be implemented, a couple of self-induced limitations are introduced:
+To limit the number of vfloat types that need to be implemented, 
+a couple of self-induced limitations are introduced:
 1) The range of representable numbers is limited to either magnitudes of 0.0 and 1.0 (_both_ inclusive), or such as to be compatible with IEEE754-2008 (aka float);
 2) The number of ranges is either 4 or 16 (2 or 4 range bits);
-3) The number of exponent bits between two adjacent ranges can differ by a maximum of 1 bit, and is either progressively increasing or decreasing from ranges with smallest number of exponent bits;
+3) The number of exponent bits between two adjacent ranges can differ by a maximum of 1 bit, and is either progressively increasing or decreasing (or remains constant);
 4) The number of mantissa bits is maximized for numbers close to magnitude 1.0, at the cost of reduced number of mantissa bits for very small (or very large) numbers;
 5) The respective dynamic ranges are chosen to be 'potentially relevant' for CNN simulations;
 6) For pfloat types covering [0.0..1.0] and [-1.0..1.0], bit patterns of range-bits all zero, exponent-bits all zero, mantissa-bits all zero represents 0.0, and bit patterns of range-bits all zero, exponent-bits all zero, mantissa-bits 0b0...01 represents 1.0;
 7) For pfloat types covering [0.0..1.0] and [-1.0..1.0], there is obviously no 'inf', and we abstain from defining a 'nan' (disclaimer: we're considering to use -0.0 to represent 'nan' in the future)
 
-The progressive increase/decrease of exponent bits over ranges leads to the 'p' in the **p**float naming convention.
+The progressive increase/decrease of exponent bits over ranges 
+leads to the 'p' in the **p**float naming convention.
 
-A very important side effect of the above: The limitation to number ranges of magnitude [0.0..1.0] has the effect that the result of a multiplication of two numbers is 
-guaranteed to fall in this number range. With appropriate scaling, even the final result of multiply-add operations on 
-vectors will fall in this number range. This greatly simplifies handling of overflow, simply by avoiding it.
+A very important side effect of the above: The limitation 
+to number ranges of magnitude [0.0..1.0] has the effect 
+that the result of a multiplication of two numbers is 
+guaranteed to fall in this number range. With appropriate 
+scaling, even the final result of multiply-add operations on 
+vectors will fall in this number range. This greatly 
+simplifies handling of overflow, simply by avoiding it.
 
-Disclaimer: The above limitations might or might not be what you're looking for. In this case, please keep looking (elsewhere)...
+Disclaimer: The above limitations might or might not be 
+what you're looking for. In this case, please keep looking (elsewhere)...
 
 ### Implemented pfloat number formats
 
 A total of ten number formats are addressed here. The nomenclature used is straight forward:
-- A prefix 'u' means unsigned type; absence of this prefix implicitly means signed type
-- The number indicates how many bits are used in the type implementation (8 --> 8 bits, 16 --> 16 bits, 32 --> 23-bits)
-- The postfix [high |low] gives a hint about the size of the dynamic range
-- Absence of a postfix for 16- and 32-bit types indicate that the dynamic range is same as for _float_
-
+- The name 'pfloat' indicates membership in the type family;
+- A prefix 'u' means unsigned type; absence of this prefix implicitly means signed type;
+- A postfix number indicates how many bits are used in the type implementation (8 --> 8 bits, 16 --> 16 bits, 32 --> 32-bits);
+- The postfix [high |low] gives a hint about the size of the dynamic range;
+- Absence of a dynamic range postfix for 16- and 32-bit types indicate that the dynamic range is same as for `float`.
 
 
 | #bits | Signed | #ranges | Type name| #mantissa bits| range | dynamic range|
@@ -125,17 +154,27 @@ A total of ten number formats are addressed here. The nomenclature used is strai
 
 Table1: Summary of implemented floating point types
 
-The above number formats are best visualized by showing the exponent-mantissa distributions as well as the bias values for each pfloat type as a function of the related ranges:
+The above number formats are best visualized by showing the 
+exponent-mantissa distributions as well as the bias values for 
+each pfloat type as a function of the related ranges:
 ![pfloat formats overview](./doc/graphs/pfloatSummary.png "Figure 1: pfloat formats overview")
 Figure 1: pfloat formats overview
 
-As one can easily appreciate from the figure above, the resolution (= number of mantissa bits) is optimized for numbers close to 1.0. 
-For example, a pfloat16 number in range [0.5..8.0) has 11 mantissa bits, a upfloat16l(ow) has even 13 mantissa bits in range [0.25..1.0].
-This compares favorably to a BFLOAT16, which only has 7 mantissa bits in those ranges. 
+As one can easily appreciate from the figure above, the resolution 
+(= number of mantissa bits) is optimized for numbers close to 1.0. 
+For example, a pfloat16 number in range [0.5..8.0) has 11 mantissa 
+bits, a upfloat16l(ow) has even 13 mantissa bits in range [0.25..1.0].
+This compares favorably to a BFLOAT16, which only has 7 mantissa 
+bits in those ranges. When compared to a 'brute force' implementation of 
+a 8-bit floating point number format, e.g. with a fix assignment of 1 sign, 
+5 exponent and 2 mantissa bits, 
+the 8-bit pfloat family members also have more mantissa bits for magnitudes close to 1.0.
+
 This comes at a cost, for example for pfloat16 numbers 
-smaller 2^-31 (~ 0.5nV) where only 5-6 mantissa bits are available. 
+smaller 2^-31 (~ 0.5nV) where only 5-6 mantissa bits are available, as 
+opposed to BFLOAT16 with 7 bits. 
 Conversion float from/to BFLOAT16 
-(implemented with underlying type uint16_t) is available.
+(implemented with underlying type uint16_t) is available, just in case needed for comparisons.
 
 Note: The range bit patterns are chosen such that small 
 magnitudes have small range encoding values. 
@@ -144,52 +183,49 @@ monotonically increases for increasing floating point magnitudes.
 
 ### pfloat math
 
-We abstained for the initial version from implementing all possible math operations using bit-wise ALU implementations for the various pfloat types. 
+We abstained for the initial version from implementing all possible 
+math operations using bit-wise ALU implementations for the various pfloat types. 
 Instead, we convert the pfloat inputs to 32-bit floating point, 
 execute the math operation, apply a truncating to the 
 mantissa bits (as to mimic a reduced size ALU), and convert back 
 to pfloat, thereby applying a selectable rounding 
-mechanism. The number of used mantissa bits is programmable, either via default value, or in each math 
-operation (except when using operators, which always use th given default value).
-
-We implemented the basic math operations (+, -, *, /) and boolean logic functions (==, !=, >=, <=, <, >)
-via operators. We implemented all relevant type 
-conversions (with double and pfloat32 only partially covered initially). We also implemented single and dual argument arbitrary math functions (all that are 
-supported by the `cmath` library). Last not least, we also implemented various vector operations, 
-in particular multiply add and similar.
-
-We applied template prgramming to enable straight forward use of different pfloat types. 
+mechanism. The number of used mantissa bits is programmable, either 
+via default `#define` value, or in each math 
+operation (except when using operators, which always use the default value).
+
+We implemented the basic math operations (+, -, *, /) and boolean 
+logic functions (==, !=, >=, <=, <, >) via operators. We implemented all relevant type 
+conversions (with double and pfloat32 only partially covered initially). 
+We also implemented single and dual argument arbitrary math functions (all that are 
+supported by the `cmath` library). Last not least, we also 
+implemented various vector operations, in particular multiply add and similar.
+
+We applied template programming to enable straight forward use of different pfloat types. 
 For example, it is easily possible to use multyply-add with one vector (weights?) 
 be represented in upfloa8l, the other vector (inputs?) represented as pfloat8h, with 
 the result returned a pfloat16h. This can deliver up to 10bits of resolution in the result. 
 
-It's worth noting that a CPU cache line size of 64 bytes - as is in widespready use - 
+It's worth noting that a CPU cache line size of 64 bytes - as is in widespread use - 
 can hold 64 pfloat8 elements, or one pfloat8 8x8 matrix. 
 
-Not all possible type combinations are implemented for vector math. But adding a 
-specific type combination is super easy thanks to macro enhanced 
-templates. Enabling a multiply-add of a first vector consisting of 
+A multiply-add operation of a first vector consisting of unsigned
 element type `upfloat8lhigh` (e.g. cell output) values with a second
-vector consisting of element type `pfloat8low` (e.g. synaptic weights) 
+vector consisting of signed element type `pfloat8low` (e.g. synaptic weights) 
 values, each input vector having 64 elements (and hence consuming 64 Bytes
 of memory each), and generating an output
-of type `pfloat16high` can be achieved by adding the following line
-to `pfloat_vector_math.tpp`:
-
-```
-pfloat_vectors_createMultiplyAdd(pfloat16high, upfloat8high, pfloat8low, 64);
-```
-The above is then used as follows (note the explicit 
-scaling factor, rounding method and mantissa truncation to 12 bits): 
+of type `pfloat16high` can be implemented like this: 
 ```
 auto multiplyAddResult_pfloat = 
  pfloatMultiplyAdd<pfloat16high, upfloat8high_64_t, pfloat8low_64_t>
  (vector1, vector2,
- scalingFactor,
+ 1.0f/32.0f,
  pfloat_n::nearest,
  0xFFFFF800,
  0xFFFFFF80);
 ```
+Note the explicit
+scaling factor, rounding method and mantissa truncation to 12/16 bits). `auto` type specifier
+can be used because the return type (`pfloat16high`) is specified in the template parameter.
 
 Implementation of matrix math is prepared but not done yet.
 
@@ -198,11 +234,11 @@ Implementation of matrix math is prepared but not done yet.
 ### Build and install
 
 After the usual gitclone operation, create a build directory in the directory 
-where you have cloned it and enter it, call `cmake..`, `make`, `make install`. 
+where you have cloned it and enter it, call `cmake ..`, `make`, `make install`: 
 
 ```
  cd $yourDirectory
- git clone https://...
+ git clone https://github.com/IBM/pfloat.git
  cd pfloat
  mkdir build
  cd build
@@ -211,8 +247,8 @@ where you have cloned it and enter it, call `cmake..`, `make`, `make install`.
  make install
 ```
 
-This will build a static and shared library, as well as as an 
-executable runtests. The libraries are installed in $yourDirectory/pfloat/lib, 
+This will build a static and shared library, as well as an 
+executable `runtests`. The libraries are installed in $yourDirectory/pfloat/lib, 
 the executable in $yourDirectory/pfloat/bin. Note that if you want the libs in /usr/local/lib, 
 you can either change the CMakeLists.txt, or copy the libs manually.
 
@@ -222,7 +258,7 @@ One `#pragma GCC diagnostic ignored "-Wignored-attributes"` is used to suppress
 
 ### Use
 
-The `tests/runtests.cpp` file can be used to look at various examples.
+The `./tests/runtests.cpp` file can be used to look at various examples.
 
 Essentially, the pfloat types can be used like they all 
 were float. Hence, code below is perfectly valid 
@@ -267,8 +303,7 @@ Disclaimers:
 
 - Increase test coverage
 - Implement matrix math (including tensorflow-like operations)
-- Play around with pfloat format combinations in math operations
-- Based on insights from the above: Create vector/matix math for more of the possible type combinations
+- Play around with pfloat format combinations in math operations (as e.g. relevant for CNN simulation)
 - Play around with bit sizes of underlying math
 - Complete implementation of pfloat32, and for completeness add pfloat64
 - Complete random rounding implementation and add multi-bit stochastic rounding