Skip to content

Commit

Permalink
Corrected typos
Browse files Browse the repository at this point in the history
  • Loading branch information
martinschmatz committed Feb 1, 2022
1 parent fab1792 commit dca783e
Showing 1 changed file with 114 additions and 79 deletions.
193 changes: 114 additions & 79 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,51 @@
# pfloat: A 8-/16-/32-bit floating point number family

Key words: floating point number, varaible precision, CNN simulation, reduced bit size, FP8, FP16, FP32, bfloat, neural network
Key words: floating point number representation, variable precision,
CNN simulation, reduced bit size, FP8, FP16, FP32, bfloat, BFLOAT16,
convolutional neural network

## Introduction
Floating point representation of 'real world' numbers are required for simulations of 'real world' systems. The number of available bits as well as the meaning of those bits has a profound impact on the accuracy of such simulations.

There's a range of floating point number representations available, notably the ones standardized by IEEE plus some important additional formats, worth mentioning here is
bfloat16.

The IEEE formats (e.g. float, double) use the MSB as sign bit, the next 8 or 11 bits as biased exponent, with the reminder of the (LSB-)bits as mantissa bits.
Each exponent bit pattern covers a binary bucket, with the mantissa bits determining the actual number inside this binary bucket.
The number of exponent and mantissa bits is kept constant, which implies that the exponent bias is a constant.
Those number ranges cover ranges from +max to minus max, where max is typically a large number.
Floating point representation of 'real world' numbers are required for simulations
of 'real world' systems. The number of available bits as well as
the meaning of those bits have a profound impact on the accuracy of such simulations.

There's a range of floating point number representations available, notably
the ones standardized by IEEE plus some important additional formats,
worth mentioning is e.g. bfloat16.

The IEEE formats (e.g. float, double) use the MSB as sign bit, the next 8 or 11 bits as
biased exponent, with the reminder of the (LSB-)bits as mantissa bits.
Each exponent bit pattern covers a binary bucket, with the mantissa bits determining
the actual number inside this binary bucket.
The number of exponent and mantissa bits is kept constant, which implies that the
exponent bias is a constant.
Those number ranges cover ranges from +max to minus max, where max is typically a very
large number.

The number of exponent bits determine the dynamic range of representable numbers, while
the number of mantissa bits determine the accuracy.

We are addressing here
general and specific cases where the number of exponent and mantissa bits
is not equal for all representable floating point numbers. We implement a selected
set of 8-/16-/32-bit floating point types which are judged to be good candidates for simulations of CNNs (Convolutional Neural Networks).
is _not equal_ for all representable floating point numbers. We implement a selected
set of 8-/16-/32-bit floating point types which are judged to be good candidates
for simulations of CNNs (Convolutional Neural Networks).

## The generalized case: vfloat

We consider a family of floating point representations, which we name **vfloat** for the current discussion,
which are basically characterized by not having a constant (but **v**ariable) exponent bias as well as a **v**ariable number of exponent- and mantissa-bits.
We consider a family of floating point representations, which we
name **vfloat** for the current discussion,
which are basically characterized by not having a
constant (but **v**ariable) exponent bias as well as
a **v**ariable number of exponent- and mantissa-bits.

In other words: Precious bits are invested to divide the entire range of representable numbers into several (sub-)ranges.
Each (sub-)range can have its own exponent bias, the magnitude of which is implicitly defined by the specific vfloat type.
The exponent bias for vfloat is encoded in range bits. In addition, each encoded range of a vfloat can have its own distinct number of sign-/exponent-/mantissa-bits.
In other words: Precious bits are invested to divide the
entire range of representable numbers into several (sub-)ranges.
Each (sub-)range can have its own exponent bias, the magnitude
of which is _implicitly_ defined by the specific vfloat type.
The exponent bias for vfloat is encoded in range bits. In
addition, each encoded range of a vfloat can have its own
distinct number of sign-/exponent-/mantissa-bits.

For example, the vfloat type
`vfloat8_32_2_5_0_1` would be an 8-bit
Expand All @@ -50,14 +67,18 @@ exponent and mantissa, with the following exponent/mantissa-distribution for the
| 2 = 0b10 | **0** | 1 | 5 | 2^4 |
| 3 = 0b11 | **1** | 2 | 4 | 2^5 |

As a special case, (plus-/minus-) zero is defined as exponent and mantissa bits all zero in range 0,
resulting the smallest non-zero magnitude of representable numbers in the above example being 2^-32.
As a special case, (plus-/minus-) zero is defined as exponent and mantissa
bits all zero in range 0,
resulting the smallest non-zero magnitude of representable numbers in the
above example being 2^-32.
The largest representable number magnitude is equal to:
range_bias times 2^range11_exponent_bits times sum of one and fractional contribution from all-ones mantissa bits =
2^5*2^1*(1+(2^4 - 1)*2^-4) = 124.0
range_bias times 2^range11_exponent_bits times sum of one and fractional contribution
from all-ones mantissa bits = 2^5*2^1*(1+(2^4 - 1)*2^-4) = 124.0

With 1 sign bit, and 2 range bits, there are 6 different combinations of exponent and mantissa bits available, which
can be assigned to one of the 4 (signed) ranges - which allows for the definition of 2'592 different vfloat8
With 1 sign bit, and 2 range bits, there are 6 different
combinations of exponent and mantissa bits available, which
can be assigned to one of the 4 (signed) ranges - which
allows for the definition of 2'592 different vfloat8
types for each specified smallest exponent bucket.

While it's deemed prohibitive to implement all possible vfloat types, it quickly becomes obvious
Expand All @@ -68,46 +89,54 @@ cost of lower resolution ('less accuracy') where it does not hurt (much).
We note (for completeness only) that one can design vfloat types with
non-consecutive number coverage. For example there could be one range
dedicated to numbers close to zero, e.g [0.0, 2^-64..2^-60), with numbers in
[2^-60..2^-16) not being representable, and a second range covering [2^-4 .. 1.0).
[2^-60..2^-4) not being representable, and a second range covering [2^-4 .. 1.0).
For such cases, we'd recommend considering ranges with all
bits being assigned to exponent (see effect of such an approach in
range 0b01 above, which is covering 32 binary buckets).

Of particular interest
(for us) are vfloat type designs with good accuracy close to (+-) 1.0 (Volt), while
having dynamic ranges of either ~2^-30 (~1nVolt) or 2^-15 (~30uVolt), with the
possibility to handle double those dynamic ranges (~2^-60) when multyplying two vfloat
numbers. We also see potential use for unsigned vfloat types.
Of particular interest (for us) are vfloat type designs with good
accuracy close to (+-) 1.0 (Volt), while having dynamic ranges
of either ~2^-30 (~1nVolt) or 2^-15 (~30uVolt), with the
possibility to handle double those dynamic ranges (~2^-60) when
multiplying two vfloat numbers. We also see potential use for
unsigned vfloat types.

## The specific case: pfloat

### Design principles

To limit the number of vfloat types that need to be implemented, a couple of self-induced limitations are introduced:
To limit the number of vfloat types that need to be implemented,
a couple of self-induced limitations are introduced:
1) The range of representable numbers is limited to either magnitudes of 0.0 and 1.0 (_both_ inclusive), or such as to be compatible with IEEE754-2008 (aka float);
2) The number of ranges is either 4 or 16 (2 or 4 range bits);
3) The number of exponent bits between two adjacent ranges can differ by a maximum of 1 bit, and is either progressively increasing or decreasing from ranges with smallest number of exponent bits;
3) The number of exponent bits between two adjacent ranges can differ by a maximum of 1 bit, and is either progressively increasing or decreasing (or remains constant);
4) The number of mantissa bits is maximized for numbers close to magnitude 1.0, at the cost of reduced number of mantissa bits for very small (or very large) numbers;
5) The respective dynamic ranges are chosen to be 'potentially relevant' for CNN simulations;
6) For pfloat types covering [0.0..1.0] and [-1.0..1.0], bit patterns of range-bits all zero, exponent-bits all zero, mantissa-bits all zero represents 0.0, and bit patterns of range-bits all zero, exponent-bits all zero, mantissa-bits 0b0...01 represents 1.0;
7) For pfloat types covering [0.0..1.0] and [-1.0..1.0], there is obviously no 'inf', and we abstain from defining a 'nan' (disclaimer: we're considering to use -0.0 to represent 'nan' in the future)

The progressive increase/decrease of exponent bits over ranges leads to the 'p' in the **p**float naming convention.
The progressive increase/decrease of exponent bits over ranges
leads to the 'p' in the **p**float naming convention.

A very important side effect of the above: The limitation to number ranges of magnitude [0.0..1.0] has the effect that the result of a multiplication of two numbers is
guaranteed to fall in this number range. With appropriate scaling, even the final result of multiply-add operations on
vectors will fall in this number range. This greatly simplifies handling of overflow, simply by avoiding it.
A very important side effect of the above: The limitation
to number ranges of magnitude [0.0..1.0] has the effect
that the result of a multiplication of two numbers is
guaranteed to fall in this number range. With appropriate
scaling, even the final result of multiply-add operations on
vectors will fall in this number range. This greatly
simplifies handling of overflow, simply by avoiding it.

Disclaimer: The above limitations might or might not be what you're looking for. In this case, please keep looking (elsewhere)...
Disclaimer: The above limitations might or might not be
what you're looking for. In this case, please keep looking (elsewhere)...

### Implemented pfloat number formats

A total of ten number formats are addressed here. The nomenclature used is straight forward:
- A prefix 'u' means unsigned type; absence of this prefix implicitly means signed type
- The number indicates how many bits are used in the type implementation (8 --> 8 bits, 16 --> 16 bits, 32 --> 23-bits)
- The postfix [high |low] gives a hint about the size of the dynamic range
- Absence of a postfix for 16- and 32-bit types indicate that the dynamic range is same as for _float_

- The name 'pfloat' indicates membership in the type family;
- A prefix 'u' means unsigned type; absence of this prefix implicitly means signed type;
- A postfix number indicates how many bits are used in the type implementation (8 --> 8 bits, 16 --> 16 bits, 32 --> 32-bits);
- The postfix [high |low] gives a hint about the size of the dynamic range;
- Absence of a dynamic range postfix for 16- and 32-bit types indicate that the dynamic range is same as for `float`.


| #bits | Signed | #ranges | Type name| #mantissa bits| range | dynamic range|
Expand All @@ -125,17 +154,27 @@ A total of ten number formats are addressed here. The nomenclature used is strai

Table1: Summary of implemented floating point types

The above number formats are best visualized by showing the exponent-mantissa distributions as well as the bias values for each pfloat type as a function of the related ranges:
The above number formats are best visualized by showing the
exponent-mantissa distributions as well as the bias values for
each pfloat type as a function of the related ranges:
![pfloat formats overview](./doc/graphs/pfloatSummary.png "Figure 1: pfloat formats overview")
Figure 1: pfloat formats overview

As one can easily appreciate from the figure above, the resolution (= number of mantissa bits) is optimized for numbers close to 1.0.
For example, a pfloat16 number in range [0.5..8.0) has 11 mantissa bits, a upfloat16l(ow) has even 13 mantissa bits in range [0.25..1.0].
This compares favorably to a BFLOAT16, which only has 7 mantissa bits in those ranges.
As one can easily appreciate from the figure above, the resolution
(= number of mantissa bits) is optimized for numbers close to 1.0.
For example, a pfloat16 number in range [0.5..8.0) has 11 mantissa
bits, a upfloat16l(ow) has even 13 mantissa bits in range [0.25..1.0].
This compares favorably to a BFLOAT16, which only has 7 mantissa
bits in those ranges. When compared to a 'brute force' implementation of
a 8-bit floating point number format, e.g. with a fix assignment of 1 sign,
5 exponent and 2 mantissa bits,
the 8-bit pfloat family members also have more mantissa bits for magnitudes close to 1.0.

This comes at a cost, for example for pfloat16 numbers
smaller 2^-31 (~ 0.5nV) where only 5-6 mantissa bits are available.
smaller 2^-31 (~ 0.5nV) where only 5-6 mantissa bits are available, as
opposed to BFLOAT16 with 7 bits.
Conversion float from/to BFLOAT16
(implemented with underlying type uint16_t) is available.
(implemented with underlying type uint16_t) is available, just in case needed for comparisons.

Note: The range bit patterns are chosen such that small
magnitudes have small range encoding values.
Expand All @@ -144,52 +183,49 @@ monotonically increases for increasing floating point magnitudes.

### pfloat math

We abstained for the initial version from implementing all possible math operations using bit-wise ALU implementations for the various pfloat types.
We abstained for the initial version from implementing all possible
math operations using bit-wise ALU implementations for the various pfloat types.
Instead, we convert the pfloat inputs to 32-bit floating point,
execute the math operation, apply a truncating to the
mantissa bits (as to mimic a reduced size ALU), and convert back
to pfloat, thereby applying a selectable rounding
mechanism. The number of used mantissa bits is programmable, either via default value, or in each math
operation (except when using operators, which always use th given default value).

We implemented the basic math operations (+, -, *, /) and boolean logic functions (==, !=, >=, <=, <, >)
via operators. We implemented all relevant type
conversions (with double and pfloat32 only partially covered initially). We also implemented single and dual argument arbitrary math functions (all that are
supported by the `cmath` library). Last not least, we also implemented various vector operations,
in particular multiply add and similar.

We applied template prgramming to enable straight forward use of different pfloat types.
mechanism. The number of used mantissa bits is programmable, either
via default `#define` value, or in each math
operation (except when using operators, which always use the default value).

We implemented the basic math operations (+, -, *, /) and boolean
logic functions (==, !=, >=, <=, <, >) via operators. We implemented all relevant type
conversions (with double and pfloat32 only partially covered initially).
We also implemented single and dual argument arbitrary math functions (all that are
supported by the `cmath` library). Last not least, we also
implemented various vector operations, in particular multiply add and similar.

We applied template programming to enable straight forward use of different pfloat types.
For example, it is easily possible to use multyply-add with one vector (weights?)
be represented in upfloa8l, the other vector (inputs?) represented as pfloat8h, with
the result returned a pfloat16h. This can deliver up to 10bits of resolution in the result.

It's worth noting that a CPU cache line size of 64 bytes - as is in widespready use -
It's worth noting that a CPU cache line size of 64 bytes - as is in widespread use -
can hold 64 pfloat8 elements, or one pfloat8 8x8 matrix.

Not all possible type combinations are implemented for vector math. But adding a
specific type combination is super easy thanks to macro enhanced
templates. Enabling a multiply-add of a first vector consisting of
A multiply-add operation of a first vector consisting of unsigned
element type `upfloat8lhigh` (e.g. cell output) values with a second
vector consisting of element type `pfloat8low` (e.g. synaptic weights)
vector consisting of signed element type `pfloat8low` (e.g. synaptic weights)
values, each input vector having 64 elements (and hence consuming 64 Bytes
of memory each), and generating an output
of type `pfloat16high` can be achieved by adding the following line
to `pfloat_vector_math.tpp`:

```
pfloat_vectors_createMultiplyAdd(pfloat16high, upfloat8high, pfloat8low, 64);
```
The above is then used as follows (note the explicit
scaling factor, rounding method and mantissa truncation to 12 bits):
of type `pfloat16high` can be implemented like this:
```
auto multiplyAddResult_pfloat =
pfloatMultiplyAdd<pfloat16high, upfloat8high_64_t, pfloat8low_64_t>
(vector1, vector2,
scalingFactor,
1.0f/32.0f,
pfloat_n::nearest,
0xFFFFF800,
0xFFFFFF80);
```
Note the explicit
scaling factor, rounding method and mantissa truncation to 12/16 bits). `auto` type specifier
can be used because the return type (`pfloat16high`) is specified in the template parameter.

Implementation of matrix math is prepared but not done yet.

Expand All @@ -198,11 +234,11 @@ Implementation of matrix math is prepared but not done yet.
### Build and install

After the usual gitclone operation, create a build directory in the directory
where you have cloned it and enter it, call `cmake..`, `make`, `make install`.
where you have cloned it and enter it, call `cmake ..`, `make`, `make install`:

```
cd $yourDirectory
git clone https://...
git clone https://github.com/IBM/pfloat.git
cd pfloat
mkdir build
cd build
Expand All @@ -211,8 +247,8 @@ where you have cloned it and enter it, call `cmake..`, `make`, `make install`.
make install
```

This will build a static and shared library, as well as as an
executable runtests. The libraries are installed in $yourDirectory/pfloat/lib,
This will build a static and shared library, as well as an
executable `runtests`. The libraries are installed in $yourDirectory/pfloat/lib,
the executable in $yourDirectory/pfloat/bin. Note that if you want the libs in /usr/local/lib,
you can either change the CMakeLists.txt, or copy the libs manually.

Expand All @@ -222,7 +258,7 @@ One `#pragma GCC diagnostic ignored "-Wignored-attributes"` is used to suppress

### Use

The `tests/runtests.cpp` file can be used to look at various examples.
The `./tests/runtests.cpp` file can be used to look at various examples.

Essentially, the pfloat types can be used like they all
were float. Hence, code below is perfectly valid
Expand Down Expand Up @@ -267,8 +303,7 @@ Disclaimers:

- Increase test coverage
- Implement matrix math (including tensorflow-like operations)
- Play around with pfloat format combinations in math operations
- Based on insights from the above: Create vector/matix math for more of the possible type combinations
- Play around with pfloat format combinations in math operations (as e.g. relevant for CNN simulation)
- Play around with bit sizes of underlying math
- Complete implementation of pfloat32, and for completeness add pfloat64
- Complete random rounding implementation and add multi-bit stochastic rounding
Expand Down

0 comments on commit dca783e

Please sign in to comment.