`Uint8List.fromList([...])` is ~10x slower than `Uint8List(length)..[0] = #..[1] = #` #56052

matanlurey · 2024-06-20T02:36:46Z

Possibly related: #32080.

It would be nice to have an intrinsic for a list-literal or perhaps document fromList as inefficient.

Here is a micro-benchmark I wrote:

import 'dart:io' as io;
import 'dart:typed_data';

import 'package:benchmark_harness/benchmark_harness.dart';

/// Benchmark for creating a [Uint8List] from a list of bytes.
///
/// Shows that [Uint8List.new] is ~10x faster than [Uint8List.fromList].
void main() {
  _Uint8ListFromList().report();
  _Uint8ListWithLength().report();
}

final class _Uint8ListFromList extends BenchmarkBase {
  _Uint8ListFromList() : super('Uint8List.fromList');

  late Uint8List _list;

  @override
  void run() {
    _list = Uint8List.fromList([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]);
  }

  @override
  void teardown() {
    // Use the list so it is not optimized away.
    _writeBytesAvoidDeopt(_list);
  }
}

final class _Uint8ListWithLength extends BenchmarkBase {
  _Uint8ListWithLength() : super('Uint8List.withLength');

  late Uint8List _list;

  @override
  void run() {
    final list = Uint8List(10);
    list[0] = 0;
    list[1] = 1;
    list[2] = 2;
    list[3] = 3;
    list[4] = 4;
    list[5] = 5;
    list[6] = 6;
    list[7] = 7;
    list[8] = 8;
    list[9] = 9;
    _list = list;
  }

  @override
  void teardown() {
    // Use the list so it is not optimized away.
    _writeBytesAvoidDeopt(_list);
  }
}

void _writeBytesAvoidDeopt(Uint8List list) {
  final tmpDir = io.Directory.systemTemp.createTempSync('nox.benchmark');
  final file = io.File(
    '${tmpDir.path}${io.Platform.pathSeparator}nox.benchmark',
  );
  file.writeAsBytesSync(list);
  file.deleteSync();
  tmpDir.deleteSync();
}

The text was updated successfully, but these errors were encountered:

dart-github-bot · 2024-06-20T02:37:13Z

Labels: area-core-library, type-enhancement
Summary: The Uint8List.fromList() constructor is significantly slower than creating a Uint8List with a specified length and then setting each element individually. This performance difference is likely due to the way fromList() handles copying data.

lrhn · 2024-06-20T10:50:22Z

It's unsurprising that .fromList is slower. If nothing else, it allocates the normal list and initializes that first, which likely uses ~8x the memory of the Uint8List on a 64-bit runtime. Then that list is passed to a constructor which spends some extra time checking if its input is a typed-data list, so it can do memcpy.

10x seems excessive, but that may just be beause Uint8List(10)..[0] = 0..[1] = 1..[2] = 2.... is incredibly fast, basically as fast as initializing the memory, and [0, ..., 9] is just as efficient, but initializes 8x the memory, and then copies it once to the Uint8List.

(The obvious optimization would be to optimize away the intermediate list so that Uint8List.fromList([e1, ..., en]) is compiled as Uint8List(n)..[0] = e1... ..[n-1] = en.)

lrhn · 2024-06-26T13:43:05Z

Doesn't seem to be list allocation that costs. Changing the list to a const [1, ... ,10] only improves performance by ~5%.

A quick look at the code shows that the definition of .fromList just calls setRange.
That's probably what it does. If using a (reusable) Uint8List as the source, performance improves ~x3. Still another x3 needed until the speed of just writing the bytes directly.

@mkustermann Typed data, there's always more to optimize! :)

rakudrama · 2024-06-26T19:22:06Z

If using a (reusable) Uint8List as the source, performance improves ~x3. Still another x3 needed until the speed of just
writing the bytes directly.

@mkustermann Typed data, there's always more to optimize! :)

For a 10-element list we are already far down the slippery slope of having too many tests to select special cases to close the final x3. I think the language needs something to help us dispatch more quickly to the appropriate specialized kernel.

I would like the full glory of control flow collections to be available to typed data lists, with both modifiable, unmodifiable, and perhaps const variants. Let the compilers and runtime do things that we would be scared to make available to the general developer.

lrhn · 2024-06-26T19:54:30Z

Indeed, the only thing here that would match the direct storing approach would be recognizing the pattern .fromList(literal) and converting it to direct stores. Possible, but doesn't generalize.

A typed-data literal, fx Uint8List{1, 2, 3, 4 ...}, could more easily be recognized and optimized, and/or allow general collection elements.

(I'd like the feature to be available to user types too, maybe requiring the type to have a special constructor or interface, we can still special-case platform types.)

mkustermann · 2024-06-27T10:56:38Z

It would be nice to have an intrinsic for a list-literal or perhaps document fromList as inefficient.

Uint8List.fromList() isn't inefficient by itself, it depends on the argument being passed and what we know about that object.

   _list = Uint8List.fromList([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]);

Right now this compiles to this:

  0: B0[graph]:0 {
      v0 <- Constant(#null) T{Null?}
      v3 <- Constant(#TypeArguments: (H16e735c6) [Type: int]) T{TypeArguments}
      v4 <- Constant(#10) [10, 10] T{_Smi}
      v6 <- Constant(#0) [0, 0] T{_Smi}
      v7 <- Constant(#1) [1, 1] T{_Smi}
      v8 <- Constant(#2) [2, 2] T{_Smi}
      v9 <- Constant(#3) [3, 3] T{_Smi}
      v10 <- Constant(#4) [4, 4] T{_Smi}
      v11 <- Constant(#5) [5, 5] T{_Smi}
      v12 <- Constant(#6) [6, 6] T{_Smi}
      v13 <- Constant(#7) [7, 7] T{_Smi}
      v14 <- Constant(#8) [8, 8] T{_Smi}
      v15 <- Constant(#9) [9, 9] T{_Smi}
}
  2: B49[function entry]:2 {
      v2 <- Parameter(0 @rdi) T{_Uint8ListFromList}
}
  3:     ParallelMove fp[-1] <- rdi
  4:     CheckStackOverflow:8(stack=0, loop=0)
  5:     ParallelMove rbx <- C, r10 <- C
  6:     v5 <- CreateArray:12(v3, v4) T{_List}
  8:     ParallelMove fp[-2] <- rax
  8:     StoreIndexed([_List] v5, v6, v6, NoStoreBarrier)
 10:     StoreIndexed([_List] v5, v7, v7, NoStoreBarrier)
 12:     StoreIndexed([_List] v5, v8, v8, NoStoreBarrier)
 14:     StoreIndexed([_List] v5, v9, v9, NoStoreBarrier)
 16:     StoreIndexed([_List] v5, v10, v10, NoStoreBarrier)
 18:     StoreIndexed([_List] v5, v11, v11, NoStoreBarrier)
 20:     StoreIndexed([_List] v5, v12, v12, NoStoreBarrier)
 22:     StoreIndexed([_List] v5, v13, v13, NoStoreBarrier)
 24:     StoreIndexed([_List] v5, v14, v14, NoStoreBarrier)
 26:     StoreIndexed([_List] v5, v15, v15, NoStoreBarrier)
 27:     ParallelMove rdx <- C
 28:     v47 <- AllocateObject:10(cls=_GrowableList, v3 T{TypeArguments}) T{_GrowableList}
 29:     ParallelMove rcx <- rax, rax <- fp[-2]
 30:     ParallelMove fp[-3] <- rcx
 30:     StoreField(v47 . GrowableObjectArray.data = v5 T{_List}, NoStoreBarrier)
 32:     StoreField(v47 T{_GrowableList} . GrowableObjectArray.length = v4 T{_Smi}, NoStoreBarrier)
 33:     ParallelMove rax <- C
 34:     v64 <- AllocateTypedData:10(v4 T{_Smi}) T{_Uint8List}
 35:     ParallelMove rdi <- rax, rsi <- C, rdx <- C, rbx <- fp[-3], r8 <- C, rax <- rax
 36:     ParallelMove fp[-2] <- rax
 36:     StaticCall:166( _slowSetRange@7027147<0> v64 T{_Uint8List}, v169 T{_Smi}, v170 T{_Smi}, v47 T{_GrowableList}, v169 T{_Smi}, using unchecked entrypoint, result_type = T{Null?})
 37:     ParallelMove rax <- fp[-2], rcx <- fp[-1]
 38:     StoreField(v2 T{_Uint8ListFromList} . _list@15459445 = v64 T{_Uint8List})
 39:     ParallelMove rax <- C
 40:     DartReturn:22(v0)
*** END CFG

Without special compiler magic for the particular use (i.e. relying on normal compiler optimizations), the compiler would need to

inline _slowSetRange here (which it doesn't because it's way too big)
recognize the list length is known at that point & unroll the loop (which would be bad for code size)
run store-to-load forwarding pass to avoid loading from growable array
notice that the growable array doesn't escape, the only uses are stores and it can therefore remove the stores & allocation

This is too much to ask from the compiler.

We could recognize this particular pattern higher up in the stack (e.g. kernel level) and rewrite it there. That would probably be ok, but not ideal as we introduce hand-crafted optimizations for library features where we depend on knowing the library implementation details.

Ideally we'd want to allow developers expressing the concept of bytes in the language - via const and non-const typed data literals. That would mostly solve this case.

If using a (reusable) Uint8List as the source, performance improves ~x3. Still another x3 needed until the speed of just writing the bytes directly.

@mkustermann Typed data, there's always more to optimize! :)

Just using a top-level Uint8List.fromList([...]) as source in this benchmark results in suboptimal code because we have to lazily initialize global, don't track length property in global type flow analysis, ... (also some other issues, e.g. filed #56096)

a-siva · 2024-07-10T20:40:25Z

@matanlurey is your benchmark inspired by some pattern that is used frequently in Flutter framework code ?

matanlurey · 2024-07-11T19:38:40Z

Nothing specific - it was just surprising to me that what seemed like the easiest way to create a list of bytes was the least performant.

a-siva added the triaged Issue has been triaged by sub team label Jul 10, 2024

a-siva added the P3 A lower priority bug or feature request label Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Uint8List.fromList([...])` is ~10x slower than `Uint8List(length)..[0] = #..[1] = #` #56052

`Uint8List.fromList([...])` is ~10x slower than `Uint8List(length)..[0] = #..[1] = #` #56052

matanlurey commented Jun 20, 2024

dart-github-bot commented Jun 20, 2024

lrhn commented Jun 20, 2024 •

edited

Loading

lrhn commented Jun 26, 2024

rakudrama commented Jun 26, 2024

lrhn commented Jun 26, 2024

mkustermann commented Jun 27, 2024 •

edited

Loading

a-siva commented Jul 10, 2024

matanlurey commented Jul 11, 2024

Uint8List.fromList([...]) is ~10x slower than Uint8List(length)..[0] = #..[1] = # #56052

Uint8List.fromList([...]) is ~10x slower than Uint8List(length)..[0] = #..[1] = # #56052

Comments

matanlurey commented Jun 20, 2024

dart-github-bot commented Jun 20, 2024

lrhn commented Jun 20, 2024 • edited Loading

lrhn commented Jun 26, 2024

rakudrama commented Jun 26, 2024

lrhn commented Jun 26, 2024

mkustermann commented Jun 27, 2024 • edited Loading

a-siva commented Jul 10, 2024

matanlurey commented Jul 11, 2024

`Uint8List.fromList([...])` is ~10x slower than `Uint8List(length)..[0] = #..[1] = #` #56052

`Uint8List.fromList([...])` is ~10x slower than `Uint8List(length)..[0] = #..[1] = #` #56052

lrhn commented Jun 20, 2024 •

edited

Loading

mkustermann commented Jun 27, 2024 •

edited

Loading