Optimize non-atomic memory allocation #14679

BlobCodes · 2024-06-09T21:31:41Z

This PR adds two new well-known functions used in the compiler:

__crystal_calloc64
__crystal_calloc_atomic64

These functions are analogues to __crystal_malloc64 and __crystal_malloc_atomic64, but they guarantee that any memory allocated using them is cleared.
This can be used as an optimization as crystal mustn't clear this memory as required with memory allocated using the malloc versions.

If the __crystal_calloc* functions cannot be found, the old behaviour is used.

Additionally, two new GC methods calloc and calloc_atomic have been added with the same behaviour.

The description of the GC method malloc (which clears memory in bdwgc and doesn't clear memory with no GC) has been updated to reflech that it does not always clear any memory. Unless the underlying GC is changed, this is not a breaking change.

In the case of bdwgc, only non-atomic memory allocations got faster.

Code:

require "benchmark"

Benchmark.ips(calculation: 60) do |x|
  x.report("malloc") { Pointer(String).malloc(1) }
end

Benchmark.ips(calculation: 60) do |x|
  x.report("malloc") { Pointer(String).malloc(2 ** 10) }
end

Benchmark.ips(calculation: 60) do |x|
  x.report("malloc") { Pointer(String).malloc(2 ** 24) }
end

Results:

Bytesize	Before	After
8B	8.02ns	7.35ns
8KiB	824.02ns	746.20ns
128MiB	23.44ms	11.84ms

As can be seen from these results, large memory allocations profit a lot while small memory allocations only see a small improvement.
Also, it may be interesting to see how often LLVM can remove the memset completely.

More advanced benchmarks must still be done.

BlobCodes · 2024-06-09T21:44:37Z

A logical follow-up to this PR would be to expose a non-clearing variant of Pointer.malloc in the stdlib (ex. as Pointer.malloc_unsafe) to speed up collection types without inner pointers.

HertzDevil · 2024-06-09T22:03:39Z

src/compiler/crystal/codegen/codegen.cr

 end
 end

 pre_initialize_aggregate(type, struct_type, type_ptr)
 end

 def pre_initialize_aggregate(type, struct_type, ptr)
- memset ptr, int8(0), size_t(struct_type.size)


.pre_initialize must clear the memory because it is intended for use with uninitialized memory, regardless of where that memory came from. If the change in #allocate_aggregate is still needed then it should not call #pre_initialize_aggregate

This line was removed because pre_initialize_aggregate is also used in allocate_aggregate.

The memset specific to the .pre_initialize primitive was moved here:

crystal/src/compiler/crystal/codegen/primitives.cr

Line 754 in 696ebdb

memset ptr, int8(0), size_t(struct_type.size)

So everything should still work the same.

Then the naming #pre_initialize_aggregate isn't accurate anymore, because it now does less than the corresponding primitive. I suggest renaming it to something more descriptive

But maybe this is also fine and the .pre_initialize primitive itself shouldn't clear the memory, so the caller can save every bit of redundant memset?

I think it's fine, but there should be a comment that states that pre_initialize_aggregate expects the memory to have already been set to zero, or have a clear argument, so the intent becomes clear.

ysbaddaden · 2024-06-10T07:55:06Z

Nice speedup!

Though, I'm not sure about naming. The C calloc function involves two traits: allocate an array of n elements of size bytes then memory is set to zero, but we'd skip the main trait here.

I'd prefer to expose something more explicit, for example just GC.malloc(size, clear: true) and __crystal_malloc(size, clear: true) and same for the atomic versions.

BlobCodes · 2024-06-10T14:36:53Z

I'd prefer to expose something more explicit, for example just GC.malloc(size, clear: true) and __crystal_malloc(size, clear: true) and same for the atomic versions.

The __crystal_malloc* functions are funs, not defs, so we don't have named args.
Adding a second arg is a breaking change since the new compiler couldn't use older stdlibs anymore.

Also, I don't really see why clear should be a param instead of a function invariant while the same isn't true for atomic (ex. GC.malloc(20, atomic: true)).

Though, I'm not sure about naming. The C calloc function involves two traits: allocate an array of n elements of size bytes then memory is set to zero, but we'd skip the main trait here.

The calloc function doesn't necessarily involve manually clearing (memset-ing) the allocated memory. For example, the Unix mmap syscall used for allocating large memory regions uses continuous 4KiB memory pages which are only actually commited to a program on access and always cleared on commit by the kernel itself.
Since calloc can always assume this is true, no memory needs to be cleared and thus commited for large allocations.

The main invariant of calloc (the memory is cleared) is implemented - however that may be.

BlobCodes added 2 commits June 9, 2024 22:30

Optimize non-atomic memory allocation

cff557a

Make __crystal_calloc* builtin optional

696ebdb

HertzDevil reviewed Jun 9, 2024

View reviewed changes

This was referenced Jun 10, 2024

Non-atomic memory is cleared twice #14677

Open

Allocation of non-zeroed memory for performance boost #14687

Open

Blacksmoke16 added kind:feature performance topic:stdlib:runtime labels Jun 10, 2024

BlobCodes mentioned this pull request Jun 13, 2024

Fix unnecessarily clearing clean memory in codegen #14710

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize non-atomic memory allocation #14679

Optimize non-atomic memory allocation #14679

BlobCodes commented Jun 9, 2024 •

edited

Loading

BlobCodes commented Jun 9, 2024

HertzDevil Jun 9, 2024

BlobCodes Jun 9, 2024

HertzDevil Jun 9, 2024

HertzDevil Jun 9, 2024

ysbaddaden Jun 10, 2024

ysbaddaden commented Jun 10, 2024

BlobCodes commented Jun 10, 2024 •

edited

Loading

Optimize non-atomic memory allocation #14679

Are you sure you want to change the base?

Optimize non-atomic memory allocation #14679

Conversation

BlobCodes commented Jun 9, 2024 • edited Loading

BlobCodes commented Jun 9, 2024

HertzDevil Jun 9, 2024

Choose a reason for hiding this comment

BlobCodes Jun 9, 2024

Choose a reason for hiding this comment

HertzDevil Jun 9, 2024

Choose a reason for hiding this comment

HertzDevil Jun 9, 2024

Choose a reason for hiding this comment

ysbaddaden Jun 10, 2024

Choose a reason for hiding this comment

ysbaddaden commented Jun 10, 2024

BlobCodes commented Jun 10, 2024 • edited Loading

BlobCodes commented Jun 9, 2024 •

edited

Loading

BlobCodes commented Jun 10, 2024 •

edited

Loading