Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize non-atomic memory allocation #14679

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

BlobCodes
Copy link
Contributor

@BlobCodes BlobCodes commented Jun 9, 2024

Closes #14677
Closes #14678

This PR adds two new well-known functions used in the compiler:

  • __crystal_calloc64
  • __crystal_calloc_atomic64

These functions are analogues to __crystal_malloc64 and __crystal_malloc_atomic64, but they guarantee that any memory allocated using them is cleared.
This can be used as an optimization as crystal mustn't clear this memory as required with memory allocated using the malloc versions.

If the __crystal_calloc* functions cannot be found, the old behaviour is used.

Additionally, two new GC methods calloc and calloc_atomic have been added with the same behaviour.

The description of the GC method malloc (which clears memory in bdwgc and doesn't clear memory with no GC) has been updated to reflech that it does not always clear any memory. Unless the underlying GC is changed, this is not a breaking change.


In the case of bdwgc, only non-atomic memory allocations got faster.

Code:

require "benchmark"

Benchmark.ips(calculation: 60) do |x|
  x.report("malloc") { Pointer(String).malloc(1) }
end

Benchmark.ips(calculation: 60) do |x|
  x.report("malloc") { Pointer(String).malloc(2 ** 10) }
end

Benchmark.ips(calculation: 60) do |x|
  x.report("malloc") { Pointer(String).malloc(2 ** 24) }
end

Results:

Bytesize Before After
8B 8.02ns 7.35ns
8KiB 824.02ns 746.20ns
128MiB 23.44ms 11.84ms

As can be seen from these results, large memory allocations profit a lot while small memory allocations only see a small improvement.
Also, it may be interesting to see how often LLVM can remove the memset completely.

More advanced benchmarks must still be done.

@BlobCodes
Copy link
Contributor Author

A logical follow-up to this PR would be to expose a non-clearing variant of Pointer.malloc in the stdlib (ex. as Pointer.malloc_unsafe) to speed up collection types without inner pointers.

end
end

pre_initialize_aggregate(type, struct_type, type_ptr)
end

def pre_initialize_aggregate(type, struct_type, ptr)
memset ptr, int8(0), size_t(struct_type.size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.pre_initialize must clear the memory because it is intended for use with uninitialized memory, regardless of where that memory came from. If the change in #allocate_aggregate is still needed then it should not call #pre_initialize_aggregate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line was removed because pre_initialize_aggregate is also used in allocate_aggregate.

The memset specific to the .pre_initialize primitive was moved here:

memset ptr, int8(0), size_t(struct_type.size)

So everything should still work the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the naming #pre_initialize_aggregate isn't accurate anymore, because it now does less than the corresponding primitive. I suggest renaming it to something more descriptive

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe this is also fine and the .pre_initialize primitive itself shouldn't clear the memory, so the caller can save every bit of redundant memset?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine, but there should be a comment that states that pre_initialize_aggregate expects the memory to have already been set to zero, or have a clear argument, so the intent becomes clear.

@ysbaddaden
Copy link
Contributor

Nice speedup!

Though, I'm not sure about naming. The C calloc function involves two traits: allocate an array of n elements of size bytes then memory is set to zero, but we'd skip the main trait here.

I'd prefer to expose something more explicit, for example just GC.malloc(size, clear: true) and __crystal_malloc(size, clear: true) and same for the atomic versions.

@BlobCodes
Copy link
Contributor Author

BlobCodes commented Jun 10, 2024

I'd prefer to expose something more explicit, for example just GC.malloc(size, clear: true) and __crystal_malloc(size, clear: true) and same for the atomic versions.

The __crystal_malloc* functions are funs, not defs, so we don't have named args.
Adding a second arg is a breaking change since the new compiler couldn't use older stdlibs anymore.

Also, I don't really see why clear should be a param instead of a function invariant while the same isn't true for atomic (ex. GC.malloc(20, atomic: true)).

Though, I'm not sure about naming. The C calloc function involves two traits: allocate an array of n elements of size bytes then memory is set to zero, but we'd skip the main trait here.

The calloc function doesn't necessarily involve manually clearing (memset-ing) the allocated memory. For example, the Unix mmap syscall used for allocating large memory regions uses continuous 4KiB memory pages which are only actually commited to a program on access and always cleared on commit by the kernel itself.
Since calloc can always assume this is true, no memory needs to be cleared and thus commited for large allocations.

The main invariant of calloc (the memory is cleared) is implemented - however that may be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GC.malloc doesn't clear memory (as specified) for gc_none Non-atomic memory is cleared twice
4 participants