Jump to content
  • Advertisement
Sign in to follow this  
Alessio1989

_aligned_malloc with big alignment waste of space is madness

This topic is 1006 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi folks, I'm trying to find out the best way to allocate "big chunk" (several MiBs) with "big alignments" (aka from a couple to several MiBs). cool.png

 

Actually I'm doing some testing  with _aligned_malloc but it does not behave very well increasing the alignment space: with the biggest alignment I am currently using (32 MiB) I am losing more or less 1/8 of the address space! That's a lot of space and wasting memory makes me a very sad Panda sad.png

 

So the question is: any way to tell the OS (Windows) to not wast preferably tons of address space (both virtual and physical) or is there any better alternative to _aligned_malloc/_aligned_free on Windows platform?  ph34r.png

 

EDIT: please not that _aligned_malloc is an alias of _mm_malloc for x86 processors.

Edited by Alessio1989

Share this post


Link to post
Share on other sites
Advertisement

One thing you could do is use VirtualAlloc, requesting an offset in your address space that lies on the alignment boundary you want, and if there's nothing overlapping, Windows should give it to you. That's pretty hackish and not reliable but it will get the job done with no waste of space as you asked for. I don't know that there is a way to ask Windows to find such an aligned memory block by itself, but if your alignment is large enough I suppose you could quickly try and scan your address space looking for a free spot.

 

How come you need such large alignment boundaries though? Usually you don't need alignments larger than the page size... what are you trying to do?

Share this post


Link to post
Share on other sites
Actually I am doing some experiments on different projects and environments area, it is too long to tell...
 
However, if we talk about game development, on of the biggest alignment (4 MiB) comes directly from the requirements of multi-sample resources on D3D12. Since one of my hardware is an Haswell iGPU (which means UMA support), and since its Virtual Address is only 31-bit large, minimizing the waste of space due memory alignment makes sense even for this "insane" alignment if I want to try to do some "dirty" things with D3D12 and UMA devices.
 
I will look for VirtualAlloc, should it support large pages (2/4 MiB) on x86 CPUs, right?

Share this post


Link to post
Share on other sites

Why do you need 32MB alignment?

 

Aligned alloc works by allocating alignment + requested size; then offsetting the actual pointer. So a 32MB alignment requires a minimum size of 32MB per allocation.

For example for 16-byte alignment, if the pointer returned by malloc is 4-byte alligned, you need to offset by 12 bytes. If the memory was 8-byte alligned, you need to offset by 8 bytes. If the memory was 1-byte aligned, you need to offset by 15 bytes. Hence the minimum allocation must be 16 bytes + requested size.

 

In the 32MB alignment case, if malloc returns the pointer 0x04000004, you need to offset it by 33554428 bytes to make it multiple of 33.554.432

Now you know where the waste space comes from. It's unavoidable

 

For stuff like meeting D3D12 requirements (like 4MB alignment) you seriously should use VirtualAlloc. But for meeting the alignment, you should do the alignment yourself. If you think the wasted storage is too big, manage the memory manually so that you can reuse the memory space that comes after the allocation start and before the aligned offset.

Share this post


Link to post
Share on other sites

Actually I am doing some experiments on different projects and environments area, it is too long to tell...
 
However, if we talk about game development, on of the biggest alignment (4 MiB) comes directly from the requirements of multi-sample resources on D3D12. Since one of my hardware is an Haswell iGPU (which means UMA support), and since its Virtual Address is only 31-bit large, minimizing the waste of space due memory alignment makes sense even for this "insane" alignment if I want to try to do some "dirty" things with D3D12 and UMA devices.
 
I will look for VirtualAlloc, should it support large pages (2/4 MiB) on x86 CPUs, right?


Last time I used VirtualAlloc in a 32-bit process the granularity (page size and alignment) was 4K, and it will give you multiples of that. I don't know if there's a limit. In 32-bit land you could probably allocate the entire address space at once if it were all free. No idea what happens in 64-bit land.

If you want to avoid fragmentation you can scan your process with VirtualQuery first to see what address ranges are already in use, and determine what a good starting address is to pass in.

From my own investigations on 32-bit processes in the past few years, typically you'll have a 32-bit EXE down around 0x00400000, your main thread stack will be at a lower address such as 0x00120000, and all of the DLLs will typically load much higher in RAM. This gives you a big empty space between the end of your EXE and the start of DLLs for heaps and stuff. It's likely that your runtime (MSVCRT or whatever) will create a/some heap(s) before your code can execute, though. I haven't analyzed any heavily multithreaded apps to know where they're likely to create additional thread stacks. Edited by Nypyren

Share this post


Link to post
Share on other sites

So, VirtualAlloc looks like the only feasible way (I will do some test forcing 2MiB page size instead of 4KiB on x64...).

 

For smaller chunk of alignment (eg: SSE/AVX alignment requirements) that's not a big deal, especially for automatic data where the compiler or the language itself can do the job for us happy.png

 

Take notes about CPU-dream architecture: fast and efficient way to return arbitrary memory alignment without waste of space...

Edited by Alessio1989

Share this post


Link to post
Share on other sites

Last time I used VirtualAlloc in a 32-bit process the granularity (page size and alignment) was 4K, and it will give you multiples of that. I don't know if there's a limit. In 32-bit land you could probably allocate the entire address space at once if it were all free. No idea what happens in 64-bit land.


I'm not sure what the max size (if any) is, but I believe the smallest permitted allocation of VirtualAlloc is the value of SYSTEM_INFO.dwAllocationGranularity. I seem to recall that in 64-bit Windows that value is actually 64k, so VirtualAlloc will give you multiples of that. There's a bit more info here, not sure how up to date it is. Edited by Oberon_Command

Share this post


Link to post
Share on other sites

If you write your own allocator, you can add the wasted/padding space back to your freelist. e.g. if you allocate 32MB but only use the first 4MB of it, then you've got a 28MB chunk goes into your freelist, for use by allocations that have alignment requirements of 4MB (or 2MB/1MB/512KB/etc...) and a size of 28MB or less.

Share this post


Link to post
Share on other sites

Take notes about CPU-dream architecture: fast and efficient way to return arbitrary memory alignment without waste of space...

It's nothing to do with CPU architecture. The hardware doesn't care what you do with the 31.99 MB of extra space you allocated to get an address which is a multiple of 0x2000000. Edited by swiftcoder

Share this post


Link to post
Share on other sites

Take notes about CPU-dream architecture: fast and efficient way to return arbitrary memory alignment without waste of space...

It's nothing to do with CPU architecture. The hardware doesn't care what you do with the 31.99 MB of extra space you allocated to get an address which is a multiple of 0x2000000.

 
My CPU dream architecture does not come with DDR or other form of RAM support, instead it has big cache modules support with direct physical address access wub.png That's cause is called "dream architecture". laugh.png
 
 

If you write your own allocator, you can add the wasted/padding space back to your freelist. e.g. if you allocate 32MB but only use the first 4MB of it, then you've got a 28MB chunk goes into your freelist, for use by allocations that have alignment requirements of 4MB (or 2MB/1MB/512KB/etc...) and a size of 28MB or less.



Yes, that' was my intention, but it still a madness of waste of space with _mm_malloc direct implementations if I use alignment greater then few KiB (looks like it doesn't use large pages). Beyond 1 MiB it becomes insane. Edited by Alessio1989

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!