Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow pixel-conversion inside glTexture2D (texture.c) #265

Closed
kas1e opened this issue Jan 16, 2021 · 30 comments
Closed

Very slow pixel-conversion inside glTexture2D (texture.c) #265

kas1e opened this issue Jan 16, 2021 · 30 comments

Comments

@kas1e
Copy link
Contributor

kas1e commented Jan 16, 2021

Hi ptitSeb!

As you may notice in the closed Irrlicht Engine issue, I found out that something very weird and slow happens when we load textures to VRAM. And does not matter the size of the textures: 3MB or just 10KB, it always the same constantly slow.

We with Daniel spend some hours on debugging and find out, that when we call glTexImage2D() with something like 8-8-8-8-REV or even 1-5-5-5-REV, the speed of loading textures drops in a factor of 6. When we use something like GL_UNSIGNED_BYTE (so no conversion happens) then it's fast.

By massive speed drop, I mean: GL_UNSIGNED_BYTE takes 1 second for load up 10 10-20kb textures, and GL_UNSIGNED_8_8_8_8_REV for the same 10 textures texture takes 6 (!) seconds. Imagine what happens when we load 30 textures :)

What we do next, is we just grab the latest GL4ES, and replace in texture.c all big_endian ifdefs on some crap, so do not call conversion and the same test code with 8_8_8_8_REV start to be very fast as expected.

Now, we have a wild guess that this is probably due to the heap activity (malloc/free). Because speed drop is constant, and do not relate at all to the size of textures (at least between 10kb and 3MB of tested ones). That just a guess, of course. And such conversions as done in texture.c by themselves should be a matter of some milliseconds, not like it now 6 seconds for 10 textures :) That why we think about heap activity being an issue there.

Probably we can solve it with one of those solutions:

1). create and use a big (eventually growing) per-context scratch-buffer for stuff like that.
2). for smaller temporary buffers, use dynamic allocation on the stack.

Currently, we have no needs to worry about all the types for tests, as 8_8_8_8_REV is one which already dead slow :)

Of course, I will be glad to donate to a solution with no problems. And other PPC machines (and not only, but any for which need conversion, just currently PPC) will benefit too :)

Thanks!

@ptitSeb
Copy link
Owner

ptitSeb commented Jan 17, 2021

Ah, yes, sorry for not responding on that.

Texture upload on GLES hardware is very limited compared to what OpenGL can do, so gl4es has a complicated scheme to handle that.
Texture upload with gl4es can follow 3 paths:

  1. No need to conversion: then it's the fast way (if there is no need also to create mipmap and the texture needs no other conversion). Texture is uploaded as-is
  2. Need to convert, but the format is one of the frequently used and so a "fast path" exist: coversion takes place, so the upload will be slower, but the "fast path" conversion is there, and use some fast byte-shuffling porcess to convert texture
  3. Need to convert, but the source or destination is a rarely used format: the convertion then use a generic slow conversion mecanism, that use float as intermediary format. The upload there is the slower.

The 2 and 3 cases also will use a malloc to create a new texure in a compatible format.

I can add a "fast path" with source == GL_UNSIGNED_INT_8_8_8_8_REV but the malloc/free will stay here.

If the speed issue is really the malloc, that something else I need to work on (creation of a pair of scratch texture in glstate structure for all conversion, because I sometimes needs 2 when conversion also include resizing)

@kas1e
Copy link
Contributor Author

kas1e commented Jan 17, 2021

Hi, thanks for your answer!

I can add a "fast path" with source == GL_UNSIGNED_INT_8_8_8_8_REV but the malloc/free will stay here.

We may try firstly that way for sake of tests, if speed will be the same as GL_UNSIGNED_BYTE, then all fine. But as I say it does not matter if the texture is 10 kilobytes of 3 megabytes, speed drops are the same. Meaning that it's not actual conversion takes time, but some "handler" of conversion (malloc/free i.e. heap ?)

If the speed issue is really the malloc, that something else I need to work on (creation of a pair of scratch texture in glstate structure for all conversion, because I sometimes need 2 when conversion also include resizing)

IMHO if the "fast path" theme will not work, and speed still will be slow, then we can be sure it "malloc/free" probably and can follow that way.

In the worse case, Daniel says he can add a conversion in ogles2.library, so no conversion needs for GL4ES, and he says it will cost nothing in terms of speed, so probably speed drops because of conversion should be not that big as we have now, just milliseconds. But of course, it's better if things will be handled inside GL4ES, as it will benefit everyone.

If you can create some test-branch with "fast path" I can test it right away :) So we will know if it path-conversion issue, or just general handling (malloc/free). But that constant speed drop not related to the size of texture somehow means it not the "path route" issue. IMHO.

ptitSeb added a commit that referenced this issue Jan 17, 2021
@ptitSeb
Copy link
Owner

ptitSeb commented Jan 17, 2021

I tried to add that fast path. I'm unsure it will trigger, so please add a printf in line 866 of pixels.c to be sure it's used, as a test.

@kas1e
Copy link
Contributor Author

kas1e commented Jan 17, 2021

Saying pixel.c:853 error: ‘GL_INT_8_8_8_8_REV’ undeclared (first use in this function); did you mean ‘GL_INT8_REV’? Probably UNSIGNED_INT ?

@ptitSeb
Copy link
Owner

ptitSeb commented Jan 17, 2021

I meant GL_UNSIGNED_INT_8_8_8_8_REV
I pushed a fix, sorry.

@kas1e
Copy link
Contributor Author

kas1e commented Jan 17, 2021

Compiles fine, but prinfs didn't showups there, so it not triggered and same slow then :)

@ptitSeb
Copy link
Owner

ptitSeb commented Jan 17, 2021

Do you know what is the exact conversion going on ? (that would be the commented printf in line 780 to get that info)

@kas1e
Copy link
Contributor Author

kas1e commented Jan 17, 2021

Uncommeted that prinf and for each loaded texture i have:

pixel conversion: 1024x1024 - GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV (4) ==> GL_RGBA, GL_UNSIGNED_BYTE (4), transform=0, align=4, src_width=4096(0), dst_width=4096(0)
pixel conversion: 1024x1024 - GL_RGBA, GL_UNSIGNED_BYTE (4) ==> GL_BGRA, GL_UNSIGNED_BYTE (4), transform=0, align=4, src_width=4096(0), dst_width=4096(0) 

@ptitSeb
Copy link
Owner

ptitSeb commented Jan 17, 2021

2 conversions for each textures? Mmmm, I need to check. Texture conversion code is a bit of a mess, because there are many special case to handle for many games...

@ptitSeb
Copy link
Owner

ptitSeb commented Jan 17, 2021

Can you activate debug log in texture.c to see what are the call that generates thoses 2 conversions?

@kas1e
Copy link
Contributor Author

kas1e commented Jan 17, 2021

you mean just #define DEBUG ?

@kas1e
Copy link
Contributor Author

kas1e commented Jan 17, 2021

If so, then for each texture i had:

glTexImage2D on target=GL_TEXTURE_2D with unpack_row_length(0), size(1024,1024) and skip(0,0), format(internal)=GL_BGRA(GL_RGBA), type=GL_UNSIGNED_INT_8_8_8_8_REV, data=0x5d535008, level=0 (mipmap_need=0, mipmap_auto=0, base_level=-1, max_level=-1) => texture=267 (streamed=0), glstate->list.compiling=0
pixel conversion: 1024x1024 - GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV (4) ==> GL_RGBA, GL_UNSIGNED_BYTE (4), transform=0, align=4, src_width=4096(0), dst_width=4096(0)
pixel conversion: 1024x1024 - GL_RGBA, GL_UNSIGNED_BYTE (4) ==> GL_BGRA, GL_UNSIGNED_BYTE (4), transform=0, align=4, src_width=4096(0), dst_width=4096(0)

@ptitSeb
Copy link
Owner

ptitSeb commented Jan 17, 2021

Ok thanks, I'll check how to improve this

ptitSeb added a commit that referenced this issue Jan 17, 2021
… used 2 pass where 1 would be enough (to help #265)
@ptitSeb
Copy link
Owner

ptitSeb commented Jan 17, 2021

Ok, I pushed something, it should be better now.

@kas1e
Copy link
Contributor Author

kas1e commented Jan 17, 2021

Omg, you did it!

Everything the same fast with EDT_OPENGL as if I just use EDT_SOFTWARE in IrrLicht in terms of texture loading! Yeah!

So it wasn't then free/malloc ?

Btw, I noticed another issue with the latest gl4es, it happens at least with IrrLicht, but maybe a general one: on loading it says about pixel/vertex shader compilation failed at positions 585 and 605 for vertex and 580 and 586 at the pixel, saying "invalid token". But I need to do more tests to see if it now only with IrrLicht, or general and will create a separate topic for (and btw, no more ARGB errors with Irrlich, you were right:) )

@ptitSeb
Copy link
Owner

ptitSeb commented Jan 17, 2021

Ok good 👍
Yeah, open another ticket, as this one is to be closed :)

@kas1e
Copy link
Contributor Author

kas1e commented Jan 17, 2021

Oh, one more thing before we close it: Daniel find out that Nova supports it all natively (those types), and so he added things to ogles2 so no need for conversion. That what he wrote in the readme:

- support for the following additional texture data types:
  GL_UNSIGNED_SHORT_1_5_5_5_REV and GL_UNSIGNED_BYTE_3_3_2, GL_UNSIGNED_BYTE_2_3_3_REV, GL_UNSIGNED_INT_10_10_10_2,
  GL_UNSIGNED_INT_2_10_10_10_REV.
  Those are not defined for the OpenGL ES 2 standard, they are extensions of our implementation.
  GL_UNSIGNED_SHORT_1_5_5_5_REV has been added to work around a very high texture conversion performance hit in certain
  games / libs (like 6 seconds instead of less than 1 to load and prepare a texture).
  The others have been added because it was a 10 minutes thingy anyway, Nova natively supports those formats.
- added GL_OES_vertex_type_10_10_10_2 and GL_EXT_texture_type_2_10_10_10_REV to the extensions string.
- support for the texture data types GL_UNSIGNED_INT_8_8_8_8 and GL_UNSIGNED_INT_8_8_8_8_REV.
  On our big endian machines GL_UNSIGNED_INT_8_8_8_8 is actually equivalent to GL_UNSIGNED_BYTE, whereas
  GL_UNSIGNED_INT_8_8_8_8_REV is like GL_UNSIGNED_BYTE with bytes swapped to the little endian order.
- unfortunately there are no existing extension strings for all those formats. May well be that our ogles2.lib is the
  world's only OpenGL ES 2.x implementation that supports all those texture data types at all ;)
  Therefore I had to come up with our own extension strings so that client applications and libs like gl4es can
  actually benefit from those newly supported data types:
  GL_AOS4_texture_format_RGB332, GL_AOS4_texture_format_RGB332REV, GL_AOS4_texture_format_RGBA1555REV,
  GL_AOS4_texture_format_RGBA8888 and GL_AOS4_texture_format_RGBA8888REV.

And that means that it will be faster than any conversion inside of gl4es. And that speed increase really need it: even if it will speed up things just by 20%, that for our machines will be pretty cool :)

But this all also untested. So, can you please add another flag to the build process, like "AMIGAOS4_NATIVE_TYPES=1" maybe, or maybe some ifdef (without worry about build-process changes for now), so we can not use conversion for all those types, but set them directly for amigaos4 only?

By that, I can test ogles2 in those terms, and if it will be faster, we can keep it (via ifdefs, of course, so other PPC-machines will have what we have now, while on amigaos4 we will have it native and without conversion at all, for all those formats).

So those ones types which gl4es used will be really cool to have ifdefed to not use conversion at all, but just directly our formats. Can you do so if it not so much hassle?

Thanks a bunch!

@kas1e
Copy link
Contributor Author

kas1e commented Jan 17, 2021

PS. I think at least formats like GL_AOS4_texture_format_RGBA8888REV and GL_AOS4_texture_format_RGBA1555REV are must, while not sure about other ones like GL_AOS4_texture_format_RGBA8888 if there any conversion for that one inside of gl4es? And those 2 GL_AOS4_texture_format_RGB332, GL_AOS4_texture_format_RGB332REV probably of no use in gl4es ?

Or that all can be still used and redirected all of them, so no conversion will be for any of them?

@kas1e
Copy link
Contributor Author

kas1e commented Jan 18, 2021

Thinking more about it, we IMHO no need any new define in the CMAKE or something, we just need to add it via pure AMIGAOS4 ifdef. I.e. keep BIG_ENDIAN where it is (for other possible PPC machines), but in all the parts where currently you do texture conversion, just add AMIGAOS4 ifdef and set it to those types, so no conversion for AMIGAOS4 will be done only. For other PPC machines it will still have a place.

I checked Irrlicht code in COpenGLTexture.cpp, and those types they surely handle:

GL_UNSIGNED_SHORT_1_5_5_5_REV
GL_UNSIGNED_SHORT_5_6_5
GL_UNSIGNED_BYTE
GL_UNSIGNED_INT_8_8_8_8_REV

So GL_UNSIGNED_BYTE you already didn't convert and all go as it, then GL_UNSIGNED_SHORT_1_5_5_5_REV and GL_UNSIGNED_INT_8_8_8_8_REV need to be ifdefed for AMIGAOS4 to not use conversion, but just GL_AOS4_texture_format_RGBA1555REV and GL_AOS4_texture_format_RGBA8888REV. But not sure about GL_UNSIGNED_SHORT_5_6_5 if it needs any handle (and if gl4es do any conversion for this type?)

@ptitSeb
Copy link
Owner

ptitSeb commented Jan 18, 2021

Yeah ok, I'll the support for those format, according to the extension string defined by Daniel.

@kas1e
Copy link
Contributor Author

kas1e commented Jan 18, 2021

Thanks a bunch!

Just probably in yesterday's part where we add ifdef BIG_ENDIAN in pixel.c should be something like #ifdef BIG_ENDIAN && !defined(AMIGAOS4) (so this will not waste time as well for unnecessary stuff). Or even in the whole pixel.c then?

@ptitSeb
Copy link
Owner

ptitSeb commented Jan 18, 2021

No need. Once I add the support for those format, convert will not be called for those format anyway.

@kas1e
Copy link
Contributor Author

kas1e commented Jan 18, 2021

Ah, that is better. So if the format is caught, then it handled, if not, then usually PPC-way. Good!

@ptitSeb
Copy link
Owner

ptitSeb commented Jan 18, 2021

Ok, I have added the formats (well the RGB332 are not used by gl4es, but I've never encountered anything using it).
You can try now.

@kas1e
Copy link
Contributor Author

kas1e commented Jan 18, 2021

Tested : builds fine, on running it give us:

LIBGL: Extension GL_AOS4_texture_format_RGB332 detected
LIBGL: Extension GL_AOS4_texture_format_RGB332REV detected
LIBGL: Extension GL_AOS4_texture_format_RGBA1555REV detected and used
LIBGL: Extension GL_AOS4_texture_format_RGBA8888 detected and used
LIBGL: Extension GL_AOS4_texture_format_RGBA8888REV detected and used 

But when I run Irrlichts test case over it: everything just fully green.

Probably need to add some debugs to see if things work as expected from the gl4es side? Maybe again in pixel.c and texture.c enable debugs ?

@kas1e
Copy link
Contributor Author

kas1e commented Jan 18, 2021

I tried an older version of ogles2.library, where we didn't have those things implemented, and all renders correct (so go old conversion way). Now to understand : is it something in gl4es or in ogles2 need dealing with :) Maybe still some big-endian ifdefs play a role there while should't ?

@ptitSeb
Copy link
Owner

ptitSeb commented Jan 18, 2021

I added support for those format: when they are encounter, I dont try to convert them and just use them. So I send those format to gles2 lib "as-is". I don't know, gl4es doesn't do anything here in fact.

@kas1e
Copy link
Contributor Author

kas1e commented Jan 18, 2021

Aha ok, will drop that info on Daniel (and big thanks for worry!)

@kas1e
Copy link
Contributor Author

kas1e commented Jan 18, 2021

Ok, Daniel fixes it, it was just some issue about Swizzles, now everything fine!

Through speed not increased at all, which kind of strange too. Imho conversion still takes some time, and when we without it should be a little bit faster anyway?

I added printfs to the pixel.c , where we yesterday add big_endian ifdefs, and they surely not called. So .. it can be that speed differences between your conversion, and no conversion at all even not visibly and there is nothing else we can do out of it.

But anyway, have no conversion anyway better and hope somewhere faster.

Thanks a bunch! Some first part of donation on the way!

@kas1e kas1e closed this as completed Jan 18, 2021
@ptitSeb
Copy link
Owner

ptitSeb commented Jan 18, 2021

Ok, Daniel fixes it, it was just some issue about Swizzles, now everything fine!

Through speed not increased at all, which kind of strange too. Imho conversion still takes some time, and when we without it should be a little bit faster anyway?

The "fast path" conversion is pretty fast. I assume the actual texture upload to the GPU memory take much more time than the fast-path conversion (as you CPU do integer stuff quite fast), the time to convert or not is negligeable compared to the upload to gpu mem...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants