wintertime, on 11 Mar 2013 - 14:15, said:
cr88192, on 10 Mar 2013 - 16:42, said:
related to a prior post:
I have observed that at 64x64 pixels and stored as RGBA, the Unicode BMP is too large to effectively fit inside a 32-bit process (although DXT1 makes more sense here).
Sorry, this point doesnt seem to add up to me.
On one side you feel the need of storing high quality glyphs as 32bit RGBA(just for those near invisible colored edges to make use of pixel ordering on LCD?) when one possibly could get away with something like 4*2bit, 8bit or even 1bit, on the other side you want to conserve(probably much less) bytes by cutting complete character ranges beyond 0xffff? I would guess Asian people would be more happy with low quality glyphs in a game than with only having "half" the characters they need.
Maybe you can just load more language ranges after first use? Also there are huge empty or just reserved or private use ranges in unicode so it should be much less than 0x10ffff anyway.
this observation was mostly made with my font-processing tools, which were crashing due to trying to malloc too much data, and failing.
these tools basically just naively malloc image buffers for the entire character space, for sake of processing. 32px was the effective upper limit for having everything fit in the process (and not crashing the tool).
this wouldn't much effect the engine, apart from if using CJK characters and ending up pulling in a large part of the character space (only accessed parts of the font-space are converted). in-engine, RGBA exists as an intermediate stage, mostly prior to converting into DXT1 (for upload to the GPU).
as for strings, the issue is that mostly things like strings and similar take up a fair amount of heap-space in my engine (although, granted, not nearly as much as voxel terrain, vertex arrays, ..., which as-is are the majority of the memory use). (~ 1GB is typically used for voxels and VAs and similar).
as is, it would be a difference mostly of around ~ 600MB for UTF-32, vs ~ 150MB for UTF-8 (ASCII-range is by far dominant). (EDIT: most of this is internal text/data, relatively little end-user directed text). for most things, it makes sense to stick with UTF-8.
(my engines' MM is able to dump how much of what types of memory allocations are made).
the main thing which would be effected by UTF-32 (assuming nearly everything else remaining UTF-8) would be the console buffers, which would go from around 500kB to 1MB, but granted, this isn't really a huge issue (that or reworking how effects work). probably it would also effect the in-console text-editor, which is basically partly integrated with the console. (EDIT/ADD: consoles store a buffer for a 1024x768 window, which works out to 128x96 chars with an 8px char, using 2 words for each character, and with 10 consoles, or 491kB vs 983kB).
most of the code in-engine works directly between UTF-8 formatted strings, with a few edge-cases where UTF-16 is used.
most of the conversion code knows about surrogate pairs and other things, though M-UTF-8 is typically the "canonical" storage, partly due to JVM influence (and, like Java and ECMAScript, my scripting language uses UTF-16 as its canonical string format, though M-UTF-8 is often used internally). (actual heap usage due to UTF-16 strings is fairly insignificant, given how infrequently they are used at present).
(EDIT/ADD: a cheap/lazy solution found for console: an effect-flag now indicates that the background-color field encodes 4 more character bits (with the background color coming from prior character in this case), subscript/superscript now uses a single bit, which if set uses strikeout to indicate which it is... this allows for effectively 20 bit characters).
or such...
Edited by cr88192, 12 March 2013 - 12:36 AM.