quote:Original post by SiCrane
baskuenen: Maybe I''m not understanding your lookup table, but with only 1024 entries, it seems that you can only store the recipricols of the run length, which means you still have to do a multiply instruction in order to get the increment value. So the combination of the load with the multiply would seem to be worse compared to a straight divide, even if there is no cache miss. As I originally understooded it, it sounded like you wanted a lookup table the size of the screen (1024 x 768) which obviously would have major major cache problems.
But let''s assume that I''m just misunderstanding your lookup table and it works efficiently with regards to best case instruction time. Let''s take the a PIII Katmai, it has a 16KB 4 way associative L1 data cache. If you want to load in a 4 KB lookup table, you''ve taken up a cache line in each and every set. If we have a 32 bit framebuffer at 1024 x 768, then each horizontal scanline will similarly consume an a cache line in each and every set in the cache. Because of the structure of a frame buffer we know that all the pixels along a verticle line would fall in the same cache set. Now let''s draw a verticle line straight from the top of the screen to the bottom. We''ve just loaded 768 different cache lines in and out of one set in the L1 cache. Not only that, but the L2 cache on a PIII Katmai is only 256 KB and 8-way associative. So whatever that was in the L1 cache flushed, and it''s also no longer is sitting in the L2 cache either. So when you try to use any part of the lookup table that sat in the that set, you''ll have to do a fetch from main memory, which takes about 115 clock cycles. Not only will pure verticle lines do this, but many near verticle lines, as horizontally adjacent pixels will also tend to fall in the same cache line. Also many lines with a slope somewhere between verticle and between 1, will cause entire cache sets to be flushed, which will induce the less costly L1 cache miss on table lookup, which is still on the order of 10 additional cycles. And after three or four near horizontal line calls, it seems likely that previously unused elements of the lookup table will also be flushed from the L1 cache. So it seems likely that for random lines, the you''ll almost always have to load the lookup table value from the L2 cache or worse.
Also things are more grim if you consider the P4, which only has a 8 KB L1 cache, or the PII where the cache miss penalities are much higher.
SiCrane, thanks for your explaination. I still cannot agree with you
First I''m assuming we''re reading the pixel first before writing - else it wont be loaded into cache.
Second, for a maximum length vertical line, you have 768 pixels. Every cache line is 32bytes. 32*768=24K cache needed. You''re right when we have a small cache. (I assumed we had atleast 128K, like the "old" Pentiums). Isn''t the cache nowadays not getting smarter? Like overwrite the least used mem? I''m not sure about this though.
The third thing - using a 1024x768 lookup table is a bit overdoing it. I would suggest a lookup table containing the values of the 1/DeltaWidth or 1/DeltaHeight. And yes - we would need an extra multiply to put this value to use though (dx/dy or dy/dx).
Just some code - not really important
FLD LookupTbl[eax*4] ;eax=dx ;1 cycleFMUL [dy] ;1 cycle; do some integer thing ;1 cycle loss due to FISTPFISTP [result] ; 3 cycles (or was it 5?)MOV eax, [result]; paired instruction 1/2 cycle for loading in register again; total of 6.5 CPU cycles needed for calculation
If the cache is only 8K in size, using this cache for drawing (including read before write) would make it real slow.
If we don''t use the cache for writing to the screen, the bit lost to our lookup table is acceptable.
Am I wrong?
Is it true the size of the cache is decreasing in newer computers? Why? To make a small bit faster, or the costs?
Some last words:
One thing is for sure: If your lines are very short - Bresenham is your solution!