• FEATURED
• FEATURED
• FEATURED
• FEATURED
• FEATURED

View more

View more

View more

### Image of the Day Submit

IOTD | Top Screenshots

### The latest, straight to your Inbox.

Subscribe to GameDev.net Direct to receive the latest updates and exclusive content.

# why processors work in bytes and not bits

Old topic!

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

21 replies to this topic

### #1lomateron  Members

Posted 19 September 2013 - 12:15 AM

making a 3D image by raycasting a voxel grid like this:

If every voxel just uses one bit and there is a for loop that checks the voxels in the way from the origin of the ray to the end of the ray when it hits a voxel with a 1 value bit

That raycasting method will run pretty fast if the hardware would permit that fro loop to just check bit by bit, and it will uses 8 times less memory

Edited by lomateron, 19 September 2013 - 12:16 AM.

### #2Álvaro  Members

Posted 19 September 2013 - 12:27 AM

You don't need to waste memory at all: You can use every bit of every byte and access them by shifting. If raycasting the way you describe were very fast in a bit-addressable machine, it would also be quite fast on a modern computer: I expect the performance bottlenecks will be things like cache misses and branch misprediction, not access to individual bits.

### #3lomateron  Members

Posted 19 September 2013 - 12:37 AM

access them by shifting

yeah I know but that shifting wastes some time

In c++ is there any way to read bit by bit without modifying a byte?

Edited by lomateron, 19 September 2013 - 12:37 AM.

### #4Hodgman  Moderators

Posted 19 September 2013 - 12:37 AM

Computers work in a mixture of bits, bytes and different kinds of words and lines too, depending on the component
e.g. on a 32-bit CPU, you can't really just operate on 1 bytes, you're usually operating on a whole word (and using masking to only seem to work on one byte), meanwhile your memory controller might work on 512bit lines of data, fetched from 8bit-aligned addresses...

Most CPUs have lots of nice instructions to work with bits, such as popcnt (count the 1's in a word), find index of first 1 from left/right end of word, count leading/trailing zeros, etc... You can use these to very efficiently iterate through arrays of bits.

access them by shifting

yeah I know but that shifting wastes some time
In c++ is there any way to read bit by bit without modifying a byte?

On a modern CPU you can't access a word byte-by-byte without shifting either
The amount of time that is "wasted" is probably so small that you won't notice if it's written well. Fetching your arrays of bytes from RAM to the CPU will likely be orders of magnitude slower than the CPU code for iterating through them.

Can you post some psuedo-code of what you want to do? Something like this?
rayLength = 0
for( int i=0; i!=end; ++i )
if( bits[i] )
rayLength = i;
break;

Edited by Hodgman, 19 September 2013 - 12:41 AM.

### #5lomateron  Members

Posted 19 September 2013 - 01:44 AM

the for loop will just find the position of the voxel by following the ray and checking the voxels that are in its way until it gets to the voxel that has something,

then the position is used as a pointer to another voxel grid that has varibles that define the color, brightnes, or any other thing of the pixel

bit array[100][100][100]; //the 3D voxel grid
byte3 pos;
//starting position of the ray. If the grid is bigger than 256x256x256 then this will require a bigger space than a byte, same for next variables
byte3 dir; //direction of the ray(it is not normalized, it can't be duhh) and one bit will be used for sign
for(byte i=0;i<100;i++)
{
read=(dir*i/100)+pos; //this operation will require a space bigger than a byte
}
the pointer is in "read" now


Edited by lomateron, 19 September 2013 - 01:56 AM.

### #6Vortez  Members

Posted 19 September 2013 - 02:14 AM

I suggest reading about boolean operation, like "and", "or", "xor" and "not", being the most importants. Suppose you want to read the first bit of a 32 bits integer, you do something like this:

// True if the first bit is set to 1

if((x & 0x00000001) > 0)

What this operation does is set every bit of x to 0, but keep the first one unchanged (it dosent change x, but a copy of it). But what if you want the fifth one?

if(((x & (1 << 4)) > 0)

This work if you are testing a single bit at least.

I also made a class that can read a bit from a starting adresses, so it can read the 1000 bit of a very long array if you wish so, or it can work for a single BYTE as well

#include "Bits.h"

////////////////////////////////////////////////////////////////////////////////////////////////
// Return information about the bit position into the bitfield
////////////////////////////////////////////////////////////////////////////////////////////////
BYTE* Setup(void *pBitField, DWORD BitIndx, DWORD *pBitShift)
{
BYTE *pData = (BYTE*)pBitField;
pData += BitIndx / 8;
*pBitShift = BitIndx % 8;
return pData;
}

////////////////////////////////////////////////////////////////////////////////////////////////
// Return true if the bit is set to 1
////////////////////////////////////////////////////////////////////////////////////////////////
bool ReadBit(void *pBitField, DWORD BitIndx, DWORD DataSize)
{
// Make sure we aren't testing beyond the data memory limit
if(DataSize > 0 && (BitIndx/8)+1 > DataSize)
return false;

// Tell how many bits to shift in the byte pointer below
DWORD BitShift = 0;
// Make a BYTE pointer to simplify things...
BYTE *pData = Setup(pBitField, BitIndx, &BitShift);
// Return true if the bit is set to 1
return ((*pData >> BitShift) & 0x01) > 0;
}

////////////////////////////////////////////////////////////////////////////////////////////////
// Write 0 or 1 into the bitfield
////////////////////////////////////////////////////////////////////////////////////////////////
void WriteBit(void *pBitField, DWORD BitIndx, BYTE BitValue, DWORD DataSize)
{
// Make sure we aren't writing beyond the data memory limit
if(DataSize > 0 && (BitIndx/8)+1 > DataSize)
return;

// Make sure the value here is only 0 or 1
BitValue = (BitValue & 0x01);

// Tell how many bits to shift in the byte pointer below
DWORD BitShift = 0;
// Make a BYTE pointer to simplify things...
BYTE *pData = Setup(pBitField, BitIndx, &BitShift);

// Set the bit to 0 or 1
switch(BitValue)
{
case 1:	 *pData |=  (0x01 << BitShift); break;
default: *pData &= ~(0x01 << BitShift); break;
}
}



I also made this bit array class, which let you use the [ ] to set or get any bits you want in the array, might be a bit advanced for you but anyway, here it is

#include "BitArray.h"

CBitArray::CBitArray()
{
pBuffer = NULL;
BufferSize = 0;
BitsCount  = 0;
}

CBitArray::~CBitArray()
{
FreeMem();
}

bool CBitArray::IsAllocated()
{
return pBuffer != NULL;
}

void CBitArray::Allocate(UINT NumBits)
{
if(IsAllocated() || NumBits == 0){return;}

BitsCount  = NumBits;
BufferSize = ((NumBits-1)/8)+1;

pBuffer = new BYTE[BufferSize];
ClearAll();
}

void CBitArray::Resize(UINT NumBits)
{
if(!IsAllocated() || NumBits == 0){return;}

BYTE *pTmpBuffer = new BYTE[BufferSize];
memcpy(pTmpBuffer, pBuffer, BufferSize);

UINT NewBitsCount  = NumBits;
UINT NewBufferSize = ((NumBits-1)/8)+1;

delete [] pBuffer;
pBuffer = NULL;
pBuffer = new BYTE[NewBufferSize];

ZeroMemory(pBuffer, NewBufferSize);
if(NewBufferSize >= BufferSize){
memcpy(pBuffer, pTmpBuffer, BufferSize);
} else {
memcpy(pBuffer, pTmpBuffer, NewBufferSize);
}

BitsCount  = NewBitsCount;
BufferSize = NewBufferSize;

delete [] pTmpBuffer;
}

void CBitArray::FreeMem()
{
if(IsAllocated()){
delete [] pBuffer;
pBuffer = NULL;
BufferSize = 0;
BitsCount  = 0;
}
}

bool CBitArray::operator[](UINT Indx)
{
if(!IsAllocated() || Indx >= BitsCount){return false;}
}

void CBitArray::ClearAll()
{
if(!IsAllocated()){return;}
ZeroMemory(pBuffer, BufferSize);
}

void CBitArray::SetAll()
{
if(!IsAllocated()){return;}
memset(pBuffer, 1, BufferSize);
}

bool CBitArray::GetBit(UINT Indx)
{
if(!IsAllocated() || Indx >= BitsCount){return false;}
}

void CBitArray::SetBit(UINT Indx, bool Value)
{
if(!IsAllocated() || Indx >= BitsCount){return;}
WriteBit(pBuffer, Indx, Value, BufferSize);
}



I don't use those classes anymore, since i now know how to work with the boolean operators, but still it could be usefull to learn from, or make complex operations much easier. (Never really used the bit array class, so it haven't changed since i wrote it, but i tested it a lot and it work)

You can even do it using assembly if you feel couragous.

Edited by Vortez, 19 September 2013 - 02:34 AM.

### #7Hodgman  Moderators

Posted 19 September 2013 - 02:30 AM

I imagine that the dir*i/100 bit will get a bit messy when working in small quantized units like this.

I'd approach converting the ray into a voxelized version of itself, e.g. in 2D, you've got an idealized line that you want to trace:

You'd end up with a pixel version of the line, like this, which can be described as a series of 'spans' - a horizontal start/end point per row.

So, for example, the span on row 4 starts at 4 and ends at 6, or the span on row 3 starts at 7 and ends at 11.

For each span, you can iterate through the bytes that the span covers, and AND the span against the actual voxel data to check for collisions.
(This code is untested, and assumes the line is travelling from left to right. Right to left is a simple variation)
int checkSpan( int first, int last, unsigned char* rowOfBits )
{
//for each byte covered by the span
int byteBegin = first >> 3;
int byteEnd   = (last+1) >> 3;
for( int byte = byteBegin; byte != byteEnd; ++byte )
{
//find the range of the span within this byte
int firstBit = max(0, first - 8*byte);
int lastBit = min(7, last - 8*byte);
int numBits = lastBit - firstBit + 1;

//Create a mask that has 1's in this span, and 0's outside of it.
//Use arithmetic shift to generate a mask numBits wide in the high end.
//Use logical shift move the generated mask so it begins at firstBit.
unsigned char mask = ((unsigned char) (((signed char)0x80) >> (numBits-1)) >> (8-numBits-firstBit));

//check the mask against the row data to see if we overlapped a solid voxel
unsigned char collision = mask & rowOfBits[byte];
if( collision )
{
//find the index of the first bit in the overlapping area
int firstBitHit = LeastSignificantBitSet( collision ); //bsf instruction
//return the index of that bit within the row
return firstBitHit + byte*8;
}
}
return -1;
}
You wouldn't do this with bytes though -- you'd use the native word size of the CPU. For a 64-bit CPU, you could be checking 64-voxel wide spans per iteration of that loop.
Because the CPU works with words instead of bits, this basically means that your inner loop is now automatically parallelized 64x! Instead of having to work through bit by bit, you can work in bulk in huge arrays of bits at once.

If you went all out on a modern 2013 CPU, you could use 512bit AVX registers to be checking 512-voxel wide spans at once, and parallelize the algorithm over 4 cores to be checking 4 of these spans at once, for a total of up to 2048 voxels per iteration of that for loop above.

n.b. however, that algorithm only performs at peak performance for near-horizontal lines. For vertical lines you have lots and lots of very, very thin spans, which produce small masks, which means you don't get the full parallelization benefit of the large word size.
To combat this, you could store two copies of your voxel data set -- one as above, and the second one rotated 90º so that the bytes store 8 values in the y-direction instead of in the x-direction.
For a 3D data set, you'd store a 3rd copy of the data, laid out so that the bytes stores 8 values in the z-direction.

You'd then first check which axis the ray is strongest in, and use either the x, y, or z optimized version of the voxel data.

### #8RobTheBloke  Members

Posted 19 September 2013 - 05:29 AM

I suggest reading about boolean bitwise operations, like "and", "or", "xor", "not",  "andnot", "blendv", "left shift", & "right shift", being the most important.

FTFY

If you went all out on a modern 2013 CPU, you could use 512bit AVX registers to be checking 512-voxel wide spans at once, and parallelize the algorithm over 4 cores to be checking 4 of these spans at once, for a total of up to 2048 voxels per iteration of that for loop above.

Knights ferry support can probably wait until later, I'd probably recommend the 256bit AVX registers first!

If you do go this route, consider using slightly odd bit orders for your voxels (rather than 0, 1, 2, 3). It should help wit the AVX side of things a little bit. I'd also probably recommend a minimum of Haswell support for this. Whilst you can do 'some' things using bitwise ops in AVX1, you'd really need the full compliment of integer ops from AVX2 to make it 'easy'.


DWORD0:   0 8 16 24 ....
DWORD1:   1 9 17 25 ....
DWORD2:   2 10 18 26 ....
....
DWORD7:   7 15 23 31 ....


Edited by RobTheBloke, 19 September 2013 - 05:35 AM.

### #9/ fir   Members

Posted 19 September 2013 - 05:40 AM

Computers work in a mixture of bits, bytes and different kinds of words and lines too, depending on the component
e.g. on a 32-bit CPU, you can't really just operate on 1 bytes, you're usually operating on a whole word (and using masking to only seem to work on one byte), meanwhile your memory controller might work on 512bit lines of data, fetched from 8bit-aligned addresses...

I never heard about it.. doe this mean that when you write a c code like this

char c = 10;

c+=10;

youre working on words?

### #10wintertime  Members

Posted 19 September 2013 - 06:32 AM

Thats all just semantics and depends on how you define working on words.

If the processor occupies a whole bus with a width of a dword or even more and also sends a few mask bits indicating only one byte should be changed, you could say its working on one byte, but also that its working on whole words.

Same if the processor reads in a whole cacheline, changes one byte and then writes the cacheline back, you could say it worked on changing that one byte or you could say it worked on the whole cacheline.

### #11Cosmic314  Members

Posted 19 September 2013 - 06:51 AM

access them by shifting

yeah I know but that shifting wastes some time

In c++ is there any way to read bit by bit without modifying a byte?

In ARM architectures a shift amount is available in nearly all instructions.  For example (see if you can determine what it does -- LSL stands for Logical Shift Left)

LSL    R1,  R0,  #0x1


Is equivalent to (with a side-effect on R1):

ADD    R0,  R0,  R0  LSL #0x1


ARM (>= v6 IIRC) provides unsigned and signed bitfield extract and insert instructions.  To extract the upper byte of a word:

UBFX    R0,  R0, #24, #8

;  Start at bit 24 of R0 and extract the next 8 bits and place into R0 (treat as unsigned)



I would highly recommend at least becoming acquainted with assembly language.  I'm not saying to go out and become an expert or anything, but questions like these will be cleared up very quickly if you have a moment to look at how your source is converted to native code.

Edited by Cosmic314, 19 September 2013 - 07:36 AM.

### #12Hodgman  Moderators

Posted 19 September 2013 - 07:21 AM

I never heard about it.. doe this mean that when you write a c code like this

char c = 10;
c+=10;

youre working on words?

Inside the CPU, you're always operating on 'registers'. Typically we define a "word" as being the size of a CPU register.

CPUs cannot work on RAM directly; they can only work on their registers. They have instructions that let them move data between RAM and their registers.

In between the CPU and RAM, there is a CPU-cache, which acts as a buffer for RAM. Each of these 3 components will likely operate on different sizes of data.

e.g.

*Between RAM and the cache, you might only be able to move 1024 bits at a time! Another cache might only be able to move 512 bits at a time. This is hidden from the programmer most of the time.

* Between the Cache and the CPU, you might only be able to move 32 bits at a time. Another CPU might allow you to move 8, 16, 32, or 64 bits to/from the cache. Again, it's up to the compiler to translate your high-level C++ code into specific instructions for each type of CPU.

On x86, the general purpose registers are 32-bits wide.

When I compile that code (without optimizations) and look at the assembly, the x86 asm is telling the CPU to:

* Write the byte "10" to RAM (to the local variable "c").

* Read that byte from RAM back into a register (download 8bits from the local variable "c" into a 32bit register).

* Add 10 to the register.

* Write the lower 8 bits of the register back to RAM (to the local variable "c").

So yes, whenever you're performing modifications on that char, it's actually being stored in a 32-bit ("word sized") register inside the CPU, and in 8 bits of RAM.

When those 8 bits are downloaded from RAM to the CPU register, the data is moved there through the cache. The cache probably isn't downloading just those 8 bits, it's probably downloading 512 bits of RAM to the cache (in case you need the nearby variables too!), and then passing on just those 8 out of the 512 to the CPU (and keeping the rest cached in case you need them soon).

A computer architecture course or text-book will teach these topic.

Edited by Hodgman, 19 September 2013 - 07:28 AM.

### #13Cosmic314  Members

Posted 19 September 2013 - 07:22 AM

Computers work in a mixture of bits, bytes and different kinds of words and lines too, depending on the component
e.g. on a 32-bit CPU, you can't really just operate on 1 bytes, you're usually operating on a whole word (and using masking to only seem to work on one byte), meanwhile your memory controller might work on 512bit lines of data, fetched from 8bit-aligned addresses...

I never heard about it.. doe this mean that when you write a c code like this

char c = 10;

c+=10;

youre working on words?

Perhaps too simple of an example.  When I compile that code the assembler doesn't even do an add operation.  It simply places a decimal 20 into the register representing c.  That is, the compiler performs that add for you.

It translates into:

MOV    r5,  #0x14   ; place decimal 20 (hex 0x14) into r5


I'll work up an example later that's a little more involved.

### #14LorenzoGatti  Members

Posted 19 September 2013 - 07:30 AM

The most basic reasons for processors to operate on large words is efficiency: all the control work and circuitry is amortized over alarger quantity of useful computational circuits. For example, an addition instruction uses the same instruction cache and does the same decoding work whether the adder it activates is one bit wide (possibly two gates only) or a sophisticated carry-lookahead marvel that takes the same time to add, say, 128-bit integers.

Omae Wa Mou Shindeiru

### #15osmanb  Members

Posted 19 September 2013 - 10:43 AM

n.b. however, that algorithm only performs at peak performance for near-horizontal lines. For vertical lines you have lots and lots of very, very thin spans, which produce small masks, which means you don't get the full parallelization benefit of the large word size.
To combat this, you could store two copies of your voxel data set -- one as above, and the second one rotated 90º so that the bytes store 8 values in the y-direction instead of in the x-direction.
For a 3D data set, you'd store a 3rd copy of the data, laid out so that the bytes stores 8 values in the z-direction.

Yes, to all of this. Alternatively, if you wanted to get more clever, you'd use a 3-D tiling pattern when you lay out your voxel data (Morton order, etc...). Then you have a few choices: Do fast coarse tests for contiguous 3D chunks, with the need to do more precise tests within chunks that indicate they have a collision, or apply the same tiling operation to your ray (chunked according to your tile size). In any case, this avoids needing to store multiple copies of the data.

### #16lomateron  Members

Posted 19 September 2013 - 01:09 PM

Is someone here inspired by this post and started a real time raycasting voxel image generator?

I am starting it but using GPU, I think it will be faster there.

Edited by lomateron, 19 September 2013 - 01:10 PM.

### #17/ fir   Members

Posted 19 September 2013 - 02:45 PM

I never heard about it.. doe this mean that when you write a c code like this

char c = 10;
c+=10;

youre working on words?

Inside the CPU, you're always operating on 'registers'. Typically we define a "word" as being the size of a CPU register.

CPUs cannot work on RAM directly; they can only work on their registers. They have instructions that let them move data between RAM and their registers.

In between the CPU and RAM, there is a CPU-cache, which acts as a buffer for RAM. Each of these 3 components will likely operate on different sizes of data.

e.g.

*Between RAM and the cache, you might only be able to move 1024 bits at a time! Another cache might only be able to move 512 bits at a time. This is hidden from the programmer most of the time.

* Between the Cache and the CPU, you might only be able to move 32 bits at a time. Another CPU might allow you to move 8, 16, 32, or 64 bits to/from the cache. Again, it's up to the compiler to translate your high-level C++ code into specific instructions for each type of CPU.

On x86, the general purpose registers are 32-bits wide.

When I compile that code (without optimizations) and look at the assembly, the x86 asm is telling the CPU to:

* Write the byte "10" to RAM (to the local variable "c").

* Read that byte from RAM back into a register (download 8bits from the local variable "c" into a 32bit register).

* Add 10 to the register.

* Write the lower 8 bits of the register back to RAM (to the local variable "c").

So yes, whenever you're performing modifications on that char, it's actually being stored in a 32-bit ("word sized") register inside the CPU, and in 8 bits of RAM.

When those 8 bits are downloaded from RAM to the CPU register, the data is moved there through the cache. The cache probably isn't downloading just those 8 bits, it's probably downloading 512 bits of RAM to the cache (in case you need the nearby variables too!), and then passing on just those 8 out of the 512 to the CPU (and keeping the rest cached in case you need them soon).

A computer architecture course or text-book will teach these topic.

If on register what is that register - whole rax or maybe al ?

I very seldom wrote asm routines and checking assembly

so I dont know, but it would be worth to check it (should take decent modern assembly course but almost all info on assembly I encounter is from the old age)

### #18pinebanana  Members

Posted 20 September 2013 - 04:21 AM

I suggest reading about boolean bitwise operations, like "and", "or", "xor", "not",  "andnot", "blendv", "left shift", & "right shift", being the most important.

FTFY

Why'd you cross "not" out? Is it not useful (no pun intended)?

anax - An open source C++ entity system

### #19/ fir   Members

Posted 20 September 2013 - 05:36 AM

I suggest reading about boolean bitwise operations, like "and", "or", "xor", "not",  "andnot", "blendv", "left shift", & "right shift", being the most important.

FTFY

Why'd you cross "not" out? Is it not useful (no pun intended)?

this pun is interesting, Not is not usefull. hehe

### #20Álvaro  Members

Posted 20 September 2013 - 06:58 AM

I suggest reading about boolean bitwise operations, like "and", "or", "xor", "not",  "andnot", "blendv", "left shift", & "right shift", being the most important.

FTFY

Why'd you cross "not" out? Is it not useful (no pun intended)?

This is the kind of sentence that gets Yoda in trouble: "Definitely useful not is." I use it all the time.

What are "andnot" and "blendv"?

Old topic!

Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.