Assembler endiannes and Linkers - How do they work in detail

Started by
5 comments, last by Zaoshi Kaba 5 years, 8 months ago

I had some doubts about hex formats(assembler output) and linkers:

1.- So, I disassembly a raw binary(no ELF, PE, etc... headers) X64 assembly code and i got that result:

 


0:  66 89 c8                mov    ax,cx
3:  e8 00 00 00 00          call   8 <gh>
0000000000000008 <gh>:
8:  66 89 c2                mov    dx,ax

I understand how Byte Offset works('66' is the byte ID 0, '89' is 1, 'c8' is 2 and on 3 the call instruction starts(that is why '3:' is there)) but, by that logic, shouldn't 'call gh' be translated to 'e8 00 00 00 00 00 00 00 08' instead of 'e8 00 00 00 00' since the byte offset of the first instruction of gh, which is 'mov   dx, ax' is 8 and the output is 64 bits?

 

2.- Using the example of above, if endianness is little end., how the assembler would swap the bytes, by each instruction? Like:

 


Original, no endiannes
{
	66 89 c8
	e8 00 00 00 00(in case that would be correct and i'm wrong in the question 1.-)
	66 89 c2
}

to

{
	c8 89 66
	00 00 00 00 e8
	c2 89 66
}

3.- And then, the big end. would be like the original, without endiannes, code of the question 2.-?

4.- Suppose that i mark gh as .globl, then, would the assembler create a map table file where gh is in 'e8 00 00 00 00'(again, in case that would be correct and i'm wrong in question 1.-), and linker will look into such map file, and if another object file calls gh, the linker will then translate call gh as either 'e8 00 00 00 00'?

Advertisement
2 hours ago, Iris_Technologies said:

I understand how Byte Offset works('66' is the byte ID 0, '89' is 1, 'c8' is 2 and on 3 the call instruction starts(that is why '3:' is there)) but, by that logic, shouldn't 'call gh' be translated to 'e8 00 00 00 00 00 00 00 08' instead of 'e8 00 00 00 00' since the byte offset of the first instruction of gh, which is 'mov   dx, ax' is 8 and the output is 64 bits?

e8 00 00 00 00 is call procedure with relative offset of 0. It is relative to next instruction. Thus call with offset 0 simply calls next instruction, which would be called anyways if the call did not exist.

 

2 hours ago, Iris_Technologies said:

2.- Using the example of above, if endianness is little end., how the assembler would swap the bytes, by each instruction? Like:

 

2 hours ago, Iris_Technologies said:

3.- And then, the big end. would be like the original, without endiannes, code of the question 2.-?

I might be wrong, but I believe endianness does not affect instructions themselves, only data. In your example only call instruction has data. For example e8 08 00 00 00 (little endian, call with offset 8) would become e8 00 00 00 08 (big endian).

First things first: endianness only affects multi-byte values. The x86/x64 instruction stream is primarily single bytes - generally only things like immediate operands and offsets are multi-byte values.

Now lets take a closer look at the instruction stream. The first instruction is


66 89 c8

The first byte, 66, is an operand-size override. Here it indicates the the next instruction will use 16-bit operands instead of 32-bit.

The next byte, 89, is the instruction opcode. This one indicates a MOV instruction that moves data from one register to another register or to memory.

The final byte, C8, is the ModR/M byte, which encodes additional information about the operands. The top two bits (11b) combined with the bottom three bits (000b) indicate the the register/memory operand (which is the destination for this instruction) is the AX register, and bits 3-6 (001b) indicate that the register-only operand (the source here) is CX.

Note that these bytes have to come in this specific order; this is required for the CPU's instruction decoder to actually be able to decode the instruction stream.

The next instruction is


e8 00 00 00 00

The first byte, E8, is the opcode. E8 is a near relative CALL. It's operand is a 32-bit signed offset relative to the first byte of the next instruction. Since the target IS the next instruction, it's just 00000000.

The final instruction is


66 89 c2

which is almost identical to the first one, the only difference being the ModR/M byte. C2 instead of C8 means that the destination is the DX register, and the source is AX.

The only multi-byte value here, and thus the only thing that endianness is relevant for, is the 32-bit offset for the call instruction, though in this case the offset is the same either way. The final instruction stream is


66 89 C8 E8 00 00 00 00 66 89 C2

If, for the sake of demonstration, the target of the CALL was 4000 (00000FA0h) bytes after the next instruction, the CALL would look like this instead:


E8 A0 0F 00 00

 

2 hours ago, Zaoshi Kaba said:

e8 00 00 00 00 is call procedure with relative offset of 0. It is relative to next instruction. Thus call with offset 0 simply calls next instruction, which would be called anyways if the call did not exist.

Now another question comes to my mind, how the assembler would manage negative offsets?, like, suppose you have a function that is being called on Byte ID Offset 3, but the call is performed on Byte ID Offset 8, how would the Opcode look?

 

EDIT

Nvm, I tried it for myself:

I had this kind of code


gh:
mov ax, dx

hh:
mov dx, cx
call gh

so the call procedure would be at Byte ID Offset 6 and the gh function would be at 0.
The Opcode would look like 'e8 f5 ff ff ff'. Looks like for negative offsets It works as the max. byte value('ff ff ff ff'), minus the Byte ID Offset of the next instruction after the call procedure. So if we have the start of call procedure at Byte ID 6 plus the size of the call opcode itself(which is 5) would end at Byte ID Offset b(11 in decimal). Now, if we do the rest of 'ff ff ff ff' - 'b' we would end up with 'f5 ff ff ff'.
 

3 hours ago, Iris_Technologies said:

Looks like for negative offsets It works as the max. byte value('ff ff ff ff'), minus the Byte ID Offset of the next instruction after the call procedure.

It's just a typical two's complement integer.  The same way negative values are stored in most integers.

8 hours ago, Iris_Technologies said:

Now another question comes to my mind, how the assembler would manage negative offsets?

It seems you have figured that out yourself. Just for future reference, x86 assembly instructions is not a military secret: https://c9x.me/x86/html/file_module_x86_id_26.html

It states that relative offset is signed integer so it can go up to approx. 2 billion bytes forward / backwards.

This topic is closed to new replies.

Advertisement