Finding Machine Language Encodings

15 Feb 2017

In this post, I’ll show how to use the Microsoft Macro Assembler (MASM) and DUMPBIN to figure out the machine language encoding of a particular x86 assembly language instruction.

In the previous post, I showed how to build a tiny x86 assembler. I stated matter-of-factly that nop is encoded as 90h, inc eax is 40h, and so forth. It’s reasonable to ask

  • if you wanted to test the assembler, how could you verify that the encodings it produces are correct?
  • if you wanted to extend the assembler with more instructions, how could you find the encodings of those instructions?

To test the assembler, it would be helpful to have an oracle – if we knew “correct” encodings for specific instructions, we could compare those encodings to the encodings produced by our assembler.

If you wanted to extend the assembler, how could you find encodings of new instructions? The correct answer is, “Read the Intel® 64 and IA-32 Architectures Software Developer Manuals.” It is the definitive source for x86 instruction encodings. However, it’s not an easy read… so it’s helpful to have a way to generate encodings for sample instructions. (I’ll admit, when I didn’t understand parts of it, I looked at a few sample instruction encodings, then went back to Intel’s documentation and used them to figure it out.)

So… how can we find encodings of x86 instructions? There are lots of ways to do this, but here’s one.

The Microsoft Macro Assembler (MASM) and DUMPBIN

The good news is, we’re not the first people to build an x86 assembler. So, an easy way to find the machine language encoding of an assembly language instructions is to assemble it using another assembler, then look at the encoding that assembler produces.

The Microsoft Macro Assembler (MASM) is an industry standard assembler, and it ships with Microsoft Visual Studio (since it’s the assembler underlying Microsoft Visual C++). We’ll use that.

We’ll also use another Visual Studio command line tool called DUMPBIN (which, I assume, stands for “dump binary,” although according to its startup banner, it’s the “Microsoft COFF/PE Dumper”). DUMPBIN is useful for many things, but one feature will be particularly useful here: it includes a disassembler.

To run MASM and DUMPBIN, you’ll need Visual Studio with Visual C++ installed. If you don’t have it yet, you can download Visual Studio Community Edition for free. After the installer starts, when prompted to choose the type of installation, select a Custom installation. You will then be prompted to select features; expand Programming Languages and select Visual C++. Finish the installation. When you want to run the commands shown below (ml and dumpbin), you will need to start a Visual Studio Command Prompt. In the Start menu, this will be an item labeled “VS2015 x86 Native Tools Command Prompt” or something similar.

Finding x86 instruction encodings using MASM and DUMPBIN

Suppose, for example, that we want to find the encoding of mov eax,12345678h. We can write a small procedure in assembly language with this as the first instruction:

.model flat
.code
example PROC
    mov eax, 12345678h
    nop
    nop
    nop
    ret
example ENDP
END

If you save the above as a file named insn.asm, you can assemble it using MASM:

ml /c insn.asm

This produces an object file, insn.obj. Now, you can disassemble this object file using DUMPBIN:

dumpbin /DISASM insn.obj

This produces the following output:

Microsoft (R) COFF/PE Dumper Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.


Dump of file insn.obj

File Type: COFF OBJECT

example:
  00000000: B8 78 56 34 12     mov         eax,12345678h
  00000005: 90                 nop
  00000006: 90                 nop
  00000007: 90                 nop
  00000008: C3                 ret

  Summary

           0 .data
          A0 .debug$S
           9 .text

What’s on each line? From right to left:

  • The last part of the line (mov eax,12345678h) is the assembly language representation of an instruction.
  • Immediately before that is a sequence of hexadecimal numbers giving the machine language encoding of that instruction, in hexadecimal (B8 78 56 34 12). Each number corresponds to one byte of the encoding.
  • The first part of the line (00000000:) is a label. The label indicates the offset of the first byte of that instruction, in hexadecimal. If the procedure includes jump instructions, these labels are also used to specify the jump destination (e.g., the assembly language for an unconditional jump to the beginning of this procedure would have been labeled jmp 00000000).

Show the encoding of an instruction – showinsn.bat

If you want to use the above technique repeatedly, we can make it easier by creating a Windows batch file that combines the above steps into a single command. We’ll call it showinsn.bat. You can use it like this:

showinsn "mov eax, 12345678h"

It will display only the “interesting” line from DUMPBIN’s output:

  00000000: B8 78 56 34 12     mov         eax,12345678h

Here is the full contents of showinsn.bat. Note that I’ve added a few labels for the purpose of testing jump instructions (try showinsn “jmp back” or showinsn “je l10”). The DUP lines insert large numbers of nop bytes (90h):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
@echo off

REM Displays the machine language encoding of an x86 instruction by assembling
REM a small program using the Microsoft Macro Assembler, then disassembling the
REM resulting object file using DUMPBIN.
REM
REM For jump instructions, there are seven labels to jump to:
REM
REM - "back" is the start of the jump.
REM   The encoded instruction will be a backward jump by the number of bytes
REM   in the jump instruction (probably 2).
REM
REM - "zero" is the instruction immediately after the jump.
REM   The encoded instruction will be a 0-byte jump.
REM
REM - "l5", "l10", "l100", "l200", and "l300" are 5, 10, 100, 200, and 300
REM   bytes after the jump, repectively.

if "%~1" == "" (
	echo Example Usage: showinsn "mov eax, ebx"
	exit /B 1
)

REM %~1 is the first command line argument with double-quotes removed.
REM This is important below since we do not want the quotes in the .asm file.

echo .model flat             > temp.asm
echo .code                  >> temp.asm
echo main PROC              >> temp.asm
echo back: %~1              >> temp.asm
echo zero: nop              >> temp.asm
echo       nop              >> temp.asm
echo       nop              >> temp.asm
echo       nop              >> temp.asm
echo       nop              >> temp.asm
echo   l5: nop              >> temp.asm
echo       nop              >> temp.asm
echo       nop              >> temp.asm
echo       nop              >> temp.asm
echo       nop              >> temp.asm
echo  l10: nop              >> temp.asm
echo       BYTE 89 DUP(90h) >> temp.asm
echo l100: nop              >> temp.asm
echo       BYTE 99 DUP(90h) >> temp.asm
echo l200: nop              >> temp.asm
echo       BYTE 99 DUP(90h) >> temp.asm
echo l300: nop              >> temp.asm
echo main ENDP              >> temp.asm
echo END                    >> temp.asm

ml /nologo /c temp.asm > NUL && ^
dumpbin /DISASM temp.obj | find "00000000:"

if errorlevel 1 (
	echo NO ENCODING                    %~1
	del temp.asm temp.obj
	exit /b 1
)

del temp.asm temp.obj

REM ---------------------------------------------------------------------------
REM Copyright (C) 2017 Jeffrey L. Overbey.  Use of this source code is governed
REM by a BSD-style license posted at http://blog.jeff.over.bz/license/

All your instructions are belong to me – showall.bat

Finally, we return to the original question… what if you wanted to test the x86 assembler we wrote in the previous post? Armed with our showinsn.bat batch file, it’s not difficult to generate encodings for every single instruction our assembler supports:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
@echo off
REM Shows machine language encodings of all instructions supported by the
REM x86 subset assembler.  Requires showinsn.bat.

set immeds=0 1 0Fh 0F0h 12345678h 0FFFFFFFEh 0FFFFFFFFh
set regs32=eax ecx edx ebx esp ebp esi edi

call showinsn nop

for %%r in (%regs32%) do (
	for %%n in (%immeds%) do (
		call showinsn "mov %%r, %%n"
	)
)

for %%r in (%regs32%) do (
	for %%s in (%regs32%) do (
		call showinsn "mov %%r, DWORD PTR [%%s]"
	)
)

for %%r in (%regs32%) do (
	for %%s in (%regs32%) do (
		call showinsn "mov DWORD PTR [%%s], DWORD PTR %%r"
	)
)

for %%i in (mov add sub and or xor cmp) do (
	for %%r in (%regs32%) do (
		for %%s in (%regs32%) do (
			call showinsn "%%i %%r, %%s"
		)
	)
)

for %%i in (inc dec not neg mul imul div idiv) do (
	for %%r in (%regs32%) do (
		call showinsn "%%i %%r"
	)
)

call showinsn cdq

for %%i in (shl shr sar) do (
	for %%r in (%regs32%) do (
		call showinsn "%%i %%r, cl"
	)
)

for %%i in (shl shr sar) do (
	for %%r in (%regs32%) do (
		for %%n in (0 1 2 3 32 64 65 255) do (
			call showinsn "%%i %%r, %%n"
		)
	)
)

for %%i in (push pop call) do (
	for %%r in (%regs32%) do (
		call showinsn "%%i %%r"
	)
)

for %%n in (0 1 2 4 16 256) do (
	call showinsn "ret %%n"
)

for %%i in (jmp jb jae je jne jbe ja jl jge jle jg) do (
	for %%l in (back zero l5 l10 l100 l200 l300) do (
		call showinsn "%%i %%l"
	)
)

REM ---------------------------------------------------------------------------
REM Copyright (C) 2017 Jeffrey L. Overbey.  Use of this source code is governed
REM by a BSD-style license posted at http://blog.jeff.over.bz/license/

This takes several minutes to run (I never said this was an efficient way to find instruction encodings…) and produces 1021 lines of output.

Exercise: Testing the assembler

At this point, we can produce a long list of machine language encodings… enough to test our x86 assembler from the previous post almost exhaustively. It’s an interesting exercise to try to build an automated test suite for the x86 assembler from this file. I’ll leave that to you. (I hacked something together with a shell script, posted below. In retrospect, I should have used Perl, but this got me by.)

Download the source code

Code from this post (finding machine language encodings):

Source Code:    showinsn.bat    76 lines
  showall.bat 64 lines
    Total: 140 lines
Output: test-insns.txt  

Solution to the exercise (testing the x86 assembler):

Source Code:    generate-test.sh    192 lines
Output: test-x86asm.c  
Makefiles: GNUmakefile   (GNU Make on Linux/macOS)
  Makefile (NMAKE on Windows)

Published on 15 Feb 2017 1815 words Comments? E-mail me!