Editorial Note: Over the last two weeks, we’ve done the groundwork for building a disassembler: We’ve seen how to find documentation for a machine’s instruction format, how to read machine code by hand, and how to build up a machine-readable opcode map. Now it’s time to build the disassembler itself.
This article will walk through the process of building a disassembler for 8086 integer instructions in Python. It takes a “bottom-up” approach, beginning with low-level data structures, and ending with a method that disassembles an arbitrary sequence of bytes.
Overview
If you want to jump to the punchline, you can download the complete source code for the disassembler here, and the opcode table here. (Save the opcode table as a file named 8086_table.txt in the same directory as the source Python file, or the latter won’t work properly.) If you’re more interested in the scenic route, let’s begin by considering a typical line of assembly code output by a disassembler:
13CB:5A92 55 PUSH BP
This line contains 4 parts, each of which we can consider in turn.
- The address of the instruction
- The machine code comprising the instruction
- A mnemonic
- Zero, one, or two arguments
Addresses
An address is made up of a segment:offset pair. Let’s make a pass at writing a Python class to support these addresses. We’ll include support for:
- Initialization
- String conversion
- Address arithmetic
class Address(object): def __init__(self, segment, offset):
self.segment = segment
self.offset = offset
def __str__(self):
return '%04X:%04X'%(self.segment, self.offset)
def calc_relative_address(self, displacement):
return Address(self.segment, (self.offset+displacement)&0xffff)
Machine Code
Machine code is just a series of bytes. A simple class to support it should include support for:
- Initialization
- String conversion
- Length calculation
- Opcode retrieval
In order to keep things nice and pretty, we should pad machine code out to the length of the longest legal instruction. If we display instruction prefixes and segment overrides as their own instructions, an 8086 instruction can consist of at most:
- 1 opcode byte
- 1 ModR/M byte
- 2 displacement bytes
- 2 immediate operand bytes
This comes to 6 instruction bytes, or 12 displayed characters of machine code. Here’s the class:
class MachineCode(object):
def __init__(self, bytes):
self.bytes = bytes
def __str__(self):
return '%-12s'%''.join('%02X'%byte for byte in self.bytes)
def __len__(self):
return len(self.bytes)
def get_opcode(self):
return self.bytes[0]
Mnemonics
The mnemonic is just a string. To keep things pretty (again) let’s pad it to the length of the longest legal mnemonic. We can find that length from the opcode map, with a little Python:
import re
max(len(a[1]) for a in map(re.compile(r'\s+').split, file('8086_table.txt', 'rb').readlines()) if len(a) > 1)
This returns a maximum length of 6 – for LOOPNZ, as it happens. With that, let’s define a little class, with support for:
- Initialization
- String conversion
class Mnemonic(object):
def __init__(self, name):
self.name = name
def __str__(self):
return '%-6s'%self.name
Arguments
So far, so simple. Now we come to arguments, where the waters get deeper. An 8086 assembly instruction can include one or more of the following types of arguments:
Addresses | Memory addresses |
Dereferences | Contents of memory addresses (may require disambiguation, e.g. “WORD PTR”) |
Registers | 8 or 16 bit general-purpose or segment registers |
Integers | 8 or 16 bit immediate integer arguments; 8 bit arguments may be sign-extended |
Constants | Intrinsic integer arguments |
Offsets | Code addresses relative to the end of the current instruction |
Several of these argument types share a common complication: their assembly code representation depends not only on the arguments themselves, but upon the overall instruction in which they are found. For instance, 8-bit Integer arguments may require a leading sign (“+” or “-“) to indicate sign extension, depending on the opcode of the instruction in which they are used. As another example, Offset arguments display a function address that depends on the location and size of their instruction, in addition to the offset argument.
In order to address this shared complication, all arguments will inherit from base Argument class that will track the parent instruction.
class Argument(object):
_parent = None
def set_parent(self, parent):
self._parent = parent
Instructions
Now, since we’re about to begin defining argument classes which refer to an Instruction class, let’s take a minute to define a first pass at an Instruction class, with support for:
- Initialization
- String conversion
class Instruction(object):
def __init__(self, addr, code, mnemonic, *args):
self.addr
= addr
self.code
= code
self.mnemonic = mnemonic
self.args
= args
for a in self.args: a.set_parent(self)
def __str__(self):
core = ' '.join(map(str, (self.addr, self.code, self.mnemonic)))
if (self.args): return core + ' ' + ', '.join(map(str, self.args))
else: return core
Arguments (Constants)
Now let’s begin defining argument classes proper. The simplest types of argument are the Constants; these can be represented with nothing more than a class which inherits from both Argument and int:
class Arg_Constant(Argument, int):
pass
This definition is about as simple as it gets, but provides the core functionality we’ll define for all five argument types:
- Initialization
- String conversion
Arguments (Addresses)
After Constants, Addresses are the next simplest types of arguments. They can be represented as a class which inherits from Argument and Address. (They’ll initialize properly because Argument has no __init__ function, so Address.__init__ will be called instead.)
class Arg_Address(Argument, Address):
pass
Arguments (Offsets)
Moving on to a meatier class of arguments, we come to the Offsets. These arguments calculate a new code address based on the address of their instruction, the length of their instruction, and a signed 8- or 16-bit offset.
class Arg_Offset(Argument):
def __init__(self, offset, offset_size):
self.offset = offset
# offset will be read from the machine code as an unsigned integer
if (self.offset >= pow(2, offset_size-1)):
self.offset -= pow(2, offset_size)
def __str__(self):
addr = self._parent.addr
code = self._parent.code
return str(addr.calc_relative_address(len(code)+self.offset))[-4:]
Arguments (Registers)
Next are the Registers. Although simple on the surface, Registers have a non-trivial initialization scheme. Registers can be selected either by name, or by a 3-bit code and a type, e.g. “the 8-bit register associated with code 4” – which would be AH, in practice. (The code and the type may not be discovered at the same point in the parsing process.) Let’s begin by defining all the register names, using the Enum class from a few weeks ago:
# Wrap a 1:1 mapping of symbol names to literals class Enum(object):
def __init__(self, lut):
self.lut = lut
self.rlut = dict((v, k) for (k, v) in lut.items())
def __getattr__(self, name):
return self.lut.get(name)
def __getitem__(self, value):
return self.rlut.get(value) # Define names (e.g. reg_set.AL), provide numeric->text LUT (e.g. reg_set[3]) reg_set = Enum({'AL':0x00,'CL':0x01,'DL':0x02,'BL':0x03,
'AH':0x04,'CH':0x05,'DH':0x06,'BH':0x07,
'AX':0x10,'CX':0x11,'DX':0x12,'BX':0x13,
'SP':0x14,'BP':0x15,'SI':0x16,'DI':0x17,
'ES':0x30,'CS':0x31,'SS':0x32,'DS':0x33,})
Now we can define the Register argument class:
class Arg_Register(Argument):
type_lut = {'b':0x00, 'v':0x10, 'w':0x10, 'S':0x30}
def __init__(self, name=None, code=None, type=None):
if (type): type = self.type_lut[type]
if (name):
reg = reg_set.__getattr__(name)
code = reg&0xf; type = reg&0xf0
self.code = code
self.type = type
def __str__(self):
return reg_set[self.code+self.type]
def set_type(self, type):
self.type = self.type_lut[type]
Arguments (Integers)
The penultimate arguments are the Integers. Integers are, like Registers, apparently simple, but complicated by the fact that 8-bit sign-extended, 8-bit non-sign-extended, and 16-bit Integers are all displayed differently in the assembly code. In particular, 8-bit sign-extended Integers are associated with one particular opcode (0x83
) and must be handled specially:
class Arg_Integer(Argument):
def __init__(self, value, value_size):
self.value
= value
self.value_size = value_size
def __str__(self):
if (self._parent.code.get_opcode() == 0x83):
# Special case: display as an 8-bit sign-extended value
if (self.value >= 128): value = self.value - 256
else: value = self.value
return '%+03X'%value
else:
return '%0*X'%(self.value_size/4, self.value)
Arguments (Dereferences)
Dereferences are the last, and most complex, form of argument encountered in assembly code. Aside from the fact that a Dereference consists of between 1 and 3 sub-parts (base register, index register, and displacement), some Dereferences are qualified by type declarations: “BYTE PTR”, “WORD PTR”, or “FAR”. The rules governing when a type declaration is applied are a little arbitrary, and therefore awkward to support in code. To summarize them:
- Dereference arguments with an operand type of ‘b’ (as specified in the opcode map) may take a “BYTE PTR” qualifier
- Dereference arguments with an operand type of ‘w’ or ‘v’ (as specified in the opcode map) may take a “WORD PTR” qualifier
- Dereference arguments with an operand type of ‘p’ (as specified in the opcode map) may take a “FAR” qualifier
- Dereferences associated with opcodes
0x80, 0x81, 0x82, 0x83, 0xC6, 0xC7, 0xD0, 0xD1, 0xD2, 0xD3, 0xF6, 0xF7, 0xFE, or 0xFF
are always qualified, except for: - Dereferences associated with a Mnemonic of PUSH, CALL or JMP, which aren’t qualified, except for:
- FAR Dereferences associated with a Mnemonic of CALL or JMP, which are qualified
Here’s the code. It’s a little on the ugly side, due to all the different forms Dereferences can take:
class Arg_Dereference(Argument): q_opcodes = set([ 0x80, 0x81, 0x82, 0x83, 0xC6, 0xC7, 0xD0, 0xD1, 0xD2, 0xD3, 0xF6, 0xF7, 0xFE, 0xFF]) nq_mnemonics = set(['PUSH', 'CALL', 'JMP']) q_far_mnemonics = set(['CALL', 'JMP']) q_lut = {'b':'BYTE PTR', 'w':'WORD PTR', 'v':'WORD PTR', 'p':'FAR'} def __init__(self, type=None, base=None, index=None, disp=None, disp_size=None):
self.type = type self.base = base self.index = index self.disp = disp self.disp_size = disp_size def __str__(self): # Build up the dereference, one piece at a time rv = []; n_terms = 0 # Qualifier, if applicable if (self._parent.code.get_opcode() in self.q_opcodes): if ((not str(self._parent.mnemonic).strip() in self.nq_mnemonics) or ((str(self._parent.mnemonic).strip() in self.q_far_mnemonics) and (self.type == 'p'))): rv.append(self.q_lut[self.type] + ' ') # Bracket rv.append('[') # Optional base and index terms if (self.base != None): rv.append(reg_set[self.base]); n_terms += 1 if (self.index != None): if (n_terms): rv.append('+') rv.append(reg_set[self.index]); n_terms += 1 # Displacement if (self.disp != None): if (self.disp_size == 8): # 8-bit displacement; always preceeded by at least one term disp = self.disp if (disp >= 128): disp -= 256 rv.append('%+03X'%disp); n_terms += 1 else: # 16-bit displacement if (n_terms): rv.append('+') rv.append('%04X'%self.disp); n_terms += 1 # Bracket rv.append(']') # Concatenate and return return ''.join(rv) def set_type(self, type): self.type = type
Disassembler (Skeleton)
At this point, we have data structures into which raw machine code can be parsed, and from which we can generate disassembly. The next step is to actually do the parsing, which isn’t really as complicated as it might sound. Let’s begin with the Disassembler class definition and __init__ function, and add functions as we go.
class Disassembler(object):
def __init__(self):
self.index = 0
self.bytes = ''
self.first_addy = Address(0, 0)
self.modrm = None
The __init__ function just sets up the state, more for illustration than anything else. A brief guide to the purpose of the members:
- bytes stores the raw machine code as a Python string
- index tracks the current read position in bytes
- first_addy stores the Address that will be assigned to the first instruction read from bytes, and from which all subsequent Addresses will be computed
- modrm tracks whether the modrm byte has been read from an instruction
Now, let’s define some low-level parsing functions. We’ll need to read:
- Opcodes
- ModR/M bytes
- 1- and 2-byte integers
Disassembler (Integers)
Integers are simple enough; we just read LSB bytes from the current read position, throwing an exception if we run off the end of the array.
class Disassembler(object):
# ... snip ...
def read_integer(self, num_bytes):
rv = sum(ord(self.bytes[self.index+i])*pow(2, i*8) for i in range(num_bytes))
self.index += num_bytes; return rv
Disassembler (ModR/M)
ModR/M bytes are a little more difficult. A ModR/M byte decomposes into 3 bit fields (A 2-bit mod, a 3-bit reg, and a 3-bit r/m.) The mod and r/m fields in turn yield one of 32 addressing forms, some of which include an 8- or 16-bit displacement immediately following the ModR/M byte, and all of which can be read about, in gory detail, on page 2-4 of the Intel documentation.
The upshot of all this is that the process of parsing a ModR/M “byte” can involve reading between 1 and 3 bytes, can produce either a Register or an Address argument (in addition to the reg field, which is also of interest to some instructions), and can be triggered at as many as three points in the instruction parsing process (when parsing the opcode, and when parsing both arguments). So, the ModR/M parsing code is longer:
class Disassembler(object):
# ... snip ... def read_modrm(self): if (self.modrm): return modrm = ord(self.bytes[self.index]); self.index += 1 mod = modrm>>6; reg=(modrm>>3)&7; rm=modrm&7 if (mod == 0): if (rm == 6): # Special case self.modrm = {'rm':Arg_Dereference(disp=self.read_integer(2), disp_size=16), 'reg':reg} return disp = None; disp_size = None elif (mod == 1): disp = self.read_integer(1); disp_size = 8 elif (mod == 2): disp = self.read_integer(2); disp_size = 16 elif (mod == 3): self.modrm = {'rm':Arg_Register(code=rm), 'reg':reg} return if (rm == 0): base = reg_set.BX; index = reg_set.SI elif (rm == 1): base = reg_set.BX; index = reg_set.DI elif (rm == 2): base = reg_set.BP; index = reg_set.SI elif (rm == 3): base = reg_set.BP; index = reg_set.DI elif (rm == 4): base = None; index = reg_set.SI elif (rm == 5): base = None; index = reg_set.DI elif (rm == 6): base = reg_set.BP; index = None elif (rm == 7): base = reg_set.BX; index = None self.modrm = {'rm':Arg_Dereference(base=base, index=index, disp=disp, disp_size=disp_size), 'reg':reg}
Disassembler (Opcodes)
Opcodes are easy enough to read – they’re just single bytes – but a little more complex to decompose into mnemonics and argument descriptors. (In fact, for group opcodes, such a decomposition isn’t even possible without parsing the ModR/M byte, as the reg field affects the meaning of the primary opcode.) The process of decomposition also brings us, at last, to the opcode map we derived last week.
We’ll need to load the opcode maps in order to parse the opcodes. I chose to import those maps as module-level objects. (Note that the following code assumes that you’ve imported the os and re modules, that the 8086_table.txt file contains the plain-text opcode map, and that the 8086_table.txt file is in the same directory as the module in which the following code lives:
# Define opcode maps
lines = file(os.path.join(os.path.split(__file__)[0], '8086_table.txt'), 'rb').readlines()
ops = filter(None, map(re.compile(r'([0-9A-F]{2})\s+(.+)').match, lines))
opcode_map = dict((int(m.group(1), 16), m.group(2).split()) for m in ops)
ops = filter(None, map(re.compile(r'(GRP\S+)/(\d)\s+(.+)').match, lines))
opcode_extension_map = dict(((m.group(1), int(m.group(2))), m.group(3).split()) for m in ops)
del lines; del ops
With the maps available, we can now parse the opcodes:
class Disassembler(object): # ... snip ...
def read_opcode(self):
opcode = ord(self.bytes[self.index]); self.index += 1
mnemonic = opcode_map[opcode][0]; arguments = opcode_map[opcode][1:]
# Handle group opcodes, and opcode extensions
if (mnemonic[:3] == 'GRP'):
self.read_modrm()
extension = opcode_extension_map[(mnemonic, self.modrm['reg'])]
if (extension[0] == '--'):
# Special case - illegal extension, use the '???' psuedo-mnemonic
mnemonic = '???'
else:
mnemonic = extension[0]
# Override the primary opcode's argument descriptors iff the extension's set is non-empty
if (extension[1:]): arguments = extension[1:]
return mnemonic, arguments
Disassembler (Arguments)
We’re almost ready to parse complete instructions; the only missing piece is a function to convert the opcode map’s argument descriptors (‘Eb’, ‘Mp’, etc.) into the Argument sub-classes we defined earlier. The code is a little long, but not particularly tricky.
class Disassembler(object):
# ... snip ...
size_lut = {'b':8, 'v':16, 'w':16}
def make_argument(self, desc):
if (desc[0] == 'e'):
# Register name; has a deeper meaning for post-8086 processors
return Arg_Register(name=desc[1:])
elif (reg_set.__getattr__(desc) != None):
# Register name
return Arg_Register(name=desc)
elif (desc in '13'):
# Constant
return Arg_Constant(int(desc))
elif (desc == 'M'):
# psuedo-dereference
self.read_modrm()
rm = self.modrm['rm']; rm.set_type('v')
return rm
elif (desc[0] == 'A'):
offset = self.read_integer(2)
return Arg_Address(self.read_integer(2), offset)
elif (desc[0] in 'EM'):
self.read_modrm()
rm = self.modrm['rm']; rm.set_type(desc[1])
return rm
elif (desc[0] == 'G'):
self.read_modrm()
return Arg_Register(code=self.modrm['reg'], type=desc[1])
elif (desc[0] == 'I'):
if (desc[1] == '0'):
# Very special case
value = self.read_integer(1)
if (value == 0xa): return
return Arg_Integer(value, 8)
else:
size = self.size_lut[desc[1]]
return Arg_Integer(self.read_integer(size/8), size)
elif (desc[0] == 'J'):
size = self.size_lut[desc[1]]
return Arg_Offset(self.read_integer(size/8), size)
elif (desc[0] == 'O'):
return Arg_Dereference(type=desc[1], disp=self.read_integer(2), disp_size=16)
elif (desc[0] == 'S'):
self.read_modrm()
return Arg_Register(code=self.modrm['reg'], type='S')
else:
raise "Unknown argument description %s" % desc
Disassembler (Instructions)
Now we can parse instructions.
class Disassembler(object):
# ... snip ...
def read_instruction(self):
self.modrm = None; start = self.index
mnemonic, arguments = self.read_opcode()
address = self.first_addy.calc_relative_address(start)
if (mnemonic == '--'):
# Illegal opcode, use the 'DB' pseudo-instruction
mnemonic = Mnemonic('DB')
arguments = [Arg_Integer(ord(self.bytes[start]), 8)]
else:
mnemonic = Mnemonic(mnemonic)
arguments = filter(None, map(self.make_argument, arguments))
code = MachineCode(map(ord, self.bytes[start:self.index]))
return Instruction(address, code, mnemonic, *arguments)
We can also parse entire blocks of machine code:
class Disassembler(object):
# ... snip ...
def disassemble(self, bytes, segment=0, offset=0, trap=0, quiet=0):
self.index
= 0
self.bytes
= bytes
self.first_addy = Address(segment, offset)
# If we're not trapping exceptions, just iterate over the instructions
if (not trap):
while (self.index < len(self.bytes)): yield(self.read_instruction())
return
# If we are trapping, optionally print an error message pointing to the problem
while (self.index < len(self.bytes)):
try:
start = self.index
yield(self.read_instruction())
except:
if (not quiet): print 'Fault in instruction at %d' % start
break
Pingback: Things that were not immediately obvious to me » Blog Archive » Adoption