Disassembler (Part 3)

Editorial Note: Over the last two weeks, we’ve done the groundwork for building a disassembler: We’ve seen how to find documentation for a machine’s instruction format, how to read machine code by hand, and how to build up a machine-readable opcode map. Now it’s time to build the disassembler itself.

This article will walk through the process of building a disassembler for 8086 integer instructions in Python. It takes a “bottom-up” approach, beginning with low-level data structures, and ending with a method that disassembles an arbitrary sequence of bytes.

Overview

If you want to jump to the punchline, you can download the complete source code for the disassembler here, and the opcode table here. (Save the opcode table as a file named 8086_table.txt in the same directory as the source Python file, or the latter won’t work properly.) If you’re more interested in the scenic route, let’s begin by considering a typical line of assembly code output by a disassembler:

13CB:5A92 55           PUSH   BP

This line contains 4 parts, each of which we can consider in turn.

  • The address of the instruction
  • The machine code comprising the instruction
  • A mnemonic
  • Zero, one, or two arguments

Addresses

An address is made up of a segment:offset pair. Let’s make a pass at writing a Python class to support these addresses. We’ll include support for:

  • Initialization
  • String conversion
  • Address arithmetic
class Address(object):
    def __init__(self, segment, offset):
        self.segment = segment
        self.offset  = offset
    def __str__(self):
        return '%04X:%04X'%(self.segment, self.offset)

    def calc_relative_address(self, displacement):
        return Address(self.segment, (self.offset+displacement)&0xffff)

Machine Code

Machine code is just a series of bytes. A simple class to support it should include support for:

  • Initialization
  • String conversion
  • Length calculation
  • Opcode retrieval

In order to keep things nice and pretty, we should pad machine code out to the length of the longest legal instruction. If we display instruction prefixes and segment overrides as their own instructions, an 8086 instruction can consist of at most:

  • 1 opcode byte
  • 1 ModR/M byte
  • 2 displacement bytes
  • 2 immediate operand bytes

This comes to 6 instruction bytes, or 12 displayed characters of machine code. Here’s the class:

class MachineCode(object):
    def __init__(self, bytes):
        self.bytes = bytes
    def __str__(self):
        return '%-12s'%''.join('%02X'%byte for byte in self.bytes)
    def __len__(self):
        return len(self.bytes)

    def get_opcode(self):
        return self.bytes[0]

Mnemonics

The mnemonic is just a string. To keep things pretty (again) let’s pad it to the length of the longest legal mnemonic. We can find that length from the opcode map, with a little Python:

import re
max(len(a[1]) for a in map(re.compile(r'\s+').split, file('8086_table.txt', 'rb').readlines()) if len(a) > 1)

This returns a maximum length of 6 – for LOOPNZ, as it happens. With that, let’s define a little class, with support for:

  • Initialization
  • String conversion
class Mnemonic(object):
    def __init__(self, name):
        self.name = name
    def __str__(self):
        return '%-6s'%self.name

Arguments

So far, so simple. Now we come to arguments, where the waters get deeper. An 8086 assembly instruction can include one or more of the following types of arguments:

Addresses Memory addresses
Dereferences Contents of memory addresses (may require disambiguation, e.g. “WORD PTR”)
Registers 8 or 16 bit general-purpose or segment registers
Integers 8 or 16 bit immediate integer arguments; 8 bit arguments may be sign-extended
Constants Intrinsic integer arguments
Offsets Code addresses relative to the end of the current instruction

Several of these argument types share a common complication: their assembly code representation depends not only on the arguments themselves, but upon the overall instruction in which they are found. For instance, 8-bit Integer arguments may require a leading sign (“+” or “-“) to indicate sign extension, depending on the opcode of the instruction in which they are used. As another example, Offset arguments display a function address that depends on the location and size of their instruction, in addition to the offset argument.

In order to address this shared complication, all arguments will inherit from base Argument class that will track the parent instruction.

class Argument(object):
    _parent = None
    def set_parent(self, parent):
        self._parent = parent

Instructions

Now, since we’re about to begin defining argument classes which refer to an Instruction class, let’s take a minute to define a first pass at an Instruction class, with support for:

  • Initialization
  • String conversion
class Instruction(object):
    def __init__(self, addr, code, mnemonic, *args):
        self.addr         = addr
        self.code         = code
        self.mnemonic    = mnemonic
        self.args         = args

        for a in self.args: a.set_parent(self)
    def __str__(self):
        core = ' '.join(map(str, (self.addr, self.code, self.mnemonic)))
        if (self.args): return core + ' ' + ', '.join(map(str, self.args))
        else: return core

Arguments (Constants)

Now let’s begin defining argument classes proper. The simplest types of argument are the Constants; these can be represented with nothing more than a class which inherits from both Argument and int:

class Arg_Constant(Argument, int):
    pass

This definition is about as simple as it gets, but provides the core functionality we’ll define for all five argument types:

  • Initialization
  • String conversion

Arguments (Addresses)

After Constants, Addresses are the next simplest types of arguments. They can be represented as a class which inherits from Argument and Address. (They’ll initialize properly because Argument has no __init__ function, so Address.__init__ will be called instead.)

class Arg_Address(Argument, Address):
    pass

Arguments (Offsets)

Moving on to a meatier class of arguments, we come to the Offsets. These arguments calculate a new code address based on the address of their instruction, the length of their instruction, and a signed 8- or 16-bit offset.

class Arg_Offset(Argument):
    def __init__(self, offset, offset_size):
        self.offset = offset

        # offset will be read from the machine code as an unsigned integer
        if (self.offset >= pow(2, offset_size-1)):
            self.offset -= pow(2, offset_size)
    def __str__(self):
        addr = self._parent.addr
        code = self._parent.code
        return str(addr.calc_relative_address(len(code)+self.offset))[-4:]

Arguments (Registers)

Next are the Registers. Although simple on the surface, Registers have a non-trivial initialization scheme. Registers can be selected either by name, or by a 3-bit code and a type, e.g. “the 8-bit register associated with code 4” – which would be AH, in practice. (The code and the type may not be discovered at the same point in the parsing process.) Let’s begin by defining all the register names, using the Enum class from a few weeks ago:

# Wrap a 1:1 mapping of symbol names to literals
class Enum(object):
    def __init__(self, lut):
        self.lut    = lut
        self.rlut   = dict((v, k) for (k, v) in lut.items())
    def __getattr__(self, name):
        return self.lut.get(name)
    def __getitem__(self, value):
        return self.rlut.get(value)

# Define names (e.g. reg_set.AL), provide numeric->text LUT (e.g. reg_set[3])
reg_set = Enum({'AL':0x00,'CL':0x01,'DL':0x02,'BL':0x03,
                'AH':0x04,'CH':0x05,'DH':0x06,'BH':0x07,
                'AX':0x10,'CX':0x11,'DX':0x12,'BX':0x13,
                'SP':0x14,'BP':0x15,'SI':0x16,'DI':0x17,
                'ES':0x30,'CS':0x31,'SS':0x32,'DS':0x33,})

Now we can define the Register argument class:

class Arg_Register(Argument):
    type_lut = {'b':0x00, 'v':0x10, 'w':0x10, 'S':0x30}
    def __init__(self, name=None, code=None, type=None):
        if (type): type = self.type_lut[type]
        if (name):
            reg = reg_set.__getattr__(name)
            code = reg&0xf; type = reg&0xf0
        self.code = code
        self.type = type
    def __str__(self):
        return reg_set[self.code+self.type]

    def set_type(self, type):
        self.type = self.type_lut[type]

Arguments (Integers)

The penultimate arguments are the Integers. Integers are, like Registers, apparently simple, but complicated by the fact that 8-bit sign-extended, 8-bit non-sign-extended, and 16-bit Integers are all displayed differently in the assembly code. In particular, 8-bit sign-extended Integers are associated with one particular opcode (0x83) and must be handled specially:

class Arg_Integer(Argument):
    def __init__(self, value, value_size):
        self.value         = value
        self.value_size    = value_size
    def __str__(self):
        if (self._parent.code.get_opcode() == 0x83):
            # Special case: display as an 8-bit sign-extended value
            if (self.value >= 128): value = self.value - 256
            else: value = self.value
            return '%+03X'%value
        else:
            return '%0*X'%(self.value_size/4, self.value)

Arguments (Dereferences)

Dereferences are the last, and most complex, form of argument encountered in assembly code. Aside from the fact that a Dereference consists of between 1 and 3 sub-parts (base register, index register, and displacement), some Dereferences are qualified by type declarations: “BYTE PTR”, “WORD PTR”, or “FAR”. The rules governing when a type declaration is applied are a little arbitrary, and therefore awkward to support in code. To summarize them:

  • Dereference arguments with an operand type of ‘b’ (as specified in the opcode map) may take a “BYTE PTR” qualifier
  • Dereference arguments with an operand type of ‘w’ or ‘v’ (as specified in the opcode map) may take a “WORD PTR” qualifier
  • Dereference arguments with an operand type of ‘p’ (as specified in the opcode map) may take a “FAR” qualifier
  • Dereferences associated with opcodes 0x80, 0x81, 0x82, 0x83, 0xC6, 0xC7, 0xD0, 0xD1, 0xD2, 0xD3, 0xF6, 0xF7, 0xFE, or 0xFF are always qualified, except for:
  • Dereferences associated with a Mnemonic of PUSH, CALL or JMP, which aren’t qualified, except for:
  • FAR Dereferences associated with a Mnemonic of CALL or JMP, which are qualified

Here’s the code. It’s a little on the ugly side, due to all the different forms Dereferences can take:

class Arg_Dereference(Argument):
    q_opcodes        = set([    0x80, 0x81, 0x82, 0x83, 0xC6, 0xC7,
                                0xD0, 0xD1, 0xD2, 0xD3, 0xF6, 0xF7, 0xFE, 0xFF])
    nq_mnemonics    = set(['PUSH', 'CALL', 'JMP'])
    q_far_mnemonics = set(['CALL', 'JMP'])
    q_lut            = {'b':'BYTE PTR', 'w':'WORD PTR', 'v':'WORD PTR', 'p':'FAR'}

    def __init__(self, type=None, base=None, index=None, disp=None, disp_size=None):
        self.type         = type
        self.base         = base
        self.index        = index
        self.disp         = disp
        self.disp_size    = disp_size
    def __str__(self):
        # Build up the dereference, one piece at a time
        rv = []; n_terms = 0
        # Qualifier, if applicable
        if (self._parent.code.get_opcode() in self.q_opcodes):
            if ((not str(self._parent.mnemonic).strip() in self.nq_mnemonics) or
                ((str(self._parent.mnemonic).strip() in self.q_far_mnemonics) and (self.type == 'p'))):
                rv.append(self.q_lut[self.type] + ' ')
        # Bracket
        rv.append('[')
        # Optional base and index terms
        if (self.base != None):
            rv.append(reg_set[self.base]); n_terms += 1
        if (self.index != None):
            if (n_terms): rv.append('+')
            rv.append(reg_set[self.index]); n_terms += 1
        # Displacement
        if (self.disp != None):
            if (self.disp_size == 8):
                # 8-bit displacement; always preceeded by at least one term
                disp = self.disp
                if (disp >= 128): disp -= 256
                rv.append('%+03X'%disp); n_terms += 1
            else:
                # 16-bit displacement
                if (n_terms): rv.append('+')
                rv.append('%04X'%self.disp); n_terms += 1
        # Bracket
        rv.append(']')
        # Concatenate and return
        return ''.join(rv)

    def set_type(self, type):
        self.type    = type

Disassembler (Skeleton)

At this point, we have data structures into which raw machine code can be parsed, and from which we can generate disassembly. The next step is to actually do the parsing, which isn’t really as complicated as it might sound. Let’s begin with the Disassembler class definition and __init__ function, and add functions as we go.

class Disassembler(object):
    def __init__(self):
        self.index         = 0
        self.bytes         = ''
        self.first_addy    = Address(0, 0)
        self.modrm         = None

The __init__ function just sets up the state, more for illustration than anything else. A brief guide to the purpose of the members:

  • bytes stores the raw machine code as a Python string
  • index tracks the current read position in bytes
  • first_addy stores the Address that will be assigned to the first instruction read from bytes, and from which all subsequent Addresses will be computed
  • modrm tracks whether the modrm byte has been read from an instruction

Now, let’s define some low-level parsing functions. We’ll need to read:

  • Opcodes
  • ModR/M bytes
  • 1- and 2-byte integers

Disassembler (Integers)

Integers are simple enough; we just read LSB bytes from the current read position, throwing an exception if we run off the end of the array.

class Disassembler(object):
    # ... snip ...
    def read_integer(self, num_bytes):
        rv = sum(ord(self.bytes[self.index+i])*pow(2, i*8) for i in range(num_bytes))
        self.index += num_bytes; return rv

Disassembler (ModR/M)

ModR/M bytes are a little more difficult. A ModR/M byte decomposes into 3 bit fields (A 2-bit mod, a 3-bit reg, and a 3-bit r/m.) The mod and r/m fields in turn yield one of 32 addressing forms, some of which include an 8- or 16-bit displacement immediately following the ModR/M byte, and all of which can be read about, in gory detail, on page 2-4 of the Intel documentation.

The upshot of all this is that the process of parsing a ModR/M “byte” can involve reading between 1 and 3 bytes, can produce either a Register or an Address argument (in addition to the reg field, which is also of interest to some instructions), and can be triggered at as many as three points in the instruction parsing process (when parsing the opcode, and when parsing both arguments). So, the ModR/M parsing code is longer:

class Disassembler(object):
    # ... snip ...
    def read_modrm(self):
        if (self.modrm): return

        modrm = ord(self.bytes[self.index]); self.index += 1
        mod = modrm>>6; reg=(modrm>>3)&7; rm=modrm&7

        if (mod == 0):
            if (rm == 6):
                # Special case
                self.modrm = {'rm':Arg_Dereference(disp=self.read_integer(2), disp_size=16), 'reg':reg}
                return
            disp = None; disp_size = None
        elif (mod == 1):
            disp = self.read_integer(1); disp_size = 8
        elif (mod == 2):
            disp = self.read_integer(2); disp_size = 16
        elif (mod == 3):
            self.modrm = {'rm':Arg_Register(code=rm), 'reg':reg}
            return

        if (rm == 0):
            base = reg_set.BX; index = reg_set.SI
        elif (rm == 1):
            base = reg_set.BX; index = reg_set.DI
        elif (rm == 2):
            base = reg_set.BP; index = reg_set.SI
        elif (rm == 3):
            base = reg_set.BP; index = reg_set.DI
        elif (rm == 4):
            base = None; index = reg_set.SI
        elif (rm == 5):
            base = None; index = reg_set.DI
        elif (rm == 6):
            base = reg_set.BP; index = None
        elif (rm == 7):
            base = reg_set.BX; index = None

        self.modrm = {'rm':Arg_Dereference(base=base, index=index, disp=disp, disp_size=disp_size), 'reg':reg}

Disassembler (Opcodes)

Opcodes are easy enough to read – they’re just single bytes – but a little more complex to decompose into mnemonics and argument descriptors. (In fact, for group opcodes, such a decomposition isn’t even possible without parsing the ModR/M byte, as the reg field affects the meaning of the primary opcode.) The process of decomposition also brings us, at last, to the opcode map we derived last week.

We’ll need to load the opcode maps in order to parse the opcodes. I chose to import those maps as module-level objects. (Note that the following code assumes that you’ve imported the os and re modules, that the 8086_table.txt file contains the plain-text opcode map, and that the 8086_table.txt file is in the same directory as the module in which the following code lives:

# Define opcode maps
lines = file(os.path.join(os.path.split(__file__)[0], '8086_table.txt'), 'rb').readlines()
ops = filter(None, map(re.compile(r'([0-9A-F]{2})\s+(.+)').match, lines))
opcode_map = dict((int(m.group(1), 16), m.group(2).split()) for m in ops)
ops = filter(None, map(re.compile(r'(GRP\S+)/(\d)\s+(.+)').match, lines))
opcode_extension_map = dict(((m.group(1), int(m.group(2))), m.group(3).split()) for m in ops)
del lines; del ops

With the maps available, we can now parse the opcodes:

class Disassembler(object):
    # ... snip ...
    def read_opcode(self):
        opcode = ord(self.bytes[self.index]); self.index += 1
        mnemonic = opcode_map[opcode][0]; arguments = opcode_map[opcode][1:]

        # Handle group opcodes, and opcode extensions
        if (mnemonic[:3] == 'GRP'):
            self.read_modrm()
            extension = opcode_extension_map[(mnemonic, self.modrm['reg'])]
            if (extension[0] == '--'):
                # Special case - illegal extension, use the '???' psuedo-mnemonic
                mnemonic = '???'
            else:
                mnemonic = extension[0]
            # Override the primary opcode's argument descriptors iff the extension's set is non-empty
            if (extension[1:]): arguments = extension[1:]

        return mnemonic, arguments

Disassembler (Arguments)

We’re almost ready to parse complete instructions; the only missing piece is a function to convert the opcode map’s argument descriptors (‘Eb’, ‘Mp’, etc.) into the Argument sub-classes we defined earlier. The code is a little long, but not particularly tricky.

class Disassembler(object):
    # ... snip ...
    size_lut = {'b':8, 'v':16, 'w':16}
    def make_argument(self, desc):
        if (desc[0] == 'e'):
            # Register name; has a deeper meaning for post-8086 processors
            return Arg_Register(name=desc[1:])
        elif (reg_set.__getattr__(desc) != None):
            # Register name
            return Arg_Register(name=desc)
        elif (desc in '13'):
            # Constant
            return Arg_Constant(int(desc))
        elif (desc == 'M'):
            # psuedo-dereference
            self.read_modrm()
            rm = self.modrm['rm']; rm.set_type('v')
            return rm
        elif (desc[0] == 'A'):
            offset = self.read_integer(2)
            return Arg_Address(self.read_integer(2), offset)
        elif (desc[0] in 'EM'):
            self.read_modrm()
            rm = self.modrm['rm']; rm.set_type(desc[1])
            return rm
        elif (desc[0] == 'G'):
            self.read_modrm()
            return Arg_Register(code=self.modrm['reg'], type=desc[1])
        elif (desc[0] == 'I'):
            if (desc[1] == '0'):
                # Very special case
                value = self.read_integer(1)
                if (value == 0xa): return
                return Arg_Integer(value, 8)
            else:
                size = self.size_lut[desc[1]]
                return Arg_Integer(self.read_integer(size/8), size)
        elif (desc[0] == 'J'):
            size = self.size_lut[desc[1]]
            return Arg_Offset(self.read_integer(size/8), size)
        elif (desc[0] == 'O'):
            return Arg_Dereference(type=desc[1], disp=self.read_integer(2), disp_size=16)
        elif (desc[0] == 'S'):
            self.read_modrm()
            return Arg_Register(code=self.modrm['reg'], type='S')
        else:
            raise "Unknown argument description %s" % desc

Disassembler (Instructions)

Now we can parse instructions.

class Disassembler(object):
    # ... snip ...
    def read_instruction(self):
        self.modrm = None; start = self.index
        mnemonic, arguments = self.read_opcode()

        address = self.first_addy.calc_relative_address(start)
        if (mnemonic == '--'):
            # Illegal opcode, use the 'DB' pseudo-instruction
            mnemonic = Mnemonic('DB')
            arguments = [Arg_Integer(ord(self.bytes[start]), 8)]
        else:
            mnemonic = Mnemonic(mnemonic)
            arguments = filter(None, map(self.make_argument, arguments))

        code = MachineCode(map(ord, self.bytes[start:self.index]))
        return Instruction(address, code, mnemonic, *arguments)

We can also parse entire blocks of machine code:

class Disassembler(object):
    # ... snip ...
    def disassemble(self, bytes, segment=0, offset=0, trap=0, quiet=0):
        self.index         = 0
        self.bytes         = bytes
        self.first_addy    = Address(segment, offset)

        # If we're not trapping exceptions, just iterate over the instructions
        if (not trap):
            while (self.index < len(self.bytes)): yield(self.read_instruction())
            return

        # If we are trapping, optionally print an error message pointing to the problem
        while (self.index < len(self.bytes)):
            try:
                start = self.index
                yield(self.read_instruction())
            except:
                if (not quiet): print 'Fault in instruction at %d' % start
                break
Share and Enjoy:
  • Twitter
  • Facebook
  • Digg
  • Reddit
  • HackerNews
  • del.icio.us
  • Google Bookmarks
  • Slashdot
This entry was posted in Projects, Python, Reverse Engineering. Bookmark the permalink.

One Response to Disassembler (Part 3)

  1. Pingback: Things that were not immediately obvious to me » Blog Archive » Adoption