This week, I post some remarks following up on the just-concluded disassembler tutorial; there are always a few loose ends to tie up, and I wanted to clarify or expand on:
- The opcode map
- Addressing
- Demo code
- Hardening
Opcode Map Updates
In the course of writing last week’s disassembler, I made some changes to the opcode map I’d previously constructed in order to bring it into line with DOS DEBUG which was, for good or ill, my reference point for the project. I also updated the presentation of the HTML version of the map, which I discovered looked bloody awful on Macs; small Courier fonts seem to render particularly badly on Mac OS X.
There are a large number of mnemonics which are simply aliases for one another – in particular, mnemonics for the conditional jump (Jcc) instructions. Which mnemonic is used is largely a matter of personal preference, and the preferences of DOS DEBUG (and, therefore, of my disassembler) turned out to differ somewhat from those of the reference documents I used to construct the first version of the map.
None of these differences are earth-shattering, I just wanted to note them:
Opcode | Old Mnemonic | New Mnemonic |
77 | JNBE | JA |
7A | JP | JPE |
7B | JNP | JPO |
7D | JNL | JGE |
7F | JNLE | JG |
E0 | LOOPNE | LOOPNZ |
E1 | LOOPE | LOOPZ |
F2 | REPNE | REPNZ |
F3 | REPE | REPZ |
As a (hopefully) final note on the 8086 map, I should add that it contains two minor errors, deliberately added in order to mimic the behaviour of DOS DEBUG. You can read about these exciting errata on the map’s homepage.
Addresses Matter
I want to say a few words about the importance of addressing for real-mode code. In a segmented memory model, there are many ways to refer to the same physical memory address. For instance, 3333:0008
and 3303:0308
refer to the same physical address (0x33338
). In most circumstances these different forms of address are equivalent, but they are *not* interchangable when addressing an instruction with an Offset argument.
Consider the following instruction:
137B:0374 E918D0 JMP D38F
This is a relative jump instruction, with a 16-bit offset. It is located at physical address 0x13B24
, which can be addressed 4096 different ways. (13B2:0004
, 13B1:0014
, 13B0:0024
, …, 03B3:FFF4
) The applied relative offset is 0xD01B
(a specified offset of 0xD018
, plus the 3-byte length of the instruction itself.) What physical address does this instruction jump to? The answer is: “It depends on how the instruction is addressed.”
If the applied offset can be added to the offset used to address the jump instruction without producing a result in excess of 0xFFFF
, the instruction will jump to a physical address equal to the sum of the instruction’s physical address and the applied offset. If the sum of the applied offset and the offset used to address the jump instruction is greater that 0xFFFF
, then the instruction will jump to a physical address equal to the sum of the instruction’s physical address and the applied offset, less 0x10000
. (Naturally, the resulting segmented addresses will differ as well, but here I’m interested in demonstrating that there’s a difference in function, not just form.)
As an example, let’s re-consider the earlier instruction, addressed from a different segment:
107B:3374 E918D0 JMP 038F
This instruction is still located at the same physical address (0x107B*16 + 0x3374 = 0x13B24
) and still has the same applied offset of 0xD01B
, but now it jumps to 107B:038F
instead of the earlier destination of 137B:D38F
. These segmented addresses are *not* equivalent: The first maps to a physical address of 0x10B3F
, and the second to 0x20B3F
.
The point of all this is that, while it normally doesn’t matter too much how you address code when disassembling (or executing!) it, the offsets you use to address instructions containing Offset arguments (e.g. certain JMPs and CALLs) turn out to matter quite a bit. Almost every such instruction has two legitimate physical interpretations (an exception to this rule might be instructions at very low or very high physical memory addresses) and which interpretation will prevail depends entirely upon how the instruction is addressed.
Happily, this sort of thing is an artifact of segmented memory models, which have largely become a thing of the past. If you like to poke around real-mode code, however, you ought to be aware of it.
Demo
Here’s an example of the disassembler in action, to make up for last week’s rather disappointing denouement. This block of code assumes that the disassembler code from last week is stored in the file dasm3.py, that the MZ Header support code from week 1 of the tutorial is stored in the file mz.py, and that both files are available on the Python path:
import mz
import dasm3
# Writes the instructions retrieved from the first 64K (if available) of data following
# the executable header. The segment may be specified, but the initial offset is presumed
# to be zero.
def dump_first_code_segment(dst_pn, src_pn, segment=0):
hdr = mz.MZ_Header(src_pn); src_fp = file(src_pn, 'rb')
src_fp.seek(hdr.calc_code_start())
d = dasm3.Disassembler()
dst_fp = file(dst_pn, 'wb')
seg_len = min(1<<16, hdr.calc_length()-hdr.calc_code_start())
block = 0
for i in d.disassemble(src_fp.read(seg_len), segment=segment, trap=1, quiet=1):
dst_fp.write(str(i)+'\n')
if ((i.addr.offset >> 12) != block):
block = i.addr.offset >> 12
print 'Processed address ' + str(i.addr)
This code comprises a very crude disassembler for DOS executables. It assumes that the portion of the executable immediately following the MZ header data will be loaded at CS:0000
and that it consists of machine code (as opposed to program data). Both guesses are actually pretty good.
Hardening
While running some tests against random bytestreams, I encountered a number of crashes in the disassembler related to invalid ModR/M bytes. I saw crashes caused by two types of conditions:
- ModR/M bytes that specified a register (mod = 3) when a memory operand was required
- ModR/M bytes with a reg greater than 3 when a segment register was being specified
I response, I added some special-case code to the Arg_Register class, which now looks like this:
class Arg_Register(Argument):
# The 'p' type is illegal, but will be quietly treated as a WORD register
type_lut = {'b':0x00, 'v':0x10, 'w':0x10, 'S':0x30}
def __init__(self, name=None, code=None, type=None):
if (type): type = self.type_lut.get(type, 0x10)
if (name):
reg = reg_set.__getattr__(name)
code = reg&0xf; type = reg&0xf0
self.code = code
self.type = type
def __str__(self):
if (self.type == 0x30):
# Hack to handle illegal segment register codes
return reg_set[(self.code&0x3)+self.type]
else:
return reg_set[self.code+self.type]
def set_type(self, type):
self.type = self.type_lut.get(type, 0x10)
If you want to use this code, you’ll have to patch it in manually, as I didn’t retrofit it into last week’s post. Happy disassembly!