Disassembler (Part 2)

Editorial Note: This article is the second in a three part series on writing an 8086 disassembler. Today we’ll cover the practical issues involved in finding an opcode map; we saw last week that such a map is central to the process of disassembly. Next week, we’ll use this map (and Python!) to build a disassembler for 8086 integer instructions.

At first, it seems pretty easy to find an opcode map for an 8086 processor: just consult Intel’s documentation. Unfortunately, there are two problems with this approach. First, and most importantly, the quality of the published maps is somewhat poor. The second problem is that the 8086 is a very old (c. 1978) chip, and documentation dedicated to it (as opposed to later members of its family) is not easy to come by. Both problems can be overcome by consulting multiple resources.

Errata

There is no way to sugar-coat a somewhat surprising fact: Many offical Intel opcode maps are wrong. The maps in both my (treasured) 1995 Pentium Processor Family Developer’s Manual Volume 3: Architecture and Programming Manual (ISBN: 1-55512-247-7) and the 1997 Intel Architecture Software Developer’s Manual Volume 2: Instruction Set Reference (ISR) contain significant errata. The 2008 Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2B: Instruction Set Reference, N-Z appears much improved, although still not perfect.

To illustrate my remarks about the quality of the available opcode maps, here are the errors (relevant to 8086 instructions) that I found in the 1997 ISR:

33 Arguments should be Gv, Ev (not Gb, Ev)
82 Arguments should be Eb, Ib (not Ev, Ib)
83 Arguments should be Ev, Ib (not Eb, Ib)
84 Arguments should be Gb, Eb (not Eb, Gb; the operation of TEST makes this difference largely irrelevant)
85 Arguments should be Gv, Ev (not Ev, Gv; the operation of TEST makes this difference largely irrelevant)
86 Arguments should be Gb, Eb (not Eb, Gb; the operation of XCHG makes this difference largely irrelevant)
87 Arguments should be Gv, Ev (not Ev, Gv; the operation of XCHG makes this difference largely irrelevant)
A4 Arguments should be Yb, Xb (not Xb, Yb; neither argument appears explicitly in assembly code)
A5 Arguments should be Yv, Xv (not Xv, Yv; neither argument appears explicitly in assembly code)
C6 The reg field of the following ModR/M byte must be zero; the map omits this detail
C7 The reg field of the following ModR/M byte must be zero; the map omits this detail
E0 Mnemonic should be LOOPNE (or LOOPNZ)
39 Misprint – mnemonic should be CMP
3A Misprint – mnemonic should be CMP
3B Misprint – mnemonic should be CMP
3C Misprint – mnemonic should be CMP
3D Misprint – mnemonic should be CMP
3E Misprint – map should indicate that 3E is the DS segment override prefix
3F Misprint – mnemonic should be AAS
8F The reg field of the following ModR/M byte must be zero; the map omits this detail
9A Argument should be Ap (not aP)
9D Mnemonic should be POPF
FF/3 Argument should be Mp (not Ep)
FF/5 Argument should be Mp (not Ep)

Evolution

The 8086 was released long before the advent of the WWW, and there is relatively little on-line Intel documentation dedicated to it. One can easily find references for later (mostly Pentium-family) processors, but these documents contain more information that we need.

Ever since IBM’s System/360 machines, backwards compatibility has been a key to the commercial success of computer hardware. Intel has learned this lesson well, as even its most modern 2008 CPUs are (almost entirely) compatible with code written for their distant 1978 ancestor, the Intel 8086 CPU.

Intel’s impressive commitment to backwards compatibility means that buried within the opcode maps of their most recent processors are the 8086 maps we’re looking for. It just takes a certain amount of detective work to pull the core 8086 instructions from inside the layers that have wrapped the instruction set over the years: additional 80186 and 80286 instructions, 32-bit extensions from the 80386, MMX, SSE, and so forth. Such work is worthwhile because the much simpler 8086 map is much easier to represent in software.

Resources

To compile my version of the opcode map, I cross-referenced several sources:

  • My 1995 hardcopy Architecture and Programming Manual (ISBN: 1-55512-247-7)
  • The aforementioned 1997 Instruction Set Reference
  • A really neat opcode map ostensibly pulled from a June 1978(!) Intel mcs-86 product description
  • The “coder32” X86 Opcode and Instruction Reference from x86asm.net
  • Experimental results from DOS DEBUG

I should mention that the best non-experimental resource was the coder32 reference. It appears to be a higher quality reference than even the 2008 Intel documentation. I would have relied upon it more heavily, but its large size and argument format did not immediately recommend themselves to me.

Conclusion

The results of my efforts can be seen here.

If you’re setting out to build a disassembler, you’re going to need an opcode map, and you’re likely to be disappointed with what’s available. Fortunately, with a little diligence, you can cross-check enough different resources to build up your own map, and gain a non-trivial sense of the processor’s architecture while you’re at it.

Next week, this will pay off as we use the plain-text version of the opcode map to power a disassembler.

Share and Enjoy:
  • Twitter
  • Facebook
  • Digg
  • Reddit
  • HackerNews
  • del.icio.us
  • Google Bookmarks
  • Slashdot
This entry was posted in Projects, Reverse Engineering. Bookmark the permalink.

Comments are closed.