… An Overview of Assembly Language Programming
Assembly language programming often evokes images of a time when real programmers hurled boulders the size of small UNIVACs. The mystique is there for those whose computer science training is primarily in higher level languages.
But programming is programming, and one language shouldn’t be much harder to pick up than another? In theory, perhaps. But in practice, there are a myriad of details.
Hurling Boulders: An Overview of Assembly Language Programming
Mastering Assembly Language programming requires understanding many more details than does mastering higher level language programming. The complexities of assembly language are not so much in the language itself but in the knowledge required of the underlying microchip platform, as well as the various elements of the programming toolchain and the many variations encountered when switching between tools and target platforms.
Here’s a short survey, using the x86 (your typical PC) as an example.
Every target platform has its own specific assembly language. Why? Because assembly language(s) are symbolic languages representing the electronic operation codes recognized by the underlying microprocessor. Which codes are available depend on how the digital system is organized. That organization therefore determines how things can be done by the system. Which means the organization of the electronic elements of the computing end up determining the language elements that are available to the programmer to work with.
Along with platform specific instructions, the assembly language programmer must deal with a toolchain typically consisting of a number of elements, each of which has its own idiosyncrasies. Switching between tools in the toolchain is not trivial, and an effective assembly language programmer must understand these and be able interchange required parts of the toolchain.
What’s in a Toolchain?
The following will give an idea of what is involved:
1. The Target Platform
First choose your hardware (computing) target platform. Is it going to be a system-on-chip or a computer system consisting of a number of interacting chips? What will be the architecture of this chip? Will it be an Intel x86 microprocessor? A Motorola 64K microprocessor? An Intel 8051 microcontroller? A MicroChip PIC microcontroller? An ARM microcontroller?
Once you have your processor, you have a processor organization and processor features which determine the programmer’s interface, i.e. the opcodes (instruction set). You have to get a copy. Where? Get the chip datasheet. And its User Guide, if there is one. And the chip’s Application notes. You have to understand at least a minimal subset of these in order to write your program and have it work as you expect.
If you’re programming for a bare chip, i.e. with no operating system wrapping your binary, then you are in charge of all timing, sequencing, etc. There is nothing negotiating between your instructions and the electronic signals they cause. You and your instructions are the only things (and all the things) driving the schedule. (Now, this actually makes things easier for deterministic, hard real-time software development.)
If you’re programming within an operating system environment (16-bit MS-DOS, 32-bit MS Windows, Linux, or a real-time operating system — RTOS), you’ll need to understand the functions that the operating system makes available as well as the calling convention required to use these functions. For 16-bit MS-DOS, functions aren’t too difficult, but for 32-bit assembly language programming on Windows, you will need to understand how to work with the Win32 API, passing parameters on the stack, cleaning up the stack, and understanding which registers are preserved between function calls. Unless you have specific requirements that dictate the use of assembly language (or unless education or masochism is the goal), you will typically much better served by programming in a language that is at least at the level of the operating system. In both the Win32 and the *NIX cases, this language is C.
2. The Host Platform
Now you have to choose a host, i.e. the platform on which you are going to compose the program. The easiest situation is if your host and target are the same. If they aren’t, then you’re going to need a cross-assembler. But you’re also going to need an emulator so that you can test your program on the host machine before cross-assembling it for testing on the target.
3. Assembly Language Dialects for the same chip
Depending on your choice of target, there could be different DIALECTS of assembly languages for the specific chip. For the x86 microprocessor, for example, there is the Intel syntax and the AT&T syntax. Though knowing one you could probably READ another, they are different enough that it is not immediate to be able to WRITE in a dialect that you are not practiced in. Elements of your toolchain will typically have a decided preference of which syntax they prefer, i.e. they will not typically be dialect indifferent. For example, the GNU toolchain is AT&T syntax oriented, while the NASM, MASM, etc. line of assemblers is Intel syntax oriented.
So, if you choose a Windows PC as your host and target (best for learning), you have to choose the dialect of your assembly language: AT&T / Motorola syntax or Intel x86 syntax.
You’ll find that the AT&T syntax is a more regularized dialect, but Intel x86 is often the choice for beginning since it appears less daunting at the outset (though its idiosyncrasies make it less appealing once you know what you’re doing).
4. Assemblers and Assembler-specific syntax and code organization
Now you need to pick an assembler. x86 assemblers will typically require that your source code be written in a particular syntax dialect (Intel or AT&T), and so, if you want to choose an assembler first, then you will need to use the dialect that it prefers.
In addition, every assembler will introduce coding conventions and definitions: pseudo-ops (db, equ, etc.), memory and pointer specifying conventions, as well as macros, pre-processor definitions, etc. The syntax for these language elements often varies from assembler to assembler. Is it mov ax, es:foo or mov ax, [es:foo]?
Finally, there may be some different (or limited) choices when it comes time to assemble, and you’ll need to know what you’re doing to get things to roll along without errors.
If you choose Intel x86 syntax, you have several choices of assemblers. I recommend NASM (Netwide Assembler).
If you choose AT&T syntax, you can use gas (GNU Assembler), which is part of the gcc toolchain.
NOTE: you can also get gas to assemble intel syntax using the directive .intel_syntax noprefix, presumably from within the code, though the documentation for this is somewhat thin.
5. Automatic Syntax Converters
If you find yourself torn between wanting to code in one dialect and wanting to use a toolchain element that requires the other dialect, you do have a last-ditch option of using an automatic syntax converter that takes you back and forth between the two. For the x86, there is Intel2ATT and ATT2Intel.
Though the converter can take a large part of the pain away, not everything will always go smoothly (as with anything automatic), and so you’ll find that you end up having to know enough about both dialects anyway.
NOTE: gcc IS able to spit out Intel x86 syntax with the switch -masm=intel, but this is not compilable by NASM. ATT2Intel gets most of the details right so is the recommended tool to get from GCC generated assembly code in AT&T syntax to Intel assembly code that can be assembled with NASM.
6. Using a Stripped-down C Compiler as an Automatic Assembly Code Generator
One way to get reasonably legible assembly language code (minus the comments!) is to write the logic in C and then use a stripped down compiler to compile it to assembly. The key here is getting the compiler to generate the least amount of unnecessary code.
For GCC, use gcc -S (compile) with the following switches:
-fno-exceptions : don’t generate exception handling code
-s : strip out symbolic information
-O0 : don’t optimize
-Os : optimize for size (i.e. don’t use unnecessary assembler)
Or use Fabrice Bellard’s Tiny C Compiler (tcc) and disassemble the executable (PEBrowse works well here).
7. A Cornucopia of x86 Object Code Formats
Your chosen x86 assembler will take your source assembly code and assemble it into object code, and spit out an object file. But there are many choices here too: omf, win32, coff, a.out, elf, etc. The object file you are hoping to produce imposes its own syntax and organizational requirements which means that it affects HOW you write your assembly language program and arrange it in your source file. In other words, your assembly language source file cannot typically be run through the same pipe to assemble it into different object file formats.
Another key reason to care about object code formats is if you wish to take advantage of pre-assembled functionality from other libraries or from the operating system. If this functionality exists in the form of object code that you can link to, then your code will need to be assembled to a compatible object format in order for the linker to link all of the necessary pieces.
To get a stand-alone executable that you can run on your target, you need a linker. There are many linkers, each having its own syntax and options. The linker will need to be able to accept the object file format(s) that you got from your assembler with the correct word length (8-, 16-, 32-, or 64- bit lengths).
9. Linking Libraries
The linker may need run-time or static libraries to resolve definitions. These will also need to be in a compatible format. You will need to know where they are, how to link them in (more syntax), and the right settings.
10. x86 Executable (Binary)
Finally, if you have done all this right, and you will get an executable that will be able to run on an x86 platform, either in a COM (DOS) window or directly on the OS (Win32 or Linux). The executable too, will have a format (e.g. PE-i386, elf, etc.), and this will determine which disassemblers and/or debuggers you are able to use with it.
11. Library (Static or DLL)
If you are creating a library, you have further considerations: static (.a or .lib) or dynamic (with its various flavors). But if dynamic, you also have to know whether the code that is produced is position independent, relocatable, or neither.
And though you now have an executable, it doesn’t end here. Your program is likely not correct in all of its details (at least not the first several times through). Some mistakes will be obvious quickly. But for others you may find that you need a debugging method.
Debugging an assembly language program isn’t easy without good tools. And good tools add their own additional intricacies to what you have already had to master.
But now that you’ve seen the roadmap, stick around and play a while. Hurling boulders does wonders for that feeling of invincibility!
Continue reading: >>An Open Source x86 Assembly Language Toolset…
If you enjoyed this article, subscribe to our RSS feed — don’t miss the next article in the series.