How to Obtain MLRISC
Problem Statement
MLRISC Based Compiler
MLRISC Intermediate Representation
MLRisc Generation
Back End Optimizations
Register Allocation
Machine Description
Garbage Collection Safety
System Integration
Graphical Interface
Line Counts
Systems Using MLRISC
Future Work
Architecture of MLRISC
The MLTREE Language
MLTree Extensions
MLTree Utilities
Instruction Selection
Machine Code Emitters
Delay Slot Filling
Span Dependency Resolution
The Graph Library
The Graph Visualization Library
Basic Compiler Graphs
SSA Optimizations
ILP Optimizations
Optimizations for VLIW/EPIC Architectur...
Register Allocator
Back Ends
The Alpha Back End
The PA RISC Back End
The Sparc Back End
The Intel x86 Back End
The PowerPC Back End
The MIPS Back End
The TI C6x Back End
Basic Types
Client Defined Constants
Client Defined Pseudo Ops
Instruction Streams
Label Expressions

Machine Description

Machine Description
-Why MDGen?
--Reserved Words
--Syntactic Sugar
-Elaboration Semantics
--Syntactic Overloading
-Basic Structure of A Machine Descriptio...
Describing the Architecture
-Architecture type
-Storage class
Specifying the Machine Encoding
-Defining New Instruction Formats
--Primitive formats
--Derived formats
-Generating Encoding Functions
-Encoding Variable Length Instructions
Specifying the Assembly Formats
-Assembly Case Declaration
-Assembly Annotations
-Generating Assembly Functions
Defining the Instruction Set
Specifying Instruction Semantics
How to Run the Tool
Machine Description
Syntax Highlighting Macros


MDGen is a simple tool for generating various modules in the MLRISC customizable code generator directly from machine descriptions. These descriptions contain architectural information such as:
  1. How the the register file(s) are organized.
  2. How instructions are encoded in machine code: MLRISC uses this information to generate machine instructions directly into a byte stream. Directly machine code generation is used in the SML/NJ compiler.
  3. How instructions are pretty printed in assembly: this is used for debugging and also for assembly output for other non-SML/NJ backends.
  4. How instructions are internally represented in MLRISC.
  5. Other information needed for performing optimizations, which include:
    1. The register transfer list (RTL) that defines the operational semantics of the instruction.
    2. Delay slot mechanisms.
    3. Information for performing span dependency resolution.
    4. Pipeline and reservation table characteristics.

Currently, item 5 is not ready for prime time.

Why MDGen?

MLRISC manipulates all instruction sets via a set of abstract interfaces, which allows the programmer to arbitrarily choose an instruction representation that is most convenient for a particular architecture. However, various functions that manipulate this representation must be provided by the instruction set's programmer. As the number and complexities of each optimizations grow, and as the number of architectures increases, the functions for manipulating the instructions become more numerous and complex. In order to keep the effort of developing and maintaining an instruction set manageable, the MDGen tool is developed to (partially) automate this task.


MDGen's machine descriptions are written in a syntax that is very much like that of Standard ML. Most core SML constructs are recognized. In addition, new declaration forms specific to MDGen are used to specify architectural information.

Reserved Words

All SML keywords are reserved words in MDGen. In addition, the following keywords are also reserved:

    always architecture assembly at backwards big bits branching called
    candidate cell cells cellset debug delayslot dependent endian field
    fields formats forwards instruction internal little locations lowercase
    name never nodelayslot nullified opcode ordering padded pipeline predicated
    register rtl signed span storage superscalar unsigned uppercase 
    verbatim version vliw when
Two kinds are quotations marks are also reserved:
    [[ ]]
    `` ''
The first [[ ]] is for describing semantics. The second `` '' is for describing assembly syntax.

Syntactic Sugar

MDGen recognizes the following syntactic sugar.
Record abbreviations
Record expressions such as \sml{{x=x,y=y,z=z}} can be simplified to just \sml{{x,y,z}}.
Binary literals
Literals in binary can be written with the prefix 0b (for integer types) or 0wb (for word types). For example, 0wb101111 is the same as 0wx2f and 0w79.
Bit slices
A bit slice, which extracts a range of bits from a word, can be written using an at expression. For example, w at [16..18] means the same thing as Word32.andb(Word32.>>(w, 0w16),0w7), i.e. it extracts bit 16 to 18 from w. The least significant bit the zeroth bit.

In general, we can write:

   w at [range1, range2, ..., rangen]
to extract a sequence of slices from w and concatenate them together. For example, the expression
    0wxabcd at [0..3, 4..7, 8..11, 12..15]
swap the 4 nybbles from the 16-bit word, and evaluates to 0wxdcba.

Signature declarations of the form
    val x y z : int -> int
can be used as a shorthand for the more verbose:
    val x : int -> int
    val y : int -> int
    val z : int -> int

Elaboration Semantics

Unfortunately, there is no complete formal semantics of how an MDGen specification elaborates. But generally speaking, a machine description is a just a structure (in the SML sense). Different components of this structure describe different aspects of the architecture.

Syntactic Overloading

In general, the syntactic overloading are used heavily in MDGen. There are three types of definitions:
  • Definitions that defines properties of the instruction set.
  • Definitions of functions and terms that are in the RTL meta-language. The syntax of MDGen's RTL language is borrowed heavily from Lambda-RTL, which in turns is borrowed heavily from SML.
  • Definitions of functions and types that are to be included in the output generated by the MDGen tool. These are usually auxiliary helper functions and definitions.
In general, entities of type 2, when appearing in other context, are properly meta-quoted in the semantics quotations [[ ]].

Basic Structure of A Machine Description

The machine description for an architecture are defined via an architecture declaration, which has the following general form.

 architecture name =
    architecture type declaration
    endianess declaration
    storage class declarations
    locations declarations
    assembly case declarations
    delayslot declaration
    instruction machine encoding format declarations
    nested structure declarations
    instruction definition

Describing the Architecture

Architecture type

Architecture type declaration specifies whether the architecture is a superscalar or a VLIW/EPIC machine. Currently, this information is ignored.

    architecture type declaration ::= superscalar | vliw

Storage class

Storage class declarations specify various information about the registers in the architecture. For example, the Alpha has 32 general purpose registers and 32 floating point registers. In addition, MLRISC requires that each architecture specifies a (pseudo) register type\footnote{Called cellkind in MLRISC.} for holding condition codes (CC). To specify these information in MDGen, we can say:

      GP "r" = 32 cells of 64 bits in cellset called "register"
                 assembly as (fn (30,_) => "$sp"
                               | (r,_)  => "$"^Int.toString r
    | FP "f" = 32 cells of 64 bits in cellset called "floating point register"
                 assembly as (fn (f,_) => "$f"^Int.toString f)
    | CC "cc" = cells of 64 bits in cellset GP called "condition code register"
                 assembly as "cc"
  • There are 32 64-bit general purpose registers, 32 64-bit floating point registers, while CC is not a real register type.
  • Cellsets are used by MLRISC for annotating liveness information in the program. The clause in cellset states that register type GP and FP are allotted their own components in the cellset, while the register type CC are put in the same cellset component as GP.
  • The clause assembly as specifies how each register is to be pretty printed. On the Alpha, general purpose register are pretty printed with prefix $, while floating point registers are pretty printed with the prefix $f. A special case is made for register 30, which is the stack pointer, and is pretty printing as $sp. Pseudo condition code registers are pretty printed with the prefix cc.


Special locations in the register files can be declared using the locations declarations. On the Alpha, GPR 30 is the stack pointer, GPR 28 and floating point register 30 are used as the assembly temporaries. This special constants can be defined as follows:

        stackptrR = $GP[30]
    and asmTmpR   = $GP[28]
    and fasmTmp   = $FP[30]

Specifying the Machine Encoding


The endianess declaration specifies whether the machine is little endian or big endian so that the correct machine instruction encoding functions can be generated. The general syntax of this is:

    endianess declaration ::= little endian | big endian
The Alpha is little endian, so we just say:
     little endian

Defining New Instruction Formats

How instructions are encoded are specified using instruction format declarations. An instruction format declaration has the following syntax:
   instruction machine encoding format declarations ::=
      instruction formats n bits 
      | format2
      | format3
      | ...
      | formatn-1
      | formatn
Each encoding format can be a primitive format, or a derived format.

Primitive formats

A primitive format is simply specified by giving it a name and specifying the position, names and types of its fields. This is usually the same way it is described in a architectural reference manual.

Here is how we specify some of the (32 bit) primitive instruction formats used in the Alpha.

    instruction formats 32 bits
      Memory{opc:6, ra:5, rb:GP 5, disp: signed 16} 
    | Jump{opc:6=0wx1a,ra:GP 5,rb:GP 5,h:2,disp:int signed 14}  
    | Memory_fun{opc:6, ra:GP 5, rb:GP 5, func:16}     
    | Branch{opc:branch 6, ra:GP 5, disp:signed 21}         
    | Fbranch{opc:fbranch 6, ra:FP 5, disp:signed 21}        
    | Operate0{opc:6,ra:GP 5,rb:GP 5,sbz:13..15=0,_:1=0,func:5..11,rc:GP 5}
    | Operate1{opc:6,ra:GP 5,lit:signed 13..20,_:1=1,func:5..11,rc:GP 5} 
For example, the format Memory
      Memory{opc:6, ra:5, rb:GP 5, disp: signed 16} 
has a 6-bit opcode field, a 5-bit ra field, a 5-bit rb field which always hold a general purpose register, and a 16-bit sign-extended displacement field. The field to the left is positioned at the most significant bits, while the field to the right is positioned at the least. The widths of these fields must add up to 32 bits.

Similarly, the format Jump

   Jump{opc:6=0wx1a,ra:GP 5,rb:GP 5,h:2,disp:int signed 14}  
contains a 6-bit opcode field which always hold the constant 0x1a, two 5-bit fields ra and rb which are of type GP, and a 14-bit sign-extended field of type integer.

Each field in a primitive format has one of 5 forms:

    name : position 
    name : position = value 
    name : type position 
    name : type position = value 
    _           : position = value 
where position is either a width, or a bits range n..m, with an optional signed prefix. The last form, with a wild card for the field name, can be used to specify an anonymous field that always has a fixed value.

By default, a field has type Word32.word. If a type T is specified, then the function emit_T is implicitly called to convert the type into the appropriate encoding. The function emit_T are generated automatically by MDGen if it is a cellkind defined by the storage class declaration, or if it is a primitive type such as integer or boolean. There are also other ways to automatically generate this function (more on this later.)

For example, the format Operate1

    Operate1{opc:6,ra:GP 5,lit:signed 13..20,_:1=1,func:5..11,rc:GP 5} 
states that bits 26 to 31 are allocated to field opc, bits 21 to 25 are allocated to field ra, which is of type GP, bits 13 to 20 are allocated to field lit, bit 12 is a single bit of value 1, etc.

MDGen generates a function for each primitive format declaration of the same name that can be used for emitting the instruction. In the case of the Alpha, the following functions are generated:

    val Memory : {opc:Word32.word, ra:Word32.word, 
                  rb:int, disp:Word32.word} -> unit
    val Jump   : {ra:int, rb:int, disp:Word32.word} -> unit
    val Operate1 : {opc:Word32.word, ra:int, lit:Word32.word,
                    func:Word32.word, rc:int} -> unit

Derived formats

Derived formats are simply instruction formats that are defined in terms of other formats. On the alpha, we have a Operate format that simplifies to either Operate0 or Operate1, depending on whether the second argument is a literal or a register.
    Operate{opc,ra,rb,func,rc} =
      (case rb of
        I.REGop rb => Operate0{opc,ra,rb,func,rc}
      | I.IMMop i  => Operate1{opc,ra,lit=itow i,func,rc}
      | I.HILABop le => Operate1{opc,ra,lit=High{le=le},func,rc}
      | I.LOLABop le => Operate1{opc,ra,lit=Low{le=le},func,rc}
      | I.LABop le => Operate1{opc,ra,lit=itow(LabelExp.valueOf le),func,rc}

Generating Encoding Functions

In MLRISC, we represent an instruction as a set of ML datatypes. Some of these datatypes represent specific fields or opcodes of the instructions. MDGen lets us to associate a machine encoding to each datatype constructor directly in the specification, and automatically generates an encoding function for these datatypes.

There are two different ways of specifying an encoding. The first way is just to write the machine encoding directly next the constructor. Here's an example directly from the Alpha description:

    structure Instruction =
       datatype branch! =  (* table C-2 *)
          BR   0x30
                    | BSR 0x34
                               | BLBC 0x38
        | BEQ  0x39 | BLT 0x3a | BLE  0x3b
        | BLBS 0x3c | BNE 0x3d | BGE  0x3e
        | BGT  0x3f
       datatype fbranch! = (* table C-2 *)
                      FBEQ 0x31 | FBLT 0x32
        | FBLE 0x33             | FBNE 0x35
        | FBGE 0x36 | FBGT 0x37
The datatypes branch and fbranch represent specific branch opcodes for integer branches BRANCH, or floating point branches FBRANCH. On the Alpha, instruction BR is encoded with an opcode of 0x30, instruction BSR is encoded as 0x34 etc. MDGen automatically generates two functions
     val emit_branch : branch -> Word32.word
     val emit_fbranch : branch -> Word32.word
that perform this encoding.

In the specification for the instruction set, we state that the BRANCH instruction should be encoded using format Branch, while the FBRANCH instruction should be encoded using format Fbranch.

    structure MC =
       (* Auxiliary function for computing the displacement of a label *)
       fun disp ... = ...
    | BRANCH of branch * $GP * Label.label
      Branch{opc=branch,ra=GP,disp=disp label}
    | FBRANCH of fbranch * $FP * Label.label
      Fbranch{opc=fbranch,ra=FP,disp=disp label}
    | ...
Since the primitive instructions formats Branch and FBranch are defined with branch and fbranch as the type in the opcode field
    | Branch{opc:branch 6, ra:GP 5, disp:signed 21}          
    | Fbranch{opc:fbranch 6, ra:FP 5, disp:signed 21}       
the functions emit_branch and emit_fbranch are implicitly called.

Another way to specify an encoding is to specify a range, as in the following example:

    datatype fload[0x20..0x23]! = LDF | LDG | LDS | LDT
    datatype fstore[0x24..0x27]! = STF | STG | STS | STT
This states that LDF should be assigned the encoding 0x20, LDG the encoding 0x21 etc. This form is useful for specifying a consecutive range.

Encoding Variable Length Instructions

Most architectures nowadays have fixed length encodings for instructions. There are some notatable exceptions, however. The Intel x86 architecture uses a legacy variable length encoding. Modern RISC machines developed for embedded systems may utilize space-reduction compression schemes in their instruction sets. Finally, VLIW machines usually have some form of NOP compression scheme for compacting issue packets.

Specifying the Assembly Formats

Assembly Case Declaration

The assembly case declaration specifies whether the assembly should be emitted in lower case, upper case, or verbatim. If either lower case or upper case is specified, all literal strings are converted to the appropriate case. The general syntax of this declaration is:

    assembly case declaration ::=
       lowercase assembly
     | uppercase assembly
     | verbatim  assembly

Assembly Annotations

Assembly output are specified in the assembly meta quotations `` '', or string quotations " ". For example, here is a fragment from the Alpha description:

    | LOAD of {ldOp:load, r: $GP, b: $GP, d:operand, mem:Region.region}
      ``<ldOp>\t<r>, <d>()<mem>''
    | STORE of {stOp:store, r: $GP, b: $GP, d:operand, mem:Region.region}
      ``<stOp>\t<r>, <d>()<mem>''
    | BRANCH of branch * $GP * Label.label
      ``<branch>\t<GP>, <label>''
    | FBRANCH of fbranch * $FP * Label.label
      ``<fbranch>\t<FP>, <label>''
    | CMOVE of {oper:cmove, ra: $GP, rb:operand, rc: $GP}
      ``<oper>\t<ra>, <rb>, <rc>''
    | FOPERATE of {oper:foperate, fa: $FP, fb: $FP, fc: $FP}
      ``<oper>\t<fa>, <fb>, <fc>''
    | ...
All characters within the quotations `` '' have the same interpretation as in the string quotation " ", except when they are delimited by the backquotes < >. Here's how the backquote is interpreted:
  • If it is <x> and x is a variable name of type t, and if an assembly function of type t is defined, then it will be invoked to convert x to the appropriate text.
  • If it is <x> and x is a variable name of type t, and if an assembly function of type t is NOT defined, then the function emit_x will be called to pretty print x.
  • If it is <e> where e is a general expression, then it will be used directly.

Generating Assembly Functions

Similar to machine encodings, we can attach assembly annotations to datatype definitions and let MDGen generate the assembly functions for us. Annotations take two forms, explicit or implicit. Explicit annotations are enclosed within assembly quotations `` ''.

For example, on the Alpha the datatype operand is used to represent an integer operand. This datatype is defined as follows:

    datatype operand =
        REGop of $GP                     ``<GP>'' 
      | IMMop of int                     ``<int>''
      | HILABop of LabelExp.labexp       ``hi(<labexp>)''
      | LOLABop of LabelExp.labexp       ``lo(<labexp>)''
      | LABop of LabelExp.labexp         ``<labexp>''
      | CONSTop of Constant.const        ``<const>''
Basicaly this states that REGop r should be pretty printed as $r, IMMop i as i, HILABexp le as hi(le), etc.

Implicit assembly annotations are specified by simply attaching an exclamation mark at the end of the datatype name. This states that the assembly output is the same as the name of the datatype constructor\footnote{But appropriately modified by the assembly case declaration.}. For example, the datatype operate is a listing of all integer opcodes used in MLRISC.

    datatype operate! = (* table C-5 *)
        ADDL  (0wx10,0wx00)                       | ADDQ (0wx10,0wx20)
                            | CMPBGE(0wx10,0wx0f) | CMPEQ (0wx10,0wx2d)
      | CMPLE (0wx10,0wx6d) | CMPLT (0wx10,0wx4d) | CMPULE (0wx10,0wx3d)
      | CMPULT(0wx10,0wx1d) | SUBL  (0wx10,0wx09)
      | SUBQ  (0wx10,0wx29)
      | S4ADDL(0wx10,0wx02) | S4ADDQ (0wx10,0wx22) | S4SUBL (0wx10,0wx0b)
      | S4SUBQ(0wx10,0wx2b) | S8ADDL (0wx10,0wx12) | S8ADDQ (0wx10,0wx32)
      | S8SUBL(0wx10,0wx1b) | S8SUBQ (0wx10,0wx3b)
      | AND   (0wx11,0wx00) | BIC    (0wx11,0wx08) | BIS (0wx11,0wx20)
                                                   | EQV (0wx11,0wx48)
      | ORNOT (0wx11,0wx28) | XOR    (0wx11,0wx40)
      | EXTBL (0wx12,0wx06) | EXTLH  (0wx12,0wx6a) | EXTLL(0wx12,0wx26)
      | EXTQH (0wx12,0wx7a) | EXTQL  (0wx12,0wx36) | EXTWH(0wx12,0wx5a)
      | EXTWL (0wx12,0wx16) | INSBL  (0wx12,0wx0b) | INSLH(0wx12,0wx67)
      | INSLL (0wx12,0wx2b) | INSQH  (0wx12,0wx77) | INSQL(0wx12,0wx3b)
      | INSWH (0wx12,0wx57) | INSWL  (0wx12,0wx1b) | MSKBL(0wx12,0wx02)
      | MSKLH (0wx12,0wx62) | MSKLL  (0wx12,0wx22) | MSKQH(0wx12,0wx72)
      | MSKQL (0wx12,0wx32) | MSKWH  (0wx12,0wx52) | MSKWL(0wx12,0wx12)
      | SLL   (0wx12,0wx39) | SRA    (0wx12,0wx3c) | SRL  (0wx12,0wx34)
      | ZAP   (0wx12,0wx30) | ZAPNOT (0wx12,0wx31)
      | MULL  (0wx13,0wx00)                        | MULQ (0wx13,0wx20)
                            | UMULH  (0wx13,0wx30)
      | SGNXL "addl" (0wx10,0wx00) (* same as ADDL *)
This definitions states that ADDL should be pretty printed as addl, ADDQ as addq, etc. However, the opcode SGNXL is pretty printed as addl since it has been explicitly overridden.

Defining the Instruction Set

How the instruction set is represented is declared using the instruction declaration. For example, here's how the Alpha instruction set is defined:
      DEFFREG of $FP
    | LDA of {r: $GP, b: $GP, d:operand}	
    | LDAH of {r: $GP, b: $GP, d:operand} 
    | LOAD of {ldOp:load, r: $GP, b: $GP, d:operand, mem:Region.region}
    | STORE of {stOp:store, r: $GP, b: $GP, d:operand, mem:Region.region}
    | FLOAD of {ldOp:fload, r: $FP, b: $GP, d:operand, mem:Region.region}
    | FSTORE of {stOp:fstore, r: $FP, b: $GP, d:operand, mem:Region.region}
    | JMPL of {r: $GP, b: $GP, d:int} * Label.label list
    | JSR of {r: $GP, b: $GP, d:int} * C.cellset * C.cellset * Region.region
    | RET of {r: $GP, b: $GP, d:int} 
    | BRANCH of branch * $GP * Label.label   
    | FBRANCH of fbranch * $FP * Label.label  
    | ...
The instruction declaration defines a datatype and specifies that this datatype is used to represent the instruction set. Generally speaking, the instruction set's designer has complete freedom in how the datatype is structured, but there are a few simple rules that she should follow:
  • If a field represents a register, it should be typed with the appropriate storage types $GP, $FP, etc.~instead of int. MDGen will treat its value in the correct manner; for example, during assembly emission a field declared type int is printed as an integer, while a field declared type $GP is displayed as a general purpose register.
  • MDGen recognizes the following special types: label, labexp, region, and cellset.

Specifying Instruction Semantics

MLRISC performs all optimizations at the granulariy of individual instructions, specialized to the architecture at hand. Many optimizations are possible only if the ``semantics'' of the instructions set to are properly specified. MDGen contains a register transfer language (RTL) sub-language that let us to describe instruction semantics in a modular and succinct manner.

The semantics of this RTL sub-language has been borrowed heavily from Norman Ramsey's and Jack Davidson's Lambda RTL. There are a few main differences, however:

  • The syntax of our RTL language is closer to that of ML than Lambda RTL.
  • Our RTL language, like that of MDGen, is tied closely to MLRISC.

How to Run the Tool

Machine Description

Here are some machine descriptions in varing degree of completion.

Syntax Highlighting Macros

Lal George
Allen Leung
SML/NJ Validate this page
Generated by mltex2html
Last modified: Thu Jan 9 19:38:15 EST 2003 by leunga@slinky