Garbage Collection Safety
High level languages such as SML make use of garbage collectors
to reclaim unused storage at runtime. For generality, I assume that
a precise, compacting garbage collector is used. In general,
low-level optimizations that reorder instructions
pass gc safepoints, when applied naively,
are not safe. In general, two general classes of safety issues can be identified:
- derived values
A derived value is a value that are
dependent on the addresses of one of more heap allocated objects
and/or the recent branch history.
When these allocated objects
are moved by the garbage collector,
has to be adjusted accordingly.
For example, inductive variable elimination may transformed an array
indexing into a running pointer to the middle of an array object.
Such running pointer is a derived value and is dependent on the
starting address of the array.
The main difficulty in handling a derived value
during garbage collection is that sometimes it is impossible or
counter-productive to recompute from .
For example, when the recent branch history is unknown, or when the
precise relationship between and cannot
be inferred from context.
We call these unrecoverable derived values.
- incomplete allocation
If heap allocation is performed inlined, then code motion may
render some allocation incomplete at a gc safepoint. In general, incomplete
allocation has to be completed, or rolled backed and then reexecuted
after garbage collection, when the source language semantics allow it.
Typically, two gc safepoints cannot be separated by an unbounded
number of allocations, which implies that in general, optimizations that move
instructions between basic blocks are unsafe when naively applied,
which greatly limits the class of optimizations in such an environment
to trivial basic block level optimizations.
framework is a necessity.
MLRISC contains a gc-safety framework
for performing aggressive machine level optimizations, including SSA-based
scalar optimizations, global instruction scheduling, and global
register allocation. Unlike previous work in this area, phases that
perform optimizations and phases that maintain and update
garbage collection information are completely separate, and the optimizer
is constructed in a fully modular manner. In particular,
gc-types and safety constraints
by the source language semantics, the object representation,
and the target architectures.
This framework has the following overall structure:
- Garbage collection invariants annotation
The front-end client is responsible for annotating each
value in the program with a gc type, which is
used to specify the abstract object representation,
and the constraints on code motion that may be applied to such a value.
The front-end uses an architecture independent RTL
language for representing the program, thus this annotation
phase is portable between target architectures.
- GC constraints propagation
After instruction selection, gc constraint are propagated throughout
the machine level program representation. Again, for portability, gc typing
rules are specified in terms of the RTL of
the machine instructions. In this phase, unsafe code motions which
exposes unrecoverable derived values to gc safepoints are automatically
identified. (Pseudo) control dependence and anti-control dependence
constraints are then added the program representation to prohibit all
gc-unsafe code motions.
- Machine level optimizations
After constraints propagation, traditional
machine level optimizations such as
SSA optimizations and/or global scheduling are applied, without regard
to gc information. This is safe because
all gc safety constraints have been translated into the appropriate
code motion constraints.
- GC type propagation and gc code generation
GC type inference is performed when all optimizations
have been performed. GC safepoints are then
identified and the root sets are determined. In addition, compensation
code are inserted at gc points for repairing recoverable derived values.
In the presence of concurrency, i.e. multiple threads
of control that communicate via a shared heap, the above framework
will have to slightly extended. As in before, we assume that
context switching can only occur at well-defined
The crucial aspect is that values that are live at safepoints must be
classified as local or global.
Local values are only observable from
the local thread, while global values are potentially observable and mutable
from other threads. The invariants to maintain are as follows:
- Only local and recoverable derived values may be live at a safepoint,
- Only local and recoverable allocation may be incomplete at a safepoint
Last modified: Thu Jan 9 19:38:15 EST 2003 by leunga@slinky