Previous Up Next

8  Memory Management Via Regions

8.1  Introduction

C gives programmers complete control over how memory is managed. An expert programmer can exploit this to write very fast and/or space-efficient programs. However, bugs that creep into memory-management code can cause crashes and are notoriously hard to debug.

Languages like Java and ML use garbage collectors instead of leaving memory management in the hands of ordinary programmers. This makes memory management much safer, since the garbage collector is written by experts, and it is used, and, therefore, debugged, by every program. However, removing memory management from the control of the applications programmer can make for slower programs.

Safety is the main goal of Cyclone, so we provide a garbage collector. But, like C, we also want to give programmers as much control over memory management as possible, without sacrificing safety. Cyclone's region system is a way to give programmers more explicit control over memory management.

In Cyclone, objects are placed into regions. A region is simply an area of memory that is allocated and deallocated all at once (but not for our two special regions; see below). So to deallocate an object, you deallocate its region, and when you deallocate a region, you deallocate all of the objects in the region. Regions are sometimes called ``arenas'' or ``zones.''

Cyclone has four kinds of region:
Stack regions
As in C, local variables are allocated on the runtime stack; the stack grows when a block is entered, and it shrinks when the block exits. We call the area on the stack allocated for the local variables of a block the stack region of the block. A stack region has a fixed size---it is just large enough to hold the locals of the block, and no more objects can be placed into it. The region is deallocated when the block containing the declarations of the local variables finishes executing. With respect to regions, the parameters of a function are considered locals---when a function is called, its actual parameters are placed in the same stack region as the variables declared at the start of the function.

Lexical regions
Cyclone also has lexical regions, which are so named because, like stack regions, their lifetime is delimited by the surrounding scope. Unlike stack regions, however, you can can add new objects to a lexical region over time. You create a lexical region in Cyclone with a statement,
  region  identifier;  statement
This declares and allocates a new lexical region, named identifier, and executes statement. After statement finishes executing, the region is deallocated. Within statement, objects can be added to the region, as we will explain below.

Typically, statement is a compound statement:
  { region identifier;
     statement1
    ...
     statementn
  }


The heap region
Cyclone has a special region called the heap. There is only one heap, whose type is denoted `H, and it is never deallocated. New objects can be added to the heap at any time (the heap can grow). Cyclone uses a garbage collector to automatically remove objects from the heap when they are no longer needed. You can think of garbage collection as an optimization that tries to keep the size of the heap small. (Alternatively, you can avoid garbage collection all together by specifying the -nogc flag when building the executable.)

Dynamic regions
Stack and lexical regions obey a strictly last-in-first-out (LIFO) lifetime discipline. This is often convenient for storing temporary data, but sometimes, the lifetime of data cannot be statically determined. Such data can be allocated in a dynamic region. A dynamic region supports deallocation at (essentially) any program point. However, before the data in a dynamic region may be accessed, the dynamic region must be opened. The open operation fails by throwing an exception if the dynamic region has already been freed. Note that each data access within a dynamic region does not require a check. Rather, you can open a given dynamic region once, access the data many times with no additional cost, and then exit the scope of the open. Thus, dynamic regions amortize the cost of checking whether or not data are still live and localize failure points. We describe dynamic regions in detail in Section 9.7.
Cyclone forbids dereferencing dangling pointers. This restriction is part of the type system: it's a compile-time error if a dangling pointer (a pointer into a deallocated region or to a deallocated object) might be dereferenced. There are no run-time checks of the form, ``is this pointing into a live region?'' As explained below, each pointer type has a region and objects of the type may only point into that region.

8.2  Allocation

You can create a new object on the heap using one of a few kinds of expression: Objects within regions can be created using the following analogous expressions. Note that new, malloc, calloc, rnew, rmalloc and rcalloc are keywords.

Here, the first argument specifies a region handle. The Cyclone library has a global variable Core::heap_region which is a handle for the heap region. So, for example, rnew (heap_region) expr allocates memory in the heap region which is initialized with expr. Moreover, new expr can be replaced with rnew(heap_region) expr.

The only way to create an object in a stack region is declaring it as a local variable. Cyclone does not currently support salloc; use a lexical region instead.

8.3  Common Uses

Although the type system associated with regions is complicated, there are some simple common idioms. If you understand these idioms, you should be able to easily write programs using regions, and port many legacy C programs to Cyclone. The next subsection will explain the usage patterns of unique and reference-counted pointers, since they are substantially more restrictive than other pointers.

Remember that every pointer points into a region, and although the pointer can be updated, it must always point into that same region (or a region known to outlive that region). The region that the pointer points to is indicated in its type, but omitted regions are filled in by the compiler according to context.

When regions are omitted from pointer types in function bodies, the compiler tries to infer the region. However, it can sometimes be too ``eager'' and end up rejecting code. For example, in
void f1(int * x) {
  int * y = new 42;
  y = &x;
}
the compiler uses y's initializer to decide that y's type is int * `H. Hence the assignment is illegal, the parameter's region (called `f1) does not outlive the heap. On the other hand, this function type-checks:
void f2(int x) {
  int * y = &x;
  y = new 42;
}
because y's type is inferred to be int * `f2 and the assignment makes y point into a region that outlives `f2. We can fix our first function by being more explicit:
void f1(int * x) {
  int *`f1 y = new 42;
  y = &x;
}
Function bodies are the only places where the compiler tries to infer the region by how a pointer is used. In function prototypes, type declarations, and top-level global declarations, the rules for the meaning of omitted region annotations are fixed. This is necessary for separate compilation: we often have no information other than the prototype or declaration.

In the absence of region annotations, function-parameter pointers are assumed to point into any possible region. Hence, given
void f(int * x, int * y);
we could call f with two stack pointers, a lexical-region pointer and a heap-pointer, etc. Hence this type is the ``most useful'' type from the caller's perspective. But the callee's body (f) may not type-check with this type. For example, x cannot be assigned a heap pointer because we do not know that x points into the heap. If this is necessary, we must give x the type int *`H. Other times, we may not care what region x and y are in so long as they are the same region. Again, our prototype for f does not indicate this, but we could rewrite it as
void f(int *`r x, int *`r y);
Finally, we may need to refer to the region for x or y in the function body. If we omit the names (relying on the compiler to make up names), then we obviously won't be able to do so.

Formally, omitted regions in function parameters are filled in by fresh region names and the function is ``region polymorphic'' over these names (as well as all explicit regions).

In the absence of region annotations, function-return pointers are assumed to point into the heap. Hence the following function will not type-check:
int * f(int * x) { return x; }
Both of these functions will type-check:
int * f(int *`H x) { return x; }
int *`r f(int *`r x) {return x; }
The second one is more useful because it can be called with any region.

In type declarations (including typedef) and top-level variables, omitted region annotations are assumed to point into the heap. In the future, the meaning of typedef may depend on where the typedef is used. In the meantime, the following code will type-check because it is equivalent to the first function in the previous example:
typedef int * foo_t;
foo_t f(foo_t x) { return x; }
If you want to write a function that creates new objects in a region determined by the caller, your function should take a region handle as one of its arguments. The type of a handle is region_t<`r>, where `r is the region information associated with pointers into the region. For example, this function allocates a pair of integers into the region whose handle is r:
  $(int,int)*`r f(region_t<`r> r, int x, int y) { 
     return rnew(r) $(x,y);
  }
Notice that we used the same `r for the handle and the return type. We could have also passed the object back through a pointer parameter like this:
  void f2(region_t<`r> r,int x,int y,$(int,int)*`r *`s p){ 
    *p = rnew(r) $(7,9); 
  }
Notice that we have been careful to indicate that the region where *p lives (corresponding to `s) may be different from the region for which r is the handle (corresponding to `r). Here's how to use f2:
  { region rgn;
    $(int,int) *`rgn x = NULL; 
    f2(rgn,3,4,&x);
  }
The `s and `rgn in our example are unnecessary because they would be inferred.

typedef, struct, and datatype declarations can all be parameterized by regions, just as they can be parameterized by types. For example, here is part of the list library.
  struct List<`a,`r>{`a hd; struct List<`a,`r> *`r tl;};
  typedef struct List<`a,`r> *`r list_t<`a,`r>;

  // return a fresh copy of the list in r2
  list_t<`a,`r2> rcopy(region_t<`r2> r2, list_t<`a> x) {
    list_t result, prev;

    if (x == NULL) return NULL;
    result = rnew(r2) List{.hd=x->hd,.tl=NULL};
    prev = result;
    for (x=x->tl; x != NULL; x=x->tl) {
      prev->tl = rnew(r2) List(x->hd,NULL);
      prev = prev->tl;
    }
    return result;
  }  
  list_t<`a> copy(list_t<`a> x) {
    return rcopy(heap_region, x);
  }

  // Return the length of a list. 
  int length(list_t x) {
    int i = 0;
    while (x != NULL) {
      ++i;
      x = x->tl;
    }
    return i;
  }
The type list_t<type,rgn> describes pointers to lists whose elements have type type and whose ``spines'' are in rgn.

The functions are interesting for what they don't say. Specifically, when types and regions are omitted from a type instantiation, the compiler uses rules similar to those used for omitted regions on pointer types. More explicit versions of the functions would look like this:
  list_t<`a,`r2> rcopy(region_t<`r2> r2, list_t<`a,`r1> x) {
    list_t<`a,`r2> result, prev;
    ...
  }
  list_t<`a,`H> copy(list_t<`a,`r> x) { ... }
  int length(list_t<`a,`r> x) { ... }

8.4  Type-Checking Regions

Because of recursive functions, there can be any number of live regions at run time. The compiler uses the following general strategy to ensure that only pointers into live regions are dereferenced: This strategy is probably too vague to make sense at this point, but it may help to refer back to it as we explain specific aspects of the type system.

Note that in the rest of the documentation (and in common parlance) we abuse the word ``region'' to refer both to region names and to run-time collections of objects. Similarly, we confuse a block of declarations, its region-name, and the run-time space allocated for the block. (With loops and recursive functions, ``the space allocated'' for the block is really any number of distinct regions.) But in the rest of this section, we painstakingly distinguish region names, regions, etc.

8.4.1  Region Names

Given a function, we associate a distinct region name with each program point that creates a region, as follows: The region name for the heap is `H. Region names associated with program points within a function should be distinct from each other, distinct from any region names appearing in the function's prototype, and should not be `H. (So you cannot use H as a label name, for example.) Because the function's return type cannot mention a region name for a block or region-construct in the function, it is impossible to return a pointer to deallocated storage.

In region r <`r> s, region r s, and region r = open(k) s the type of r is region_t<`r> (assuming, that k has type region_key_t<`r,_>). In other words, the handle is decorated with the region name for the construct. Pointer types' region names are explicit, although you generally rely on inference to put in the correct one for you.

8.4.2  Capabilities

In the absence of explicit effects (see below), the capability for a program point includes exactly: For each dereference or allocation operation, we simply check that the region name for the type of the object is in the capability. It takes extremely tricky code (such as existential region names) to make the check fail.

8.4.3  Type Declarations

Each pointer and region handle type is decorated with a set of region names. This set of region names is referred to as the store or effect qualifier of a type. For instance a pointer type might be int *`r+`H. This indicates the type of pointer to an integer that resides in the region named `r or the heap region, `H. Similarly, a region handle type might be region_t<`r1+`r2> which indicates a handle to region named `r1 or `r2.

A struct, typedef, or datatype declaration may be parameterized by any number of effect qualifiers. The store qualifiers are placed in the list of type parameters. They may be followed by their kind; i.e. ``::E''. For example, given
  struct List<`a,`r>{`a hd; struct List<`a,`r> *`r tl;};
the type struct List<int,`H> is for a list of ints in the heap, while the type struct List<int,`l> is for a list of ints in some lexical region. Notice that all of the ``cons cells'' of the List will be in the same region (the type of the tl field uses the same region name `r that is used to instantiate the recursive instance of struct List<`a,`r>). However, we could instantiate `a with a pointer type that has a different region name.

8.4.4  Subtyping and Effect Qualifiers

A pointer type's effect qualifier is part of the type. If e1 and e2 are pointers, then e1 = e2 is well-typed only if all the regions names mentioned in the the effect qualifier for e2's type also appears in the effect qualifier of e1's type. For instance, both assignments to b below is legal, while the assignment to a is not.
  void foo(int *`r a) {
    int *`r+`H b = a;
    if(!a) b = new 1;
    a = b;
  }
The store qualifier in the type of b indicates that it can point into the region named `r or into the heap region. Therefore, initializing b with a pointer into `r is certainly consistent with its store qualifier. Similarly, the second assignment to b is legal since it is updated to point to the heap region. However, the assignment to a is not permitted since the declared type of a claims that it is pointer into the region named `r alone and b may actually point to the heap.

For handles, if `r is a region name, there is at most one value of type region_t<`r> (there are 0 if `r is a block's name), so there is little use in creating variables of type region_t<`r>. However, it could be useful to create variables of type region_t<`r1+`r2>. This is the type of handle to either the region named `r1 or to the region named `r2. This is illustrated by the following code:
  void foo(int a, region_t<`r> h) {
    region_t<`r+`H> rh = h;
    if(a) {
      rh = heap_region;
    }
  }
As always, this form of subtyping for effect qualifiers is not permitted under a reference. Thus, the assignment in the program below is not legal.
  void foo(int *`H *`r a) {
    int *`r+`H *`r b = a;
  }

8.4.5  Function Calls

If a function parameter or result has type int *`r or region_t<`r>, the function is polymorphic over the region name `r. That is, the caller can instantiate `r with any region in the caller's current capability as long as the region has the correct kind. This instantiation is usually implicit, so the caller just calls the function and the compiler uses the types of the actual arguments to infer the instantiation of the region names (just like it infers the instantiation of type variables).

The callee is checked knowing nothing about `r except that it is in its capability (plus whatever can be determined from explicit outlives assumptions), and that it has the given kind. For example, it will be impossible to assign a parameter of type int*`r to a global variable. Why? Because the global would have to have a type that allowed it to point into any region. There is no such type because we could never safely follow such a pointer (since it could point into a deallocated region).

8.4.6  Explicit and Default Effects

If you are not using existential types, you now know everything you need to know about Cyclone regions and memory management. Even if you are using these types and functions over them (such as the closure library in the Cyclone library), you probably don't need to know much more than ``provide a region that the hidden types outlive''.

The problem with existential types is that when you ``unpack'' the type, you no longer know that the regions into which the fields point are allocated. We are sound because the corresponding region names are not in the capability, but this makes the fields unusable. To make them usable, we do not hide the capability needed to use them. Instead, we use a region bound that is not existentially bound.

If the contents of existential packages contain only heap pointers, then `H is a fine choice for a region bound.

These issues are discussed in Section 13.


Previous Up Next