Most modern programming languages support the dynamic creation and deletion of data structures (also called objects). A chunk of memory is allocated for a new structure when requested. When a structure is no longer needed, the corresponding chunk of memory may be reused for other purposes.
When new objects are repeatedly created but unused ones are not discarded, long running programs eventually run out of memory. This is called a memory leak. Fortunately, this error is relatively easy to detect by monitoring memory usage and printing statistics about allocated and deallocated objects types.
The other type of error is discarding objects still in use, causing memory corruption. For example, a Color object may be created and used by two Window objects. When discarding one of the Window objects, the program may also discard erroneously the referenced Color object; when the remaining Window object accesses the discarded Color object, unpredictable results may be obtained.
Indeed, discarded objects may not be reused immediately and may be accessed a number of times without apparent problem. Eventually, the associated memory is reallocated to another object. Subsequent accesses will return incorrect values, which may cause considerable damage if used as pointers, for example.
The only way to prevent such errors is to have the system automatically determine unused objects. This is called garbage collection.
Objects may first be allocated consecutively in memory. Since each object may have a different size, the size must be attached to the object. Often, when an object of n bytes is requested, n+4 bytes are reserved, the first 4 bytes are used to store the size and the address of the remaining n bytes is returned.
When an object is discarded, the size information is checked. A list of discarded objects is maintained to make them available for reuse. At the next allocation, the list of discarded objects is examined to find an object of the correct size. The first object of the correct or greater size may be selected. The object with the closest matching size (still greater or equal) may alternatively be selected. In either case, allocation takes quite some time and the memory becomes more and more fragmented as discarded objects are split to allocate slightly smaller objects. Checking for contiguous free memory among the discarded objects, in order to regroup them into larger less fragmented objects, is a possibility but adds to the allocation time.
More elaborate algorithms achieve better results. For example, objects may be allocated in a limited number of fixed sizes corresponding to the powers of 2 (8, 16, 32, 64..., 1GB, 2GB, 4GB bytes). Furthermore, these objects may be aligned on addresses that are a multiple of their size and be created in contiguous pairs. Finding unused objects of the right size is easy as the number of sizes is limited and a pair of smaller objects is obtained by splitting in two an object of the next larger size. Finding contiguous discarded objects to merge also becomes easier since only contiguous pairs of the same size exist.
Even with these more elaborate algorithms, dynamic memory allocation is somewhat expensive. Several programs using dynamically allocated linked lists spend almost half of their processing time in the memory allocation routines. Moreover, at least 4 bytes of information is added to each allocated object and the size is rounded to the next power of two. The resulting size is thus between n+4 and 2(n+4)-1 for an average overhead of 0.5n + 5.5. In other words, on average, approximately 1/4 of the memory used for dynamic allocation is wasted.
To insure that objects are never discarded while still in use, objects should be retained as long as they are still reachable. The memory management system must then identify the objects that are unreachable and discard them. It is not necessary to discard the objects immediately when they become unreachable; it may be done somewhat later.
While garbage collection prevents memory corruption, memory leaks are still possible. Indeed, pointers to unneeded objects should be cleared. Failing that, the object remains reachable and cannot be discarded. Fortunately, memory leaks are relatively easy to track down.
There are several algorithms for garbage collection. Most start from the local and global variables and follow all the pointers recursively to determine all the reachable objects. For this, each object should have a type tag identifying it, instead of simply its size. This way, the object type reveals its size as well as the locations of pointers within its structure.
The reachable objects may be marked and then the memory scanned for unmarked and thus unreachable objects to discard. In that case, a list of discarded objects must be maintained and fragmentation may be a problem.
Another strategy is to copy the reachable objects into another initially empty memory region. The pointers must be adjusted to reflect the new position of the copied objects. The initial memory region can then be freed for reuse. One advantage of this method is that it avoids fragmentation and may increase reference locality. The disadvantage is that objects are moved and pointers must be adjusted. Furthermore, addresses cannot be used as immutable object identifiers, for example as hash code in tables. Extra memory is also consumed while copying objects to the new memory region.
It is possible however to have several memory regions, each containing successively older objects. In most cases, new objects have a greater probability of being discarded quickly and the region containing new objects should be collected more often.
Garbage collection is mandated to prevent memory corruption errors. It also simplifies the decoupling between libraries and applications; there is no debate about which of the calling program or the library allocates and discards the objects used to return results. It also relieves the programmer of memory management tasks such as keeping track of how many Window objects reference a Color object to know when the Color object may be discarded.
The price to pay for this may be increased memory consumption, although traditional dynamic memory allocation routines have a similar overhead. More important is the garbage collecting time, which is usually larger than the traditional dynamic allocation time. While this overhead in the past was measured around 10 to 20%, recent implementations claim an almost negligible overhead, possibly below 5%. Nevertheless, this time overhead may be concentrated and disrupt the execution. An interactive application, for instance, may stop for garbage collection for 5 seconds every few minutes. Fortunately, incremental garbage collectors have been developed and require more frequent but much shorter collections.
Run time type identification is not available in most classical structured languages. Its inclusion is being debated for C++ but is not yet available in most C++ compilers. The cost of run time type identification is fairly minor. Indeed, each dynamically allocated object already stores its size and may also store a method table. Objects may instead store a pointer to a type descriptor. This type descriptor, shared by all objects of a given type, then stores the size, method table, and possibly information about parent classes and fields (in particular pointers to follow during garbage collections).
The benefits of tagging each object with its type are numerous. It enables garbage collection, support for writing and reading objects from files and allows checking the type of objects at run time, whenever appropriate.