Persistent Objects

Many programs read information from persistent storage, perform some processing and then write the result back to persistent storage. Defining the format of these input and output files on the persistent storage medium is tedious and cumbersome.

Proponents of the persistent programming paradigm view programs as having volatile objects, which exist only during the current execution, and persistent objects which continue to exist between program executions. The persistent objects are automatically retrieved from the disk when actually accessed in the program and modifications to these objects are automatically written back to disk as they happen or before the program terminates. This relieves the programmer from a number of low level considerations but involves numerous interesting technical issues.

Translation between memory and disk representations

The automatic translation from the memory to the disk representation implies that each object has a type tag and that the run time libraries have information about the type. This run time information must list the fields, their position and their type. Simple fields like integers, floats or character strings are simply converted to a suitable disk representation (binary or ASCII with a neutral byte ordering). Pointers must be replaced by an adequate disk representation, usually some unique integer identifier.

Several objects written together to disk are often called a pickle. If objects in a pickle contain pointers to other objects, not in the pickle, these pointers will have no meaning when the objects are read back. That is unless each object is manually assigned a unique identifier, unique among all pickles and program invocations. For this reason, many languages, like Modula-3 and Eiffel, provide pickling procedures that receive an object as argument. This object and all those recursively reachable from that object will be included in the pickle. This way, there are no pointers to objects not in the pickle. Each object in the pickle is assigned a unique identifier. When reloading a pickle, all the pointers are converted to the new objects addresses.

Pointers to procedures are usually treated differently. The source code may not be available and the executable code cannot run on other architectures. Instead, the procedure name and arguments types are stored (perhaps as a compressed hashing code). Upon reloading, if the same procedure exists in the reading program, its address is used.

Lazy loading and unloading

Writing and reading objects explicitly, through the pickling mechanisms, is easier than defining input and output file formats. The next step in this direction is transparent loading on demand. Pointer variables are initialized with the desired globally unique object identifiers. Then, when (and if) these variables are accessed, the corresponding object is loaded from disk. This object in turn references other objects but these will be loaded only if accessed.

The advantage of this is that objects are accessed just as if they were all in memory although only the accessed objects are actually loaded. The problem is to implement this loading on demand efficiently and to assign unique global identifiers to all objects.

Among the possible implementations, there are mainly three in use. Each pointer to an object that may be on disk (often called smart pointer) contains the unique identifier, the memory address and a flag indicating if the memory address is known. Upon accessing the smart pointer, it is checked if the memory address is known, in which case the access proceeds. Otherwise, a table is searched to find if the object has been loaded and its memory address; the memory address field of the smart pointer can then be set and used. If the object is not loaded, it gets loaded and inserted in the table.

The overhead of this implementation is one test for each access plus searching through the table and loading the object upon the first access. This is relatively simple to implement and the overhead is very small. To unload an object, all the smart pointers refering to it must have their flag reset to indicate that the memory address is invalid. Thus, each object must keep a list of the associated smart pointers.

The second possible implementation is to use a virtual address as object identifier within the program. This virtual address is the position where the object will be loaded, when (and if) it will be required. Upon accessing a pointer, if the object is not there, a page fault will occur. The page fault interrupt service routine will then load the needed object at that address.

The overhead of this implementation is to service the interrupt and load the object upon the first access and nothing for the subsequent accesses. Interrupt service routines are difficult to write and often non portable. Furthermore, interrupt service routines are costly in CPU time on many operating systems. Therefore, even though this implementation has no overhead once the object is loaded, the higher loading cost often makes it no more effective than the preceeding one. Unloading objects does not involve special actions. The virtual address remains reserved for the object in case it gets reloaded.

A third possible implementation is to only store the unique object identifier in the smart pointers. Each time an access is made, the table is searched to find the object in memory and load it if required. The overhead is one hash table access for each smart pointer access. Unloading objects only requires removing them from the table. This method is very simple to implement and very flexible but carries a relatively large overhead upon each access.

Persistence through reachability

In most implementations, persistent and volatile objects are different. Persistent objects have unique identifiers and are only accessed through smart pointers. Other implementations are more ambitious and objects become persistent simply by being reacheable from the persistent roots. The persistent roots are variables declared as leading to persistent objects.

Thus, whenever a pointer to volatile objects is stored in a persistent object, these become persistents. The mechanisms involved are refinements over those presented above.

Updating the persistent storage

When the program ends, all persistent objects may be written back to persistent storage. Similarly, when an object is unloaded, it may be written back to disk. Often, most objects are unmodified and there is no need to update them in the persistent store. In some implementations, every object has a flag indicating if they were modified. This flag is set by all the state modifying methods of the object. Then, only modified objects are written back, which saves some time.

In other implementations, a flag is maintained for each memory page. Memory pages are initially set read only. When an object on a page is modified, a memory protection violation interrupt occurs. The interrupt service routine sets the flag for the page and removes the read only protection. For this, the persistent store must work at the granularity of memory pages.

An important problem is the update frequency. If modified objects are only written back at a later time, a lot of modifications may be lost if the program suddenly crashes. Thus, in some cases, every time an object is modified its state is written back to the persistent store.

As an optimization, it is possible to write the complete object state once in a while (checkpoint) and only write the state changes (in a change log) when an object is updated through state modifying methods. If the change log becomes too long, a new checkpoint should be written and the change log emptied. When a crash occurs, the last checkpoint and the change log are used to get back to the exact state prior to the crash.


Copyright 1995 Michel Dagenais, dagenais@vlsi.polymtl.ca, Wed Mar 8 14:41:03 EST 1995