Currently Genezzo is a single-process database -- only one process on a single server may access the database files at a time. The current Genezzo codebase also does not implement transaction rollback of data already written to disk. The Clustered Genezzo project will add support for multi-server multi-process access. The multi-process support will be specifically designed to support shared data clusters. Transaction rollback support will also be added. These additions will be done via Genezzo's Havok and SysHook extension mechanisms, so little modification of the base code will be necessary.
This implementation of Clustered Genezzo will rely heavily on two outside components: a Cluster File System and a Distributed Lock Manager. Initial choices for these components have been made to enable development. It is expected these components will later be replaced, either with other outside components or with new components developed as part of the Clustered Genezzo project.
The remainder of this document discusses the details of the design and proposed implementation of Clustered Genezzo. Many notes detail alternative design decisions and places for further improvement.
A shared data cluster with a topology like that shown below provides numerous benefits when deploying an application like the Genezzo database. These benefits are in the areas of scalability, high availability, and most recently affordability. Several of these benefits arise because every server can equally access every shared disk. The cluster scales in compute power by adding servers, and in capacity by adding disks. The data stored by the cluster is highly available because single server failures don't prevent access to any of the data, and because the SAN disks can easily be set up in a RAID configuration to guard against disk failures. Finally, clusters built with commodity AMD/Intel processors, ATA/SATA hard drives, and ethernet adapters and switches are very affordable. The major drawback to shared data clusters is the greater complexity of the operating system(s) and applications required to utilize it. The holy grail at both the operating system and database level is the "single system image" the illusion presented to the higher level applications and the users that the cluster is a single monolithic system and the underlying hardware complexity can safely be ignored.
The configuration of the Genezzo test cluster is discussed here.
One major concern with any file system will be the implementation of the fsync system call. When implementing database transactions it is important to guarantee data has really been written to disk. fsync is used to force all cached data to disk. When buffers are passed between processes on different machines via the disk another process may read an inconsistent block due to a partially completed write. This is very possible when database blocks are larger than file system blocks. We can detect this failure case using a checksum, and the block will be reread if a mismatch is found.
In the past clustered databases (such as Oracle Parallel Server or RAC) have been implemented utilizing one database instance per machine, where each instance has its own internal lock manager. With row level locking the lock information has been stored directly in the data blocks with the rows. The DLM has been used to implement cache coherence between database instances on different machines, not database transaction-level locks. In our case we will not initially be implementing a separate instance-level (intra-machine) lock manager. Instead the DLM will provide all locking services even if only a single machine is running.
Early tests of OpenDLM's scalability have found it to run out of locks at around 360,000 locks. Only 350MB out of 1GB memory had been consumed, so the issue was not simple memory exhaustion. Initially we will restrict our usage to 100,000 locks. This would restrict the size of an update or select to 400MB, assuming a 4K database page size. We will instead hash the lock names into a space of 100,000 locks, trading off the possibility of false lock blocking for larger potential update size.
There may also be issues with the performance of OpenDLM. In testing it took 1.9 seconds to obtain 10,000 locks, or 5263 locks/second. In the 100,000 lock limit case above it would take 19 seconds to lock the entire database. This is probably acceptable.
While OpenDLM provides many complex locking services, we will initially utilize a very simple set. We will not utilize the async APIs, avoiding issues with callback synchronization in Perl. By utilizing a simple set it should be easier to move to a different DLM (or write our own) in the future.
Each Genezzo process is started with a globally unique process id. Currently it is allocated by requesting a lock [SVR-processid] in exclusive nowait mode, and incrementing the processid and retrying if the processid is already in use (this N^2 technique may need to be revisited with a large number of processes). All processes share a single undo file, but have distinct ranges of blocks preallocated to each process. Under normal operation only the owning process reads or writes its portion of the undo file. When a process failure occurs the first Genezzo server process to detect the failure will roll back any failed transaction using the undo file information. Exclusive access to these undo file sections is maintained by a single DLM lock [SVR-processId].
The undo file lists the current status of each process (CLEAR, COMMITTED, ROLLEDBACK, PENDING) with one block per process. A range of blocks per process lists filenumber-blocknumber pairs. These are the blocks written to by the current transaction. Before the data block is written the before image is copied to the tail of its associated data file to a slot at filelength + blocknum.
Note only before-images of blocks are being generated, and they are only being retained for the duration of the transaction. No long-lived redo or log file is generated. Cluster-wide recovery with multiple redo log files is quite complex, and it is believed may customers cannot accept the downtime implied in a restore-backups + mount-offline-logs + roll-logs-forward style recovery. Instead we will rely on the availability of cheap SATA RAID hardware to cover the media failure case. Simple process death or power failure is covered in the undo file scheme.
An online hot-backup facility can be added via a tablespace or cluster-wide lock which must be held in SHARED mode to enable writes to the tablespace. A backup process would obtain it in EXCLUSIVE mode, locking out all writers for the duration of the backup.
Alternately all transactions could be streamed to a remote log server. This is similar to how many databases accomplish replication.
blockType fileNumber blockNumber blockRevisionNumber currentProcess (or none) ...metadata... ...row data.. checksum
If currentProcess is not none a transaction is in progress against this block, and this block has been modified. currentProcess indicates the slot in the undo file which contains a commit record if the transaction was committed. In this case currentProcess can safely be set to none. Otherwise the transaction has been rolled back, and the current block needs to be replaced with the before image from the tail of the data file.
blockRevisionNumber is incremented once for each (commited) transaction this block participates in. Note no global System Commit Number or Log Sequence Number is maintained. This eliminates the need for this type of global sequence generation algorithm. The algorithms in this document do not currently utilize the blockRevisionNumber. It is useful for log-based recovery, multi-version read consistency / row-level locking, etc.
As SQL statements are processed blocks will be requested from the buffer cache. Each request will form a lock name [BLK-fileNumber-blockNumber]. The process will determine whether this block is already locked by the process. If not, a blocking SHARED mode lock request will be made.
def blockReadRequest(fileNumber, blockNumber) form lock name [BLK-fileNumber-blockNumber] if(not lock already held by process) blocking SHARED lock request add lock to hash and list of locks held by process end loadBlock: checksumMatch = false retries = 0 // all block reads follow this checksum-retry logic; // it is only listed here while(not checksumMatch and retries < maxRetryCount) read block from disk compute checksumMatch for block end if(retries >= maxRetryCount) raise BlockCorruptionException end if(block currentUndo is not none (and not current process!)) if(recoverProcess(currentUndo in block) == FALSE) sleep 5 seconds // couldn't get lock; hopefully someone else // is in the middle of recovering this process end goto loadBlock end endThe Genezzo push-hash implementation automatically detects any attempts to write to a block. The first time a block is written in a transaction the SHARED lock must be promoted to EXCLUSIVE and the old contents of the block must be written to the undo file.
def blockWriteRequest(fileNumber, blockNumber, blockLocationInBufferCache) if(currentUndo field is not none) // already written to undo return end form lock name [BLK-fileNumber-blockNumber] find lock in hash of locks held by process if(not lock already held by process in EXCLUSIVE mode) blocking EXCLUSIVE lock conversion request update local record of lock state end append current (old) contents of block to tail of data file fsync data file add fileno-blockno to process undo block in undo file (written to two blocks for recoverability). fsync undo file set currentProcess field of block in buffer cache to this process undo file endAt this point the block may be updated and safely written out to disk at any time by the buffer cache. It may LRU in and out of the buffer cache repeatedly, so we do support updates larger than the buffer cache.
At commit time the buffer cache must be written out to disk (same as single-process Genezzo). Then the process status in the undo file is set to Committed. Each of the updated data blocks now has the currentUndo field set to none. Finally the undo file is truncated (set to empty), and the locks are released.
def commit() write all dirty buffer cache buffers to disk fsync data file(s) write Committed status to process slot in undo file fsync undo file // At this point we have now persistently marked the transaction as // COMMITTED. Prior to this point recovery will roll back the // transaction. After this point recovery will complete the transaction. read undo slots in undo file sequentially to determine list of updated blocks: for each block listed in undo slots find block in buffer cache (loading if not found) set currentProcess field in block to none write out data block end fsync data file(s) write CLEAR status to process slot in undo file fsync undo file free all data buffers release all locks in lock list endIn the case of a rollback we can use the undo file to back out the changes:
def rollback() write ROLLEDBACK status to process slot in undo file fsync undo file invalidate buffer cache for each block listed in undo slots in undo file write block from data file tail to corresponding data file location end fsync data file(s) write CLEAR status to process slot in undo file fsync undo file release all locks in lock list end
def recoverProcess(undoFile) // returns true when undo applied, // false when lock could not be obtained nonblocking lock request for [SVR-processID] of undoFile if(lock request failed) return FALSE end committed = FALSE lookup process status of failed process if COMMITTED status found committed = TRUE end for each block listed in undo slots of failed process // These blocks won't be in our buffer cache, since they // were updated by someone else's failed transaction if committed read data block into local temp buffer // NOT buffer cache set currentProcess field to none write out data block else write block from tail of data file to corresponding data file location end end fsync data file(s) write CLEAR status to process slot of failed process in undo file fsync undo file return TRUE end
The second major buffer cache difficulty is that currently the cache must be invalidated on completion of each transaction, since all the locks are released. One alternative is to store the blockRevisionNumber in the Lock Value Block (a 32-byte user-defined field) provided by OpenDLM. If the value matched the blockRevisionNumber in the buffer cache a read could be avoided. However, writing a new value requires upgrading the lock to EXCLUSIVE mode, which would be difficult with blocks under high contention. The lock would also need to be down-converted to NULL mode rather than released, so the DLM will preserve the value. Finally, with hashing of locks the lock name would also need to be written in the value field and only one value could be retained in a hash-collision case. A second option is to have the server processes retain locks after committing, and instead only release locks when an async callback occurs requesting the lock. This would require async callback programming in Perl.
File system performance will be strongly influenced by the implementation of the underlying Cluster File System. If it is built directly on raw devices there will not be any other file system buffering, and all reads will hit the disk (or disk cache). Cluster file systems build on top of the operating system file system have additional levels of buffer cache. These levels of cache reduce the need for a large Genezzo buffer cache, but introduce a greater likelihood of fsync problems.
Performance will also be impacted by the lack of serial writes. Traditional enterprise databases have one dedicated disk per database server log. This way each server can perform serial writes. They also perform write-ahead in the log and minimize fsyncs of the data files. We have a single random-accessed undo file shared between all processes on all servers. A dedicated undo disk per process would be cost prohibitive.
A minimal Apache 2.0 process (with mod_perl) consumes 6 MB [need to verify with code sharing, etc.]. The Genezzo code is small, and the total process size will be primarily determined by the size of the buffer cache. With a 4 MB buffer cache the total process would be 10 MB, and a 1 GB server could have say 80 Apache Genezzo child processes running. This is small compared to the hundreds of user connections supported by commercial databases, but the difference is that we are using a stateless web protocol instead of persistent connections. If a user makes small simple requests and waits tens of seconds between them then a single Apache process can serve dozens of users.
Note the stateless protocol means user transactions cannot span multiple HTTP requests. Instead some type of server-side stored procedures will be required to bundle multiple SQL statements into a single transaction. Stored procedures would of course be written in Perl. A simple alternative is to create separate mod_perl web pages for each procedure. They could be shared between the Apache servers in the cluster via the Cluster File System. Another alternative is to store the procedure definitions in the database. Large bulk loads would need to be done using a command-line tool (similar to gendba.pl).
With the locking model described above we still have the potential for "Phantoms". To prevent phantoms in indexed tables we can acquire additional locks on the index pages in positions before or after where phantom rows would be inserted. Preventing phantoms in non-indexed columns is more difficult.