GT.M Database Basics (from iothrash utility manual)

GT.M Database Engine Basics

Database engines typically operate on two types of files . database files and journal files. In the case of GT.M, update information is written sequentially to the end of journal files to harden or commit transactions. Once an update is written to a journal file, it is considered recoverable and updates to the database can follow subsequently.

The GT.M database engine is daemonless. Processes run as normal user processes, and cooperate to manage the database. The logic resides in a shared library provided by GT.M that the processes link dynamically. Thus, any process can read or write any database file to which it has access permissions, and any process can append to journal files based on access permissions. In a small application, there are frequently tens of concurrently active processes; in a large application, there can be thousands of active processes concurrently accessing the same database and journal files.

Journal files are not read during normal operation, and grow to a specified maximum size (less than or equal to 4GB). In the actual GT.M database engine, once a journal file reaches its limit, it is closed and a new journal file is created; the new journal file contains a back pointer to its predecessor. A lock ensures that at any given time, only one process is appending to a specific journal file. Different processes can concurrently append to different journal files. Although journal records are of variable size, and journal writes can be of variable size, an optimization in the engine ensures that, under conditions of moderate to heavy load (which is when benchmarking and performance become interesting), journal writes are a multiple of the natural block size of the file system, starting at a multiple of that natural block size from the beginning of the file.
Database files are read from and written to randomly. Even in update intensive transaction processing applications, reads frequently outnumber writes. Database files have a block size, which is a multiple of 512 bytes ranging from 512 through 65024 bytes. Recommended database block sizes for optimal performance should be a multiple of the natural block size of the underlying file system, and popular block sizes are 4KB, 8KB and 16KB. A database file can have 157810688 (128M) blocks, so with the popular 8KB block size, a single database file can have a maximum size of 1TB.
Database blocks are always at an offset from the file header that is a multiple of the block size. Database IO occurs on blocks. Reads are always of block size bytes. There are two write options, partial and full block writes. When writing data into a block, it is usually the case that the block is not full. Partial block writes simply write the actual data. On some IO subsystems, this can lead to less than optimal performance, because the IO subsystem may read an entire sector from disk, modify it, and write it back. For IO subsystems that bypass the read if an entire block is written, GT.M can optionally be configured to perform full block writes, where an entire block's worth of data are written, even if what is written consists of valid data and garbage (i.e., by performing more IO between the CPU and the IO subsystem, the IO subsystem may perform physical IO more efficiently). Deployments will empirically use the setting that gives them the better performance during benchmarking.
Each database file has either zero or one active journal file. Databases that are not journaled are typically used for scratch or temporary data, data that is static, and/or data that is easily restored. If a SAN is not used, in large applications, it is common for a database file and its journal file to reside on different disks and disk controllers. In smaller applications (such as a primary care clinic), they may reside on the same RAID device.
A logical database can be spread over multiple regions (a region is a database file and its journal file; between seven and twenty is common in a large institution, with individual database files typically in the tens of GB and a few in the hundreds of GB).
There are periodic epochs - checkpoints at which all database blocks in shared memory, and all journal records are written to disk, and an fsync() operation performed. An epoch involves a burst of writes by one process followed by fsync() of the database file and its journal file. Epochs of different database files happen at different times.