GT.M Database Basics (from iothrash utility manual)
GT.M Database Engine Basics
Database engines typically operate on two types of files . database files and journal files. In the case of GT.M,
update information is written sequentially to the end of journal files to harden or commit transactions.
Once an update is written to a journal file, it is considered recoverable and updates to the database can follow subsequently.
The GT.M database engine is daemonless. Processes run as normal user processes,
and cooperate to manage the database. The logic resides in a shared library provided
by GT.M that the processes link dynamically. Thus, any process can read or write any
database file to which it has access permissions, and any process can append to
journal files based on access permissions. In a small application, there are
frequently tens of concurrently active processes; in a large application, there
can be thousands of active processes concurrently accessing the same database and journal files.
- Journal files are not read during normal operation, and grow to a specified maximum size (less
than or equal to 4GB). In the actual GT.M database engine, once a journal file reaches its limit,
it is closed and a new journal file is created; the new journal file contains a back pointer to its
predecessor. A lock ensures that at any given time, only one process is appending to a specific
journal file. Different processes can concurrently append to different journal files. Although
journal records are of variable size, and journal writes can be of variable size, an optimization in
the engine ensures that, under conditions of moderate to heavy load (which is when
benchmarking and performance become interesting), journal writes are a multiple of the natural
block size of the file system, starting at a multiple of that natural block size from the beginning of
- Database files are read from and written to randomly. Even in update intensive
processing applications, reads frequently outnumber writes. Database files have a block size,
which is a multiple of 512 bytes ranging from 512 through 65024 bytes. Recommended database
block sizes for optimal performance should be a multiple of the natural block size of the
underlying file system, and popular block sizes are 4KB, 8KB and 16KB. A database file can
have 157810688 (128M) blocks, so with the popular 8KB block size, a single database file can
have a maximum size of 1TB.
- Database blocks are always at an offset from the file header that is a multiple of the block size.
Database IO occurs on blocks. Reads are always of block size bytes. There are two write
options, partial and full block writes. When writing data into a block, it is usually the case that
the block is not full. Partial block writes simply write the actual data. On some IO subsystems,
this can lead to less than optimal performance, because the IO subsystem may read an entire
sector from disk, modify it, and write it back. For IO subsystems that bypass the read if an entire
block is written, GT.M can optionally be configured to perform full block writes, where an entire
block's worth of data are written, even if what is written consists of valid data and garbage (i.e.,
by performing more IO between the CPU and the IO subsystem, the IO subsystem may perform
physical IO more efficiently). Deployments will empirically use the setting that gives them the
better performance during benchmarking.
- Each database file has either zero or one active journal file. Databases that are not journaled are
typically used for scratch or temporary data, data that is static, and/or data that is easily
restored. If a SAN is not used, in large applications, it is common for a database file and its
journal file to reside on different disks and disk controllers. In smaller applications (such as a
primary care clinic), they may reside on the same RAID device.
- A logical database can be spread over multiple regions (a region is a database file and its
journal file; between seven and twenty is common in a large institution, with individual database
files typically in the tens of GB and a few in the hundreds of GB).
- There are periodic epochs - checkpoints at which all database blocks in shared memory, and
all journal records are written to disk, and an fsync() operation performed. An epoch involves
a burst of writes by one process followed by fsync() of the database file and its journal file.
Epochs of different database files happen at different times.