Server — MariaDB — Replication
The binary log article introduced replication as one of two reasons the binary log exists, alongside point-in-time recovery. It then immediately noted that February has no replicas and moved on. This article is the part that got deferred: what replication actually is, how it works at a conceptual level, and what it would mean for February’s architecture if it were ever worth implementing.
Nothing in this article needs to be done. It is context, not instructions.
What replication is
MariaDB replication is a mechanism for keeping one or more database servers — called replicas — continuously synchronised with a primary server. Every change made on the primary is captured in the binary log, transmitted to each replica, and applied there in the same order. The result is a replica that reflects the primary’s state with a small delay, typically milliseconds to seconds depending on network conditions and write volume.
The two main use cases are high availability and read scaling. High availability: if the primary fails, a replica can be promoted to take its place, minimising downtime. Read scaling: read-heavy applications can be directed at replicas rather than the primary, distributing the query load across multiple servers.
Neither of these applies to February in its current form. February runs a handful of lightly used databases that generate very little query traffic. The complexity of maintaining replication is not justified by the problem it would solve. But understanding the mechanism matters because it clarifies why the binary log is structured the way it is, why the server_id setting exists in the configuration, and what the path looks like if the server ever grows to a point where replication becomes worthwhile.
How it works
The binary log is the foundation. The primary records every data-modifying operation to the binary log as it happens. Each event in the log has a position, a timestamp, and the content of the change, formatted according to the configured binlog format: a SQL statement for STATEMENT mode, or the actual row data before and after the change for ROW mode.
A replica connects to the primary over a standard MariaDB connection and identifies itself as a replication client. The primary assigns it a position in the binary log to start from and begins streaming events. The replica writes these events to its own local log called the relay log, then applies them to its own data in a separate thread.
This two-thread architecture is intentional. The IO thread handles receiving events from the primary and writing them to the relay log. The SQL thread handles reading from the relay log and applying the changes. Separating them means a slow apply operation on the replica does not stall the receipt of new events from the primary. The replica can be temporarily behind the primary in its apply position while still keeping up with the incoming event stream.
The gap between the primary’s current position and the replica’s apply position is called replication lag. A lag of zero means the replica is fully caught up. A lag of a few seconds is normal under moderate write load. A lag that grows continuously means the replica cannot apply changes as fast as the primary produces them, which is a capacity problem.
GTID replication
Traditional replication tracks position by binary log filename and byte offset. This works but creates operational complexity: if a replica needs to be reconfigured to follow a different primary, the administrator must calculate the correct log file and position to start from.
Global Transaction Identifiers (GTID) solve this. In GTID-based replication, every transaction committed on the primary is assigned a unique identifier in the form server_id-sequence_number. Replicas track which GTIDs they have applied rather than a file position. Reconfiguring a replica to follow a different primary becomes a matter of pointing it at the new primary; the replica knows which transactions it has already applied and requests only what it is missing.
MariaDB has supported GTID replication since version 10.0. The server_id = 1 line in February’s configuration is a GTID prerequisite: every server in a replication topology needs a unique numeric identifier, and the GTID includes that identifier to guarantee global uniqueness across the whole topology. On a standalone server with no replicas, the value is arbitrary. On a server that participates in replication, it must be unique.
What replication does not replace
Replication is often conflated with backup. It is not a backup.
A replica reflects the primary’s state continuously, including mistakes. If a developer runs DELETE FROM records; on the primary without a WHERE clause, that deletion replicates to every replica within seconds. The data is gone on all servers simultaneously. Replication propagates changes; it does not protect against them.
Point-in-time recovery using the binary log is the mechanism that addresses accidental data loss. Backups using borgmatic address disaster recovery at the hardware level. Replication addresses availability and read distribution. These are three different problems and the tools for them are not interchangeable.
This is worth being explicit about because the combination of “we have replication” and “therefore we do not need backups” is a pattern that has caused real data loss at real organisations. Having a replica does not mean you have a backup.
Galera cluster
Galera is a different model: synchronous multi-primary replication where every node can accept writes, and a transaction is not committed on any node until it has been committed on all nodes. This eliminates replication lag and allows true high availability without manual failover, but at a significant cost in write latency: every write has to complete the Galera certification protocol across all nodes before committing.
Galera is appropriate for high-write, high-availability database tiers where the synchronous guarantee is worth the latency cost. For February’s workload, the overhead of Galera would be measurable and the benefit non-existent. It is worth knowing the option exists for larger deployments.
What February has instead
February’s current approach to database availability and recovery:
Availability is handled by the UPS and NUT setup. February stays online through power interruptions and shuts down cleanly before the battery runs out. A clean shutdown means InnoDB does not need crash recovery on the next start, and the binary log is in a consistent state.
Point-in-time recovery is handled by the combination of borgmatic logical backups and the binary log. A failure between borgmatic runs can be recovered to within seconds of the failure point.
Disaster recovery is handled by borgmatic’s daily backup to the NAS. A complete hardware failure can be recovered from the most recent backup, accepting the loss of data since the last backup window.
What February does not have is automatic failover. If February’s hardware fails entirely, there is no other server to promote. Recovery requires restoring to new hardware. For a homelab server with a single user base and no SLA, that is an acceptable limitation.
If February ever grows into something running business-critical services with multiple users and an expectation of continuous availability, the replication and Galera options described here are the natural next steps. At that point the binary log configuration, the GTID-ready server_id setting, and the logical backup strategy are all already in place. Adding a replica becomes a matter of configuring the new server and pointing it at February rather than rebuilding the foundation from scratch.
A note on the series
The MariaDB sub-series is now complete. It has covered installation, configuration, the three storage engines in use on February, logging, TLS, the binary log, user management, backup, and this conceptual overview of replication. That is more depth on a single component than most homelab guides give to the entire server stack.
The reason is that databases are where data lives, and data is what matters. The rest of February’s infrastructure — DNS, WireGuard, LoRaWAN, the UPS, the firewall — can be rebuilt from documentation. The data in the databases cannot.