Physical log based replication
This is a replication scheme that relies on physical logs generated by Berkeley DB to replicate changes to other nodes asynchronously. The replicant nodes (a.k.a. physreps) pull the physical logs from the source machine and apply it locally.
It is important to note that physreps are applying the logs out-of-band and, thus, should not be considered part of the source cluster.
In order to enable replication, replicate_from
must be added to the copycomdb2-d
physrep’s lrl file. This line can either take a valid Comdb2 cluster tier, a
hostname or a comma-separated list of hostnames (without any space).
replicate_from source_dbname @source_host/source_tier
Physical replicants started with the above configuration will start replicating directly off of the specified source host or one of the nodes in the cluster tier.
Setup tiered replication topology
Physical replicants pull logs from the source by executing a SELECT query. Thus any physrep added to the system would add to the overall cost incurred by the source host/cluster.
In order to avoid having source support all physreps directly, one could setup tiered replication in which some physical replicants could become the source for the other replicants, thus keeping some load off of the top-level source host/cluster.
S | -- tier 0
/ \ |
/ \ | log-flow
R1 R2 | -- tier 1
/ \ \ |
/ \ \ v
R3 R4 R5 -- tier 2
Note: The source cluster nodes are always considered at tier 0.
Setting up tiered replication topology requires 2 base tables to maintain the current state of replication as well as the replication topology.
- comdb2_physreps
- comdb2_physrep_connection
These tables automatically get updated to reflect the changes as replicants
join or leave the system and thus are not designed to be manually modified
under normal circumstances. In order to keep the load evenly spread, these table
are consulted to ensure a certain fanout physrep_fanout
is maintained across all
the nodes. The LSN information in comdb2_physreps
table is used by all the
nodes to pause log-deletion.
Algorithm
On start, a physical replicant executes sys.physrep.register_replicant()
against
the physrep_metadb
, which in turn, responds with a list of potential nodes that
can be used as the source of physical logs. The replicant then picks up a node from
the list and tries to connect to it. On successful connection, the replicant executes
sys.physrep.update_registry()
against the physrep_metadb
, confirming that the
replicant is now connected
to a node.
Upon successful registration, the physical replicants execute
SELECT .. FROM .. comdb2_transaction_logs
to retrieve (PULL) log records, which is then applied locally. The replicants also
periodically execute sys.physrep.keepalive()
against physrep_metadb
to inform
about their current LSN. This information is used by the nodes to control log-deletion.
Cross-tier replication
In certain setups, where a TCP connection is not permitted from a lower replication tier (say development) to a higher replication tier (say production), the PULL strategy disussed above will not work.
So, in order to enable physical replicants, running in development tier, replicate
from source nodes running in production tier, we force them to wait for a connection
from a node in production tier, and when a connection arrives, the replicants use
that to execute the (reverse) query to read logs from comdb2_transaction_logs
.
,------- S <-------,
|(1) connect |
| | (production)
----------------------|------------------|-------------------------------------
| | (development)
| |(2) execute `SELECT .. FROM comdb2_transaction_logs`
`------> R1 -------'
Now, the question is, how would the nodes in production know which node(s) in development
to connect to? They get this information from physrep_metadb's
comdb2_physrep_sources
table. This table can be updated with all the source->target information, based on which
a replicant (in development) waits for a connection (sys.physrep.should_wait_for_con()
)
and the source nodes (in production) can connect to the physrep nodes by getting a list
from the same table (sys.physrep.should_wait_for_con()
). The comdb2_physrep_sources
table allows the user to specify the source and target db’s by either hostname, tier, or cluster.
A database will attempt to establish a reverse connection with a target if it’s source_dbname
matches the source_dbname, and the source_host field matches the sources hostname, cluster-name,
or tier. When a source sees that a target is specified by tier, it will attempt to establish
a reverse-connection to all of the machines listed under comdb2db for that tier. If a target
is listed by cluster-name, the source will attempt to establish a reverse-connection to all of
the hosts listed in that cluster, as represented by the bbcpu.lst.
NOTE: In cross-tier replication, the replication metadata tables must be hosted by a separate database running in the a lower (development) tier.
Physical replication metadata tables
comdb2_physreps
CREATE TABLE comdb2_physreps(dbname CSTRING(60),
host CSTRING(120),
file INT,
offset INT,
firstfile INT,
last_keepalive DATETIME,
state CSTRING(60),
UNIQUE (dbname, host))
- Physical replicant states:
Pending
: The node has requested to become a physical replicant (registration in-progress)Active
: The node is a physical replicantInactive
: The node might not be actively replicating; not allowed to become a replication source
comdb2_physrep_connections
CREATE TABLE comdb2_physrep_connections(dbname CSTRING(60),
host CSTRING(120),
source_dbname CSTRING(60),
source_host CSTRING(120),
UNIQUE (dbname, host, source_dbname, source_host),
FOREIGN KEY(dbname, host) REFERENCES comdb2_physreps(dbname, host) ON DELETE CASCADE)
comdb2_physrep_sources
CREATE TABLE comdb2_physrep_sources(dbname CSTRING(60),
host CSTRING(120),
source_dbname CSTRING(60),
source_host CSTRING(120),
UNIQUE (dbname, host, source_dbname, source_host))
Tunables
- blocking_physrep: The
SELECT .. FROM comdb2_transaction_logs
query executed by physical replicants blocks for the next log record. (Default:false
) - physrep_check_minlog_freq_sec: Check the minimum log number to keep this often. (Default:
10
) - physrep_debug: Print extended physrep trace. (Default:
off
) - physrep_exit_on_invalid_logstream: Exit physreps on invalid logstream. (Default: off)
- physrep_fanout: Maximum number of physical replicants that a node can service (Default:
8
) - physrep_hung_replicant_check_freq_sec: Check for hung physical replicant this often. (Default: 10)
- physrep_hung_replicant_threshold: Report if the physical replicant has been inactive for this duration. (Default: 60)
- physrep_i_am_metadb: Marks the node as ‘physical replcation metadb’. (Default: off)
- physrep_keepalive_freq_sec: Periodically send lsn to source node after this interval. (Default: 10)
- physrep_max_candidates: Maximum number of candidates that should be returned to a new physical replicant during registration. (Default:
6
) - physrep_metadb_host: List of physical replication metadb cluster hosts.
- physrep_metadb_name: Physical replication metadb cluster name.
- physrep_reconnect_penalty: Physrep wait seconds before retry to the same node. (Default:
5
) - physrep_register_interval: Interval for physical replicant re-registration. (Default:
3600
) - physrep_shuffle_host_list: Shuffle the host list returned by register_replicant() before connecting to the hosts. (Default: off)
- physrep_source_dbname: Physical replication source cluster dbname.
- physrep_source_host: List of physical replication source cluster hosts.
- revsql_allow_command_execution : Allow processing and execution of command * over the
reverse connection
that has come in as part of the request. This is mostly intended for testing. (Default: off) - revsql_cdb2_debug: Print extended reversql-sql cdb2 related trace. (Default: off)
- revsql_connect_freq_sec: This node will attempt to
reverse connect
to the remote host at this frequency. (Default: 5secs) - revsql_debug: Print extended reversql-sql trace. (Default: off)
- revsql_host_refresh_freq_sec: The frequency at which the reverse connection host list will be refreshed. (Default: 5secs)
- revsql_force_rte: Force rte-mode for all reverse connections. (Default: on)
- connect_remote_rte: Force rte-mode for both fdb and revsql connections. (Default: off)
lrl configurations
- replicate_from dbname @host/tier: This line sets the source host/cluster. It is required for all physical replicants.
- replicate_wait
: Tells the physical replicant to wait for this many seconds before applying the log records. - physrep_metadb: If set, all the nodes will connect to this database (as against source host/cluster mentioned via
replicate_from
) for replication metadata tables - physrep_fanout_override
: This is set on the metadb, and allows per-database overrides of the 'physrep_fanout' tunable. The 'physrep_fanout_override' message-trap allows this to be set dynamically. The 'physrep_fanout_dump' message-trap prints the current overrides. - physrep_ignore
: All the log records that belong to any of these tables are ignored by physical replicants - nonames: This configuration forces system database file names to not carry the database name. This setting is required for physical-log based replication to work properly.
- elect_highest_committed_gen: Bias election by the highest generation in the logfile. This setting is required for physical-log based replication to work properly.
- autoanalyze: Since physical replicants run in read-only mode, autoanalyze would not have any effect on the state of ‘stat’ tables. Thus, it is recommended to keep it turned off.
Stored procedures
- sys.cmd.start_replication: Start replication threads
- sys.cmd.stop_replication: Stop replication threads
Note: Based on the current lrls (and existence of replication matadata tables) the replication threads are created automatically on start.
There are additional stored procedures that nodes participating in the physical replication execute to perform various tasks from setting up to maintining the state of physical replication. These procedures should be considered internal and not be executed manually.
- sys.physrep.register_replicant: A request to register a new physical replicant. This is the first step of registration.
- sys.physrep.update_registry: As the second step of registration, the physrep executes this to confirm that it is successfully connected to a node.
- sys.physrep.keepalive: Is executed by replicants periodically to report their current LSNs.
- sys.physrep.reset_nodes: Is executed to reset states of nodes in replication metadb.
- sys.physrep.topology: Returns a list of all the nodes in the system alongside their tiers.
- sys.physrep.get_reverse_hosts: Returns pairs of dbname/host against which the source node must periodically send
reversesql
requests. - sys.physrep.should_wait_for_con: This is used by physreps to check whether they should wait for a ‘reversesql’ connection from a source host.
Log-triggers / Log-queue-triggers
- log-triggers provide a way for a physical replicant to redirect writes to queues to either a file, or to a Kafka topic. It provides a solution for products which previously utilized consumers under local-replication. As consuming directly from a queue requires the ability to write to the database (which is not possible for physical replicants) log-triggers are being developed as an alternate notification mechanism. As log-triggers are under active development, we expect it to change as we receive feedback from customers.
The log-trigger sub-system introduces two new lrl-directives:
- qdump <queue-name> <file-name> <maxsize>: All write-traffic directed to the specified ‘queue-name’ will be written to ‘file-name’. The maxsize parameter places a limit on the amount of memory we allow to remain queued in memory. Specifying 0 for maxsize allows this to grow without bound.
- qkafka <queue-name> <kafka-topic> <maxsize>: All write-traffic directed to the specified ‘queue-name’ will be written to the specified ‘kafka-topic’. As with the qdump parameter, the maxsize argument places a limit on the amount of memory we allow in the outgoing queue, with 0 allowing it to grow without bound. In a clustered-phyrep deployment, qkafka will only write to kafka from the current master node.