Physical log based replication

This is a replication scheme that relies on physical logs generated by Berkeley DB to replicate changes to other nodes asynchronously. The replicant nodes (a.k.a. physreps) pull the physical logs from the source machine and apply it locally.

It is important to note that physreps are applying the logs out-of-band and, thus, should not be considered part of the source cluster.

In order to enable replication, replicate_from must be added to the copycomdb2-d physrep’s lrl file. This line can either take a valid Comdb2 cluster tier, a hostname or a comma-separated list of hostnames (without any space).

replicate_from source_dbname @source_host/source_tier

Physical replicants started with the above configuration will start replicating directly off of the specified source host or one of the nodes in the cluster tier.

Setup tiered replication topology

Physical replicants pull logs from the source by executing a SELECT query. Thus any physrep added to the system would add to the overall cost incurred by the source host/cluster.

In order to avoid having source support all physreps directly, one could setup tiered replication in which some physical replicants could become the source for the other replicants, thus keeping some load off of the top-level source host/cluster.


                            S       |                    -- tier 0 
                           / \      |
                          /   \     | log-flow
                         R1   R2    |                    -- tier 1
                        / \     \   |
                       /   \     \  v
                      R3   R4    R5                      -- tier 2

Note: The source cluster nodes are always considered at tier 0.

Setting up tiered replication topology requires 2 base tables to maintain the current state of replication as well as the replication topology.

  • comdb2_physreps
  • comdb2_physrep_connection

These tables automatically get updated to reflect the changes as replicants join or leave the system and thus are not designed to be manually modified under normal circumstances. In order to keep the load evenly spread, these table are consulted to ensure a certain fanout physrep_fanout is maintained across all the nodes. The LSN information in comdb2_physreps table is used by all the nodes to pause log-deletion.

Algorithm

On start, a physical replicant executes sys.physrep.register_replicant() against the physrep_metadb, which in turn, responds with a list of potential nodes that can be used as the source of physical logs. The replicant then picks up a node from the list and tries to connect to it. On successful connection, the replicant executes sys.physrep.update_registry() against the physrep_metadb, confirming that the replicant is now connected to a node.

Upon successful registration, the physical replicants execute SELECT .. FROM .. comdb2_transaction_logs to retrieve (PULL) log records, which is then applied locally. The replicants also periodically execute sys.physrep.keepalive() against physrep_metadb to inform about their current LSN. This information is used by the nodes to control log-deletion.

Cross-tier replication

In certain setups, where a TCP connection is not permitted from a lower replication tier (say development) to a higher replication tier (say production), the PULL strategy disussed above will not work.

So, in order to enable physical replicants, running in development tier, replicate from source nodes running in production tier, we force them to wait for a connection from a node in production tier, and when a connection arrives, the replicants use that to execute the (reverse) query to read logs from comdb2_transaction_logs.

                      ,------- S <-------,
                      |(1) connect       |
                      |                  |                         (production)
----------------------|------------------|-------------------------------------
                      |                  |                        (development)
                      |                  |(2) execute `SELECT .. FROM comdb2_transaction_logs`
                      `------> R1 -------'

Now, the question is, how would the nodes in production know which node(s) in development to connect to? They get this information from physrep_metadb's comdb2_physrep_sources table. This table can be updated with all the source->target information, based on which a replicant (in development) waits for a connection (sys.physrep.should_wait_for_con()) and the source nodes (in production) can connect to the physrep nodes by getting a list from the same table (sys.physrep.should_wait_for_con()). The comdb2_physrep_sources table allows the user to specify the source and target db’s by either hostname, tier, or cluster. A database will attempt to establish a reverse connection with a target if it’s source_dbname matches the source_dbname, and the source_host field matches the sources hostname, cluster-name, or tier. When a source sees that a target is specified by tier, it will attempt to establish a reverse-connection to all of the machines listed under comdb2db for that tier. If a target is listed by cluster-name, the source will attempt to establish a reverse-connection to all of the hosts listed in that cluster, as represented by the bbcpu.lst.

NOTE: In cross-tier replication, the replication metadata tables must be hosted by a separate database running in the a lower (development) tier.

Physical replication metadata tables

comdb2_physreps

CREATE TABLE comdb2_physreps(dbname CSTRING(60),
                             host CSTRING(120),
                             file INT,
                             offset INT,
                             firstfile INT,
                             last_keepalive DATETIME,
                             state CSTRING(60),
                             UNIQUE (dbname, host))
  • Physical replicant states:
    • Pending : The node has requested to become a physical replicant (registration in-progress)
    • Active : The node is a physical replicant
    • Inactive : The node might not be actively replicating; not allowed to become a replication source

comdb2_physrep_connections

CREATE TABLE comdb2_physrep_connections(dbname CSTRING(60),
                                        host CSTRING(120),
                                        source_dbname CSTRING(60),
                                        source_host CSTRING(120),
                                        UNIQUE (dbname, host, source_dbname, source_host), 
                                        FOREIGN KEY(dbname, host) REFERENCES comdb2_physreps(dbname, host) ON DELETE CASCADE)

comdb2_physrep_sources

CREATE TABLE comdb2_physrep_sources(dbname CSTRING(60),
                                    host CSTRING(120),
                                    source_dbname CSTRING(60),
                                    source_host CSTRING(120),
                                    UNIQUE (dbname, host, source_dbname, source_host))

Tunables

  • blocking_physrep: The SELECT .. FROM comdb2_transaction_logs query executed by physical replicants blocks for the next log record. (Default: false)
  • physrep_check_minlog_freq_sec: Check the minimum log number to keep this often. (Default: 10)
  • physrep_debug: Print extended physrep trace. (Default: off)
  • physrep_exit_on_invalid_logstream: Exit physreps on invalid logstream. (Default: off)
  • physrep_fanout: Maximum number of physical replicants that a node can service (Default: 8)
  • physrep_hung_replicant_check_freq_sec: Check for hung physical replicant this often. (Default: 10)
  • physrep_hung_replicant_threshold: Report if the physical replicant has been inactive for this duration. (Default: 60)
  • physrep_i_am_metadb: Marks the node as ‘physical replcation metadb’. (Default: off)
  • physrep_keepalive_freq_sec: Periodically send lsn to source node after this interval. (Default: 10)
  • physrep_max_candidates: Maximum number of candidates that should be returned to a new physical replicant during registration. (Default: 6)
  • physrep_metadb_host: List of physical replication metadb cluster hosts.
  • physrep_metadb_name: Physical replication metadb cluster name.
  • physrep_reconnect_penalty: Physrep wait seconds before retry to the same node. (Default: 5)
  • physrep_register_interval: Interval for physical replicant re-registration. (Default: 3600)
  • physrep_shuffle_host_list: Shuffle the host list returned by register_replicant() before connecting to the hosts. (Default: off)
  • physrep_source_dbname: Physical replication source cluster dbname.
  • physrep_source_host: List of physical replication source cluster hosts.
  • revsql_allow_command_execution : Allow processing and execution of command * over the reverse connection that has come in as part of the request. This is mostly intended for testing. (Default: off)
  • revsql_cdb2_debug: Print extended reversql-sql cdb2 related trace. (Default: off)
  • revsql_connect_freq_sec: This node will attempt to reverse connect to the remote host at this frequency. (Default: 5secs)
  • revsql_debug: Print extended reversql-sql trace. (Default: off)
  • revsql_host_refresh_freq_sec: The frequency at which the reverse connection host list will be refreshed. (Default: 5secs)
  • revsql_force_rte: Force rte-mode for all reverse connections. (Default: on)
  • connect_remote_rte: Force rte-mode for both fdb and revsql connections. (Default: off)

lrl configurations

  • replicate_from dbname @host/tier: This line sets the source host/cluster. It is required for all physical replicants.
  • replicate_wait : Tells the physical replicant to wait for this many seconds before applying the log records.
  • physrep_metadb: If set, all the nodes will connect to this database (as against source host/cluster mentioned via replicate_from) for replication metadata tables
  • physrep_fanout_override : This is set on the metadb, and allows per-database overrides of the 'physrep_fanout' tunable. The 'physrep_fanout_override' message-trap allows this to be set dynamically. The 'physrep_fanout_dump' message-trap prints the current overrides.
  • physrep_ignore : All the log records that belong to any of these tables are ignored by physical replicants
  • nonames: This configuration forces system database file names to not carry the database name. This setting is required for physical-log based replication to work properly.
  • elect_highest_committed_gen: Bias election by the highest generation in the logfile. This setting is required for physical-log based replication to work properly.
  • autoanalyze: Since physical replicants run in read-only mode, autoanalyze would not have any effect on the state of ‘stat’ tables. Thus, it is recommended to keep it turned off.

Stored procedures

  • sys.cmd.start_replication: Start replication threads
  • sys.cmd.stop_replication: Stop replication threads

Note: Based on the current lrls (and existence of replication matadata tables) the replication threads are created automatically on start.

There are additional stored procedures that nodes participating in the physical replication execute to perform various tasks from setting up to maintining the state of physical replication. These procedures should be considered internal and not be executed manually.

  • sys.physrep.register_replicant: A request to register a new physical replicant. This is the first step of registration.
  • sys.physrep.update_registry: As the second step of registration, the physrep executes this to confirm that it is successfully connected to a node.
  • sys.physrep.keepalive: Is executed by replicants periodically to report their current LSNs.
  • sys.physrep.reset_nodes: Is executed to reset states of nodes in replication metadb.
  • sys.physrep.topology: Returns a list of all the nodes in the system alongside their tiers.
  • sys.physrep.get_reverse_hosts: Returns pairs of dbname/host against which the source node must periodically send reversesql requests.
  • sys.physrep.should_wait_for_con: This is used by physreps to check whether they should wait for a ‘reversesql’ connection from a source host.

Log-triggers / Log-queue-triggers

  • log-triggers provide a way for a physical replicant to redirect writes to queues to either a file, or to a Kafka topic. It provides a solution for products which previously utilized consumers under local-replication. As consuming directly from a queue requires the ability to write to the database (which is not possible for physical replicants) log-triggers are being developed as an alternate notification mechanism. As log-triggers are under active development, we expect it to change as we receive feedback from customers.

The log-trigger sub-system introduces two new lrl-directives:

  • qdump <queue-name> <file-name> <maxsize>: All write-traffic directed to the specified ‘queue-name’ will be written to ‘file-name’. The maxsize parameter places a limit on the amount of memory we allow to remain queued in memory. Specifying 0 for maxsize allows this to grow without bound.
  • qkafka <queue-name> <kafka-topic> <maxsize>: All write-traffic directed to the specified ‘queue-name’ will be written to the specified ‘kafka-topic’. As with the qdump parameter, the maxsize argument places a limit on the amount of memory we allow in the outgoing queue, with 0 allowing it to grow without bound. In a clustered-phyrep deployment, qkafka will only write to kafka from the current master node.