LRL files
Comdb2 configuration files have an .lrl extension for historic reasons. No configuration file is required. If no file is specified, the database will pick reasonable defaults. On startup, the database will look for the following files, and read them, in order:
$COMDB2_ROOT/etc/cdb2/config/comdb2.lrl
$COMDB2_ROOT/etc/cdb2/config/comdb2_local.lrl
$COMDB2_ROOT/etc/cdb2/config.d/*.lrl
dbname.lrl
(for your database’s name)
COMDB2_ROOT
is set at compile time (defaults to /opt/bb/
). Setting a $COMDB2_ROOT
environment variable overrides
the default value.
The intent behind multiple config files is to make deployment of settings easier. comdb2.lrl
is meant to contain settings
that are identical across many databases in your application. comdb2_local.lrl
is for settings that are local to this
machine. Files under config.d
are useful for organizing settings in multiple files (the ‘include’ directive in .lrl files
can also accomplish this).
The last file read is the database specific lrl file. This contains settings specific to a given database. Comdb2 will look for this file in one of these locations, in this order:
- Current directory
- In database directory if it’s specified with
--dir /path/to/db
on command line - In the default database location, if the database directory isn’t specified
- In the location specified with ‘–lrl /path/to/dbname.lrl` on command line
Lines beginning with ‘#’ are treated as comments and skipped. Other lines should consist of a directive and its value.
Database tunables
Cache size
Cache size can be set with the the ‘cache’ tunable. It takes a cache size and a unit (kb, mb, gb). For example
cache 64 mb
This gets you 64 megabytes of cache. The database will print its cache size when starting up:
Cache:64MB Segments:1 Segment-size:64MB
The default cache size is 64MB. You can set policies on a machine by capping the min and max settings for cachesize (putting these in global comdb2.lrl or comdb2_local.lrl is a good idea. For example to set the minimum cache size at 8MB, and restrict the maximum at 1GB (say, for a shared machine), one can have:
cachekbmin 8192
cachekbmax 1048576
Maximum and minimum values are given in KB.
Thread Pools
There are various threadpools in Comdb2 responsible for specific work. They all take a common set of options.
Name | Function | Default size | Default max queue depth |
---|---|---|---|
appsockpool | Pool for application connections | Unlimited | 0 |
sqlenginepool | Pool for sql runner threads | 48 | 500 |
iopool | Cache flusher threads | 4 | 8000 |
The following options are available for configuring thread pool parameters:
Parameters | Function |
---|---|
dump_on_full | If set, argument is on ) will dump the current state of the threadpool when the queue is full |
exit_on_error | If set (argument is on ), database will exit if it can’t spin up new threads for this threadpool. |
linger | How long threads above mint will hang around (in seconds) before exiting. |
maxagems | Max age of an item on a queue in ms. Items older than given will be dropped. |
maxq | Maximum queue depth. If maxt threads are active and none are available, items are enqueued. If the queue reaches this depth, requests to enqueue further are dropped. |
maxqover | Maximum queue override depth. Queued items below this limit won’t generate warnings. |
maxt | Maximum number of threads to keep around. Lower this you don’t get gains from additional concurrency for the specific subsystem. |
mint | Minimum number of threads to keep around. Threads above this value will exit after linger seconds. Raise this if the thread pool reports lots of thread creates. |
Examples:
To limit the number of concurrent SQL threads for this node to 12:
sqlenginepool maxt 12
To dump the SQL thread pool every time the queue is full:
sqlenginepool maxq dump_on_error on
Parallel sql execution
A subset of sql queries can run in parallel, using multiple sql engines and scaling up the numbers of rows per second returned to the client.
The statements that can be run in parallel are decomposed in a set of sql statements, which run in parallel in dedicated sql engines.
The sql engine initiating the parallel execution becomes the coordinator, which does row aggregation and final computations before sending
the rows back to the client.
A trivial case is the UNION ALL
query. Each SELECT
query that is part of a UNION ALL
statement runs in parallel with the rest.
This feature makes possible to scale up the throughput of sql queries, proportional with the amount of allocated resources. Additionally, this
feature can also benefit cases where per row retrieval cost is high, either due to I/O latency or the computation required.
Settings:
Option | Default | Description |
---|---|---|
dohsql_disable | 0 | Disable parallel sql execution (the SQL decomposition is still performed) |
dohast_disable | 0 | Disable SQL decomposition phase, required to distribute the sql query (in effect, disables the parallel execution mode). |
dohast_verbose | 0 | Enable debug information for parallel execution phase |
dohsql_max_queued_kb_highwm | 10000 | Maximum shard queue size, in KB; throttles amount of cached rows by each parallel component |
dohsql_max_threads | 8 | Allow only up to 8 parallel components. If more are required, statement runs sequential |
dohsql_pool_thread_slack | 1 | Reserve a number of sql engines to run only non-parallel load (including parallel components). |
dohsql_sc_max_threads | 8 | Allow only up to 8 parallel schema changes. If more are required, they runs sequential |
Networks
There are a few references to various “networks” in the descriptions below. All machines specified by the
cluster nodes
tunable (or added dynamically) form connections. There is 2 dedicated threads for each machine
for reading and writing messages. There’s 3 “networks” formed for each database (so 3 sets of connections).
Each network has its own set of tunables. Messages are queued for each destination, and are flushed when
necessary. All nodes periodically send heartbeats. Lack of a heartbeat is treated like a network drop.
Network tunables for controlling heartbeat parameters, queue sizes, etc. are listed below. The following
sections describe each network’s uses.
Option | Default | Description |
---|---|---|
heartbeat_check_time | 10 (seconds) | Consider an error if no heartbeat for this many seconds |
nax_max_mem | 0 (not set) | Maximum size (in MB) of items keep on replication network queue before dropping (per replicant) |
noudp | Disables udp . |
|
osql_bkoff_netsend | 100 ms | On a full offload net queue, attempt to wait this long before attempting to resend |
osql_bkoff_netsend_lmt | 300000 | Wait a total of this many ms attempting to send on the offload net |
osql_heartbeat_send_time | 5 (sec) | Like heartbeat_send_time for the offload network |
udp | set | Transaction acks are sent back to master via UDP. Since UDP is potentially lossy, replicants will inject the current LSN ack into their TCP channel to the master every 500 ms. On a lossy network, if you see lots of 500ms transactions, you may want to disable UDP. Such cases aren’t typical. |
Replication
The replication network handles replication-specific traffic between database nodes. This includes election messages, log messages (ie: normal replication), requests for re-transmits, etc. Steps 3 and 4 in Transaction Life cycle go over the replication network.
Offload
The offload net is used by replicants to send work to the master to apply. Replicants run SQL and assemble a set of record modifications to be applied. Steps 2 and 5 in Transaction Life cycle go over the offload net.
Signal
The signal net is reserved for future use.
Sync commands
These tunables control how often the database flushes its log to disk, and what nodes are required to acknowledge a log entry before control is returned to the user. All these tunables start with the ‘sync’ command, and the following word determine which option is being set. The default mode is to NOT flush logs on every transaction, and to wait for all nodes to acknowledge a transaction before returning success.
Option | Default | Description |
---|---|---|
full | not set | Logs are flushed on every commit |
log-delete | on | on enables log deletion, off disables it |
log-delete-age-set | Takes epoch time in seconds, log files older than the given time are eligible for deletion when not needed | |
log-delete-before | not set | Make all log files older than current time are eligible for deletion when not needed |
log-delete-now | set | Make all log files eligible for deletion when not needed |
log-sync-time | 10 (seconds) | Sets how often we flush the logs to disk. |
none | not set | Asynchronous replication, don’t wait for any replication acknowledgments before returning success. |
normal | set | Logs are NOT flushed on every commit (ie: you can lose transactions if all the nodes in the cluster crash). This is the default. This option also implies fully synchronous replication - ie: wait for all nodes to ack before returning success. |
rep_always_wait | not set | Wait for all nodes, including those that aren’t connected (not recommended) |
room | not set | Wait for acknowledgments only from nodes in the same availability zone/data center (see room affinity). |
Machine class configuration overrides
By default, the risk machine class names are, from the least risky deployment to the highest risk: ‘dev’/’test’, ‘alpha’, ‘uat’, ‘prod’. The following lrl option overrides this hierarchy to a client custom configuration that matches a different deployment stage scheme.
classes <lowest_risk_name> <higher_risk_name> ... <highest_risk_name>
The machine specific, as perceived by the server using the involved lrl file, can be overridden using the following lrl option:
machine_class <name>
NOTE: the overriding name should match one of the existing class names, otherwise the client access will be denied.
Allow/Disallow commands
These commands control machine class permissions. For every incoming connection, Comdb2 will try to read its class
from comdb2db once. allow
and disallow
can be used to set up policies for which classes of machines are allowed
to access databases. For example, you may have a blanket rule that production databases can only be accessed by
production machines, etc. Commands have this syntax:
allow|disallow writes|cluster|broadcast [from] class|machine
Class should be one of ‘dev’, ‘alpha’, ‘beta’, ‘prod’. Any other value is assumed to be a machine name.
writes
allows/disallows transactions that change the state of the database from the given source. cluster
allows/disallows forming clusters with machines from the given source. broadcast
allows/disallows consumer
connections from the given source.
Switches
Switches are simple on/off settings that generally control debugging or experimental features. They can
be enabled with the on
option, and disabled with off
. Unless debugging/developing Comdb2, toggling these
is generally not advised.
Option | Default | Description |
---|---|---|
abort_on_ufid_mismatch | on | Abort on open if the on-disk ufid doesn’t match the ufid in the transaction log. |
accept_on_child_nets | off | listen on separate port for osql/signal nets |
allow_broken_datetimes | on | Allow broken datetimes |
allow_parallel_rep_on_pagesplit | on | allow parallel rep on pgsplit |
allow_parallel_rep_on_prefix | on | allow parallel rep on bam_prefix |
already_aborted_trace | off | Print trace when dd_abort skips an ‘already-aborted’ locker |
bad_lrl_fatal | off | Unrecognized lrl options are fatal errors |
broadcast_check_rmtpol | on | Check rmtpol before sending triggers |
cache_sockpool_tran | on | Cache sent rows for an sql transaction, and survive a master swing |
catch_response_on_retry | on | print trace when we try to send replies on a retry |
check_page_in_recovery | off | verify that a page has or hasn’t gotten corrupt |
check_sparse_files | off | When allocating a page, check that we aren’t creating a sparse file |
check_sql_source | off | Check sql source |
check_sqlite_numeric_types | on | Report if our numeric conversion disagrees with SQLite’s |
check_wrong_db | on | Return error if connecting to wrong database |
comptxn_inherit_locks | on | Compensating transactions inherit pagelocks |
consumer_rtcpu | on | Don’t send update broadcasts to machines that are marked offline |
core_on_sparse_file | off | Generate a core if we catch berkeley creating a sparse file |
dbglog_use_sockpool | on | Use sockpool for connections opened for dbglog |
debug_mpalloc_size | off | Alarm on suspicious allocation requests; |
debug_temp_tables | off | Debug temp tables |
dflt_livesc | on | Use live schema change by default |
dflt_plansc | on | Use planned schema change by default |
disable_blob_check | off | return immediately in check_blob_buffers |
disable_etc_services_lookup | off | When on, disables using /etc/services first to resolve ports |
disable_exit_on_thread_error | off | don’t exit on thread errors |
disable_new_si_overhead | off | return immediately in several new snapisol functions |
disable_rowlocks_logging | off | Don’t add logical logging for rowlocks |
disable_rowlocks | off | Follow rowlocks codepath but don’t lock |
disable_stable_for_ipu | on | For inplace update tables, disable stable find-next cursors |
disable_update_shadows | off | stub out update shadows code |
dump_after_byteswap | off | dump page after byteswap |
dump_locks_on_repwait | off | Dump locks on repwaits |
dump_page_on_byteswap_error | off | fsnap a malformed page from byteswap |
emptystrnum | on | Empty strings don’t convert to numbers |
enable_osql_logging | off | Log every osql packet and operation |
enable_osql_longreq_logging | off | Log untruncated osql strings |
fix_cstr | on | Fix validation of cstrings |
flush_check_active_peer | on | Check if still have active connection when trying to flush |
force_old_cursors | off | Replicant will use old cursors |
gather_rowlocks_on_replicant | on | Replicant will gather rowlocks |
key_updates | on | Update non-dupe keys instead of delete/add |
lightweight_rename | off | When enabled, rename will add an alias instead of completely changing the table name |
lock_timing | on | Berkeley DB will keep stats on time spent waiting for locks |
locks_check_waiters | off | Light a flag if a lockid has waiters |
mask_internal_tunables | on | When enabled, comdb2_tunables system table would not list INTERNAL tunables |
memp_pg_timing | on | Berkeley DB will keep stats on time spent in __memp_pg |
memp_timing | off | Berkeley DB will keep stats on time spent in __memp_fget |
no_timeouts_on_release_locks | on | Disable client-timeouts if we’re releasing locks |
only_match_on_commit | off | Only rep_verify_match on commit records |
parallel_sync | on | Run checkpoint/memptrickle code with parallel writes |
pflt_toblock_lcl | on | Prefault toblock operations locally |
pflt_toblock_rep | on | Prefault toblock operations on replicants |
pfltverbose | on | Verbose errors in prefaulting code |
plannedsc | on | Use planned schema change by default |
prefix_foreign_keys | on | Allow foreign key to be a prefix of your key |
print_blockp_stats | off | print thread-count in block processor |
private_blkseq | on | Keep a private blkseq |
random_rowlocks | off | Grab random, guaranteed non-conflicting rowlocks |
release_locks_trace | off | Print trace if we release locks |
rep_printlock | off | Print locks in rep commit |
replicate_rowlocks | on | Replicate rowlocks |
repverifyrecs | off | Verify every berkeley log record received |
requeue_on_tran_dispatch | on | Requeue transactional statement if not enough threads |
reset_queue_cursor_mode | on | Reset queue consumeer read cursor after each consume |
rl_retry_on_deadlock | on | retry micro commit on deadlock |
rowlocks_commit_on_waiters | off | Don’t commit a physical transaction unless there are lock waiters |
rowlocks_deadlock_trace | off | Prints deadlock trace in phys.c |
schemachange_perms | on | Check if schema change allowed from source machines |
scpushlogs | on | Push to next log after a schema changes |
shalloc_timing | on | Berkeley DB will keep stats on time |
sql_release_locks_in_update_shadows | on | Release sql locks in update_shadows on lockwait |
sql_release_locks_on_emit_row_lockwait | off | Release sql locks when we are about to emit a row |
sql_release_locks_on_si_lockwait | on | Release sql locks from si if the rep thread is waiting |
sql_release_locks_on_slow_reader | on | Release sql locks if a tcp write to the client blocks |
sqlclient_use_random_readnode | off | Sql client will use random sql allocation by default |
sqlite3openserial | on | Serialize calls to sqlite3_open to prevent excess CPU |
superset_foreign_keys | on | Allow foreign key to be a superset of your key |
support_datetime_in_triggers | on | Enable support for datetime/interval types in triggers |
t2t | off | New tag->tag conversion code |
thread_stats | on | Berkeley DB will keep stats on what its threads are doing |
track_curtran_locks | off | Print curtran lockinfo |
track_queue_time | on | Track time sql requests spend on queue |
update_startlsn_printstep | off | Print steps walked in update_startlsn code |
use_blkseq | on | Enable blkseq |
use_fastseed_for_comdb2_seqno | off | Use fastseed instead of context for comdb2_seqno unique values |
utxnid_log | on | When enabled, 8-byte transaction IDs will be written to log records. |
verbose_net | off | Net prints lots of messages |
verbose_toblock_backouts | off | print verbose toblock backout trace |
verbose_waiter_flag | off | Print trace setting the waiter flag in lock code |
verify_all_pools | off | verify objects are returned to the correct pools |
verify_dbreg | off | Periodically check if dbreg entries are correct |
verify_directio | off | Run expensive checks on directio calls |
verifycheckpoints | off | Highly paranoid checkpoint validity checks |
verifylsn | on | Verify if LSN written before writing page |
warn_cstr | on | Warn on validation of cstrings |
sqllogger
commands
There’s a detailed SQL logging system in place. It allows fine-grained data collection on queries. This
can be controlled with the sqllogger
commands. Logs themselves are binary files that are dumped into
the database directory with the name $DBNAME.sqllog. When rolled, they before $DBNAME.sqllog.1,
$DBNAME.sqllog.2, etc.
SQL logger command | Defaul | Description |
---|---|---|
sqllogger on | Enables sql logging | |
sqllogger asyncsize | 4194304 | For async logging, the max size of unflushed query data |
sqllogger async | 1 | Log asynchronously - don’t write to file in the same thread as SQL query |
sqllogger every | 1 | Log every Nth statement (1 means every), occasionally good for sampling queries |
sqllogger keep | 2 | Keep this many copies of rolled logs |
sqllogger rollat | not set | Automatically roll logs when they get to this size (in bytes) |
sqllogger sync | 1 | Log synchronously - write to file in the same thread as SQL query |
Incoherent nodes overview
One problem with synchronous replication is that commit time is dominated by the slowest node. What’s worse, a node that stops responding altogether can hold up replication forever. To avoid these problems, the master may choose to mark nodes incoherent. Incoherent nodes continue to operate as before with 2 exceptions:
- They will no longer service any requests - any new queries/transactions that come in are rejected and will go to another node.
- The master will no longer wait for replication acknowledgments from incoherent nodes.
There’s 2 ways for a node to be marked incoherent. The first catches outright broken nodes. If any nodes have acknowledged a transaction, and any other node takes longer than a configured threshold to respond, it’s considered broken and marked incoherent. The second option captures nodes that are statistically slower than other nodes. A node where the running average of acknowledgment times over the last minute twice as bad as the second bad node (withing some minimum threshold) is considered slow, and will become incoherent.
Incoherent nodes are self-healing. If a node marked incoherent catches up to the master, the master will mark it coherent again, will resume waiting for replication, and the node will be free to service requests. If a node was marked slow (second method, listed above), it will become coherent once its running average is back within the configured threshold.
The tunables in this section control the various tunables used to decide when nodes are to be considered incoherent
DB layer tunables:
Option | Default | Description |
---|---|---|
incoherent_alarm_time | 120 (seconds) | Warn about incoherent nodes this often. |
incoherent_msg_freq | 3600 (seconds) | Drop a warning file if a node is incoherent (see incoherent nodes overview) for more information. |
max_incoherent_nodes | 1 | Number of incoherent nodes before an alarm triggers. |
BDB layer tunables (see bdbattr tunables)
Option | Default | Description |
---|---|---|
REPMETHODMAXSLEEP | 300 (SECS) | Delay commits by at most this much if forced to delay by incoherent nodes |
REPTIMEOUT_LAG | 50 (PERCENT) | Used in replication. Once a node has received our update, we will wait REPTIMEOUT_LAG% of the time that took for all other nodes. |
REPTIMEOUT_MAXMS | 5 * 60 * 1000 (MSECS) | We should wait this long for one node to acknowledge replication. If after this time we have failed to replicate anywhere then the entire cluster is incoherent! |
REPTIMEOUT_MINMS | 10000 (MSECS) | Even if the first node comes back quickly, wait at least this many ms to replicate to other nodes. |
REP_DEBUG_DELAY | 0 (MSECS) | For debugging - set an artificial replication delay |
SKIPDELAYBASE | 100 (MSECS) | Delay commits by at least this much if forced to delay by incoherent nodes |
TOOMANYSKIPPED | 2 (QUANTITY) | Call for election again and delay commits if more than this many nodes are incoherent |
sql_tranlevel_default
options
These options allow you to set the default SQL transaction level across the database. Any setting here can be overwritten by running ‘SET TRANSACTION …’ after connecting to the database. See Isolation Levels for available transaction levels.
sql_tranlevel option | Description |
---|---|
sql_tranlevel_default default | Set to default isolation level |
sql_tranlevel_default recom | Set to READ COMMITTED isolation level |
sql_tranlevel_default snapshot | Set to SNAPSHOT isolation level |
sql_tranlevel_default serial | Set to SERIALIZABLE isolation level |
Query limit commands
Queries running in the database can be limited in cost. The database can optionally warn when a query reaches a certain cost, or prevent it from running further.
querylimit option | Description |
---|---|
querylimit maxcost N | Set the maximum cost the database will allow for a query to N |
querylimit maxcost off | Turn off returning errors for queries over a cost limit |
querylimit warn maxcost N | Warn if a query has a cost > N |
querylimit warn maxcost off | Turn off warnings for queries over a cost limit |
Transaction limit settings
These options limit a transaction on the master node.
Option | Default | Description |
---|---|---|
max_cascaded_rows_per_txn | 0 | Set the max allowed cascaded rows written (updated) per transaction |
max_time_per_txn_ms | 0 | Set the max time allowed for a transaction to complete |
max_wr_rows_per_txn | 0 | Set the max written rows per transaction – inludes all rows written by a transaction, including cascaded operations |
Decimal rounding options
The rounding policy for rounding for columns of decimal types can be controlled with the decimal_rounding
option.
Available options are:
decimal_rounding option | Description |
---|---|
decimal_rounding DEC_ROUND_NONE | Don’t round |
decimal_rounding DEC_ROUND_05UP | Round for reround |
decimal_rounding DEC_ROUND_CEILING | Round towards +infinity |
decimal_rounding DEC_ROUND_DOWN | Round towards 0 (truncate) |
decimal_rounding DEC_ROUND_FLOOR | Round towards -infinity |
decimal_rounding DEC_ROUND_HALF_DOWN | 0.5 rounds down |
decimal_rounding DEC_ROUND_HALF_EVEN | 0.5 rounds to nearest even (default) |
decimal_rounding DEC_ROUND_HALF_UP | 0.5 rounds up |
decimal_rounding DEC_ROUND_UP | Round away from 0 |
berkattr
tunables
There’s a few low-level tunables that control options in BerkeleyDB subsystems present in Comdb2.
All options below are preceded with “berkattr” in the configuration file.
Option | Default | Description |
---|---|---|
abort_on_replicant_log_write | 0 | Abort if replicant is writing to logs |
abort_zero_lsn_memp_put | 0 | Abort on memp_fput pages with zero headers |
abort_zero_lsn_writes | 0 | Abort on writing pages with zero headers |
always_run_recovery | 1 | Replicant always runs recovery after rep_verify |
apprec_track_lsn_ranges | 1 | During recovery track lsn ranges |
blocking_latches | 0 | Block on latch rather than deadlock |
btpf_cu_gap | 5 | How close a cursor should be (pages) to the prefaulted limit before prefaulting again |
btpf_enabled | 0 | Enables index pages read ahead |
btpf_min_th | 1 | Preload pages only if the tree has height less than this parameter |
btpf_pg_gap | 0 | Min. number of records to the page limit before read ahead |
btpf_wndw_inc | 1 | Increment factor for the number of pages read ahead |
btpf_wndw_max | 1000 | Maximum number of pages read ahead |
btpf_wndw_min | 100 | Minimum number of pages read ahead |
cache_lc_check | 0 | Check LC cache system on every transaction |
cache_lc_debug | 0 | Lots of verbose messages out of LC cache system |
cache_lc_max | 16 | Keep this many transactions around in LC cache |
cache_lc_memlimit_tran | 1048576 | Limit per transaction memory used by LC cache |
cache_lc_memlimit | 2097152 | Limit total memory used by LC cache (0 = unlimited). |
cache_lc_trace_evictions | 0 | Print a message at the point of eviction |
cache_lc_trace_misses | 0 | Print a message on cache miss |
cache_lc | 0 | Collect logs into LSN_COLLECTIONs as they come in |
check_applied_lsns_debug | 0 | Lots of verbose trace for debugging applied LSNs |
check_applied_lsns_fatal | 0 | Abort if check_applied_lsns fails |
check_applied_lsns | 0 | Check transaction that its LSNs have been applied |
check_pwrites_debug | 0 | Read page after direct pwrite, check that it matches |
check_pwrites | 0 | Read page after direct pwrite, check that it matches |
check_zero_lsn_writes | 1 | Warn on writing pages with zero LSNs |
commit_map_debug | 0 | Produce debug output in commit lsn map |
consolidate_dbreg_ranges | 1 | Combine adjacent dbreg ranges for same file |
db_lock_lsn_step | 1024 | Stepup for preallocated db_lock_lsns |
dbreg_errors_fatal | 0 | dbreg errors fatal |
debug_addrem_dbregs | 0 | Generate debug records for addrems |
debug_deadlock_replicant_percent | 0 | Percent of replicant events getting deadlocks |
debug_enospc_chance | 0 | DEBUG %% random ENOSPC on writes |
flush_scan_dbs_first | 0 | Don’t hold bufpool mutex while opening files for flush |
ilock_step | 2048 | Stepup for preallocated ilock-latches |
iomap_enabled | 1 | Map file that tells comdb2ar to pause while we fsync |
latch_max_poll | 5 | Poll latch this many times before returning deadlock |
latch_max_wait | 5000 | Block at most this many microseconds before returning deadlock |
latch_poll_us | 1000 | Poll latch this many microseconds before retrying |
latch_timed_mutex | 1 | Use a timed mutex |
lockerid_node_step | 128 | Stepup for preallocated lids |
log_applied_lsns | 0 | Log applied LSNs to log |
log_cursor_cache | 0 | Cache log cursors |
lsnerr_logflush | 1 | Flush log on lsn error |
lsnerr_pgdump_all | 0 | Dump page on LSN errors on all nodes |
lsnerr_pgdump | 1 | Dump page on LSN errors |
max_latch_lockerid | 10000 | Size of latch lockerid array |
max_latch | 200000 | Size of latch array |
num_write_retries | 8 | number of times to retry writes on ENOSPC |
preallocate_max | 256 * MEGABYTE | Pre-allocation size |
preallocate_on_writes | 0 | Pre-allocate on writes |
recovery_processor_poll_interval_us | 1000 | Recovery processor wakes this often to check workers |
recovery_verify_fatal | 0 | Abort if recovery_verify is set, and fails. |
recovery_verify | 0 | After recovery, run a full pass to make sure everything is applied |
sgio_enabled | 0 | Do scatter gather I/O |
sgio_max | 10 * MEGABYTE | Max scatter gather I/O to do at one time |
skip_sync_if_direct | 1 | Don’t fsync files if directio enabled |
start_recovery_at_dbregs | 1 | Start recovery at dbregs |
tracked_locklist_init | 10 | Initial allocation count for tracked locks |
warn_nondbreg_records | 0 | warn on non-dbreg records before checkpoint |
warn_on_replicant_log_write | 1 | Warn if replicant is writing to logs |
bdbattr
tunables
These tunables control the ‘BDB’ layer of the db. Its main responsibility is setting tunables that control lower-level functions like replication, various timeouts, etc. These are set with the ‘setattr’ tunable. The next word determines which option to set, and the following word determines its value.
Option | Default (type) | Description |
---|---|---|
ALLOW_OFFLINE_UPGRADES | 0 (BOOLEAN) | Allow machines marked offline to become master. |
AUTODEADLOCKDETECT | 1 (BOOLEAN) | When enabled, deadlock detection will run on every lock conflict. When disabled, it’ll run periodically (every DEADLOCKDETECTMS ms) |
BLOBSTRIPE | 0 (BOOLEAN) | Settable at ONLY at database creation time. Create stripes for every variable length field (eg: blobs) |
BULK_SQL_MODE | 1 (BOOLEAN) | Enable reading data in bulk when performing a scan (alternative is single-stepping a cursor) |
BULK_SQL_THRESHOLD | 2 (QUANTITY) | Use bulk retrieval of data on scan after this many next operations |
CHECKPOINTRAND | 30 (SECS) | Stagger scheduled checkpoints by this random amount within this many seconds (to prevent multiple databases from checkpointing at the same time) |
CHECKPOINTTIMEPOLL | 100 (MSECS) | Poll a random amount of time < this before writing a checkpoint (attempting to prevent multiple databases from checkpointing at the same exact time) |
CHECKPOINTTIME | 60 (SECS) | Write a checkpoint at this interval. |
CHECK_LOCKER_LOCKS | 0 (BOOLEAN) | Sanity check locks at end of transaction |
COMMITDELAYMAX | 0 (QUANTITY) | Caps the max transaction delay time. COMMITDELAY won’t set above this value, unless set manually. |
COMMITDELAY | 0 (MSECS) | Add a delay after every commit. This is occasionally useful to throttle the transaction rate. |
DEADLOCKDETECTMS | 100 (MSECS) | When automatic deadlock detection is disabled, run the deadlock detector this often. |
DEADLOCK_LEAST_WRITES_EVER | 1 (BOOLEAN) | If AUTODEADLOCKDETECT is off, prefer transaction with least write as deadlock victim |
DEADLOCK_MOST_WRITES | 0 (BOOLEAN) | If AUTODEADLOCKDETECT is off, prefer transaction with most write as deadlock victim |
DEADLOCK_WRITERS_WITH_LEAST_WRITES | 1 (BOOLEAN) | If AUTODEADLOCKDETECT is off, prefer transaction with least write as deadlock victim |
DEADLOCK_YOUNGEST_EVER | 0 (BOOLEAN) | If AUTODEADLOCKDETECT is off, prefer youngest transaction as deadlock victim |
DEF_MIN_KEEP_LOGS | 5 (QUANTITY) | Keep at least this many logs around, even if they are eligible for deletion (unless disk space is very low) |
DELAYED_OLDFILE_CLEANUP | 1 (BOOLEAN) | If set, don’t delete unused data/index files in the critical path of schema change - schedule them for deletion later. |
DISABLE_PAGEORDER_RECSZ_CHK | 0 (BOOLEAN) | If set, allow page-order table scans even for larger record sizes where they don’t necessarily lead to improvement. |
DISABLE_PGORDER_MIN_NEXTS | 1000 (QUANTITY) | Don’t disable page order table scans for tables less than this many pages. |
DISABLE_PGORDER_THRESHOLD | 60 (PERCENT) | Disable page order table scans if skipping this percentage of pages on a scan |
DISABLE_UPDATE_STRIPE_CHANGE | 1 (BOOLEAN) | Enable to move records between stripes on an update. |
DISABLE_WRITER_PENALTY_DEADLOCK | 0 (BOOLEAN) | If set, won’t shrink max #writers on deadlock |
DTASTRIPE | 8 (QUANTITY) | Partition each table’s data into this many stripes. Note that this is ONLY settable at database creation time. |
ELECTTIMEBASE | 50 (MSECS) | Master election timeout base value |
ELECT_DISABLE_NETSPLIT_PATCH | 0 (BOOLEAN) | When false - on a net split, the side with the master keeps it instead of downgrading. New masters still can’t be elected without a quorum. |
ENABLE_TEMPTABLE_CLEAN_EXIT | 0 (BOOLEAN) | On exit, clean up temp tables (they are deleted on next startup regardless). |
FULLRECOVERY | 0 (BOOLEAN) | Instead of recovering from the last checkpoint, run recover from the start of the available logs. |
GENID_COMP_THRESHOLD | 60 (QUANTITY) | Try to compress rowids if the record data is smaller than this size. |
HOSTILE_TAKEOVER_RETRIES | 0 (QUANTITY) | Attempt to take over mastership if the master machine is marked offline, and the current machine is online. |
INDEX_PRIORITY_BOOST | 1 (BOOLEAN) | Treat index pages as higher priority in the buffer pool. |
KEEP_REFERENCED_FILES | 1 (BOOLEAN) | Don’t remove any files that may still be referenced by the logs. |
LITTLE_ENDIAN_BTREES | 1 (BOOLEAN) | Enabling this sets byte ordering for pages to little endian |
LOGSEGMENTS | 1 (QUANTITY) | Changing this can create multiple logfile segments. Multiple segments can allow the log to be written while other segments are being flushed. |
LOG_DEBUG_CTRACE_THRESHOLD | 20 (QUANTITY) | Limit trace about log file deletion to this many events |
LOWDISKTHRESHOLD | 95 (PERCENT) | Sets the low headroom threshold (percent of filesystem full) above which Comdb2 will start removing logs against set policy. |
MASTER_REJECT_REQUESTS | 1 (BOOLEAN) | Master will reject SQL requests - they’ll be routed to a replicant. The master can serve SQL requests, but it’s better to avoid it for better workload balancing. |
MASTER_REJECT_SQL_IGNORE_SANC | 0 (BOOLEAN) | If MASTER_REJECT_REQUESTS is set, reject if no other connected nodes are available. |
MAX_SQL_IDLE_TIME | 3600 (QUANTITY) | Warn when an SQL connection remains idle for this long. |
MAX_VLOG_LSNS | 10000000 (QUANTITY) | Apply up to this many replication record trying to maintain a snapshot transaction. |
MEMPTRICKLEMSECS | 1000 (MSECS) | Pause for this many ms between runs of the cache flusher. |
MEMPTRICKLEPERCENT | 99 (PERCENT) | Try to keep at least this percentage of the buffer pool clean. Write pages periodically until that’s achieved. |
MINREPTIMEOUT | 10000 (MSECS) | Wait at leas this long for a replication event before marking a node incoherent (see incoherent nodes overview) |
MIN_KEEP_LOGS_AGE | 0 (SECS) | Keep logs that are at least this old (0 - do not keep logs based on time) |
NEW_MASTER_DUMMY_ADD_DELAY | 5 (SECS) | Force a transaction after this delay, after becoming master. |
NOMASTER_ALERT_SECONDS | 60 (QUANTITY) | Replicants will alarm there’s no master for this many seconds. |
NUMBERKDBCACHES | 0 (QUANTITY) | Split the cache into this many segments. |
OSYNC | 0 (BOOLEAN) | Enable O_SYNC on writes. Reads will still use filesystem cache. |
PAGEDEADLOCK_MAXPOLL | 5 (QUANTITY) | If retrying on deadlock (see PAGEDEADLOCK_RETRIES ), poll up to this many ms on each retry. |
PAGEDEADLOCK_RETRIES | 500 (QUANTITY) | On a page deadlock, retry the page operation up to this many times. |
PAGE_EXTENT_SIZE | 0 (QUANTITY) | If set, allocate pages in blocks of this many (extents) |
PAGE_ORDER_TABLESCAN | 1 (BOOLEAN) | Scan tables in order of pages, not in order of rowids (faster for non-sparse tables) |
PRINT_FLUSH_LOG_MSG | 0 (BOOLEAN) | Produce trace when flushing log files. |
RECOVERY_PAGES | 0 (QUANTITY) | Disabled if set to 0. Othersize, number of pages to write in addition to writing datapages. This works around corner recovery cases on questionable filesystems. |
REPLIMIT | 256 * 1024 (BYTES) | Replication messages will be limited to this size |
REPSLEEP | 0 (MSECS) | Add a delay on replicants before completing processing a log record. |
REPTIMEOUT | 20 (SECS) | Replication timeout. |
REP_DB_PAGESIZE | 0 (QUANTITY) | Page size for BerkeleyDB’s replication cache db. |
REP_LONGREQ | 1 (SECS) | Warn if replication events are taking this long to process. |
REP_LSN_CHAINING | 0 (BOOLEAN) | If set, will force transactions on replicant to always release locks in LSN order. |
REP_MEMSIZE | 524288 (QUANTITY) | Maximum size for a local copy of log records for transaction processors on replicants. Larger transactions will read from the log directly. |
REP_PROCESSORS | 4 (QUANTITY) | Try to apply this many transactions in parallel in the replication stream |
REP_WORKERS | 16 (QUANTITY) | Size of worker pool for applying page changes on behalf of transactions (only has effect when REP_PROCESSORS is set |
ROWLOCKS_PAGELOCK_OPTIMIZATION | 1 (BOOLEAN) | Upgrade rowlocks to pagelocks if possible on cursor traversals. |
SEQNUM_WAIT_INTERVAL | 500 (QUANTITY) | Wake up to check the state of the world this often while waiting for replication acks. |
SOSQL_DDL_MAX_COMMIT_WAIT_SEC | 259200 (SECS) | Wait for the master to commit a DDL transaction for up to this long |
SOSQL_MAX_COMMIT_WAIT_SEC | 600 (SECS) | Wait for the master to commit a transaction for up to this long |
SOSQL_POKE_FREQ_SEC | 5 (QUANTITY) | On replicants, check this often for transaction status. |
SOSQL_POKE_TIMEOUT_SEC | 12 (QUANTITY) | On replicants, when checking on master for transaction status, retry the check after this many seconds. |
SQLBULKSZ | 2097152 (BYTES) | For index/data scans, the database will retrieve data in bulk instead of singlestepping a cursor. This set the buffer size for the bulk retrieval. |
SQLITE_SORTER_TEMPDIR_REQFREE | 6 (PERCENT) | Refuse to create a sorter for queries if less than this percent of disk space is available (and return an error to the application) |
SQL_QUERY_IGNORE_NEWER_UPDATES | 0 (BOOLEAN) | In transaction modes below SNAPSHOT, skip records updated after the current transaction started. |
SQL_QUEUEING_CRITICAL_TRACE | 100 (QUANTITY) | Produce trace when SQL request queue is this deep. |
SQL_QUEUEING_DISABLE_TRACE | 0 (BOOLEAN) | Disable trace when SQL requests are starting to queue. |
TABLESCAN_CACHE_UTILIZATION | 20 (PERCENT) | Attempt to keep no more than this percentage of the buffer pool of table scans. |
TEMPTABLE_CACHESZ | 262144 (BYTES) | Cache size for temporary tables. Temp tables do not share the database’s main buffer pool. |
TEMPTABLE_MEM_THRESHOLD | 512 (QUANTITY) | If in-memory temp tables contain more than this many entries, spill them to disk. |
ZLIBLEVEL | 6 (QUANTITY) | If zlib compression is enabled, this determines the compression level. |
Auto analyze options
These options control when the database kicks off analyze automatically. Auto-analyze
will be run by the database when the database does some fixed count of inserts/deletes. Various counters
can be queried with the stat autonalyze
command.
Option | Default (type) | Description |
---|---|---|
AUTOANALYZE | 0 (BOOLEAN) | Set to enable auto-analyze |
AA_COUNT_UPD | 0 (BOOLEAN) | Also consider updates towards the count of operations |
AA_LLMETA_SAVE_FREQ | 1 (QUANTITY) | Persist change counters per table on every N’th iteration (called every CHK_AA_TIME seconds) |
AA_MIN_PERCENT_JITTER | 300 (QUANTITY) | Additional jitter factor for determining percent change. |
AA_MIN_PERCENT | 20 (QUANTITY) | Percent change above which we kick off analyze |
CHK_AA_TIME | 180 (SECS) | Check whether we should start analyze this often |
MIN_AA_OPS | 100000 (QUANTITY) | Start analyze after this many operations |
MIN_AA_TIME | 7200 (SECS) | Don’t re-run auto-analyze if already ran within this many seconds |
SQL planner tunables
Option | Default (type) | Description |
---|---|---|
PLANNER_EFFORT | 1 (QUANTITY) | Planner effort (try harder) levels, default is 1 |
PLANNER_SHOW_SCANSTATS | 0 (BOOLEAN) | After each query, display statistics about index/data paths taken. |
PLANNER_WARN_ON_DISCREPANCY | 0 (BOOLEAN) | After each query warn if the estimate and actual cost are significantly different |
SHOW_COST_IN_LONGREQ | 1 (BOOLEAN) | Show query cost in the database long requests log (see logs.html#long-requests-log) |
Schema change tunables
Option | Default (type) | Description |
---|---|---|
SC_RESTART_SEC | 60 (QUANTITY) | Delay restarting schema change for this many seconds after startup/new master election |
INDEXREBUILD_SAVE_EVERY_N | 10 (QUANTITY) | Save schema change state to every n-th row for index only rebuilds |
SC_CHECK_LOCKWAITS_SEC | 1 (QUANTITY) | Frequency of checking lockwaits during schemachange in seconds |
SC_DECREASE_THRDS_ON_DEADLOCK | 1 (BOOLEAN) | Decrease number of schema change threads on deadlock – way to have schema change backoff |
SC_FORCE_DELAY | 0 (BOOLEAN) | Force schemachange to delay after every record inserted – to have sc backoff |
SC_NO_REBUILD_THR_SLEEP | 10 (QUANTITY) | Sleep this many microsec when conversion threads count is at max |
SC_PAUSE_REDO | 0 (BOOLEAN) | Pauses the newsc asynchronous redo-thread for testing |
SC_USE_NUM_THREADS | 0 (QUANTITY) | Start up to this many threads for parallel rebuilding durin schema change. 0 means use one per dtastripe . Setting is capped at dtastripe . |
UDP tunables
UDP option | description | |
---|---|---|
RAND_UDP_FAILS | 0 (QUANTITY) | For testing, rate of drop of UDP packets |
UDP_AVERAGE_OVER_EPOCHS | 4 (QUANTITY) | Average over these many TCP epochs |
UDP_DROP_DELTA_THRESHOLD | 10 (QUANTITY) | Warn if delta of dropped packets exceeds this threshold |
UDP_DROP_WARN_PERCENT | 10 (PERCENT) | Warn only if percentage of dropped packets exceeds this |
UDP_DROP_WARN_TIME | 300 (SECS) | No more than one warning per UDP_DROP_WARN_TIME seconds |
2PC tunables
2PC option | Default | Description |
---|---|---|
ENABLE_2PC | 1 (BOOLEAN) | Uses 2-phased commits for fdb-distributed transactions |
ALLOW_COORDINATOR | n/a (DB/TIER) | Allow this db/tier to act as a coordinator |
COORDINATOR_SYNC_ON_COMMIT | 1 (BOOLEAN) | Coordinator syncs log and waits for replicants on distributed commit |
COORDINATOR_BLOCK_UNTIL_DURABLE | 1 (BOOLEAN) | Coordinator blocks until commit is durable |
COORDINATOR_WAIT_PROPAGATE | 1 (BOOLEAN) | Coordinator blocks until participants complete distributed commit |
FLUSH_ON_PREPARE | 1 (BOOLEAN) | Master flushes log on prepare |
FLUSH_REPLICANT_ON_PREPARE | 1 (BOOLEAN) | Replicants flush log on prepare |
WAIT_FOR_PREPARE_SEQNUM | 1 (BOOLEAN) | Participants fail on prepare if replicated to less than quorum |
DISTTXN_ASYNC_MESSAGE | 1 (BOOLEAN) | Send 2-pc messages asynchronously |
DISTTXN_RANDOM_RETRY_POLL | 500 (MSECS) | Coordinator replicant random max poll time on distributed deadlock |
DISTTXN_HANDLE_CACHE | 1 (BOOLEAN) | Disttxn caches handles to peers |
DISTTXN_HANDLE_LINGER_TIME | 60 (SECS) | Time an unused disttxn-handle persists |
Rowlock tunables
Rowlocks option | Description | |
---|---|---|
ACK_ON_REPLAG_THRESHOLD | 0 | Drop the undo interval to 1 after this many deadlocks |
MAX_ROWLOCKS_REPOSITION | 10 | Release a physical cursor an re-establish |
PHYSICAL_ACK_INTERVAL | 0 | For logical transactions, have the slave send an ‘ack’ after this many physical operations |
PHYSICAL_COMMIT_INTERVAL | 512 | Force a physical commit after this many physical operations |
ROWLOCKS_MICRO_COMMIT | 1 | Commit on every btree operation |
Misc tunables
Misc option | Default | Description |
---|---|---|
ASOF_THREAD_DRAIN_LIMIT | 0 | How many entries at maximum should the BEGIN TRANSACTION AS OF thread drain per run |
ASOF_THREAD_POLL_INTERVAL_MS | 500 | For how long should the BEGIN TRANSACTION AS OF thread sleep after draining its work queue |
DELETE_OLD_FILE_DEBUG | 0 | Spew debug info about deleting old files during schema change. |
DISABLE_CACHING_STMT_WITH_FDB | 1 | Don’t cache query plans for statements with foreign table references |
DISABLE_SELECTVONLY_TRAN_NOP | 0 | Disable verifying rows selected via SELECTV if there’s no other actions done by the same transaction |
DONT_BLOCK_DELETE_FILES_THREAD | 0 | Don’t delete any files that would cause delete files thread to block |
GENID48_WARN_THRESHOLD | 500000000 | Print a warning when there are only a few genids remaining */ |
NET_INORDER_LOGPUTS | 1 | Attempt to order messages to ensure they go out in LSN order |
RCACHE_COUNT | 257 | Number of entries in root page cache |
RCACHE_PGSZ | 4096 | Size of pages in root page cache |
REP_VERIFY_LIMIT_ENABLED | 1 | Enable aborting replicant if it doesn’t make sufficient progress while rolling back logs to sync up to master. |
REP_VERIFY_MAX_TIME | 300 | Maximum amount of time we allow a replicant to roll back its logs in an attempt to sync up to the master. |
REP_VERIFY_MIN_PROGRESS | 10485760 | Abort replicant if it doesn’t make this much progress while rolling back logs to sync up to master. |
RLLIST_STEP | 10 | Reallocate rowlock lists in steps of this size. |
SC_VIA_DDL_ONLY | 0 | If DDL_ONLY is set, we don’t do checks needed for comdb2sc |
file_permissions | 0660 | Default filesystem permissions for database files. |
Log configuration
Log option | Default | Description |
---|---|---|
DEBUG_LOG_DELETION | 0 | Enable to see when/why database log files are being deleted |
LOGDELETELOWFILENUM | -1 (QUANTITY) | Set to set the lowest deleteable log file number. |
LOGFILEDELTA | 4096 (BYTES) | Treat lsn differences less than this as “close enough” for the purpose of determining if a new node is in sync |
LOGFILESIZE | 41943040 (BYTES) | Attempt to keep each log file around this size. |
LOGMEMSIZE | 10485760 (BYTES) | Use this much memory for a log file in-memory buffer. |
LOG_DELETE_LOW_HEADROOM_BREAKTIME | 10 | Try to delete logs this many times if the filesystem is getting full before giving up |
MIN_KEEP_LOGS_AGE_HWM | 0 | … |
Replay detection tunables
Comdb2 databases allow the client APIs to replay a transaction if its outcome is uncertain (ie: client issues commit, but the database drops a connection, so uncertain whether it committed). This system is internally called “blkseq” (block sequence). Tunables to control it are below
BLKSEQ option | Default | Description |
---|---|---|
DISABLE_SERVER_SOCKPOOL | 1 | Don’t get connections to other databases from sockpool. |
PRIVATE_BLKSEQ_CACHESZ | 4194304 | Cache size of the blkseq table |
PRIVATE_BLKSEQ_CLOSE_WARN_TIME | 100 | Warn when it takes longer than this many MS to roll a blkseq table |
PRIVATE_BLKSEQ_ENABLED | 1 | Sets whether dupe detection is enabled |
PRIVATE_BLKSEQ_MAXAGE | 20 | Maximum time in seconds to let “old” transactions live |
PRIVATE_BLKSEQ_STRIPES | 8 | Number of stripes for the blkseq table |
TIMEOUT_SERVER_SOCKPOOL | 10 | Timeout for getting a connection to another database from sockpool. |
Coherency options
Comdb2 will mark nodes that are non-responsive “incoherent”. Incoherent nodes will not respond to new queries and replication will not wait for them acknowledge events. Nodes stay incoherent until the are caught up to the rest of the cluster. The master node will send replicants a coherency lease to allow them to stay coherent. It’ll withhold the lease if a node is either not responding or is significantly slower than the other nodes.
Coherency option | Default(type) | Description |
---|---|---|
ADDITIONAL_DEFERMS | 0 | Wait-fudge to ensure that a replicant has gone incoherent |
ADD_RECORD_INTERVAL | 1 | Add a record every |
CATCHUP_ON_COMMIT | 1 | Replicant to INCOHERENT_WAIT rather than INCOHERENT on commit if within CATCHUP_WINDOW |
CATCHUP_WINDOW | 1000000 | Start waiting in waitforseqnum if replicant is within this many bytes of master |
COHERENCY_LEASE | 200 | A coherency lease grants a replicant the right to be coherent for this many MS. |
COHERENCY_LEASE_UDP | 1 | Use udp to issue leases |
COMMITDELAYBEHINDTHRESH | 1048576 (BYTES) | Call for election again and ask the master to delay commits if we’re further than this far behind on startup. |
DOWNGRADE_PENALTY | 10000 | Prevent upgrades for at least this many ms after a downgrade |
GOOSE_REPLICATION_FOR_INCOHERENT_NODES | 0 (BOOLEAN) | Call for election for nodes affected by COMMITDELAYBEHINDTHRESH |
LEASE_RENEW_INTERVAL | 50 | How often we renew leases |
MAKE_SLOW_REPLICANTS_INCOHERENT | 1 (BOOLEAN) | Make slow replicants incoherent. |
REMOVE_COMMITDELAY_ON_COHERENT_CLUSTER | 1 | Stop delaying commits when the all the nodes in the cluster are coherent. |
SLOWREP_INACTIVE_TIMEOUT | 5000 (SECS) | If a “slow” replicant hasn’t responded in this long, mark him incoherent. |
SLOWREP_INCOHERENT_FACTOR | 2 (QUANTITY) | Make replicants that are this many times worse than the second worst replicant incoherent. This is the threshold for WARN_SLOW_REPLICANTS and MAKE_SLOW_REPLICANTS_INCOHERENT . |
SLOWREP_INCOHERENT_MINTIME | 2 (MSECS) | Ignore replicantion events faster than this. |
TRACK_REPLICATION_TIMES_MAX_LSNS | 50 (QUANTITY) | Track replication times for up to this many transactions |
TRACK_REPLICATION_TIMES | 1 (BOOLEAN) | Track how long each replicant takes to ack all transactions. |
WARN_SLOW_REPLICANTS | 1 (BOOLEAN) | Warn if any replicant’s average response times over the last 10 seconds are significantly worse than the second worst replicant’s. |
Init time options
These options apply only at database creation time. Use -create -lrl /path/to/lrlfile.lrl
with the
pre-populated options to create the database
Option | Default | Description |
---|---|---|
blobstripe | 1 | Also stripe blob files (several physical files per blob field) |
dtastripe | 1 | Stripe each table across this many files (several physical files per table) |
init_with_bthash | 0 (off) | Turn on a root page lookaside cache for tables |
init_with_compr | On | Turns out compression on table data. |
init_with_compr_blobs | On | Turns out compression on blob/vutf8 columns. |
init_with_genid48 | Off | Use incrementing numbers for rowids instead of timestamps |
init_with_inplace_update | On | Do record inplace inplace instead of a delete/add. This is far more efficient. |
init_with_instant_schema_change | On | When possible (eg: when just adding fields) schema change will not rebuild the underlying tables. |
init_with_ondisk_header | On | Add an extra header to records for versioning and compression information. This can be enabled on tables individually with ALTER TABLE . This controls the default for new tables. Prerequisite for other init_with_* options below. |
init_with_rowlocks | Off | Use rowlocks instead of pagelocks for short-term transaction locks |
init_with_rowlocks_master_only | Off | Alternate rowlocks scheme that only requires short-term rowlocks on the master |
table | Multiple table options can be added to an lrl file to add tables at database init time. Arguments are table name and path to .csc2 file. |
Start time options
These options are toggable at start time of the server.
Option | Default | Description |
---|---|---|
archive_on_init | ON | Archive preexisting database files on init. |
cache | 64 mb | Database cache size, see cache size |
cachekb | see cache size | |
cachekbmax | see cache size | |
cachekbmin | see cache size | |
checksums | On | Checksum data pages. Turning this off is highly discouraged. |
cluster nodes | List of nodes that comprise the cluster for this database. See setting up clusters | |
dedicated_network_suffixes | Suffix to append to node name when server has extra network interfaces that comdb2 is able to use to ensure resiliency when losing one network, example: if eth1 and eth2 are extra network cards on the server and the dns hostnames assigned to the ips of the respective cards are node1_eth1 and node1_eth2, then the option here should be set as: dedicated_network_suffixes _eth1 _eth2 | |
dir | $COMDB2_ROOT/var/cdb2/$DBNAME |
Database directory |
directio | On | Bypass filesystem cache for page I/O |
fullrecovery | Off, recovery runs from previous checkpoint | Attempt to run database recovery from the beginning of available logs |
largepages | 0 | Enables large pages. |
nonames | Off | Use database name for some environment files (older setting, should remain off) |
remsql_whitelist databases | If this option is set, when another DB makes a connection to this DB, we will only allown processing of that request if that other DB’s name is in the whitelist, otherwise it will receive an error, example: remsql_whitelist databases db1 db2 db3 . |
Runtime options
These options are toggle-able at runtime.
Option | Default | Description |
---|---|---|
ack_trace | not set | Every second, produce trace for ack messages |
allow | See permissioning commands | |
allow_lua_print | 0 | Enable to allow stored procedures to print trace on DB’s stdout |
allow_user_schema | 0 | Enable to allow per-user schemas |
analyze_comp_threads | 10 | Number of thread to use when generating samples for computing index statistics |
analyze_comp_threshold | 104857600 | Index file size above which we’ll do sampling, rather than scan the entire index. |
analyze_tbl_threads | 5 | Number of threads to go through generated samples when generating index statistics |
appsockpool | See thread pools | |
appsockslimit | 500 | Start warning on this many connections to the database |
berkattr | See BerkeleyDB attributes | |
blob_mem_mb | not set | Blob allocator - sets the max memory limit to allow for blob values (in MB). |
blobmem_sz_thresh_kb | not set | Sets the threshold (in kb) above which blobs are allocated by the blob allocator. |
cache_flush_interval | 30 (s) | Flushes buffer-cache page numbers to logs/pagelist on this interval. The database pre-heats the buffercache with these pages when it starts. Setting to 0 disables. |
chkpoint_alarm_time | 60 (sec) | Warn if checkpoints are taking more than this many seconds. |
clean_exit_on_sigterm | 1 | When enabled, SIGTERM will cause database to do an orderly shutdown. When disabled follows system SIGTERM default (terminate, no core) |
clrpol | See permissioning commands | |
commit_delay_on_copy_ms | 0 | Amount of time each commit will be delayed if a copy is ongoing |
commit_delay_timeout_seconds | 10 | Period of time a master will delay-commits if a copy is ongoing |
commitdelaymax | 0 | Introduce a delay after each transaction before returning control to the application. Occasionally useful to allow replicants to catch up on startup with a very busy system. |
crc32c | set | Use crc32c (alternate faster implementation of CRC32, different checksums) for page checksums |
crypto | See Authentication and Encryption | |
ctrace_dbdir | not set | If set, debug trace files will go to the data directory instead of `$COMDB2_ROOT/var/log/cdb2/) |
ctrace_nlogs | 7 | When rolling trace files, keep this many. The older files will have incrementing number suffixes (.1, .2, etc.) |
ctrace_rollat | 0 | Roll database debug trace file ($COMDB2_ROOT/var/log/cdb2/$dbname.trc.c ) at specified size. Set to 0 to never roll. |
deadlock_rep_retry_max | not set | If set, will reset the deadlock mode after this many deadlocks on the replicant while applying the log stream. |
debugthreads | off | If set to ‘on’ enables trace on thread events. |
decimal_rounding | DEC_ROUND_HALF_EVEN | See decimal rounding options |
default_sql_mspace_kbsz | 1024 | Default size of memory regions owned by SQL threads, in KB |
delay_sql_lock_release | 1 | Delay release locks in cursor move if bdb lock desired but client sends rows back |
disable_cache_internal_nodes | Disable enable_cache_internal_nodes | |
disable_inplace_blob_optimization | Disables enable_inplace_blob_optimization | |
disable_inplace_blobs | Disables enable_inplace_blobs (needs enable_inplace_blob_optimization, and enable_osql_blob_optimization also enabled - which they are by default) | |
disable_lowpri_snapisol | ||
disable_new_snapshot | Disables alternate snapshot implementation | |
disable_osql_blob_optimization | Disables disable_osql_blob_optimization | |
disable_overflow_page_trace | 1 | If set, warn when a page order table scan encounters an overflow page. |
disable_page_latches | Turns off page latches | |
disable_pageorder_recsz_check | 0 | If set, allow page order table scans even for pages with overflows. |
disable_partial_indexes | Disables partial indices | |
disable_prefault_udp | Disable enable_prefault_udp |
|
disable_replicant_latches | Turns off page latches on replicants | |
disable_sparse_lockerid_map | Disables enable_sparse_lockerid_map | |
disable_sql_dlmalloc | not set | If set, will use default system malloc for SQL state machines. By default, each thread running SQL gets a dedicated memory pool. |
disable_temptable_pool | Disables the pool of temp tables set by temptable_limit , temp tables are created as needed. |
|
disable_upgrade_ahead | Disables enable_upgrade_ahead |
|
disallow | See permissioning commands | |
do | At the end of processing config files, execute the rest of this line as an operational command, see operational Commands | |
dump_cache_max_pages | 0 | Maximum number of pages that will be written into the default pagelist |
dumpthreadonexit | off | If set to ‘on’ dump resources held by a thread on exit |
early | set | When set, replicants will ack a transaction as soon as they acquire locks - not that replication must succeed at that point, and reads on that node will either see the records or block. |
enable_bulk_import | 0 | Enable API to quickly bring in tables from another database |
enable_bulk_import_different_tables | 0 | Enable API to bring in tables from another databases that are not present in the current database |
enable_cache_internal_nodes | set | Btree internal nodes have a higher cache priority. |
enable_inplace_blob_optimization | Enables inplace blob updates (blobs are updated in place in their b-tree when possible, not deleted/added) | |
enable_inplace_blobs | set | Don’t update the rowid of a blob entry on an update |
enable_lowpri_snapisol | 0 | Give lower priority to locks acquired when updating snapshot state |
enable_new_snapshot | 0 | Experimental Enable new SNAPSHOT implementation |
enable_new_snapshot_asof | 0 | Experimental Enable new BEGIN TRANSACTION AS OF implementation |
enable_new_snapshot_logging | 0 | Experimental Enable alternate logging scheme |
enable_osql_blob_optimization | set | Replicant tracks which columns are modified in a transaction to allow blob updates to be omitted if possible |
enable_overflow_page_trace | If set, don’t warn when a page order table scan encounters an overflow page. | |
enable_pageorder_recsz_check | Disables enable_pageorder_recsz_check | |
enable_partial_indexes | not set | If set, allows partial index definitions in table schema. See partial indices |
enable_prefault_udp | not set | Send lossy prefault requests to replicants |
enable_selectv_range_check | not set | Experimental If set, SELECTV will send ranges for verification, not every touched record. |
enable_serial_isolation | 0 | Enable to allow SERIALIZABLE level transactions to run against the database |
enable_snapshot_isolation | 0 | Enable to allow SNAPSHOT level transactions to run against the database |
enable_sparse_lockerid_map | set | If set, allocates a sparse map of lockers for deadlock resolution |
enable_sql_stmt_caching | not set | Enable caching of query plans. If followed by “all” will cache all queries, including those without parameters. |
enable_tagged_api | 0 | |
enable_upgrade_ahead | not set | Occasionally update read records to the newest schema version (saves some processing when reading them later) |
externalauth | off | Enable use of external auth plugin |
forbid_remote_admin | set | Disallow admin SQL sessions unless it is on the same machine as the database |
gbl_exit_on_pthread_create_fail | 1 | If set, database will exit if thread pools aren’t able to create threads. |
heartbeat_send_time | 5 (seconds) | Send heartbeats this often. |
include | Include file given as argument. Named file will be processed before continuing processing the current file. | |
ioqueue | 0 | Max depth of the I/O prefaulting queue |
iothreads | 0 | Number of threads to use for I/O prefaulting |
keycompr | Enable index compression (applies to newly allocated index pages, rebuild table to force for all pages, see REBUILD | |
load_cache_max_pages | 0 | Maximum number of pages that will be prefaulted into the bufferpool cache. |
load_cache_threads | 8 | Number of threads that will prefault a pagelist into the bufferpool cache. |
location | Sets up default file locations - see file locations | |
lock_conflict_trace | Off | Dump count of lock conflicts every second |
log_delete_after_backup | 0 | Set log deletion policy to disable log deletion (can be set by backups, thought the default backups provided by copycomdb2 use a different mechanism) |
log_delete_before_startup | 0 | Set log deletion policy to disable logs older than database startup time. |
log_delete_now | 1 | Set log deletion policy to delete logs as soon as possible. |
logmsg | Controls the database logging level - accepts logging commands. | |
master_retry_poll_ms | 100 | Have a node wait this long after a master swing before retrying a transaction |
master_swing_osql_verbose | not set | Produce verbose trace for SQL handlers detecting a master change |
max_lua_instructions | 10000 | Max lua opcodes to execute before we assume the stored procedure is looping and kill it |
max_sqlcache_hints | 100 | Max number of “hinted” query plans to keep (global) - see cdb2_use_hints() |
max_sqlcache_per_thread | 10 | Max number of plans to cache per sql thread (statement cache is per-thread, but see hints below) |
maxappsockslimit | 1400 | Start dropping new connections on this many connections to the database |
maxcolumns | 255 | Raise the maximum permitted number of columns per table. There’s a hard limit of 1024. |
maxlockers | 256 | Initial size of the lockers table (there’s no current maximum) |
maxosqltransfer | 50000 | Maximum number of records modifications allowed per transaction |
maxq | 192 | Maximum queue depth for write requests |
maxretries | 500 | Maximum number of times a transactions will be retried on a deadlock |
maxsockcached | 500 | After this many connections, start requesting that further connections are no longer pooled. |
maxtxn | 128 | Maximum concurrent transactions. |
maxwt | 8 | Maximum number of threads processing write requests |
memp_dump_cache_threshold | 20 | Don’t flush the bufferpool pagelist until at least this percentage of pages has been modified. |
mempget_timeout | 60 (seconds) | |
memstat_autoreport_freq | 180 (sec) | Dump memory usage to trace files at this frequency |
nice | not set | If set, will call nice() with this value to set the database nice level |
no_ack_trace | Turns off ack trace | |
no_lock_conflict_trace | On | Turns off lock_conflict_trace |
no_rep_process_txn_trace | Unsets rep_process_txn_trace | |
no_round_robin_stripes | ||
nocrc32c | Disables crc32c , fall back to CRC32 |
|
noearly | Disables early . With noearly replicant will wait to commit a transaction locally and release locks before acking. The only advantage is that subsequent reads won’t block on this transaction’s lock - the replication semantics don’t change. |
|
nokeycompr | Disable index compression (applies to newly allocated index pages, just like keycompr ) |
|
norcache | Disables rcache |
|
noreallearly | Disables reallyearly |
|
notimeout | not set | Turns off SQL timeouts |
nowatch | not set | Disable watchdog. Watchdog aborts the database if basic things like creating threads, allocating memory, etc. doesn’t work. |
nullfkey | Constraints are enforced for all key values | Do not enforce foreign key constraints for null keys. |
num_record_converts | 100 | During schema changes, pack this many records into a transaction. |
on/off | Enable/disable various switches - see switches | |
osql_verify_ext_chk | 1 | For block transaction mode only - after this many verify errors, see if transaction is non-commitable - see default isolation level |
osql_verify_retry_max | 499 | Retry a transaction on a verify error this many times - see optimistic concurrency control |
osqlprefaultthreads | 0 | If set, send prefaulting hints to nodes. |
osync | Off | Enables O_SYNC on data files (reads still go through FS cache) if directio isn’t set |
page_latches | not set | Experimental If set, in rowlocks mode, will acquire fast latches on pages instead of full locks. |
pagedeadlock_maxpoll | 5 (ms) | Randomly poll for this many ms and retry a deadlocked component of a rowlocks transaction |
pagedeadlock_retries | 500 | Retry a deadlocked component of a rowlock transaction this many times before reporting deadlock for the transaction. |
pageordertablescan | set | Table scans read the table in page order, not row order. |
pbkdf2_iterations | 4096 | Number of PBKDF2 iterations. PBKDF2 is used for password hashing. The higher the value, the more secure and the more computationally expensive. The mininum number of iterations is 4096. |
prefaulthelperthreads | 0 | Max number of prefault helper threads. |
print_deadlock_cycles | 100 | Print deadlock cycle every n-th time a transaction encounters a deadlock. Set to 1 to turn off, set to 1 to print all deadlock cycles. |
print_syntax_err | not set | Trace all SQL with syntax errors. |
query_plan_percentage | 50 | Alarm if the average cost per row of current query plan is n percent above the cost for different query plan. |
querylimit | See query limit commands | |
queuepoll | 0 | Occasionally wake up and poll consumer queues even when no events require it |
rcache | set | Keep a lookaside cache of root pages for b-trees |
reallearly | not set | Ack as soon as a commit record is seen by the replicant (before it’s applied). This effectively makes replication asynchronous, so reads may not see the effects of a committed transaction yet. |
rep_process_txn_trace | not set | If set, report processing time on replicant for all transactions |
repchecksum | 0 | Enable to do additional check-summing of replication stream (log records in replication stream already have checksums) |
replicant_latches | not set | Experimental Also acquire latches on replicants |
replicate_local | 0 | When enabled, record all database events to a comdb2_oplog table. This can be used to set clusters/instances that are fed data from a database cluster. Alternate ways of doing this are planned, so enabling this option should not be needed in the near future. |
report_deadlock_verbose | 0 | If set, dump the current thread’s stack for every deadlock. |
reqldiffstat | 60 (sec) | Set how often the database will dump various usage statistics (each entry will include changes in the last interval) |
reqltruncate | 1 | Disable to always log full SQL queries in request logs (they are truncated by default to save space) |
resource | not set | Registers a file with the databases. Can be referred to from stored procedures. |
round_robin_stripes | 0 | Alternate to which table stripe new records are written. The default is to keep stripe affinity by writer. |
sbuftimeout | not set | Set a timeout on client connections, connections drop if they |
sc_del_unused_files_threshold | ||
setattr | Change bdb tunables - see bdb tunables | |
setclass | See permissioning commands | |
set_snapshot_impl | “modsnap” | Set the implementation to be used for snapshot isolation. Can be one of “original”, “new”, or “modsnap”. |
setsqlattr | See (SQL tunables)[#sql-tunables] | |
sockbplog_sockpool | off | Osql bplog sent over sockets is using local sockpool |
sockbplog | off | Osql bplog is sent from replicants to master on their own socket |
sql_time_threshold | 5000 (ms) | Sets the threshold time in ms after which queries are reported as running a long time. |
sql_tranlevel_default | Sets the default SQL transaction level for the database, see (SQL transaction levels)[#sql-transaction-levels] | |
sqlenginepool | See thread pools | |
sqlflush | not set | Force flushing the current record stream to client every specified number of records |
sqllogger | See request logging | |
sqlsortermaxmmapsize | 2147418112 | maximum amount of file-backed mmap size in bytes to give the sqlite sorter |
sqlsortermem | 314572800 | maximum amount of memory to give the sqlite sorter |
stack_at_lock_get | not set | Collect comdb2_stack for every lock |
stack_at_lock_handle | not set | Collect comdb2_stack for every handle-lock |
stack_at_write_lock | not set | Collect comdb2_stack for every write-lock |
survive_n_master_swings | 600 | Have a node retry applying a transaction against a new master this many times before giving up. |
sync | See sync command | |
tablepenaltyincpercent | See BDB_ATTR_DISABLE_WRITER_PENALTY_DEADLOCK | |
tablescan_cache_utilization | 20 | Percent of cache to allow to be used for table scans. |
temptable_limit | 8192 | Set the maximum number of temporary tables the database can create |
throttle_txn_chunks_msec | 0 | Wait that many milliseconds before starting a new transaction chunk |
throttlesqloverlog | 5 (sec) | On a full queue of SQL requests, dump the current thread pool this often |
update_shadows_interval | 0 | Set to higher than 0 to update snaphots on every Nth operation (default is for every operation) |
use_parallel_schema_change | 1 | Scan stripes for a table in parallel during schema change. |
use_planned_schema_change | 1 | Only change entities that need to change on a schema change. Disable to always rebuild all data files and indices for the changing table. |
Compatibility
In order to allow seamless migration between older and newer Comdb2 versions, some tunables are introduced to disable incompatible features introduced in newer versions.
legacy_tunables
(introduced in 7.0
)
This tunable disables all new features/behaviours that, if used, would have prevented downgrading to older versions. It implicitly enables following configurations:
allow_negative_column_size
berkattr elect_highest_committed_gen 0
clean_exit_on_sigterm off
create_default_user
ddl_cascade_drop 0
decoupled_logputs off
disable_inplace_blob_optimization
disable_inplace_blobs
disable_osql_blob_optimization
disable_tpsc_tblvers
disallow write from beta if prod
dont_forbid_ulonglong
dont_init_with_inplace_updates
dont_init_with_instant_schema_change
dont_init_with_ondisk_header
dont_prefix_foreign_keys
dont_sort_nulls_with_header
dont_superset_foreign_keys
enable_sql_stmt_caching none
enable_tagged_api
env_messages
init_with_time_based_genids
legacy_schema on
logmsg level info
logmsg notimestamp
logmsg skiplevel
logput window 1
no_null_blob_fix
no_static_tag_blob_fix
noblobstripe
nochecksums
nocrc32c
nokeycompr
norcache
nullfkey off
nullsort high
off fix_cstr
off osql_odh_blob
off return_long_column_names
on accept_on_child_nets
on disable_etc_services_lookup
online_recovery off
osql_send_startgen off
queuedb_genid_filename off
reorder_socksql_no_deadlock off
setattr DIRECTIO 0
setattr ENABLE_SEQNUM_GENERATIONS 0
setattr MASTER_LEASE 0
setattr NET_SEND_GBLCONTEXT 1
unnatural_types 1
usenames
legacy_schema
(introduced in 7.0
)
Turning it off
would enable support for newer (backwards incompatible) CSC2
constructs introduced in 7.0
. The features include: uniqnulls (UNIQUE/PRIMARY
KEY in DDL), partial index, index on expression and non-null default value for
datetime fields. This list could grow in future.
noenv_messages
(introduced in 7.0
)
Turing it on
would enable support for upsert and partial index, both
introduced in 7.0
.