如何搭建PostgreSQL高可用方案repmgr？详解安装部署与关键组件疑问

文章导航

一、repmgr概述

repmgr：是一个用于增强和管理PostgreSQL数据库内建复制和故障转移机制的开源工具集。其主要功能包括设置备用服务器、监控复制状态、以及在故障发生时自动或手动执行故障转移和切换操作。

Replication cluster: 在repmgr中，“replication cluster”是指一组通过流式复制技术连接在一起的PostgreSQL服务器。这些服务器之间复制数据，从而确保数据一致性和高可用性。
Node: 在复制集群中，”node”表示单个的PostgreSQL服务器实例。每个节点可以扮演主节点或备用节点的角色。
Upstream node: 在备用服务器的上下文中，”upstream node”是指该备用服务器正在接收复制数据的节点。这通常指的是主节点，但在级联复制中也可能是另一台备用节点。
Failover: “Failover”操作发生在主节点失效时，一个选定的备用节点被提升为新的主节点。`repmgrd`守护进程可以配置为支持自动故障转移，以尽量减少服务中断的时间。
Switchover: “Switchover”是一种受控的操作，用于主动将主节点的角色切换到一台备用节点上。与故障转移不同，切换是在没有主节点失效的情况下出于主动维护等原因进行的。
Fencing: 在进行故障转移后，为了防止原主节点不期望地重新加入集群并造成数据冲突（称为脑裂情况），必须实施”fencing”策略。Fencing能确保原主节点与集群的其它部分保持隔离。
Witness server: repmgr支持设置一个”Witness server”，其不参与数据复制，但包含关于集群状态的元数据。它的作用是在故障转移时帮助确定谁是最适合成为新主节点的备用服务器。Witness server能提供额外的信息来协助做出正确选择，从而确保集群的一致性和稳定性。

二、组件说明

repmgr 包括两个主要的组件：

repmgr：这是一个命令行工具，用于执行多种管理任务，比如：
1. 配置和启动备用服务器
2. 将备用服务器提升为新的主服务器
3. 在主服务器和备用服务器之间进行切换
4. 显示复制集群中各服务器的状态

repmgrd：这是一个守护进程，其主动监控复制集群并执行如下任务：
1. 监控复制性能并记录相关数据
2. 通过检测到主服务器的故障并自动提升最合适的备用服务器来实施故障转移
3. 向用户定义的脚本发送集群中事件的通知，这些脚本可以用来执行任务，比如发送电子邮件警报等

三、安装部署

1、环境

注意要点

不支持在win上部署
同一套集群PG版本统一
repmgr安装统一版本并且集群中所有节点都必须安装

repmgr+pg版本对应关系（版本关系可以在github查看或Document查看）

repmgr版本	PG对应版本
repmgr 5.4	9.4, 9.5, 9.6, 10, 11, 12, 13, 15,16
repmgr 5.3	9.4, 9.5, 9.6, 10, 11, 12, 13, 14, 15
repmgr 5.2	9.4, 9.5, 9.6, 10, 11, 12, 13
repmgr 5.1	9.3, 9.4, 9.5, 9.6, 10, 11, 12
repmgr 5.0	9.3, 9.4, 9.5, 9.6, 10, 11, 12
repmgr 4.x	9.3, 9.4, 9.5, 9.6, 10, 11
repmgr 3.x	9.3, 9.4, 9.5, 9.6
repmgr 2.x	9.0, 9.1, 9.2, 9.3, 9.4

2、安装：

修改postgresql参数

注：postgresql已安装ok（不熟悉的可以yum一次性解决）


#修改pg配置postgresql.conf
listen_addresses = '*'
wal_level = logical
wal_log_hints = on

#重启pg
systemctl restart postgresql-15

创建repmgr所需的库和账号


#创建repmgr账号和库
create user repmgr with superuser password 'repmgr123';
create database repmgr owner  repmgr;

配置认证pg_hba.conf
# 允许用户 repmgr 通过local，127.0.0.1，10.248.32. 连接到replication
local   replication   repmgr                              trust
host    replication   repmgr      10.248.32.187/24        trust
host    replication   repmgr      10.248.32.188/24        trust

# 允许用户 repmgr 通过local，127.0.0.1，10.248.32. 连接到repmgr schema
local   repmgr   repmgr                              trust
host    repmgr   repmgr      10.248.32.187/24        trust
host    repmgr   repmgr      10.248.32.188/24        trust

#重启pg
systemctl reload postgresql-15

repmgr节点间免密配置以及pg连接免密


#选择任意节点创建密钥对(一路回车什么都不输入)
ssh-keygen -t rsa -b
Generating public/private rsa key pair.
Enter file in which to save the key (/var/lib/postgresql/.ssh/id_rsa):
Created directory '/var/lib/postgresql/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /var/lib/postgresql/.ssh/id_rsa.
Your public key has been saved in /var/lib/postgresql/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:fokF65XAW82Z8xI1SJuPlmCKnEuchkj6uder8nVp+c4 postgres@cda1-032187-test-tb-postgresql-goodscenter
The key's randomart image is:
+---[RSA 4096]----+
|           ...o  |
|       .   o.* . |
|  .     + + X    |
| o . + + O o B   |
|. . . O S + = o  |
| . . o + * o .   |
|  o  .o O o      |
|  .....o +       |
|  .+o... .E      |
+----[SHA256]-----+

cat /var/lib/postgresql/.ssh/id_rsa.pub >/var/lib/postgresql/.ssh/authorized_keys

#将密钥信息
scp -r /var/lib/postgresql/.ssh/ root@other-ip:/var/lib/postgresql/
-- other节点执行权限变更
chmod 0700 /var/lib/postgresql/.ssh/
chmod 0600 /var/lib/postgresql/.ssh/*
chown postgres:postgres /var/lib/postgresql/.ssh/ -R

#所有节点配置pgpass
ip1:5432:repmgr:repmgr:repmgr123
ip2:5432:repmgr:repmgr:repmgr123

chmod 0600 .pgpass

添加Primary节点


-- 注册primary节点(IP1)
cat /etc/repmgr.conf
node_id=****
node_name='****'
conninfo='host=**** port=**** user=**** dbname=**** connect_timeout=****'
data_directory='/pgdata/'
ssh_options='-q -o ConnectTimeout=10'

-- 修改权限
chown postgres:postgres /etc/repmgr.conf

-- 注入primary node
su - postgres
repmgr -f /etc/repmgr.conf primary register
INFO: connecting to primary database...
NOTICE: attempting to install extension "repmgr"
NOTICE: "repmgr" extension successfully installed
NOTICE: primary node record (ID: 1) registered
-- 验证集群
repmgr -f /etc/repmgr.conf cluster show
ID| Name          | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+---------------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------
**** | **** | primary | * running |          | default  | 100      | 1        | host=**** port=**** user=**** dbname=**** connect_timeout=****
-- 元数据表中的记录
repmgr=# select * from nodes;
-[ RECORD 1 ]----+-------------------------------------------------------------------------
node_id          | ****
upstream_node_id |
active           | t
node_name        | ****
type             | primary
location         | default
priority         | 100
conninfo         | host=**** port=**** user=**** dbname=**** connect_timeout=****
slot_name        |
config_file      | /etc/repmgr.conf

#在pg1写入测试数据
psql -c "create database demo01;"
pgbench -i -s 20 -d demo01;

添加standby节点


-- 注册standby节点(IP2)
cat /etc/repmgr.conf
node_id=****
node_name='****'
conninfo='host=**** port=**** user=**** dbname=**** connect_timeout=****'
data_directory='/pgdata/'
ssh_options='-q -o ConnectTimeout=10'
-- 修改权限
chown postgres:postgres /etc/repmgr.conf
-- 使用参数--dry-run 检查是否可以克隆从库
主要检查如下几点:
  检查目录
  检查参数 max_wal_senders 是否大于2
  检查参数 wal_log_hints
  检查通过会执行备份命令 pg_basebackup -l "repmgr base backup" 

systemctl stop postgresql-16
-- 停止pg才能执行如下步骤(如果当前实例pgdata目录不为空，则加上--force参数)
repmgr -h ip -U user -d database -f /etc/repmgr.conf standby clone --dry-run
NOTICE: destination directory "/pgdata" provided
INFO: connecting to source node
DETAIL: connection string is: host=**** port=**** user=**** dbname=****
DETAIL: current installation size is 337 MB
INFO: replication slot usage not requested;  no replication slot will be set up for this standby
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: checking and correcting permissions on existing directory "/pgdata"
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
  pg_basebackup -l "repmgr base backup"  -D /pgdata -h ip -p port -U user -X stream
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: pg_ctl -D /pgdata start
HINT: after starting the server, you need to register this standby with "repmgr standby register"


-- 启动从库
systemctl start postgresql-16

--注册从节点
repmgr -f /etc/repmgr.conf standby register
INFO: connecting to local node "ip2" (ID: 2)
INFO: connecting to primary database
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID: 1)
INFO: standby registration complete
NOTICE: standby node "ip2" (ID: 2) successfully registered

-- 查看集群信息
repmgr -f /etc/repmgr.conf cluster show
 ID | Name          | Role    | Status    | Upstream      | Location | Priority | Timeline | Connection string
----+---------------+---------+-----------+---------------+----------+----------+----------+--------------------------------------------------------------------------
 1  | ip1 | primary | * running |               | default  | 100      | 1        | host=**** port=**** user=**** dbname=**** connect_timeout=****
 2  | ip2 | standby |   running |               | default  | 100      | 1        | host=**** port=**** user=**** dbname=**** connect_timeout=****

-- 主从切换
repmgr -f /etc/repmgr.conf standby switchover

repmgr -f /etc/repmgr.conf cluster show
 ID | Name          | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+---------------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------
 1  | ip1 | standby |   running |          | default  | 100      | 3        | host=**** port=**** user=**** dbname=**** connect_timeout=****
 2  | ip2 | primary | * running |          | default  | 100      | 4        | host=**** port=**** user=**** dbname=**** connect_timeout=****