logo头像

You name it , I got it !

[筆記] 在ubuntu 18.04安裝psql 11 以及 pg_auto_failover / install psql 11 and pg_auto_failover in ubuntu 18.04

最近都在弄postgresql

備份、還原測試得差不多了,就等著看到時候要用什麼方式

前幾天看到 pg_auto_failover 這個postgresql 的extension

https://github.com/citusdata/pg_auto_failover

感覺挺不錯的,看起來設定很簡單,雖然之前已經測試了 keepalived 做 HA

不過,反正當作練功嘛,多測試一套也不錯!

基本的邏輯是 一台 monitor , 一台 master/primary node ,一台 slave/secondary node 組成一個cluster

官方提供的架構圖如下

pg_auto_failover architecture

當master/primary node 上面的 postgresql 服務死掉了,slave/secondary node 會自動接手

等到master/primary node 回來之後,會自動降級為 slave/secondary node

而原本的 slave/secondary node 就會變成 master/primary


安裝相關套件

以下步驟在三個node 都要操作

安裝相依套件

1
sudo apt install make libssl1.0.0 libssl-dev libkrb5-3 libkrb5-dev libpq5 libpq-dev pgdg-keyring ssl-cert postgresql-plperl-11 postgresql-pltcl-11 postgresql-plpython-11 postgresql-plpython3-11 postgresql-11 postgresql-common postgresql-server-dev-11 postgresql-client-11 postgresql-client-common postgresql-doc-11 tcl8.6 libjson-perl tcl-tclreadline

安裝完成後,切換到postgres身份,把原本系統內的postgresql 檔存放目錄還有資料庫檔案目錄都清掉

1
2
3
4
sudo su - postgres
mv /etc/postgresql/11 /etc/postgresql/11.bak
mv /var/lib/postgresql/11 /var/lib/postgresql/11.bak
exit

開始安裝 pg_auto_failover

1
2
curl https://install.citusdata.com/community/deb.sh | sudo bash
sudo apt install postgresql-11-auto-failover -y

postgresql 的相關執行檔路徑,預設不會載入到PATH變數中,所以要自己手動增加

直接加入 /etc/environment ,或者是去修改使用者的 .profile 載入正確的 $PATH 都可以

1
2
#replace PATH variable in /etc/environment
PATH="/usr/lib/postgresql/11/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games"

因為要用 pg_auto_failover 來控制postgresql ,所以要把系統內建的服務先停掉,並且設定開啟不啟動

1
2
sudo systemctl disable postgresql
sudo systemctl stop postgresql

為了讓三台機器可以直接用hostname溝通,所以要修改 /etc/hosts

1
2
3
4
#add to /etc/hosts
192.168.11.151 monitor monitor.abc.com
192.168.11.152 pg-primary pg-primary.abc.com
192.168.11.153 pg-slave pg-slave.abc.com

以上完成在三台node上,安裝postgresql/pg_auto_failover 的工作


設定 monitor node

在monitor node操作

建立monitor node

1
sudo runuser -l postgres -c "/usr/bin/pg_autoctl create monitor --pgdata /var/lib/postgresql/11/main --pgport 5432 --nodename monitor"

這個步驟會在pg_hba.conf 中,新增一筆紀錄

1
host "pg_auto_failover" "autoctl_node" 192.168.11.0/24 trust # Auto-generated by pg_auto_failover

檢查一下狀態,看看有沒有安裝成功

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
2019-09-19 06:56:09 [administrator@monitor ~]$ sudo runuser -l postgres -c "psql -c '\'du"
List of roles
Role name | Attributes | Member of
--------------+------------------------------------------------------------+-----------
autoctl | | {}
autoctl_node | | {}
postgres | Superuser, Create role, Create DB, Replication, Bypass RLS | {}


2019-09-19 06:56:15 [administrator@monitor ~]$ sudo runuser -l postgres -c "psql -c '\'l"
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
------------------+----------+-----------+---------+-------+-----------------------
pg_auto_failover | autoctl | SQL_ASCII | C | C |
postgres | postgres | SQL_ASCII | C | C |
template0 | postgres | SQL_ASCII | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | SQL_ASCII | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
(4 rows)

OK,看起來沒有問題

啟動pg_auto_failover monitor

1
2
3
4
5
6
2019-09-19 06:56:19 [administrator@monitor ~]$ sudo runuser -l postgres -c "pg_autoctl run --pgdata /var/lib/postgresql/11/main"
06:57:38 INFO Managing PostgreSQL installation at "/var/lib/postgresql/11/main"
06:57:38 INFO PostgreSQL is running in "/var/lib/postgresql/11/main" on port 5432
06:57:38 INFO The version of extenstion "pgautofailover" is "1.0" on the monitor
06:57:38 INFO pg_auto_failover monitor is ready at postgres://autoctl_node@monitor:5432/pg_auto_failover
06:57:38 INFO Contacting the monitor to LISTEN to its events

這個指令會停留在console畫面上,要不就是在最後加上

放去背景執行
1
2
3
4
5
6

要不就是用systemd/supervisor 來控制,這個後面再來改

就先放著讓他跑,還可以順便觀察cluster變動時的狀態

再開一個terminal 跑底下的指令,產生monitor node 的URI ,在建立maser/slave node 的時候會用到

2019-09-19 06:58:39 [administrator@monitor ~]$ sudo runuser -l postgres -c “pg_autoctl show uri –pgdata /var/lib/postgresql/11/main”
postgres://autoctl_node@monitor:5432/pg_auto_failover
2019-09-19 06:59:01 [administrator@monitor ~]$

1
2
3
4
5
6
7
8

以上完成monitor node 的準備工作

***

#### 設定master/primary node

**建立pg_auto_failover master/primary node**

2019-09-19 15:04:16 [administrator@pg-primary ~]$ sudo runuser -l postgres -c “pg_autoctl create postgres –pgdata /var/lib/postgresql/11/main –pgport 5432 –nodename pg-primary –monitor postgres://autoctl_node@monitor:5432/pg_auto_failover”
15:04:19 INFO Found pg_ctl for PostgreSQL 11.5 at /usr/lib/postgresql/11/bin/pg_ctl
15:04:19 INFO Registered node pg-primary:5432 with id 1 in formation “default”, group 0.
15:04:19 INFO Writing keeper init state file at “/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/11/main/pg_autoctl.init”
15:04:19 INFO Successfully registered as “single” to the monitor.
15:04:21 INFO Initialising a PostgreSQL cluster at “/var/lib/postgresql/11/main”
15:04:21 INFO Postgres is not running, starting postgres
15:04:21 INFO /usr/lib/postgresql/11/bin/pg_ctl –pgdata /var/lib/postgresql/11/main –options “-p 5432” –options “-h *” –wait start
15:04:21 INFO CREATE DATABASE postgres;
15:04:21 INFO The database “postgres” already exists, skipping.
15:04:21 INFO FSM transition from “init” to “single”: Start as a single node
15:04:21 INFO Initialising postgres as a primary
15:04:21 INFO Transition complete: current state is now “single”
15:04:21 INFO Keeper has been succesfully initialized.
2019-09-19 15:04:21 [administrator@pg-primary ~]$

1
2

**啟動 pg_auto_failover master/primary node**

2019-09-19 15:14:34 [administrator@pg-primary ~]$ sudo runuser -l postgres -c “pg_autoctl run –pgdata /var/lib/postgresql/11/main”
15:15:11 INFO Managing PostgreSQL installation at “/var/lib/postgresql/11/main”
15:15:11 INFO The version of extenstion “pgautofailover” is “1.0” on the monitor
15:15:11 INFO pg_autoctl service is starting
15:15:11 INFO Calling node_active for node default/1/0 with current state: single, PostgreSQL is running, sync_state is “”, WAL delta is -1.
15:15:16 INFO Calling node_active for node default/1/0 with current state: single, PostgreSQL is running, sync_state is “”, WAL delta is -1.
15:15:21 INFO Calling node_active for node default/1/0 with current state: single, PostgreSQL is running, sync_state is “”, WAL delta is -1.

1
2

一樣,執行完之後,會停留在畫面上,所以開另一個視窗來執行以下檢查的指令

2019-09-19 15:17:08 [administrator@pg-primary ~]$ sudo runuser -l postgres -c “pg_autoctl show state –pgdata /var/lib/postgresql/11/main”
Name | Port | Group | Node | Current State | Assigned State
———–+——–+——-+——-+——————-+——————
pg-primary | 5432 | 0 | 1 | single | single

2019-09-19 15:17:16 [administrator@pg-primary ~]$

1
2
3
4
5
6
7
8

以上,master/primary node 設定結束

***

####設定 slave/secondary node

**建立 pg_auto_failover slave/secondary node**

2019-09-19 07:04:08 [administrator@pg-slave ~]$ sudo runuser -l postgres -c “pg_autoctl create postgres –pgdata /var/lib/postgresql/11/main –pgport 5432 –nodename pg-slave –monitor postgres://autoctl_node@monitor:5432/pg_auto_failover”
07:19:57 INFO Found pg_ctl for PostgreSQL 11.5 at /usr/lib/postgresql/11/bin/pg_ctl
07:19:57 INFO Registered node pg-slave:5432 with id 2 in formation “default”, group 0.
07:19:57 INFO Writing keeper init state file at “/var/lib/postgresql/.local/share/pg_autoctl/var/lib/postgresql/11/main/pg_autoctl.init”
07:19:57 INFO Successfully registered as “wait_standby” to the monitor.
07:19:57 INFO FSM transition from “init” to “wait_standby”: Start following a primary
07:19:57 INFO Transition complete: current state is now “wait_standby”
07:20:02 INFO FSM transition from “wait_standby” to “catchingup”: The primary is now ready to accept a standby
07:20:02 INFO The primary node returned by the monitor is pg-primary:5432
07:20:02 INFO Initialising PostgreSQL as a hot standby
07:20:02 INFO Running /usr/lib/postgresql/11/bin/pg_basebackup -w -h pg-primary -p 5432 –pgdata /var/lib/postgresql/11/backup -U pgautofailover_replicator –write-recovery-conf –max-rate 100M –wal-method=stream –slot pgautofailover_standby …
07:20:04 INFO pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/2000028 on timeline 1
pg_basebackup: starting background WAL receiver
0/23751 kB (0%), 0/1 tablespace (…ostgresql/11/backup/backup_label)
13103/23751 kB (55%), 0/1 tablespace (…gresql/11/backup/base/13091/2602)
23761/23761 kB (100%), 0/1 tablespace (…esql/11/backup/global/pg_control)
23761/23761 kB (100%), 1/1 tablespace
pg_basebackup: write-ahead log end point: 0/20000F8
pg_basebackup: waiting for background process to finish streaming …
pg_basebackup: base backup completed

07:20:04 INFO Postgres is not running, starting postgres
07:20:04 INFO /usr/lib/postgresql/11/bin/pg_ctl –pgdata /var/lib/postgresql/11/main –options “-p 5432” –options “-h *” –wait start
07:20:04 INFO PostgreSQL started on port 5432
07:20:04 INFO Transition complete: current state is now “catchingup”
07:20:04 INFO Keeper has been succesfully initialized.
2019-09-19 07:20:04 [administrator@pg-slave ~]$

1
2
3
可以看到在建立slave/secondary node 的時候,就會開始第一次的同步

**啟動 pg_auto_failover slave/secondary node

2019-09-19 07:20:04 [administrator@pg-slave ~]$ sudo runuser -l postgres -c “pg_autoctl run –pgdata /var/lib/postgresql/11/main”
07:21:33 INFO Managing PostgreSQL installation at “/var/lib/postgresql/11/main”
07:21:33 INFO The version of extenstion “pgautofailover” is “1.0” on the monitor
07:21:33 INFO pg_autoctl service is starting
07:21:33 INFO Calling node_active for node default/2/0 with current state: catchingup, PostgreSQL is running, sync_state is “”, WAL delta is -1.
07:21:33 INFO FSM transition from “catchingup” to “secondary”: Convinced the monitor that I’m up and running, and eligible for promotion again
07:21:33 INFO Transition complete: current state is now “secondary”
07:21:34 INFO Calling node_active for node default/2/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
07:21:39 INFO Calling node_active for node default/2/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.

1
2

開新視窗確認狀態

2019-09-19 07:22:03 [administrator@pg-slave ~]$ sudo runuser -l postgres -c “pg_autoctl show state –pgdata /var/lib/postgresql/11/main”
Name | Port | Group | Node | Current State | Assigned State
———–+——–+——-+——-+——————-+——————
pg-primary | 5432 | 0 | 1 | primary | primary
pg-slave | 5432 | 0 | 2 | secondary | secondary

2019-09-19 07:22:45 [administrator@pg-slave ~]$

1
2
3
4
5
6

正常的話,應該就是會出現兩個 node ,一個 pg-primary 一個pg-slave

**generate URI for applications**

這個步驟是用來產生給client/applications 連線用的連線字串(connection string)

2019-09-19 07:29:56 [administrator@pg-slave ~]$ sudo runuser -l postgres -c “pg_autoctl show uri –formation default –pgdata /var/lib/postgresql/11/main”
postgres://pg-slave:5432,pg-primary:5432/postgres?target_session_attrs=read-write
2019-09-19 07:30:06 [administrator@pg-slave ~]$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
以後就都通過這個URI來存取這個cluster

切記,**不可以用系統內建的postgresql service,一定要用 pg_auto_failover 來啟動DB**

以上 slave/secondary node 設定完成

***

設定上並不複雜,比起keepalived 要簡單太多了..

那就要評估看看在異常狀況發生,切換資料庫時的表現了

接下來就繼續測試auto failover 的功能!

***

#### 測試 auto failover

首先,在 pg-prmiary node 上,透過上面產生的URI來進入psql

然後建立一個測試db、建立一個測試table,插入幾筆資料

2019-09-20 10:58:21 [administrator@pg-primary ~]$ sudo runuser -l postgres -c “psql postgres://pg-slave:5432,pg-primary:5432/postgres?target_session_attrs=read-write”
psql (11.5 (Ubuntu 11.5-1.pgdg18.04+1))
Type “help” for help.

postgres=# create database testdb_1058;
CREATE DATABASE
postgres=# create table testtbl (serial int);
CREATE TABLE
postgres=# insert into testtbl values (111);
INSERT 0 1
postgres=# insert into testtbl values (222);
INSERT 0 1
postgres=# insert into testtbl values (3);
INSERT 0 1
postgres=# insert into testtbl values (444);
INSERT 0 1
postgres=# select * from testtbl;

serial

111
222
  3
444

(4 rows)

postgres=# \q
2019-09-20 10:59:30 [administrator@pg-primary ~]$

1
2

然後切換到 pg-slave ,一樣透過URI進入psql 撈看看資料

2019-09-20 03:02:12 [administrator@pg-slave ~]$ sudo runuser -l postgres -c “psql postgres://pg-slave:5432,pg-primary:5432/postgres?target_session_attrs=read-write”
psql (11.5 (Ubuntu 11.5-1.pgdg18.04+1))
Type “help” for help.

postgres=# select * from testtbl;

serial

111
222
  3
444

(4 rows)

postgres=#

1
2
3
4
5
6

確認透過URI的確是可以正常的存取資料

在兩台單機上(master/slave)上也都可以看到同樣的資料,代表資料有正確的被同步到兩台node

**pg-slave**

2019-09-20 03:03:06 [administrator@pg-slave ~]$ sudo runuser -l postgres -c psql
psql (11.5 (Ubuntu 11.5-1.pgdg18.04+1))
Type “help” for help.

postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
————-+———-+———–+———+——-+———————–
nexus | postgres | SQL_ASCII | C | C |
postgres | postgres | SQL_ASCII | C | C |
template0 | postgres | SQL_ASCII | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | SQL_ASCII | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
testdb_1058 | postgres | SQL_ASCII | C | C |
(5 rows)

postgres=# select * from testtbl;

serial

111
222
  3
444

(4 rows)

postgres=#

1
2

**pg-master**

2019-09-20 10:59:30 [administrator@pg-primary ~]$ sudo runuser -l postgres -c psql
psql (11.5 (Ubuntu 11.5-1.pgdg18.04+1))
Type “help” for help.

postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
————-+———-+———–+———+——-+———————–
nexus | postgres | SQL_ASCII | C | C |
postgres | postgres | SQL_ASCII | C | C |
template0 | postgres | SQL_ASCII | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | SQL_ASCII | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
testdb_1058 | postgres | SQL_ASCII | C | C |
(5 rows)

postgres=# select * from testtbl;

serial

111
222
  3
444

(4 rows)

postgres=#

1
2
3
4

接著來把本來是master的機器關掉,看看會有什麼變化

首先,在pg-slave 這台機器上,就會看到cluster 有狀況,primary 不見了

03:06:02 INFO Calling node_active for node default/4/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
03:06:08 INFO Calling node_active for node default/4/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
03:06:13 INFO Calling node_active for node default/4/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
03:06:18 ERROR PostgreSQL cannot reach the primary server: the system view pg_stat_wal_receiver has no rows.
03:06:18 INFO Calling node_active for node default/4/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is -1.
03:06:23 ERROR PostgreSQL cannot reach the primary server: the system view pg_stat_wal_receiver has no rows.
03:06:23 INFO Calling node_active for node default/4/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is -1.
03:06:28 ERROR PostgreSQL cannot reach the primary server: the system view pg_stat_wal_receiver has no rows.
03:06:28 INFO Calling node_active for node default/4/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is -1.
03:06:33 ERROR PostgreSQL cannot reach the primary server: the system view pg_stat_wal_receiver has no rows.
03:06:33 INFO Calling node_active for node default/4/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is -1.
03:06:38 ERROR PostgreSQL cannot reach the primary server: the system view pg_stat_wal_receiver has no rows.
03:06:38 INFO Calling node_active for node default/4/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is -1.

1
2

這時候在pg-slave 還能不能透過URI 連線進去psql操作?

2019-09-20 03:08:06 [administrator@pg-slave ~]$ sudo runuser -l postgres -c “psql postgres://pg-slave:5432,pg-primary:5432/postgres?target_session_attrs=read-write”
psql (11.5 (Ubuntu 11.5-1.pgdg18.04+1))
Type “help” for help.

postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
————-+———-+———–+———+——-+———————–
nexus | postgres | SQL_ASCII | C | C |
postgres | postgres | SQL_ASCII | C | C |
template0 | postgres | SQL_ASCII | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | SQL_ASCII | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
testdb_1058 | postgres | SQL_ASCII | C | C |
(5 rows)

postgres=# create database testdb_1108;
CREATE DATABASE
postgres=# create table testtbl_1108 (serial int);
CREATE TABLE
postgres=# insert into testtbl_1108 values (999);
INSERT 0 1
postgres=# insert into testtbl_1108 values (888);
INSERT 0 1
postgres=# insert into testtbl_1108 values (77);
INSERT 0 1
postgres=# insert into testtbl_1108 values (66);
INSERT 0 1
postgres=# insert into testtbl_1108 values (55);
INSERT 0 1
postgres=# \q
2019-09-20 03:09:34 [administrator@pg-slave ~]$

1
2

很好,看起來沒有問題,這時候把 pg-primary 打開,執行啟動 pg_auto_failover node 的指令,看看會發生什麼事

2019-09-20 11:12:13 [administrator@pg-primary ~]$ sudo runuser -l postgres -c “pg_autoctl run”
11:12:29 INFO Managing PostgreSQL installation at “/var/lib/postgresql/11/main”
11:12:29 INFO The version of extenstion “pgautofailover” is “1.0” on the monitor
11:12:29 INFO pg_autoctl service is starting
11:12:29 INFO Calling node_active for node default/1/0 with current state: primary, PostgreSQL is not running, sync_state is “”, WAL delta is -1.
11:12:29 INFO Postgres is not running, starting postgres
11:12:29 INFO /usr/lib/postgresql/11/bin/pg_ctl –pgdata /var/lib/postgresql/11/main –options “-p 5432” –options “-h *” –wait start
11:12:30 WARN PostgreSQL was not running, restarted with pid 1407
11:12:30 INFO FSM transition from “primary” to “demoted”: A failover occurred, no longer primary
11:12:30 INFO Transition complete: current state is now “demoted”
11:12:30 INFO Calling node_active for node default/1/0 with current state: demoted, PostgreSQL is not running, sync_state is “”, WAL delta is -1.
11:12:30 INFO FSM transition from “demoted” to “catchingup”: A new primary is available. First, try to rewind. If that fails, do a pg_basebackup.
11:12:30 INFO The primary node returned by the monitor is pg-slave:5432
11:12:30 INFO Rewinding PostgreSQL to follow new primary pg-slave:5432
11:12:30 ERROR Connection to database failed: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket “/var/run/postgresql/.s.PGSQL.5432”?

11:12:30 ERROR Failed to get the postgresql.conf path from the local postgres server, see above for details
11:12:30 WARN Failed to rewind demoted primary to standby, trying pg_basebackup instead
11:12:30 INFO Initialising PostgreSQL as a hot standby
11:12:30 INFO Target directory exists: “/var/lib/postgresql/11/main”, stopping PostgreSQL
11:12:30 INFO pg_ctl: no server running

11:12:30 INFO pg_ctl stop failed, but PostgreSQL is not running anyway
11:12:30 INFO Running /usr/lib/postgresql/11/bin/pg_basebackup -w -h pg-slave -p 5432 –pgdata /var/lib/postgresql/11/backup -U pgautofailover_replicator –write-recovery-conf –max-rate 100M –wal-method=stream –slot pgautofailover_standby …
11:12:33 INFO pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/12000028 on timeline 6
pg_basebackup: starting background WAL receiver
0/46937 kB (0%), 0/1 tablespace (…ostgresql/11/backup/backup_label)
46947/46947 kB (100%), 0/1 tablespace (…esql/11/backup/global/pg_control)
46947/46947 kB (100%), 1/1 tablespace
pg_basebackup: write-ahead log end point: 0/120000F8
pg_basebackup: waiting for background process to finish streaming …
pg_basebackup: base backup completed

11:12:33 INFO Postgres is not running, starting postgres
11:12:33 INFO /usr/lib/postgresql/11/bin/pg_ctl –pgdata /var/lib/postgresql/11/main –options “-p 5432” –options “-h *” –wait start
11:12:33 INFO PostgreSQL started on port 5432
11:12:33 INFO Drop replication slot “pgautofailover_standby”
11:12:33 INFO Transition complete: current state is now “catchingup”
11:12:33 INFO Calling node_active for node default/1/0 with current state: catchingup, PostgreSQL is running, sync_state is “”, WAL delta is -1.
11:12:38 INFO Calling node_active for node default/1/0 with current state: catchingup, PostgreSQL is running, sync_state is “”, WAL delta is -1.
11:12:43 INFO Calling node_active for node default/1/0 with current state: catchingup, PostgreSQL is running, sync_state is “”, WAL delta is -1.
11:12:48 INFO Calling node_active for node default/1/0 with current state: catchingup, PostgreSQL is running, sync_state is “”, WAL delta is -1.
11:12:48 INFO FSM transition from “catchingup” to “secondary”: Convinced the monitor that I’m up and running, and eligible for promotion again
11:12:48 INFO Transition complete: current state is now “secondary”
11:12:48 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
11:12:53 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
11:12:58 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
11:13:03 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.

1
2
3
4
5

可以看到 pg_auto_failover 偵測到了cluster node 有意異動,開始同步資料,然後切換角色,從primary變成了 secondary


同一時間,在pg-slave 的訊息也有變動

03:12:34 INFO Calling node_active for node default/4/0 with current state: wait_primary, PostgreSQL is running, sync_state is “async”, WAL delta is 0.
03:12:39 INFO Calling node_active for node default/4/0 with current state: wait_primary, PostgreSQL is running, sync_state is “async”, WAL delta is 0.
03:12:45 INFO Calling node_active for node default/4/0 with current state: wait_primary, PostgreSQL is running, sync_state is “async”, WAL delta is 0.
03:12:50 INFO Calling node_active for node default/4/0 with current state: wait_primary, PostgreSQL is running, sync_state is “async”, WAL delta is 0.
03:12:50 INFO FSM transition from “wait_primary” to “primary”: A healthy secondary appeared
03:12:50 INFO Enabling synchronous replication
03:12:50 INFO Transition complete: current state is now “primary”
03:12:50 INFO Calling node_active for node default/4/0 with current state: primary, PostgreSQL is running, sync_state is “sync”, WAL delta is 0.
03:12:55 INFO Calling node_active for node default/4/0 with current state: primary, PostgreSQL is running, sync_state is “sync”, WAL delta is 0.

1
2
3
4

pg-slave 的狀態從原本的secondary 變成了 primary

來檢查看看

2019-09-20 03:19:04 [administrator@pg-slave ~]$ sudo runuser -l postgres -c “pg_autoctl show state”
Name | Port | Group | Node | Current State | Assigned State
———–+——–+——-+——-+——————-+——————
pg-primary | 5432 | 0 | 1 | secondary | secondary
pg-slave | 5432 | 0 | 4 | primary | primary

2019-09-20 03:19:07 [administrator@pg-slave ~]$

1
2
3
4
5
6

跟上面第一次執行的結果不同了,兩台角色互換了,這也是預期中的結果

那在pg-master 離線期間的異動資料呢?在 pg-master上查得到嗎?

先透過 URI存取DB 檢查看看

2019-09-20 11:12:16 [administrator@pg-primary ~]$ sudo runuser -l postgres -c “psql postgres://pg-slave:5432,pg-primary:5432/postgres?target_session_attrs=read-write”
psql (11.5 (Ubuntu 11.5-1.pgdg18.04+1))
Type “help” for help.

postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
————-+———-+———–+———+——-+———————–
nexus | postgres | SQL_ASCII | C | C |
postgres | postgres | SQL_ASCII | C | C |
template0 | postgres | SQL_ASCII | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | SQL_ASCII | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
testdb_1058 | postgres | SQL_ASCII | C | C |
testdb_1108 | postgres | SQL_ASCII | C | C |
(6 rows)

postgres=# select * from testtbl_1108;

serial

999
888
 77
 66
 55

(5 rows)

postgres=#

1
2
3
4

GOOD!看來資料也同步過來了

再來看看單機的狀態

2019-09-20 11:22:07 [administrator@pg-primary ~]$ sudo runuser -l postgres -c “psql”
psql (11.5 (Ubuntu 11.5-1.pgdg18.04+1))
Type “help” for help.

postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
————-+———-+———–+———+——-+———————–
nexus | postgres | SQL_ASCII | C | C |
postgres | postgres | SQL_ASCII | C | C |
template0 | postgres | SQL_ASCII | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | SQL_ASCII | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
testdb_1058 | postgres | SQL_ASCII | C | C |
testdb_1108 | postgres | SQL_ASCII | C | C |
(6 rows)

postgres=# select * from testtbl_1108;

serial

999
888
 77
 66
 55

(5 rows)

postgres=#

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

很好,確認資料有正確的抄寫過來!

***

以上測試可以發現 pg_auto_failover 這個postgresql的extension 還真的很好用!

設定簡單、快速、方便,設定完成後也不用傷腦筋,反正有異常,會自動幫你搞定master-slave之間的主從關係

資料也都會自動同步好,真的是非常推薦啊!

UPDATE

#### 將 pg_auto_failover 用 systemd 管理

每次都要手動執行 pg_autoctl run 太麻煩了,應該用systemd 或者是 supervisor 來管理

而pg_auto_failover 有指令可以做成 systemd 的service 檔案

既然人家都做好了,當然就直接用 systemd 來做

執行以下指令,產生 systemd configuration

sudo runuser -l postgres -c “pg_autoctl -q show systemd” | sudo tee /etc/systemd/system/pgautofailover.service

1
2
3
4
5
6
7
在/etc/systemd/system底下會多出一個 pgautofailover.service檔案

內容就是 pg_auto_failover 所提供的 systemd config

接下來就可以直接用 sudo service pgautofailover start 來啟動

沒有什麼問題,不過呢,這樣的作法,會在 /var/log/syslog 塞滿了 pg_auto_failover 的紀錄

Sep 20 14:59:34 pg-primary systemd[1]: Started pg_auto_failover.
Sep 20 14:59:34 pg-primary pg_autoctl[7817]: 14:59:34 INFO Managing PostgreSQL installation at “/var/lib/postgresql/11/main”
Sep 20 14:59:34 pg-primary pg_autoctl[7817]: 14:59:34 INFO Found a stale pidfile at “/tmp/pg_autoctl/var/lib/postgresql/11/main/pg_autoctl.pid”
Sep 20 14:59:34 pg-primary pg_autoctl[7817]: 14:59:34 WARN Removing the stale pid file “/tmp/pg_autoctl/var/lib/postgresql/11/main/pg_autoctl.pid”
Sep 20 14:59:34 pg-primary pg_autoctl[7817]: 14:59:34 INFO The version of extenstion “pgautofailover” is “1.0” on the monitor
Sep 20 14:59:34 pg-primary pg_autoctl[7817]: 14:59:34 INFO pg_autoctl service is starting
Sep 20 14:59:34 pg-primary pg_autoctl[7817]: 14:59:34 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 14:59:39 pg-primary pg_autoctl[7817]: 14:59:39 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 14:59:44 pg-primary pg_autoctl[7817]: 14:59:44 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 14:59:49 pg-primary pg_autoctl[7817]: 14:59:49 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 14:59:54 pg-primary pg_autoctl[7817]: 14:59:54 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 14:59:59 pg-primary pg_autoctl[7817]: 14:59:59 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:00:04 pg-primary pg_autoctl[7817]: 15:00:04 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:00:09 pg-primary pg_autoctl[7817]: 15:00:09 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:00:15 pg-primary pg_autoctl[7817]: 15:00:14 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:00:20 pg-primary pg_autoctl[7817]: 15:00:20 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:00:25 pg-primary pg_autoctl[7817]: 15:00:25 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:00:30 pg-primary pg_autoctl[7817]: 15:00:30 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:00:35 pg-primary pg_autoctl[7817]: 15:00:35 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:00:40 pg-primary pg_autoctl[7817]: 15:00:40 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:00:45 pg-primary pg_autoctl[7817]: 15:00:45 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:00:50 pg-primary pg_autoctl[7817]: 15:00:50 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:00:55 pg-primary pg_autoctl[7817]: 15:00:55 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:01:00 pg-primary pg_autoctl[7817]: 15:01:00 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:01:05 pg-primary pg_autoctl[7817]: 15:01:05 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:01:10 pg-primary pg_autoctl[7817]: 15:01:10 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:01:15 pg-primary pg_autoctl[7817]: 15:01:15 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:01:20 pg-primary pg_autoctl[7817]: 15:01:20 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:01:25 pg-primary pg_autoctl[7817]: 15:01:25 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:01:30 pg-primary pg_autoctl[7817]: 15:01:30 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:01:35 pg-primary pg_autoctl[7817]: 15:01:35 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:01:40 pg-primary pg_autoctl[7817]: 15:01:40 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:01:45 pg-primary pg_autoctl[7817]: 15:01:45 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:01:50 pg-primary pg_autoctl[7817]: 15:01:50 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:01:55 pg-primary pg_autoctl[7817]: 15:01:55 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:02:00 pg-primary pg_autoctl[7817]: 15:02:00 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:02:05 pg-primary pg_autoctl[7817]: 15:02:05 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:02:10 pg-primary pg_autoctl[7817]: 15:02:10 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:02:15 pg-primary pg_autoctl[7817]: 15:02:15 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:02:20 pg-primary pg_autoctl[7817]: 15:02:20 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:02:25 pg-primary pg_autoctl[7817]: 15:02:25 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:02:30 pg-primary pg_autoctl[7817]: 15:02:30 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:02:35 pg-primary pg_autoctl[7817]: 15:02:35 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:02:40 pg-primary pg_autoctl[7817]: 15:02:40 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:02:46 pg-primary pg_autoctl[7817]: 15:02:46 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:02:51 pg-primary pg_autoctl[7817]: 15:02:51 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:02:56 pg-primary pg_autoctl[7817]: 15:02:56 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:03:01 pg-primary pg_autoctl[7817]: 15:03:01 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:03:06 pg-primary pg_autoctl[7817]: 15:03:06 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:03:11 pg-primary pg_autoctl[7817]: 15:03:11 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:03:16 pg-primary pg_autoctl[7817]: 15:03:16 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:03:21 pg-primary pg_autoctl[7817]: 15:03:21 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:03:26 pg-primary pg_autoctl[7817]: 15:03:26 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:03:31 pg-primary pg_autoctl[7817]: 15:03:31 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:03:36 pg-primary pg_autoctl[7817]: 15:03:36 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:03:41 pg-primary pg_autoctl[7817]: 15:03:41 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:03:46 pg-primary pg_autoctl[7817]: 15:03:46 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:03:51 pg-primary pg_autoctl[7817]: 15:03:51 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:03:57 pg-primary pg_autoctl[7817]: 15:03:57 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:04:02 pg-primary pg_autoctl[7817]: 15:04:02 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:04:07 pg-primary pg_autoctl[7817]: 15:04:07 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:04:12 pg-primary pg_autoctl[7817]: 15:04:12 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:04:17 pg-primary pg_autoctl[7817]: 15:04:17 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:04:22 pg-primary pg_autoctl[7817]: 15:04:22 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:04:27 pg-primary pg_autoctl[7817]: 15:04:27 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:04:32 pg-primary pg_autoctl[7817]: 15:04:32 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:04:37 pg-primary pg_autoctl[7817]: 15:04:37 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:04:42 pg-primary pg_autoctl[7817]: 15:04:42 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:04:47 pg-primary pg_autoctl[7817]: 15:04:47 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
Sep 20 15:04:52 pg-primary pg_autoctl[7817]: 15:04:52 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.

1
2
3
4
5
6

這樣下去不用多久,syslog 一定爆,而且嚴重干擾到其他的系統訊息

所以要修改一下 config ,把log 寫到另外的檔案去

在/etc/systemd/system/pgautofailover.service 加入底下兩行

StandardOutput=file:/var/log/pgautofailover.log
StandardError=file:/var/log/pgautofailover-error.log

1
2
3
4
5
6
7
8
9
10
11
然後 sudo systemctl daemon-reload 接著重起 sudo service pgautofailover restart

就會把訊息都寫到 /var/log/pgautofailover.log , /var/log/pgautofailover-error.log

再用 logrotate 來管理就可以了

這個動作需要在三台node都執行

不過我在想應該是可以不要產生那麼多log ,只要紀錄 critical 就好了?來試試看加入 LogLevelMax=3 的效果

在 /etc/systemd/system/pgautofailover.service 檔案中的service 區段 加入

LogLevelMax=3

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
loglevel 的等級定義看這邊

https://www.ctrl.blog/entry/systemd-log-levels.html


* emergency (0)
* alert (1)
* critical (2)
* error (3)
* warning (4)
* notice (5)
* info (6)
* debug (7)

完成後,一樣執行

sudo systemctl daemon-reload
sudo service pgautofailover restart

1
看一下檔案內容,怎麼就沒有訊息了!

2019-09-20 15:23:15 [administrator@pg-primary ~]$ sudo tail -30 /var/log/pgautofailover.log
15:14:22 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:14:27 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:14:32 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:14:37 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:14:42 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:14:47 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:14:52 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:14:57 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:15:02 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:15:07 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:15:12 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:15:17 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:15:22 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:15:27 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:15:32 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:15:37 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:15:42 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:15:47 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:15:52 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:15:58 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:16:03 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:16:08 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:16:13 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:16:18 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:16:23 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:16:28 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:16:34 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:16:39 INFO Calling node_active for node default/1/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 0.
15:16:39 WARN Smart shutdown: received signal Terminated
15:16:39 INFO pg_autoctl service stopping
2019-09-20 15:23:29 [administrator@pg-primary ~]$

1
2

可是我的服務活著啊!

2019-09-20 15:23:29 [administrator@pg-primary ~]$ sudo service pgautofailover status
● pgautofailover.service - pg_auto_failover
Loaded: loaded (/etc/systemd/system/pgautofailover.service; disabled; vendor preset: enabled)
Active: active (running) since Fri 2019-09-20 15:16:39 CST; 7min ago
Main PID: 8611 (pg_autoctl)
Tasks: 1 (limit: 2321)
CGroup: /system.slice/pgautofailover.service
└─8611 /usr/bin/pg_autoctl run

Sep 20 15:16:39 pg-primary systemd[1]: Started pg_auto_failover.
2019-09-20 15:23:56 [administrator@pg-primary ~]$

1
2

拿另一台來做實驗,先不要加入 loglevelmax=3 ,看看log到底長怎樣!

2019-09-20 07:25:34 [administrator@pg-slave ~]$ sudo service pgautofailover stop
2019-09-20 07:25:58 [administrator@pg-slave ~]$ sudo tail -10 /var/log/syslog
Sep 20 07:25:32 pg-slave pg_autoctl[5840]: 07:25:32 INFO Calling node_active for node default/4/0 with current state: primary, PostgreSQL is running, sync_state is “sync”, WAL delta is 0.
Sep 20 07:25:37 pg-slave pg_autoctl[5840]: 07:25:37 INFO Calling node_active for node default/4/0 with current state: primary, PostgreSQL is running, sync_state is “sync”, WAL delta is 0.
Sep 20 07:25:42 pg-slave pg_autoctl[5840]: 07:25:42 INFO Calling node_active for node default/4/0 with current state: primary, PostgreSQL is running, sync_state is “sync”, WAL delta is 0.
Sep 20 07:25:47 pg-slave pg_autoctl[5840]: 07:25:47 INFO Calling node_active for node default/4/0 with current state: primary, PostgreSQL is running, sync_state is “sync”, WAL delta is 0.
Sep 20 07:25:52 pg-slave pg_autoctl[5840]: 07:25:52 INFO Calling node_active for node default/4/0 with current state: primary, PostgreSQL is running, sync_state is “sync”, WAL delta is 0.
Sep 20 07:25:57 pg-slave pg_autoctl[5840]: 07:25:57 INFO Calling node_active for node default/4/0 with current state: primary, PostgreSQL is running, sync_state is “sync”, WAL delta is 0.
Sep 20 07:25:58 pg-slave systemd[1]: Stopping pg_auto_failover…
Sep 20 07:25:58 pg-slave pg_autoctl[5840]: 07:25:58 WARN Smart shutdown: received signal Terminated
Sep 20 07:25:58 pg-slave pg_autoctl[5840]: 07:25:58 INFO pg_autoctl service stopping
Sep 20 07:25:58 pg-slave systemd[1]: Stopped pg_auto_failover.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

好吧,看不出有什麼差別! 直接把primary 關機測試! 結果還是看不出來 0rz

放棄加入loglevel 的想法,就用logrotate 來管理吧!

UPDATE

用systemd 來管理的作法失敗了,重起之後, pgautofailover 不會自動啟動...

找了很久找不出原因,改用 supervisor來做好了

***

#### 設定 supervisor

**安裝supervisor**

sudo apt install supervisor
sudo vim /etc/supervisor/conf.d/pgautofailover.conf
加入以下內容
[program:pgautofailover]
command = pg_autoctl run
user = postgres
autostart=true
autorestart=true
redirect_stderr = true
environment=PGDATA=/var/lib/postgresql/11/main,HOME=/var/lib/postgresql

1
重新啟動 supervisor & 檢查狀態

2019-09-20 17:26:22 [administrator@pg-primary postgresql]$ sudo service supervisor restart
2019-09-20 17:26:57 [administrator@pg-primary postgresql]$ sudo supervisorctl status
pgautofailover RUNNING pid 14554, uptime 0:00:05
2019-09-20 17:27:03 [administrator@pg-primary postgresql]$

1
檢查 supervisor log

2019-09-20 17:27:03 [administrator@pg-primary postgresql]$ sudo tail -30 /var/log/supervisor/pgautofailover-stdout—supervisor-5V8qET.log
17:26:58 INFO Managing PostgreSQL installation at “/var/lib/postgresql/11/main”
17:26:58 INFO Found a stale pidfile at “/tmp/pg_autoctl/var/lib/postgresql/11/main/pg_autoctl.pid”
17:26:58 WARN Removing the stale pid file “/tmp/pg_autoctl/var/lib/postgresql/11/main/pg_autoctl.pid”
17:26:58 INFO The version of extenstion “pgautofailover” is “1.0” on the monitor
17:26:58 INFO pg_autoctl service is starting
17:26:58 INFO Calling node_active for node default/1/0 with current state: primary, PostgreSQL is running, sync_state is “sync”, WAL delta is 0.
17:27:03 INFO Calling node_active for node default/1/0 with current state: primary, PostgreSQL is running, sync_state is “sync”, WAL delta is 0.
17:27:08 INFO Calling node_active for node default/1/0 with current state: primary, PostgreSQL is running, sync_state is “sync”, WAL delta is 0.
17:27:13 INFO Calling node_active for node default/1/0 with current state: primary, PostgreSQL is running, sync_state is “sync”, WAL delta is 0.
17:27:18 INFO Calling node_active for node default/1/0 with current state: primary, PostgreSQL is running, sync_state is “sync”, WAL delta is 0.
2019-09-20 17:27:21 [administrator@pg-primary postgresql]$

1
2
3
4
5
6

簡單多了 ... 在剩下的monitor/slave 也一樣操作安裝、設定supervisor 就可以了

喔,順便把剛剛新增的 systemd config 給砍了..

不過不砍也沒差,反正開機也不會自己啟動 = =+

2019-09-20 17:24:17 [administrator@pg-slave ~]$ sudo rm -rf /etc/systemd/system/pgautofailover.service

#安裝supervisor
2019-09-20 17:23:08 [administrator@pg-slave ~]$ sudo apt install supervisor
Reading package lists… Done
Building dependency tree
Reading state information… Done
The following additional packages will be installed:
python-meld3 python-pkg-resources
Suggested packages:
python-setuptools supervisor-doc
The following NEW packages will be installed:
python-meld3 python-pkg-resources supervisor
0 upgraded, 3 newly installed, 0 to remove and 167 not upgraded.
Need to get 415 kB of archives.
After this operation, 2,138 kB of additional disk space will be used.
Do you want to continue? [Y/n] y
Get:1 http://archive.ubuntu.com/ubuntu bionic/main amd64 python-pkg-resources all 39.0.1-2 [128 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 python-meld3 all 1.0.2-2 [30.9 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/universe amd64 supervisor all 3.3.1-1.1 [256 kB]
Fetched 415 kB in 0s (6,752 kB/s)
Selecting previously unselected package python-pkg-resources.
(Reading database … 103586 files and directories currently installed.)
Preparing to unpack …/python-pkg-resources_39.0.1-2_all.deb …
Unpacking python-pkg-resources (39.0.1-2) …
Selecting previously unselected package python-meld3.
Preparing to unpack …/python-meld3_1.0.2-2_all.deb …
Unpacking python-meld3 (1.0.2-2) …
Selecting previously unselected package supervisor.
Preparing to unpack …/supervisor_3.3.1-1.1_all.deb …
Unpacking supervisor (3.3.1-1.1) …
Processing triggers for ureadahead (0.100.0-20) …
Setting up python-meld3 (1.0.2-2) …
Setting up python-pkg-resources (39.0.1-2) …
Setting up supervisor (3.3.1-1.1) …
Created symlink /etc/systemd/system/multi-user.target.wants/supervisor.service → /lib/systemd/system/supervisor.service.
Processing triggers for systemd (237-3ubuntu10.29) …
Processing triggers for man-db (2.8.3-2) …
Processing triggers for ureadahead (0.100.0-20) …
2019-09-20 17:23:37 [administrator@pg-slave ~]$ sudo vim /etc/supervisor/conf.d/pgautofailover.conf

#貼上上面的 conf 內容
2019-09-20 17:30:12 [administrator@pg-slave ~]$ sudo service supervisor restart
2019-09-20 17:30:58 [administrator@pg-slave ~]$ sudo supervisorctl status
pgautofailover RUNNING pid 2232, uptime 0:00:04
2019-09-20 17:31:03 [administrator@pg-slave ~]$ sudo tail -30 /var/log/supervisor/pgautofailover-stdout—supervisor-LC_FG8.log
17:30:59 INFO Managing PostgreSQL installation at “/var/lib/postgresql/11/main”
17:30:59 INFO Found a stale pidfile at “/tmp/pg_autoctl/var/lib/postgresql/11/main/pg_autoctl.pid”
17:30:59 WARN Removing the stale pid file “/tmp/pg_autoctl/var/lib/postgresql/11/main/pg_autoctl.pid”
17:30:59 INFO The version of extenstion “pgautofailover” is “1.0” on the monitor
17:30:59 INFO pg_autoctl service is starting
17:30:59 INFO Calling node_active for node default/4/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 320.
17:31:04 INFO Calling node_active for node default/4/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 320.
17:31:09 INFO Calling node_active for node default/4/0 with current state: secondary, PostgreSQL is running, sync_state is “”, WAL delta is 320.
2019-09-20 17:31:13 [administrator@pg-slave ~]$

1
2
3
4
5
6
7
8

搞定,收工!準備寫 ansible playbook!


***
UPDATE

新增用logrotate 來管理 pgautofailover logfile

sudo vim /etc/logrotate.d/pgautofailover

/var/log/supervisor/pgautofail*.log {
daily
rotate 7
copytruncate
delaycompress # today and yesterday will not compress
compress
missingok
notifempty
}

1
2
3
UPDATE

node 消失,需要手動在monitor上執行以下指令來移除node ,才能夠再次加入新的node

2019-09-24 01:11:09 [administrator@monitor ~]$ sudo runuser -l postgres -c “psql postgres://autoctl_node@monitor:5432/pg_auto_failover”
psql (11.5 (Ubuntu 11.5-1.pgdg18.04+1))
Type “help” for help.

pg_auto_failover=> select pgautofailover.remove_node(‘pg-third’)
pg_auto_failover-> ;

remove_node

t
(1 row)

pg_auto_failover=>
`