Background
See also:
...
https://docs.linbit.com/doc/users-guide-84/s-resolve-split-brain/
Symptoms
Code Block | ||
---|---|---|
|
...
stop cluster, on broken node first!
...
| |
cat /proc/drbd
-->
GIT-hash: a4d5de01fffd7e4cde48a080e2c686f9e8cebf4c build by mockbuild@, 2017-09-15 14:23:22
1: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r-----
ns:0 nr:119823323 dw:119823323 dr:2128 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 |
cs:StandAlone means the node is not connected.
This should be visible on both sides.
Find out which node is active in the PCS cluster
Code Block | ||
---|---|---|
| ||
pcs status
-->
Cluster name: portal
Stack: corosync
Current DC: acd-store1 (version 1.1.16-12.el7_4.7-94ff4df) - partition with quorum
Last updated: Sun Mar 18 18:05:32 2018
Last change: Fri Feb 16 00:07:51 2018 by root via cibadmin on acd-store2
2 nodes configured
3 resources configured
Node acd-store1: standby
Online: [ acd-store2 ]
Full list of resources:
Resource Group: haproxy_group
ClusterDataJTELSharedMount (ocf::heartbeat:Filesystem): Started acd-store2
ClusterIP (ocf::heartbeat:IPaddr2): Started acd-store2
samba (systemd:smb): Started acd-store2
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled |
In the example above, the first node is in standby. The most important thing to check, is on which server the resources are started.
In this case, the resources are started on acd-store2.
This will therefore be defined as the NON BROKEN node.
Standby the broken node in the PCS cluster
This command can be run on either machine.
Code Block | ||
---|---|---|
| ||
sudo pcs cluster standby |
...
acd- |
...
lb-broken
--> Verify this with
pcs status |
On broken node
Code Block | ||
---|---|---|
| ||
sudo drbdadm disconnect jtelshared |
...
sudo drbdadm secondary jtelshared |
...
sudo drbdadm connect --discard-my-data jtelshared |
...
On the healthy node
Code Block | ||
---|---|---|
| ||
sudo drbdadm primary jtelshared |
...
sudo drbdadm connect jtelshared |
...
start cluster
...
Check re-sync activity
The re-sync might take a long time.
Watch the status of this using:
cat /proc/drbd
Example output:
Code Block | ||
---|---|---|
| ||
[root@storage01 ~]# cat /proc/drbd
version: 8.4.10-1 (api:1/proto:86-101)
GIT-hash: a4d5de01fffd7e4cde48a080e2c686f9e8cebf4c build by mockbuild@, 2017-09-15 14:23:22
1: cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate C r-----
ns:0 nr:1411538 dw:121234862 dr:2128 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:184698664
[>....................] sync'ed: 0.8% (180368/181744)M
finish: 26:12:15 speed: 1,940 (2,760) want: 2,120 K/sec |
Restart PCS node
Code Block | ||
---|---|---|
| ||
sudo pcs cluster unstandby |
...
acd- |
...
lb-broken
--> Verify this with
pcs status |
Check everything
Code Block | ||
---|---|---|
|
...
sudo pcs resource show |
...
check DRBD
sudo cat /proc/drbd |