CentOS 8.x| Code Block |
|---|
| title | Unstandby broken node |
|---|
|
| sudo pcs resource show
sudo cat /proc/drbdpcs cluster start acd-lb-broken
pcs node unstandby acd-lb-broken
--> Verify this with
pcs status |
|
Untune the transfer (Second Node) - CentOS 7x onlyIf the transfer was tuned, then untune it (on the broken node). Note: it won't hurt to run this command anyway. | Translations Ignore |
|---|
| Code Block |
|---|
| title | drbd - Untune Transfer |
|---|
| drbdadm adjust jtelshared |
|
Check everything| Translations Ignore |
|---|
| Code Block |
|---|
| pcs status
# CentOS 7.x
cat /proc/drbd
# CentOS 8.x
drbdadm status
# On some other linux machines
ls /home/jtel/shared
# Windows
dir //acd-store/shared |
|
File System CorruptSometimes, when DRBD fails, the file system will also become corrupt. In this case both nodes might be primary, however neither will have the share mounted. The command mount /srv/jtel/shared will fail. In this case, it may be necessary to repair the file system. Symptoms| Translations Ignore |
|---|
| Code Block |
|---|
[17354513.483526] XFS (drbd1): log mount/recovery failed: error -22
[17354513.483569] XFS (drbd1): log mount failed
[17355040.104433] XFS (drbd1): Mounting V5 Filesystem
[17355040.122234] XFS (drbd1): Corruption warning: Metadata has LSN (56:112832) ahead of current LSN (56:112733). Please unmount and run xfs_repair (>= v4.3) to resolve.
[17355040.122239] XFS (drbd1): log mount/recovery failed: error -22
[17355040.122322] XFS (drbd1): log mount failed |
|
RepairingOne one of the nodes (need to choose one to become primary): | Translations Ignore |
|---|
| Code Block |
|---|
xfs_repair /dev/drbd/by-res/jtelshared/0
pcs resource cleanup |
|
This should then mount and start the resources on that node. Then proceed with the other node as "broken" in the split brain situation. Stalled ResyncIf the DRBD resync stalls - the output will be "stalled" when cat /proc/drbd is executed - then it may be necessary to restart the machine. This has been observed once, and restarting resolved the situation. However not much more is known about this state, or the cause, at this time. Failed Connect (Unrelated data, aborting)When the secondary has been told to discard it's data, and all of the commands to start the sync have been entered on both the healthy and the broken node, sometimes cat /proc/drbd will not report a connection. Check /var/log/messages If you can see output like this: | Code Block |
|---|
kernel: block drbd0: uuid_compare()=-1000 by rule 100
kernel: block drbd0: Unrelated data, aborting! |
Then the metadata has become corrupt. This requires that the metadata be completely reconstructed on the bad node. Use the following commands to recreate the data on the broken node: | Code Block |
|---|
drbdadm down jtelshared
drbdadm wipe-md jtelshared
drbdadm create-md jtelshared |
Then proceed with the re-sync as above (start with the part "On the broken node" with the commands to place this in secondary and discard data. Failed Connect (Unknown Connection)This produces errors something like this: | Code Block |
|---|
??: Failure: (162) Invalid configuration request
additional info from kernel:
unknown connection |
In this case, drbd might not be loaded and enabled. Execute the following code on the broken node and then proceed as above: | Code Block |
|---|
modprobe drbd
systemctl enable drbd
systemctl start drbd |
|