File system corrupt after re-adding software RAID 1 member after test. Why?












0















A colleague and me set up a software RAID 1 with mdadm consisting of two physical disks, with two partitions on the virtual device. The set up went fine, and booting directly from one of the RAID disks yielded:



# cat /proc/mdstat 
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md127 : active raid1 sda1[0] sdb1[1]
92094464 blocks super 1.2 [2/2] [UU]

md1 : active (auto-read-only) raid1 sda2[0] sdb2[2]
4069376 blocks super 1.2 [2/2] [UU]

unused devices: <none>


To test our setup, we then shut the machine down, disconnected one of the disks, and restarted. The system came up fine, naturally in a degraded state:



Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md127 : active raid1 sda1[1]
92094464 blocks super 1.2 [2/1] [_U]

md1 : active (auto-read-only) raid1 sda2[2]
4069376 blocks super 1.2 [2/1] [_U]

unused devices: <none>


Next, we shut the machine down again, reconnected the disconnected disk, and disconnected the other disk. Again, everything went fine, with the following expected state:



Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md127 : active raid1 sda1[0]
92094464 blocks super 1.2 [2/1] [U_]

md1 : active (auto-read-only) raid1 sda2[0]
4069376 blocks super 1.2 [2/1] [U_]

unused devices: <none>


Finally, we shut down for the final time, reconnected everything, but what we got was this:



Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md1 : active (auto-read-only) raid1 sdb2[2] sda2[0]
4069376 blocks super 1.2 [2/2] [UU]

md127 : active raid1 sdb1[1]
92094464 blocks super 1.2 [2/1] [_U]

unused devices: <none>


As you can see, the first partition (second entry, they were swapped for some reason) is in a degraded state (the second is not, but that's just a swap partition). We weren't particularly worried by this. After all, it's expected that the two partitions aren't exactly equal any more after the simulated alternating failure of the disks. We added the missing partition like this:



# mdadm --manage /dev/md127 --add /dev/sda1
mdadm: re-added /dev/sda1


We expected for the partition on /dev/sda to sync (be overwritten by) the one on /dev/sdb. Instead, we ended up with a corrupt file system (numerous errors within seconds).





After this experience, I rebooted from a third disk, reinitialised the file system on /dev/md127 (with the -c option to mkfs.ext4 for good measure), and rebooted back into the now again functioning RAID. Then once more, we shut down, disconnected one disk, booted, shut down again, reconnected the disk, and this time we also left the other disk connected, and booted. Now we got this:



Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md127 : active raid1 sda1[0]
92094464 blocks super 1.2 [2/1] [U_]

md1 : active (auto-read-only) raid1 sdb2[2] sda2[0]
4069376 blocks super 1.2 [2/2] [UU]

unused devices: <none>


Now we're afraid the same thing will happen again if we just use the --add option as above.



I have two questions:




  1. What caused the file system corruption after simulating the alternating failure? My guess is that is has something to do with both disks diverging from the state just before the first disconnection, and this somehow tricked mdadm --add in not doing a resync. What would have been the correct sequence of commands to tell mdadm to use the mounted state as authoritative and sync the added disk to it?

  2. In our current situation (one simulated failure and then reconnect, i. e. only one of the disks diverged from the state just before disconnection), what is the proper way to re-add the missing device? Can I just use the add command as above and it will resync? Why didn't it resync automatically?


If it helps, here is the current output from mdadm --examine:



# mdadm --examine /dev/sda1
/dev/sda1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 726d9204:889a4c89:b7a1bdb9:a77d8130
Name : testhost:0 (local to host testhost)
Creation Time : Mon Feb 4 14:39:21 2019
Raid Level : raid1
Raid Devices : 2

Avail Dev Size : 184188928 (87.83 GiB 94.30 GB)
Array Size : 92094464 (87.83 GiB 94.30 GB)
Data Offset : 131072 sectors
Super Offset : 8 sectors
Unused Space : before=130984 sectors, after=0 sectors
State : clean
Device UUID : 46077734:6a094293:96f92dc3:0a09706e

Update Time : Tue Feb 5 13:36:59 2019
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : 139d1d09 - correct
Events : 974


Device Role : Active device 0
Array State : A. ('A' == active, '.' == missing, 'R' == replacing)
# mdadm --examine /dev/sdb1
/dev/sdb1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 726d9204:889a4c89:b7a1bdb9:a77d8130
Name : testhost:0 (local to host testhost)
Creation Time : Mon Feb 4 14:39:21 2019
Raid Level : raid1
Raid Devices : 2

Avail Dev Size : 184188928 (87.83 GiB 94.30 GB)
Array Size : 92094464 (87.83 GiB 94.30 GB)
Data Offset : 131072 sectors
Super Offset : 8 sectors
Unused Space : before=130984 sectors, after=0 sectors
State : clean
Device UUID : dcffbed3:147347dc:b64ebb8d:97ab5956

Update Time : Tue Feb 5 10:47:41 2019
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : e774af76 - correct
Events : 142


Device Role : Active device 1
Array State : AA ('A' == active, '.' == missing, 'R' == replacing)









share|improve this question























  • Did you allow a significant amount of time between each to test to allow the RAID to rebuild itself?

    – Ramhound
    Feb 5 at 13:02













  • @Ramhound No. Between the 1st and 2nd test, there was nothing to rebuild. Should we have waited after the 2nd test before executing --add? OTOH, would a member that is no longer considered part of the array by the kernel even be rebuilt?

    – Alexander Klauer
    Feb 5 at 13:07
















0















A colleague and me set up a software RAID 1 with mdadm consisting of two physical disks, with two partitions on the virtual device. The set up went fine, and booting directly from one of the RAID disks yielded:



# cat /proc/mdstat 
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md127 : active raid1 sda1[0] sdb1[1]
92094464 blocks super 1.2 [2/2] [UU]

md1 : active (auto-read-only) raid1 sda2[0] sdb2[2]
4069376 blocks super 1.2 [2/2] [UU]

unused devices: <none>


To test our setup, we then shut the machine down, disconnected one of the disks, and restarted. The system came up fine, naturally in a degraded state:



Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md127 : active raid1 sda1[1]
92094464 blocks super 1.2 [2/1] [_U]

md1 : active (auto-read-only) raid1 sda2[2]
4069376 blocks super 1.2 [2/1] [_U]

unused devices: <none>


Next, we shut the machine down again, reconnected the disconnected disk, and disconnected the other disk. Again, everything went fine, with the following expected state:



Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md127 : active raid1 sda1[0]
92094464 blocks super 1.2 [2/1] [U_]

md1 : active (auto-read-only) raid1 sda2[0]
4069376 blocks super 1.2 [2/1] [U_]

unused devices: <none>


Finally, we shut down for the final time, reconnected everything, but what we got was this:



Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md1 : active (auto-read-only) raid1 sdb2[2] sda2[0]
4069376 blocks super 1.2 [2/2] [UU]

md127 : active raid1 sdb1[1]
92094464 blocks super 1.2 [2/1] [_U]

unused devices: <none>


As you can see, the first partition (second entry, they were swapped for some reason) is in a degraded state (the second is not, but that's just a swap partition). We weren't particularly worried by this. After all, it's expected that the two partitions aren't exactly equal any more after the simulated alternating failure of the disks. We added the missing partition like this:



# mdadm --manage /dev/md127 --add /dev/sda1
mdadm: re-added /dev/sda1


We expected for the partition on /dev/sda to sync (be overwritten by) the one on /dev/sdb. Instead, we ended up with a corrupt file system (numerous errors within seconds).





After this experience, I rebooted from a third disk, reinitialised the file system on /dev/md127 (with the -c option to mkfs.ext4 for good measure), and rebooted back into the now again functioning RAID. Then once more, we shut down, disconnected one disk, booted, shut down again, reconnected the disk, and this time we also left the other disk connected, and booted. Now we got this:



Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md127 : active raid1 sda1[0]
92094464 blocks super 1.2 [2/1] [U_]

md1 : active (auto-read-only) raid1 sdb2[2] sda2[0]
4069376 blocks super 1.2 [2/2] [UU]

unused devices: <none>


Now we're afraid the same thing will happen again if we just use the --add option as above.



I have two questions:




  1. What caused the file system corruption after simulating the alternating failure? My guess is that is has something to do with both disks diverging from the state just before the first disconnection, and this somehow tricked mdadm --add in not doing a resync. What would have been the correct sequence of commands to tell mdadm to use the mounted state as authoritative and sync the added disk to it?

  2. In our current situation (one simulated failure and then reconnect, i. e. only one of the disks diverged from the state just before disconnection), what is the proper way to re-add the missing device? Can I just use the add command as above and it will resync? Why didn't it resync automatically?


If it helps, here is the current output from mdadm --examine:



# mdadm --examine /dev/sda1
/dev/sda1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 726d9204:889a4c89:b7a1bdb9:a77d8130
Name : testhost:0 (local to host testhost)
Creation Time : Mon Feb 4 14:39:21 2019
Raid Level : raid1
Raid Devices : 2

Avail Dev Size : 184188928 (87.83 GiB 94.30 GB)
Array Size : 92094464 (87.83 GiB 94.30 GB)
Data Offset : 131072 sectors
Super Offset : 8 sectors
Unused Space : before=130984 sectors, after=0 sectors
State : clean
Device UUID : 46077734:6a094293:96f92dc3:0a09706e

Update Time : Tue Feb 5 13:36:59 2019
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : 139d1d09 - correct
Events : 974


Device Role : Active device 0
Array State : A. ('A' == active, '.' == missing, 'R' == replacing)
# mdadm --examine /dev/sdb1
/dev/sdb1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 726d9204:889a4c89:b7a1bdb9:a77d8130
Name : testhost:0 (local to host testhost)
Creation Time : Mon Feb 4 14:39:21 2019
Raid Level : raid1
Raid Devices : 2

Avail Dev Size : 184188928 (87.83 GiB 94.30 GB)
Array Size : 92094464 (87.83 GiB 94.30 GB)
Data Offset : 131072 sectors
Super Offset : 8 sectors
Unused Space : before=130984 sectors, after=0 sectors
State : clean
Device UUID : dcffbed3:147347dc:b64ebb8d:97ab5956

Update Time : Tue Feb 5 10:47:41 2019
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : e774af76 - correct
Events : 142


Device Role : Active device 1
Array State : AA ('A' == active, '.' == missing, 'R' == replacing)









share|improve this question























  • Did you allow a significant amount of time between each to test to allow the RAID to rebuild itself?

    – Ramhound
    Feb 5 at 13:02













  • @Ramhound No. Between the 1st and 2nd test, there was nothing to rebuild. Should we have waited after the 2nd test before executing --add? OTOH, would a member that is no longer considered part of the array by the kernel even be rebuilt?

    – Alexander Klauer
    Feb 5 at 13:07














0












0








0








A colleague and me set up a software RAID 1 with mdadm consisting of two physical disks, with two partitions on the virtual device. The set up went fine, and booting directly from one of the RAID disks yielded:



# cat /proc/mdstat 
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md127 : active raid1 sda1[0] sdb1[1]
92094464 blocks super 1.2 [2/2] [UU]

md1 : active (auto-read-only) raid1 sda2[0] sdb2[2]
4069376 blocks super 1.2 [2/2] [UU]

unused devices: <none>


To test our setup, we then shut the machine down, disconnected one of the disks, and restarted. The system came up fine, naturally in a degraded state:



Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md127 : active raid1 sda1[1]
92094464 blocks super 1.2 [2/1] [_U]

md1 : active (auto-read-only) raid1 sda2[2]
4069376 blocks super 1.2 [2/1] [_U]

unused devices: <none>


Next, we shut the machine down again, reconnected the disconnected disk, and disconnected the other disk. Again, everything went fine, with the following expected state:



Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md127 : active raid1 sda1[0]
92094464 blocks super 1.2 [2/1] [U_]

md1 : active (auto-read-only) raid1 sda2[0]
4069376 blocks super 1.2 [2/1] [U_]

unused devices: <none>


Finally, we shut down for the final time, reconnected everything, but what we got was this:



Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md1 : active (auto-read-only) raid1 sdb2[2] sda2[0]
4069376 blocks super 1.2 [2/2] [UU]

md127 : active raid1 sdb1[1]
92094464 blocks super 1.2 [2/1] [_U]

unused devices: <none>


As you can see, the first partition (second entry, they were swapped for some reason) is in a degraded state (the second is not, but that's just a swap partition). We weren't particularly worried by this. After all, it's expected that the two partitions aren't exactly equal any more after the simulated alternating failure of the disks. We added the missing partition like this:



# mdadm --manage /dev/md127 --add /dev/sda1
mdadm: re-added /dev/sda1


We expected for the partition on /dev/sda to sync (be overwritten by) the one on /dev/sdb. Instead, we ended up with a corrupt file system (numerous errors within seconds).





After this experience, I rebooted from a third disk, reinitialised the file system on /dev/md127 (with the -c option to mkfs.ext4 for good measure), and rebooted back into the now again functioning RAID. Then once more, we shut down, disconnected one disk, booted, shut down again, reconnected the disk, and this time we also left the other disk connected, and booted. Now we got this:



Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md127 : active raid1 sda1[0]
92094464 blocks super 1.2 [2/1] [U_]

md1 : active (auto-read-only) raid1 sdb2[2] sda2[0]
4069376 blocks super 1.2 [2/2] [UU]

unused devices: <none>


Now we're afraid the same thing will happen again if we just use the --add option as above.



I have two questions:




  1. What caused the file system corruption after simulating the alternating failure? My guess is that is has something to do with both disks diverging from the state just before the first disconnection, and this somehow tricked mdadm --add in not doing a resync. What would have been the correct sequence of commands to tell mdadm to use the mounted state as authoritative and sync the added disk to it?

  2. In our current situation (one simulated failure and then reconnect, i. e. only one of the disks diverged from the state just before disconnection), what is the proper way to re-add the missing device? Can I just use the add command as above and it will resync? Why didn't it resync automatically?


If it helps, here is the current output from mdadm --examine:



# mdadm --examine /dev/sda1
/dev/sda1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 726d9204:889a4c89:b7a1bdb9:a77d8130
Name : testhost:0 (local to host testhost)
Creation Time : Mon Feb 4 14:39:21 2019
Raid Level : raid1
Raid Devices : 2

Avail Dev Size : 184188928 (87.83 GiB 94.30 GB)
Array Size : 92094464 (87.83 GiB 94.30 GB)
Data Offset : 131072 sectors
Super Offset : 8 sectors
Unused Space : before=130984 sectors, after=0 sectors
State : clean
Device UUID : 46077734:6a094293:96f92dc3:0a09706e

Update Time : Tue Feb 5 13:36:59 2019
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : 139d1d09 - correct
Events : 974


Device Role : Active device 0
Array State : A. ('A' == active, '.' == missing, 'R' == replacing)
# mdadm --examine /dev/sdb1
/dev/sdb1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 726d9204:889a4c89:b7a1bdb9:a77d8130
Name : testhost:0 (local to host testhost)
Creation Time : Mon Feb 4 14:39:21 2019
Raid Level : raid1
Raid Devices : 2

Avail Dev Size : 184188928 (87.83 GiB 94.30 GB)
Array Size : 92094464 (87.83 GiB 94.30 GB)
Data Offset : 131072 sectors
Super Offset : 8 sectors
Unused Space : before=130984 sectors, after=0 sectors
State : clean
Device UUID : dcffbed3:147347dc:b64ebb8d:97ab5956

Update Time : Tue Feb 5 10:47:41 2019
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : e774af76 - correct
Events : 142


Device Role : Active device 1
Array State : AA ('A' == active, '.' == missing, 'R' == replacing)









share|improve this question














A colleague and me set up a software RAID 1 with mdadm consisting of two physical disks, with two partitions on the virtual device. The set up went fine, and booting directly from one of the RAID disks yielded:



# cat /proc/mdstat 
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md127 : active raid1 sda1[0] sdb1[1]
92094464 blocks super 1.2 [2/2] [UU]

md1 : active (auto-read-only) raid1 sda2[0] sdb2[2]
4069376 blocks super 1.2 [2/2] [UU]

unused devices: <none>


To test our setup, we then shut the machine down, disconnected one of the disks, and restarted. The system came up fine, naturally in a degraded state:



Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md127 : active raid1 sda1[1]
92094464 blocks super 1.2 [2/1] [_U]

md1 : active (auto-read-only) raid1 sda2[2]
4069376 blocks super 1.2 [2/1] [_U]

unused devices: <none>


Next, we shut the machine down again, reconnected the disconnected disk, and disconnected the other disk. Again, everything went fine, with the following expected state:



Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md127 : active raid1 sda1[0]
92094464 blocks super 1.2 [2/1] [U_]

md1 : active (auto-read-only) raid1 sda2[0]
4069376 blocks super 1.2 [2/1] [U_]

unused devices: <none>


Finally, we shut down for the final time, reconnected everything, but what we got was this:



Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md1 : active (auto-read-only) raid1 sdb2[2] sda2[0]
4069376 blocks super 1.2 [2/2] [UU]

md127 : active raid1 sdb1[1]
92094464 blocks super 1.2 [2/1] [_U]

unused devices: <none>


As you can see, the first partition (second entry, they were swapped for some reason) is in a degraded state (the second is not, but that's just a swap partition). We weren't particularly worried by this. After all, it's expected that the two partitions aren't exactly equal any more after the simulated alternating failure of the disks. We added the missing partition like this:



# mdadm --manage /dev/md127 --add /dev/sda1
mdadm: re-added /dev/sda1


We expected for the partition on /dev/sda to sync (be overwritten by) the one on /dev/sdb. Instead, we ended up with a corrupt file system (numerous errors within seconds).





After this experience, I rebooted from a third disk, reinitialised the file system on /dev/md127 (with the -c option to mkfs.ext4 for good measure), and rebooted back into the now again functioning RAID. Then once more, we shut down, disconnected one disk, booted, shut down again, reconnected the disk, and this time we also left the other disk connected, and booted. Now we got this:



Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md127 : active raid1 sda1[0]
92094464 blocks super 1.2 [2/1] [U_]

md1 : active (auto-read-only) raid1 sdb2[2] sda2[0]
4069376 blocks super 1.2 [2/2] [UU]

unused devices: <none>


Now we're afraid the same thing will happen again if we just use the --add option as above.



I have two questions:




  1. What caused the file system corruption after simulating the alternating failure? My guess is that is has something to do with both disks diverging from the state just before the first disconnection, and this somehow tricked mdadm --add in not doing a resync. What would have been the correct sequence of commands to tell mdadm to use the mounted state as authoritative and sync the added disk to it?

  2. In our current situation (one simulated failure and then reconnect, i. e. only one of the disks diverged from the state just before disconnection), what is the proper way to re-add the missing device? Can I just use the add command as above and it will resync? Why didn't it resync automatically?


If it helps, here is the current output from mdadm --examine:



# mdadm --examine /dev/sda1
/dev/sda1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 726d9204:889a4c89:b7a1bdb9:a77d8130
Name : testhost:0 (local to host testhost)
Creation Time : Mon Feb 4 14:39:21 2019
Raid Level : raid1
Raid Devices : 2

Avail Dev Size : 184188928 (87.83 GiB 94.30 GB)
Array Size : 92094464 (87.83 GiB 94.30 GB)
Data Offset : 131072 sectors
Super Offset : 8 sectors
Unused Space : before=130984 sectors, after=0 sectors
State : clean
Device UUID : 46077734:6a094293:96f92dc3:0a09706e

Update Time : Tue Feb 5 13:36:59 2019
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : 139d1d09 - correct
Events : 974


Device Role : Active device 0
Array State : A. ('A' == active, '.' == missing, 'R' == replacing)
# mdadm --examine /dev/sdb1
/dev/sdb1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 726d9204:889a4c89:b7a1bdb9:a77d8130
Name : testhost:0 (local to host testhost)
Creation Time : Mon Feb 4 14:39:21 2019
Raid Level : raid1
Raid Devices : 2

Avail Dev Size : 184188928 (87.83 GiB 94.30 GB)
Array Size : 92094464 (87.83 GiB 94.30 GB)
Data Offset : 131072 sectors
Super Offset : 8 sectors
Unused Space : before=130984 sectors, after=0 sectors
State : clean
Device UUID : dcffbed3:147347dc:b64ebb8d:97ab5956

Update Time : Tue Feb 5 10:47:41 2019
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : e774af76 - correct
Events : 142


Device Role : Active device 1
Array State : AA ('A' == active, '.' == missing, 'R' == replacing)






hard-drive software-raid raid-1 mdadm






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Feb 5 at 12:56









Alexander KlauerAlexander Klauer

1013




1013













  • Did you allow a significant amount of time between each to test to allow the RAID to rebuild itself?

    – Ramhound
    Feb 5 at 13:02













  • @Ramhound No. Between the 1st and 2nd test, there was nothing to rebuild. Should we have waited after the 2nd test before executing --add? OTOH, would a member that is no longer considered part of the array by the kernel even be rebuilt?

    – Alexander Klauer
    Feb 5 at 13:07



















  • Did you allow a significant amount of time between each to test to allow the RAID to rebuild itself?

    – Ramhound
    Feb 5 at 13:02













  • @Ramhound No. Between the 1st and 2nd test, there was nothing to rebuild. Should we have waited after the 2nd test before executing --add? OTOH, would a member that is no longer considered part of the array by the kernel even be rebuilt?

    – Alexander Klauer
    Feb 5 at 13:07

















Did you allow a significant amount of time between each to test to allow the RAID to rebuild itself?

– Ramhound
Feb 5 at 13:02







Did you allow a significant amount of time between each to test to allow the RAID to rebuild itself?

– Ramhound
Feb 5 at 13:02















@Ramhound No. Between the 1st and 2nd test, there was nothing to rebuild. Should we have waited after the 2nd test before executing --add? OTOH, would a member that is no longer considered part of the array by the kernel even be rebuilt?

– Alexander Klauer
Feb 5 at 13:07





@Ramhound No. Between the 1st and 2nd test, there was nothing to rebuild. Should we have waited after the 2nd test before executing --add? OTOH, would a member that is no longer considered part of the array by the kernel even be rebuilt?

– Alexander Klauer
Feb 5 at 13:07










1 Answer
1






active

oldest

votes


















0














I found out what went wrong. I found this in the documentation of mdadm:




When a device is added to an active array, mdadm checks to see if it has metadata on it which suggests that it was recently a member of the array. If it does, it tries to "re-add" the device. If there have been no changes since the device was removed, or if the array has a write-intent bitmap which has recorded whatever changes there were, then the device will immediately become a full member of the array and those differences recorded in the bitmap will be resolved.


(emphasis mine)



Since both devices diverged from the time of initial disconnection, their recorded changes were mutually incompatible, shredding the filesystem.



The solution for such a case is to call mdadm --zero-superblock on the missing device before adding it. This will force a clean rebuild.



The second case, where only one of the devices diverged, was probably harmless, though I haven't tried it. In a case of actual failure, where you have to replace the physical drive, you should also be fine since there is no metadata in the first place.






share|improve this answer























    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "3"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1402239%2ffile-system-corrupt-after-re-adding-software-raid-1-member-after-test-why%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    I found out what went wrong. I found this in the documentation of mdadm:




    When a device is added to an active array, mdadm checks to see if it has metadata on it which suggests that it was recently a member of the array. If it does, it tries to "re-add" the device. If there have been no changes since the device was removed, or if the array has a write-intent bitmap which has recorded whatever changes there were, then the device will immediately become a full member of the array and those differences recorded in the bitmap will be resolved.


    (emphasis mine)



    Since both devices diverged from the time of initial disconnection, their recorded changes were mutually incompatible, shredding the filesystem.



    The solution for such a case is to call mdadm --zero-superblock on the missing device before adding it. This will force a clean rebuild.



    The second case, where only one of the devices diverged, was probably harmless, though I haven't tried it. In a case of actual failure, where you have to replace the physical drive, you should also be fine since there is no metadata in the first place.






    share|improve this answer




























      0














      I found out what went wrong. I found this in the documentation of mdadm:




      When a device is added to an active array, mdadm checks to see if it has metadata on it which suggests that it was recently a member of the array. If it does, it tries to "re-add" the device. If there have been no changes since the device was removed, or if the array has a write-intent bitmap which has recorded whatever changes there were, then the device will immediately become a full member of the array and those differences recorded in the bitmap will be resolved.


      (emphasis mine)



      Since both devices diverged from the time of initial disconnection, their recorded changes were mutually incompatible, shredding the filesystem.



      The solution for such a case is to call mdadm --zero-superblock on the missing device before adding it. This will force a clean rebuild.



      The second case, where only one of the devices diverged, was probably harmless, though I haven't tried it. In a case of actual failure, where you have to replace the physical drive, you should also be fine since there is no metadata in the first place.






      share|improve this answer


























        0












        0








        0







        I found out what went wrong. I found this in the documentation of mdadm:




        When a device is added to an active array, mdadm checks to see if it has metadata on it which suggests that it was recently a member of the array. If it does, it tries to "re-add" the device. If there have been no changes since the device was removed, or if the array has a write-intent bitmap which has recorded whatever changes there were, then the device will immediately become a full member of the array and those differences recorded in the bitmap will be resolved.


        (emphasis mine)



        Since both devices diverged from the time of initial disconnection, their recorded changes were mutually incompatible, shredding the filesystem.



        The solution for such a case is to call mdadm --zero-superblock on the missing device before adding it. This will force a clean rebuild.



        The second case, where only one of the devices diverged, was probably harmless, though I haven't tried it. In a case of actual failure, where you have to replace the physical drive, you should also be fine since there is no metadata in the first place.






        share|improve this answer













        I found out what went wrong. I found this in the documentation of mdadm:




        When a device is added to an active array, mdadm checks to see if it has metadata on it which suggests that it was recently a member of the array. If it does, it tries to "re-add" the device. If there have been no changes since the device was removed, or if the array has a write-intent bitmap which has recorded whatever changes there were, then the device will immediately become a full member of the array and those differences recorded in the bitmap will be resolved.


        (emphasis mine)



        Since both devices diverged from the time of initial disconnection, their recorded changes were mutually incompatible, shredding the filesystem.



        The solution for such a case is to call mdadm --zero-superblock on the missing device before adding it. This will force a clean rebuild.



        The second case, where only one of the devices diverged, was probably harmless, though I haven't tried it. In a case of actual failure, where you have to replace the physical drive, you should also be fine since there is no metadata in the first place.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Feb 5 at 15:32









        Alexander KlauerAlexander Klauer

        1013




        1013






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Super User!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1402239%2ffile-system-corrupt-after-re-adding-software-raid-1-member-after-test-why%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to reconfigure Docker Trusted Registry 2.x.x to use CEPH FS mount instead of NFS and other traditional...

            is 'sed' thread safe

            How to make a Squid Proxy server?