2

I'd like to set up a software raid 10 array (4x2TB) for a workstation using a fresh install, but I'm finding non applicable/conflicting/old resources online. I'd like to get the community's advice on setting up such an array as there appears to be a plethora of possible configurations.

Specifically, I'll be using the workstation for image analysis (i.e., typical file sizes from several hundred MB-several GB), so if there are optimizations that can be made for such purposes that would be great.

Cheers

Jorge Castro
  • 71,754

3 Answers3

1

I don't remember exactly what resources I followed when setting up my RAID on my server, but I think this article was the main point of information. Some important things:

  • Use mdadm and not dmraid.
  • Use /dev/disk/by-id/ paths to point at the disks, instead of /dev/sda etc… It's easier to map them to the physical devices, in case you need to replace a disk or such.
  • Be patient. At first I thought my RAID would be ready after the 5 hours initial setup time. Then it took another day to rebuild itself and actually be ready. (4x2TB)
  • Use a separate disk/partition/RAID for primary OS installs. It's easier to manage everything if you keep the large data RAID separate from primary OS and its data. Much easier to recover a small OS disk, than rebuild a huge multi-terabyte array, if something goes bad on the OS side.
dobey
  • 40,982
  • Not entirely, you still need to identify your drives in advance, preferably by serial number, so when one is reported failed it's easy to remove just that one. I'll take a label and copy the serial # to the cable end of the drive. The by-id name usually includes the serial number. – ppetraki Dec 14 '12 at 14:21
  • @ppetraki Ok, fair enough. It would save you some time in mapping the physical disk visually to the name given. However, are you sure this actually works like this? I seems not to work for me when doing auto-assembling by array-UUID. It reports plain names in the array configuration (e.g. sda1) no matter what the original paths to disks were on the time of creation. – gertvdijk Dec 14 '12 at 14:30
  • You don't need to use the by-id names in assembling the array, but does help you organize your plan, especially if you have tons of disks, they're just symlinks to SD devices. MD just scans for metadata, it doesn't care about disk names, neither does the admin tool, which is unfortunate as now I have to spend the extra time to lookup sda => scsi-SATA_TOSHIBA_THNS128_10US10BTT02Z, which is a lot more meaningful. It would be a great feature to add I just haven't got around to it. – ppetraki Dec 14 '12 at 14:35
  • @dobey - so far, that's the best description for a RAID 10 setup I've seen. Thx. Pity that it's dated 2008. – Prophet60091 Dec 14 '12 at 14:48
  • @Prophet60091 Regarding the mdadm commands, nothing has really changed. The Grub part is a bit outdated, though. – gertvdijk Dec 14 '12 at 14:49
  • Sorry, updated the answer to mention physical mapping instead. Performance wise, the default RAID 10 settings are probably fine, assuming you're using SATA-II or SATA-III disks and controller. Unless you want to do RAID 5 or 6 and spend a lot of money to do it right, you're probably not going to see any useful performance gains. – dobey Dec 14 '12 at 15:02
  • @dobey - thx. I'll try to benchmark results using real-world tests and PTS. Nevertheless, your answer to the question posed is shaping up to the best so far. – Prophet60091 Dec 14 '12 at 15:09
  • Yes, with grub2 you don't have to do anything goofy anymore, you can just have one big raid partition on all the disks and install grub to all of them, no need for a /boot partition. – psusi Dec 14 '12 at 15:32
  • Mucking with grub is only relevant if you're using a RAID as your primary OS drive. I would recommend using a separate drive for the OS, and if you want to RAID it, use 2 smaller drives in RAID 1 or something simple, with a larger RAID 10 array for data. However, even without a /boot partition, on newer 64 bit hardware, you may need to have an EFIboot partition to be able to boot the OS. – dobey Dec 14 '12 at 16:19
  • I like the idea of a fifth drive, but I don't think I can shoehorn a fifth drive into the micro-ATX chasis :D Thx to dobey and @gertvdijk for the discussion above. If you come across other well-documented guides let me know. – Prophet60091 Dec 14 '12 at 21:14
  • Oh. I'd recommend an external enclosure for the RAID then, with its own power supply, and hot swappable drive bays. :) – dobey Dec 14 '12 at 21:49
1

With RAID10 in the given situation I see only two variables candidate for optimization:

  • Chunk size

    Set it to something larger than the default of 512KiB to minimize the overhead for linear reads/writes of large files. You should try it on a small partition on your machine to see what gives the best performance, e.g. 1MB, 2MB, 5MB, 10MB...

  • Near vs Far layout

    comparable to RAID1+0 vs RAID0+1. Far is a bit faster as the performance for reading is more like RAID0. Yet a Near layout is the default because it has a slightly higher chance of surviving the unlikely event that all mirrored disks are broken (some probability math here). Some more visual idea of the differens is below, happily stolen from SLES mdadm documentation:

    Near looks like

    sda1 sdb1 sdc1 sdd1
      0    0    1    1
      2    2    3    3
      4    4    5    5
    

    Far looks like

    sda1 sdb1 sdc1 sdd1
      0    1    2    3
      4    5    6    7       
      . . .
      3    0    1    2
      7    4    5    6
    

Update about far vs near redundancy from the discussion in the comments. Suppose sda fails:

       near
sda1 sdb1 sdc1 sdd1
  -    0    1    1
  -    2    3    3
  -    4    5    5

then sdc or sdd can still fail, while in far:

        far
sda1 sdb1 sdc1 sdd1
  -    1    2    3
  -    5    6    7       
  . . .
  -    0    1    2
  -    4    5    6

now only sdc can fail, as a failed sdb drive make block 4 inaccessible and a failed sdd drive will make block 3 inaccessible.

Conclusion: chances of surviving a 2-disk failure are higher when using a near layout. (can someone do the math here for a quantitative number?)

gertvdijk
  • 67,947
  • thx for that link. I like the distinction made between "Complex vs. Nested RAID 10". That particular nuance was never clear to me. – Prophet60091 Dec 14 '12 at 14:46
  • far doesn't have any better chance of survival ( data is on the same disk, just different location ), it is just optimized for reading rather than writing. There is also the offset layout, which is more of a balance between the two. – psusi Dec 14 '12 at 15:28
  • @psusi Suppose one drive has failed in a 4-disk RAID10. Then only one out of the working three could fail without blowing up the array in a far-layout, whereas two out of three can fail in near-layout. At least this is what I understood a year ago when I set up my RAID10 array. I am unable to find that source again. – gertvdijk Dec 14 '12 at 15:34
  • @gertvdijk, no.. once again, the backup copy is on the same drive with far layout, just all towards the end of the drive rather than mixed throughout. – psusi Dec 14 '12 at 23:54
  • 1
    @psusi No, that's exactly making the difference! Suppose sda fails. In near, then sdc or sdd can fail, but not sdb. In far, only sdc can fail because sdb can't fail (missing block 4) and sdd can't either (missing block 3). – gertvdijk Dec 15 '12 at 00:01
  • 1
    No, both layouts can handle some double disk failures, just which ones differ. With far it is sdb and sdd or sda and sdc. – psusi Dec 15 '12 at 01:21
  • 1
    Also worth noting is that the far layout can not be resized. I recently picked up 3 1 TB WD blue drives cheap and set them up in raid10,offset with 16M chunk size and it gets nearly 500 MB/s sequential read throughput. I only initialized part of the array ( with the -z switch ) to avoid the long wait for resyncing the whole thing, and can grow it as needed ( on the fly ). – psusi Jan 04 '14 at 05:47
-3

Picking up some hotspares in advance would be a good idea. Also taking these notes into account.

Recommended storage scheme for home server? (LVM/JBOD/RAID 5...)

See footnote [1] in above link to see what happens with cheap storage when you need it the most.

This is all a moot point however until you profile how your the target application actually uses storage. You might find that parallelism is possible, so one block can be used for reading results, and one for writing them. This could be further abstracted behind a RAID0 (until the HBA reports QUEUE_FULL) with the results backed up via rsync.

It really depends, saying "I'm doing image analysis" without defining the workload or level of service just isn't enough; even if you did, that level of performance analysis is real work, I know it's something "I" wouldn't do in my spare time. My intentions are to get you thinking about your application to create your own solutions. Spindles are always the slowest part of your system, plan accordingly.

One idea if you wish to do the multi-array approach would be creating two RAID 1's, on separate controllers, and adding those MD devices to a LVM VG for management. Sure a RAID 10 is fast, but it's still one storage queue, now you have two, and with separate controllers there's no HBA queue sharing either.

Performance Notes:

Remember, SW RAID is no different than HW RAID, understand how it works or when it fails you might end up being more at risk as opposed to say spending your energies creating a regular backup strategy (rsync.net) instead. I've lost count of the number of users who've lost everything because they didn't read the manual and actually test the failure modes.

ppetraki
  • 5,483
  • Yes, some amazing IO requirements are definitely in mind (image analysis). If I had the money, this would be purely an SSD affair. Also, it's my understanding that SW RAID is more flexible and cheaper that HW RAID...I've even read that it's faster, but that's neither here nor there for this particular question. Thx for the link though. – Prophet60091 Dec 14 '12 at 14:38
  • In which case, asynchronous backup is cheaper, start with a RAID0 Rsync to replicate your work to a safe location. I recommend rsync.net, works well with open standards: rsync, duplicity, & deja-dup. You might want to have two arrays, on separate controllers. One that does all the reading of the original data, and one that serves as the destination of the processed result; nice for multi-threaded processing (pipeline), You also might want to consider using XFS or similar fs that excels at large files, the storage array is just one piece of the puzzle. – ppetraki Dec 14 '12 at 14:47
  • interesting: so one RAID0 dedicated to reading, and another RAID0 dedicated to writing, all the while backing up the processed results. I'll have to explore this option as well. Thx. – Prophet60091 Dec 14 '12 at 15:01
  • If you're doing batch processing, doing something as simple as dividing the list into 4 parts and spawning a pipeline for each segment for instant parallelism. You can get really cute here and chunk the list by total size and spawn a job for every 500GB for example. – ppetraki Dec 14 '12 at 15:08
  • 1
    Several things are wrong with this answer: 1) lvm over raid10 is better than lvm over two raid1s, 2) you don't need to replace with the exact same model with or without partitions, 3) hardware or software raid does not matter; raid is no substitute for having regular backups! – psusi Dec 14 '12 at 15:23
  • RAID 1's vs RAID 10 is subjective 2) I've seen problems with tooling wrt block devices without partitions (UDEV), and using partitions creates headroom for bad block management (see linked post) 3) sure agreed, but SW gets you into more trouble faster than HW raid as users get comfortable with the easy setup and give no thought to managing faults. I've seen more customers lose data to SW RAID not because of it's reliability, but because they treated it with less respect than a HW RAID, the later is usually better documented, in one place.
  • – ppetraki Dec 15 '12 at 18:04
  • @ppetraki, lvm over two raid1s effectively is raid10, only more complex and less manageable. Using partitions has nothing at all to do with bad block management; that is handled internally by the drive. User error is user error whether it is hardware or software raid, and the point was that neither one is a license to go without backups. – psusi Dec 15 '12 at 23:54
  • @psusi, badblocks: the firmware on the cheaper drives (SATA, not SAS) will lie to you, overstep the "free list" and commandeer blocks previously committed to primary storage. It will literally eat your data to save the disk. There's a bias somewhere where taking blocks from the end is more important than leaving it unrepaired, which is fine if you have a filesystem with some partitions, that's all offset based access, block mirroring on the other hand is depending on the fact that disk X is YYYY blocks, if it becomes YYYY-1 it becomes ineligible to remain in the array forever. – ppetraki Dec 17 '12 at 14:10
  • @ppetraki, I have never seen or heard of such a thing, do you have any sources? That would be an incredibly stupid thing to do since if it just fails the write then the OS at least can avoid using that sector and inform the user of the problem rather than have other random parts of the fs trashed. I find it very hard to believe that any vendor would do something so foolish with nothing to gain. – psusi Dec 18 '12 at 16:57
  • @psusi, personal experience. This isn't the sort of stuff you see press releases about, it's the kind of bug vendors silently fix in their firmware with an abstract description and hope nobody notices. When you work on storage for living you get to see the worst failure modes. I've also personally fixed lost interrupts on hotplug (firmware just drops dead) and other things people think are safe. Most storage adapters for example can't survive more than 50 consecutive diskpulls before something terrible happens: firmware, driver, mid-layer; it's always something. I can share test details... – ppetraki Dec 18 '12 at 18:54