More/Better Internal Storage on the Toshiba AC100 – Part 2

Following my research for the previous article about the performance of SD/CF/USB flash modules, the only conclusion I could reach is that most of them are pretty dire. The only notable exception among the SD cards seems to be the latest generation of the SanDisk Extreme Pro (95MB/s) cards that just about managed to squeeze out enough performance on random writes to match a 7200rpm disk. Still, this is pretty dire compared to any reasonable SSD, so I wanted to see what else could be done about installing extra storage with good performance into an AC100.

What I came across is this: SuperTalent RC8 USB stick. It may look like a USB stick, but it is actually a full-on SSD, featuring a SandForce 1200 flash controller. I figured this was worth a shot, even though the4 specifications indicate it is rather large (far too large to fit inside an AC100 in it’s standard form). Stripped out of the casing, however, it looks like RC8 might just be fittable inside the AC100.

This is what I ended up with. There appears to be only one place inside an AC100 where a bare RC8 circuit board could be fitted. You will need the following:

1) P3MU mini-PCIe USB break-out module

2) SuperTalent RC8 USB stick

3) Custom made USB cable (male and female type A USB connectors, some single core wire, and some skill with a soldering iron)

Measure out exactly how long you need the cable to be – there is no room to tuck away excess able inside an AC100. Here is what my cable layout ended up looking like.

AC100 motherboard with P3MU and custom USB cable fitted

AC100 motherboard with P3MU and custom USB cable fitted

This is what it looks like with the top panel fitted. Note the large cut-out that has been made below the mini-PCIe slot access hole.

AC100 modified to receive RC8 USB SSD

AC100 modified to receive RC8 USB SSD

And again with the screws fitted. Note that one of the screw holes is in the area that had to be cut out. This shouldn’t affect the structural integrity of the AC100, though. Also note that the right speaker cable has been re-routed slightly to now go over the LED ribbon cable.

AC100 modified to receive RC8 SSD

AC100 modified to receive RC8 SSD

This is what it looks like with the RC8 attached. Now you can see why the cut-out in the top panel was exactly the shape it was – I specifically cut out the minimum possible amount to allow the RC8 to fit.

Toshiba AC100 with the SuperTalent RC8 USB SSD installed

Toshiba AC100 with the SuperTalent RC8 USB SSD installed

I also put a piece of thin transparent sticky tape over it to hold in in place, just to make sure nothing can short out against the underside of the keyboard.

Toshiba AC100 with the SuperTalent RC8 SSD

Toshiba AC100 with the SuperTalent RC8 SSD

And that is pretty much it. Put the keyboard back in and bolt it all together. The metal part of the USB connector will sit a tiny bit above the line of the panel, but the only way you’ll notice it once you put the keyboard back on is by knowing that there is a tiny bulge there.

Your AC100 should now be able to handle ~ 2000 IOPS on both random reads and random writes, along with much better life expectancy that having proper flash management brings.

At this point I would like to point out just how impressed I am with the SuperTalent RC8 USB SSD. Not only is the performance fenomenal (for a USB stick at least), but it really behaves like a SATA SSD – to the point where you can use tools like hdparm and smartctl on it (yes, it even supports SMART).

Flash Module Benchmark Collection: SD Cards, CF Cards, USB Sticks

Having spent a considerable amount of time, effort, and ultimately money trying to find decently performing SD, CF and USB flash modules, I feel I really need to ensure that I make the lives of other people with the same requirements easier by publishing my findings – especially since I have been unable to find a reasonable comprehensive data source with similar information.

Unfortunately, virtually all SD/microSD (referred to as uSD from now on), CF and USB flash modules have truly atrocious performance for use as normal disks (e.g. when running the OS from them on a small, low power or embedded device), regardless of what their advertised performance may be. The performance problem is specifically related to their appalling random-write performance, so this is the figure that you should be specifically paying attention to in the tables below.

As you will see, the sequential read and write performance of flash modules is generally quite good, as is random-read performance. But on their own these are largely irrelevant to overall performance you will observe when using the card to run the operating system from, if the random-write performance is below a certain level. And yes, your system will do several MB of writing to the disk just by booting up, before you even log in, so don’t think that it’s all about reads and that writes are irrelevant.

For comparison, a typical cheap laptop disk spinning at 5400rpm disk can typically achieve 90 IOPS on both random reads and random writes with typical (4KB) block size. This is an important figure to bear in mind purely to be able to see just how appalling the random write performance of most removable flash media is.

All media was primed with two passes of:

 dd if=/dev/urandom of=/dev/$device bs=1M oflag=direct

in order to simulate long term use and ensure that the performance figures reasonably accurately reflect what you might expect after the device has been in use for some time.

There are two sets of results:

1) Linear read/write test performed using:

dd if=/dev/$device of=/dev/null    iflag=direct
dd if=/dev/zero    of=/dev/$device oflag=direct

The linear read-write test script I use can be downloaded here.

2) Random read/write test performed using:

iozone -i 0 -i 2 -I -r 4K -s 512m -o -O +r +D -f /path/to/file

In all cases, the test size was 512MB. Partitions are aligned to 2MB boundaries. File system is ext4 with 4KB block size (-b 4096) and 16-block (64KB) stripe-width (-E stride=1,stripe-width=16), no journal (-O ^has_journal), and mounted without access time logging (-o noatime). The partition used for the tests starts at half of the card’s capacity, e.g. on a 16GB card, the test partition spans the space from 8GB up to the end. This is in done in order to nullify the effect of some cards having faster flash at the front of the card.

The data here is only the first modules I have tested and will be extensively updated as and when I test additional modules. Unfortunately, a single module can take over 24 hours to complete testing if their performance is poor (e.g. 1 IOPS) – and unfortunately, most of them are that bad, even those made by reputable manufacturers.

The dd linear test is probably more meaningful if you intend to use the flash card in a device that only ever performs large, sequential writes (e.g. a digital camera). For everything else, however, the dd figures are meaningless and you should instead be paying attention to the iozone results, particularly the random-write (r-w). Good random write performance also usually indicates a better flash controller, which means better wear leveling and better longevity of the card, so all other things being similar, the card with faster random-write performance is the one to get.

Due to WordPress being a little too rigid in it’s templates to allow for wide tables, you can see the SD / CF / USB benchmark data here. This table will be updated a lot so check back often.

More/Better Internal Storage on the Toshiba AC100

One of the unfortunate things about the AC100 is that the internal storage isn’t removable, and thus isn’t easily upgradable or replaceable. The latter could be an issue in the longer term because it is flash memory, so it will eventually wear out, and I since it is relatively basic eMMC, I don’t expect the flash controller to be particularly advanced when it comes to wear leveling and minimizing write amplification. Using the SD slot is an option, but if we are running the operating system from it, we cannot use it for removable media, which could be handy. We could use a USB stick instead, but then we lose the only USB port on the machine. There is no SATA controller inside the AC100.

What can be done about this? Well, models that have a 3G modem have it on a mini-PCIe USB card. Even though Tegra 2 has a PCIe controller built into it, the mini-PCIe slot isn’t fully wired up – only USB lines are connected. Since most of us can tether a data connection via our phones, and since this is more cost effective than paying for two separate mobile connections, the 3G module isn’t particularly vital. The main issue that the slot only has USB wired up. So what we would need is a USB mini-PCIe SSD. Is there such a thing? It turns out that there is. I have been able to find two:

  1. EMPhase Mini PCIe USB S1 SSD
  2. InnoDisk miniDOM-U SSD

The specification of the two modules is virtually identical (both use SLC flash among other similarities), so I decided to investigate both of them. Unfortunately, having contacted an EMPhase re-seller, they called me back having spoken to the manufacturer and talked me out of buying one, citing unspecified issues.

My local InnoDisk re-seller was more interested in selling me a product, but there were two reasons why despite very good pre-sales service I ultimately decided against buying one of these. The first and foremost was the performance specification. According to the manufacturer’s own figures, the random access performance with 4KB blocks is 1440 random read IOPS and 30 random write IOPS. Considering the price per GB of these modules is approximately 4x that of similarly performing SLC SD cards, this module was discarded on the basis of cost effectiveness.

Having discarded the above modules, there are still a few alternative options available. The low risk, tidy options include an SD mini-PCIe USB adapter and a micro-SD mini-PCIe USB adapter. They are very reasonably priced so I got one of each for testing, and I am pleased to say that they work absolutely fine in the AC100. Here is what they look like fitted into the AC100.

Dual micro-SD mini-PCIe USB Adapter

Dual micro-SD mini-PCIe USB Adapter

SD mini-PCIe USB Adapter

SD mini-PCIe USB Adapter

The SD cards will appear as USB disks. If you use the dual micro-SD adapter you can RAID the two cards together.

Unfortunately, I have found that the best results are achieved using a single SD card, purely because I haven’t found any micro-SD cards that have reasonable performance when it comes to random-write IOPS. SD cards fare a little better, but the best SD card I have found in terms of random write IOPS still tops out at a mere 19 random write IOPS using 4KB blocks. Still, it is 2/3 of the marketed figures for the InnoDisk SSD at 4x lower price per GB, and the performance just about scrapes past what I would consider minimal requirements for reasonable use.

I am currently putting together a list of SD, micro SD and USB flash devices and consistent benchmark performance figures for them, which should hopefully help you to choose the ones most suitable for your application. I hope to have the article up reasonably soon, but don’t expect it too soon – benchmarking SD cards takes a long time to do properly.

Enabling Write-Read-Verify Feature on Disks

Given the appalling reliability of modern disks, any feature that helps ensure data integrity and early detection of failure has to be deemed a good thing. What caught my attention recently is that all of the Seagate Barracuda disks I have (a number of ST31000333AS, ST31000340AS and ST31000528AS models) support the Write-Read-Verify feature. But there is a snag – disks from different batches, even for the same model, seem to disagree about the default state of this feature. Worse, the feature gets reset to it’s default setting on every reboot. This wouldn’t be a problem if the usual tool for such things on Linux, hdparm, had an option for controlling the state of this feature – but it doesn’t. So I wrote a patch to add control of write-read-verify capability to hdparm. Hopefully this will help keep your data a little safer.

The Appalling Quality of Hard Disks

I decided to write this article after having spent the last three years fighting unreliable and buggy disks of certain brands. The most prominent anti-star of this article is the disk model HD501LJ – a 500GB SATA disk. If you Google the model number, I am sure you can find out who the manufacturer of it is.

The story begins back in February 2008 when I bought two of these disks for the machine I was building. Approximately 15 months later, both of the disks (used in a RAID1 stripe) failed about 20 minutes apart with massive unrecoverable media failure, taking the data with them. This was annoying and inconvenient, but I learned some decades ago about the importance of keeping backups, so it wasn’t that serious a setback – more an annoyance at the waste of time than anything else.

Before I say anything else, I would like to say that the quality of the service provided by Rexo (the company in UK that handles warranty replacements for this particular disk manufacturer) is superb. They always send the replacement disk the same day the faulty disks arrive, and as the saga that I am about to recount unfolded, they even sent couriers to pick up faulty disks and deliver the replacements at the same time. Their superb service was in fact the only reason why I bought more disks made by the same manufacturer. This turned out to have been a big mistake since Rexo no longer handle warranty services directly – you have to arrange for it via the manufacturer’s web site. No doubt they were too efficient and helpful toward the end customers for the manufacturer’s liking.

The real story begins with the disks that arrived as replacements under warranty. One of them worked OK, and passed all the cursory tests I threw at it (short and long SMART tests and a rudimentary pass of badblocks). The other initially passed the tests, but approximately 50% of the time, the actuator would get stuck and the disk would just click indefinitely when trying to power up. Power cycling typically rectified the problem but I wasn’t prepared to put up with it, so it went back to get replaced.

The next replacement disk exhibited a different interesting problem. It failed the SMART tests immediately, on a sector that was beyond the LBA addressable range. This turned into a pending sector and couldn’t be fixed because it wasn’t a writable sector – it was a spare sector for remapping unrecoverable sectors. Clearly the firmware is buggy in it’s handling of such a condition – it doesn’t handle the physical and logical block addressing correctly. Since this rendered built in SMART diagnostics useless, this disk went back to be replaced again. Another 4 disks arrived after it, all with the same issue. This indicates a systematic fault and very poor quality control procedures on refurbished disks.

Eventually my case got escalated to their engineering department, and one of the engineers hand-picked a disk on which the problem wasn’t manifesting, ran a full set of tests on it (including SMART, which shockingly do not form a part of the quality control check on disks from this manufacturer – or at least they didn’t form a part of the checks in 2009), and sent me that disk to replace the one that was faulty.

Now, another year later, and the problem manifested again on one of those disks (bad sector at LBA+1 address). The only way this could be cured (refusal to reallocate sectors on overwrite and bad sectors beyond the end of LBA addressable range) was by performing a secure erase. That made the disks afterwards pass the full SMART self diagnostics and badblocks tests. The HD501LJ and HD103UJ, however, had an additional problem. Once the security on the disks was activated and the password set (required to perform a secure erase), they didn’t automatically disable security upon the erase. It also appears that the security implementation is buggy, and if the disk is secured, it will cause the machine to crash during booting on certain combinations of BIOS/SATA controller and motherboard. I worked around this by putting the disks in a machine that didn’t end up crashing and disabling the security on the disks manually.

Over time I did a bit more investigative work on these disks, and found additional bugginess in the firmware. I found that unreadable sectors that come up as “pending” sectors in SMART, once written to, disappear rather than show up as reallocated sectors. The pending sector count goes down to 0, and the reallocated sector count stays at 0. This is extremely bad behaviour, and affects not just the HD501LJ mode, but also the 1TB HD103UJ disk from the same manufacturer. Since it isn’t limited to a specific model, it seems likely that it affects all disks made by this manufacturer. I should also point out that there is another model of a disk that I have observed the exact same behaviour from: WD5000AAKS. You have been warned – these disks lie about the number of reallocated sectors they have, which means that one of the most important metrics indicating the health of the disk is missing. In some cases these disks also refused to reallocate the sectors on overwrite. It is worth noting that Google’s research on disk reliability shows the sector reallocation count to be the most reliable indicator of the disk’s imminent failure. They have found that 40% of the failed disks show reallocated sectors (see Figure 14. in the linked document). It seems reasonable to assume that the reason the disks made by the two manufacturers in question do not track this value in order to reduce their warranty claims by making the problem remain unnoticed for longer – no doubt hoping that you won’t notice until just after the warranty period expires. Any manufacturer that does this doesn’t deserve your custom – spend your money instead on a brand of disks that works correctly! Consider yourself warned.

None of the remedial actions listed above are something an average user would likely be able to carry out, and a more knowledgeable user would get there in the end after spending more time on the task than the cost of a disk would justify. All I can say in conclusion is that unless your data and time are worthless, buy disks made by a more competent manufacturer. Good manufacturers make a bad model from time to time. Bad manufacturers make bad models all the time.

Update: I have recently come across an interesting article on disk failure rates. I cannot help but wonder how much higher would returns rates be on disks from the two manufacturers whose disks (mentioned above) I’ve had the misfortune to own if they weren’t misreporting reallocated sectors.

Disk and File System Optimization

While RAID and flash disks have become much more common over the recent years, some of the old advisories on extracting best performance out of them appear to have become deprecated for most common uses. In this article I will try to cover the basic file system optimisations that every Linux system administrator should know and apply regularly.

Performance from the ground up

The default RAID block size offered by most controllers and Linux software RAID of 64-256KB is way too big for normal use. It will kill the performance of small IO operations without yielding a significant increase in performance for large IOs, and sometimes even hurting large IO operations, too.

To pick the optimum RAID block size, we have to consider the capability of the disks.

Multi-sector transfers

Modern disks can handle transfers of multiple sectors in a single operation, thus significantly reducing the overheads caused by the latencies of the bus. You can find the multi-sector transfer capability of a disk using hdparm. Here is an example:

# hdparm -i /dev/hda

/dev/hda:

 Model=WDC WD400BB-75FRA0, FwRev=77.07W77, SerialNo=WD-WMAJF1111111
 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
 RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=74
 BuffType=DualPortCache, BuffSize=2048kB, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=78125000
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4
 DMA modes:  mdma0 mdma1 mdma2
 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: Unspecified:  ATA/ATAPI-1 ATA/ATAPI-2 ATA/ATAPI-3
ATA/ATAPI-4 ATA/ATAPI-5 ATA/ATAPI-6

 * signifies the current active mode[/code]

The thing we are interested in here is the following:

MaxMultSect=16, MultSect=16

I have not seen any physical disks recently that have this figure at anything other than 16, and 16 sectors * 512 bytes/sector = 8KB. Thus, 8KB is a reasonable choice for the RAID block (a.k.a. chunk in software RAID) size for traditional mechanical disks.

There are a few exceptions to the these figures – some modern disks such as the very recent Western Digital ones support sector sizes of 4KB, so it is important to check the details and make sure what you’re dealing with. Also, some virtualization platforms provide virtual disk emulation that supports transfers of as many as 128 sectors. However, in the case of virtual disk images, especially sparse ones, it is impossible to make any reasonable guesstimates about the underlying physical hardware so none of this applies in a meaningful way.

Flash memory

For flash based solid state disks, however, there are a few additional things to consider. Flash disks can only be erased in relatively large blocks, typically between 128KiB and 512KiB. Size of writes can have a massive influence on performance because in the extreme case, a single sector write of 512 bytes still ends up resulting in a whole 512KiB block being erased and re-written. Worse, if we are not careful about the way we align our partitions, we could easily end up with file system blocks (typically 4KB) that span multiple physical flash blocks, which means that for all writes to those spanning blocks we would end up having to write two flash blocks. This is bad for both performance and longevity of the disk.

Disk Geometry – Mechanical Disks

There is an additional complication in that alignment of the virtual disk geometry also plays a major role with flash memory.

On a mechanical disk the geometry is variable due to the inherently required logical translation for compensating for varying numbers of sectors per cylinder (inner cylinders have smaller circumference and thus fewer sectors than the outer cylinders). This means that any optimization we may try to do to ensure that superblock and extent beginnings never span cylinder boundaries (and thus avoid a track-to-track seek overhead, which is nowdays, fortunately, very low) is relatively meaningless, because the next cylinder shrink could throw out our calculation. While this used to be a worthwhile optimisation 25 years ago, it is, sadly, no longer the case.

There is a useful side-effect of this translation that one should be aware of. Since outer cylinders have more sectors and the rpm is constant, it follows that the beginning of the disk is faster than the end. Thus, the most performance crytical partitions (e.g. swap) should be physically at the front of the disk. The difference in throughput between the beginning and the end of the disk can be as much as two-fold, so this is quite important!

Disk Geometry – Flash

Flash disks require no such translation so careful geometry alignment is both useful and worthwhile. To take advantage of it, we first have to look at the erase block size of the flash disk in use. If we are lucky, the manufacturer will have provided the erase block
size in the documentation. Most, however, don’t seem to. In the absence of definitive documentation, we can try to guesstimate this by doing some benchmarking. The theory is simple – we disable hardware disk caching (hdparm -W0) and test the speed of unbuffered writes to the disk using:

dd if=/dev/zero of=/dev/[hs]d[a-z] oflag=direct bs=[8192|16384|32768|65536|131072|262144|524288]

What we should be able to observe is that the performance will increase nearly linearly up to erase block size (typically, but not always, 128KiB), and then go flat.

Once we have this, we need to partition the disk with the geometry such that cylinders always start at the beginning of an erase block. Since these will always be powers of 2, the default CHS geometry with 255 heads and 63 sectors per track is pretty much the worst that can be chosen. If we set it for 128 heads and 32 sectors per track, however, things become much more sane for aligning cylinders to erase block boundaries. This yields 2MB cylinders which should work well for just about all flash disks. Thus, we can run fdisk by explicitly telling it the geometry:

fdisk -H 128 -S 32 /dev/[hs]d[a-z]

One important thing to note is that the first partition (physically) on the disk doesn’t start at sector 0. This is a hangover from DOS days, but if we used the first cylinder as is, we would end up messing up the alignment of our first partition. So, what we can do instead is make a partition spanning only the 1st cylinder and simply not use it. We waste a bit of space but that is hardly a big deal. Alternatively, we could also the /boot partition at the beginning of the disk as that changes very infrequently and is never accessed after booting.

Next we have to look at what available options are available for the file system we intend to use. The ext2/34 file systems provides several parameters that are worth looking at.

Stride

The stride parameter is used to adjust the file system layout so that data and metadata for each block is placed on different disks. This improves performance because the operations can be parallelized.

This is specifically related to RAID – on a single disk we cannot distribute this load and there is more to be gained by keeping the data and metadata in adjecent blocks to avoid seek times and make better use of read-ahead.

The stride parameter should be set so that ext4 block size (usually 4KB) * stride = chunk size, in this case ext3 block size = 4KB, stride = 2, RAID chunk = 8KB.

mkfs.ext4 -E stride=2

Stripe Width

This is a setting that both RAID arrays and flash media can benefit from. It aims to arrange blocks so that writes will be to a whole stripe at once, rather than suffer a double hit on the read-modify-write operation that RAID levels with parity (RAID 3,4,5,6) suffer.
This benefit is also directly applicable to flash media because on flash we have to write an entire erase block, so cramming more useful data-writes into that single operation has a positive effect both in terms of performance and disk longevity. If the erase block size (or stripe width size for RAID) is, for example, 128KiB, we should set stripe-width = 128KiB / 4KiB = 32:

mkfs.ext4 -E stripe-width=32

Block Groups

So far so good, but we’re not done yet. Next we need to to consider the extent / block group size. The beginning of each block group contains a superblock for that group. It is the top of that inode subtree, and needs to be checked to find any file/block in that group. That means the beginning block of a block group is a major hot-spot for I/O, as it has to be accessed for every I/O operation on that group. This, in turn, means that for anything like reasonable performance we need to have the block group beginnings distributed evenly across all the disks in the RAID array, or else one disk will end up doing most of the work while the others are sitting idle.

For example, the default for ext2/3/4 is 32768 blocks in a block group. The adjustment can only be made in increments of 8 blocks (32KB assuming 4KB blocks). Other file systems may have different granularity.

The optimum number of blocks in a group will depend on the RAID level and the number of disks in the array, but you can simplify it into a RAID0 equivalent for the purpose of this exercise e.g. 8 disks in RAID6 can be considered to be 6 disks in RAID0. Ideally you want the block group size to align to the stripe width +/- 1 stride width so that the block group beginnings rotate among the disks (upward for +1 stride, downward for -1 stride, both will achieve the same effect).

The stripe width in the case described is 8KB * 6 disks = 48KB. So, for optimal performance, the block group should align to a multiple of 8KB * 7 disks = 56KB. Be careful here – in the example case we need a number that is a multiple of 56KB, but not a multiple of 48KB because if they line up, we haven’t achieved anything and are back where we started!

56KB is 14 4KB blocks. Without getting involved in a major factoring exercise, 28,000 blocks sounds good (default is 32768 for ext3, which is in a reasonable ball park). 28,000*4KB is a multiple of 56KB but not 48KB, so it looks like a reasonable choice in this example.

Obviously, you’ll have to work out the optimal numbers for your specific RAID configuration, the above example is for:
disk multi-sector = 16
ext3 block size = 4KB
RAID chunk size = 8KB
ext3 stripe-width = 12
ext3 stride = 2
RAID = 6
disks = 8

mkfs.ext4 -g 28000

In case a flash disk is being used, the default value of 32768 is fine since this results in block groups that are 128MB in size. 128MB is a clean multiple of all likely erase block sizes, so no adjustment is necessary.

Journal Size

Journal size can also be adjusted to optimize array performance. Ideally, the journal should be sized to fill a multiple of stripe size. In the example above, this means a multiple of 48KB. The default is 128MB, which doesn’t quite fit, but 126MB (for example) does.

mkfs.ext4 -J size=32256

Since flash disks typically have very fast reads and access time, it is possible to not use journalling at all. Some crash-proofing will be lost, but fsck will typically complete very quickly in a SSD, thus minimizing the requirement for having a journal in environments that don’t require the extra degree of crash-proof data consistency. In journalling is not required, simply use ext2 file system instead:

mkfs.ext2

or disable the journal:

mkfs.ext4 -O ^has_journal

Growing

If you are certain that the file system will never need to be grown, you can adjust the amount of reserved space for new inodes. Unfortunately, the growth limit has to be a few percept bigger than the current file system size, but this is still better than the default of 1000x bigger or 16TB, whichever is smaller. This will also free up some space for data.

mkfs.ext4 -E resize=6T

Crippling Abstraction

The sort of finesse explained above that can be applied to extract better (and sometimes _massively_ better) performance from disks is one of the key reasons why LVM (Logical Volume Management) should be avoided where possible. It abstracts things and encourages a lack of forward thinking. Adding a new volume is the same as adding a new disk to a software RAID to stretch it. It’ll upset the block group size calculation and disrupt the advantage of load balancing across all the disks in the array that we have just carefully established. By doing this you can cripple the performance on some operations from scaling linearly with the number of disks to being bogged down to the performance of just one disk.

This can make a massive difference to IOPS figures you get out of a storage system. There is scope for offsetting this, but it reduces the flexibility somewhat. You could carve up the storage into identical logical volumes, each of which is carefully aligned to the underlying physical storage and add logical volumes in appropriate quantitiy (rather than just one at a time) so that the block groups and journal size still align in an optimal way.