USB flash drive failure and replacement
By Patrick Wigmore, , published: , updated
Recording performance begins to degrade
It started to look like switching the USB power off during the day had been a mistake. For some unknown reason, the webcam seemed to become progressively more unstable after starting to use that configuration.
I had two working theories.
The first was that the webcam was most reliable when left switched on permanently. If that was correct, then the only solution was going to be to leave the power on all the time, and possibly add in an additional hardware feature to turn off the IR floodlight separately.
The second theory was that switching the power off every day caused filesystem corruption on the USB flash drive, because the USB reset procedure was not able to unmount the filesystem for some reason.
I added some additional steps to terminate and kill any lingering v4l2-ctl
processes prior to unmounting the flash drive.
sftp-server
can also prevent the flash drive unmounting, if it was recently used to access files on the flash drive. Since I wasn’t using sftp-server
any more, I removed it.
The gradual realisation that the flash drive had failed
When it seemed the webcam was unstable, it wrote files that caused read errors when I tried to rsync them to my laptop. This highlighted the fact that the filesystem had the option ’errors=remount-ro’ enabled, because as soon as rsync encountered any one such file, everything would fail because the filesystem was remounted read-only. I changed this to ’errors=continue'
uci set fstab.@mount[0].options='errors=continue'
uci commit
Having run fsck.fat
on the flash drive’s filesystem, and (certainly not for the first time) correcting the errors found, I hoped the more careful unmounting would avoid further corruption.
I also realised that, when the webcam seemed to be playing up, v4l2-ctl
could end up dropping to ridiculously low frame rates and taking absolutely ages to finish what should be a 30-second run.
I started to get frames from previous videos showing through into new files. I’d been moving the camera around the garden and quite frequently a video would contain flashes of footage from a previous location. Given the general shadowy appearance of the infrared footage, these glitches had a bit of a horror vibe to them!
At first I thought these ghost frames must be somehow coming from the webcam itself, but I now believe they were a symptom of the flash drive’s slow demise.
I tried installing fstrim
on the Home Hub and running that on the flash drive, thinking that perhaps it had simply run out of spare blocks for shuffling data around, but the device didn’t support the necessary discard feature.
Then I decided to try overwriting the flash drive with zeroes and reformatting it. If nothing else, it would stop the ghost frames. But the overwrite operation was too slow. I decided to remove the flash drive and attempt the same operation on my laptop.
On the laptop, trying to overwrite the flash drive with zeroes proved that the flash drive itself was the bottleneck. Writing was extremely slow (~30KiB/s). At that rate, it would have taken days to complete, so I cancelled the operation.
Trying to revive the flash drive
Secure erase?
The Internet suggested that the ATA Secure Erase command might be useful as a substitute for trim/discard. I tried that, but it didn’t work; the drive did not seem to support the commands.
Magic exFAT?
After much faffing around, I eventually had the idea of formatting the drive as exFAT, and doing so on the entire block device, rather than having a partition table. This is the way they had formatted it in the factory. My thinking was that perhaps the controller on the drive is capable of understanding the exFAT filesystem and could use that capability to free up unused blocks.
At first I thought the exFAT format produced interesting results. It took a couple of minutes to format, and then a little moment to mount, and when I tried writing some data, it suddenly started bursting at 200MiB/s, on and off. When I unmounted it, it took ages to unmount, with the drive’s LED flashing on and off. “Oh good,” I thought, “it must be freeing up used blocks!” Then it dawned on me: the OS had probably enabled write caching for the exFAT filesystem. No wonder it was writing at 200MiB/s if it was writing to memory. The delay when unmounting was just the OS flushing the cache onto the disk!
Trimming sector ranges
hdparm
has a --trim-sector-ranges
command, which I thought was worth trying.
It uses a LBA:sector-count addressing format. Given that hdparm -I
reported a 512K sector size, I assumed there were 64000000000/512 = 125000000 sectors, with LBAs therefore ranging from 0 to 124999999.
I tried a range somewhere in the middle of the drive, and it said it succeeded.
The --trim-sector-ranges-stdin
command looked like a good way to do the entire drive, but I could not get it to work. It just complained that my ranges had numerical results out of range. Instead, I generated a shell script to run the command repeatedly with different ranges. I cheated and used a spreadsheet to generate the list of commands to run, pasting them into a text editor.
The first attempt worked for about five seconds, but then got stuck, with hdparm waiting for the disk. This didn’t look like it was going to resolve itself, so I unplugged the drive and amended the script to put a half-second delay between the hdparm commands, on the assumption that sending too many in quick succession had crashed the flash drive’s controller. Assuming the time taken by hdparm itself was negligible, this would take about 15 minutes to run, and hopefully not get stuck.
It worked for a little while, but then got stuck again. Subsequent runs seemed to only manage a handful of hdparm --trim-sector-ranges
commands before getting stuck, and then the flash drive had to be unplugged and re-plugged before any more such commands would work.
So, I reworked my spreadsheet to produce commands containing batches of 20 ranges, in the hope that these would work better. It turned out I could reliably get three batches of 20 ranges operated upon before the drive seemed to lock up, but because there were far fewer of these commands than there had been without batching, it was tolerable to just run three commands, unplug, re-plug, wait, run 3 more commands, and so on, and I managed to go through the whole lot in less than an hour.
The question then was whether it had actually achieved anything! I did notice the drive was getting warmer than it had been. It also seemed like the locking up had less to do with how many commands I was executing and more to do with how long the drive had been plugged in for, which seemed to suggest that it might be doing something to itself which didn’t kick in until it had been plugged in for a few moments. So, I left it plugged in for a bit after executing the final commands.
Promisingly, the disk did not seem to have any partitions after trimming, which was the expected outcome in the event that the trimming was successful.
In the name of reducing future filesystem corruption by means of journalling, I decided to format the drive with an ext4 filesystem. This took a disappointingly and discouragingly long time (minutes), during which I thought further about the implications of an ext4 filesystem and decided it would be better to stick with FAT32, as journalling might wear the drive prematurely. That was assuming that the drive’s speed had improved at all.
Unfortunately, the drive did not seem to get any faster, only slower. It dwindled to complete unusability.
Replacing the USB flash drive
Now realising that I must have completely worn out the flash, I gave up trying to recover the USB flash drive and decided to switch to using a high-endurance microSD card mounted in a USB microSD card reader. Unlike USB flash drives, These are explicitly sold for continuous video recording, high write endurance and challenging environmental conditions.
I realised that my thoughts about the lack of “trim” causing problems are more subtle than I previously elucidated. The problem only arises when the disk is not allowed to fill up to capacity.
If existing files are overwritten in a loop-recording style, like a dash cam, or the disk is allowed to fully fill up, then is fully deleted, then fully fills up, and so on, then the wear on the flash is going to be fairly even.
But, if the files are regularly deleted, leaving lots of free space in the filesystem, then the writes are likely to be concentrated on the same flash blocks, because the flash controller will still think the whole disk is full and only the blocks chosen by the filesystem for writing new files will get written, which is likely to wear the same blocks repeatedly.
Trying to give the replacement the best chance
One benefit of microSD cards is that they do tend to support ‘blockerase’, which can be used like ’trim’ to tell the card’s controller about free blocks. However, this is not an ongoing process, especially if the drive is mounted in a USB reader, since almost all USB readers present the card as a SATA device that doesn’t support TRIM, rather than as a raw MMC block device.
So, either I needed to get ’trim’ working through the USB reader, somehow, or I needed to implement loop recording, or I needed to regularly remove the card, put it in my laptop’s PCI Express SD card reader, and run blockerase upon it.
I decided to implement loop recording, as it was the path of least resistance. The USB card reader did not pass through the ability to run blkdiscard
on the SD card.
Loop recording created the challenge of how to selectively copy only the most recent files from the camera. I needed some way to exclude .mjpg files which were the same name as existing .mp4 files, other than the file name extension.
rsync
doesn’t seem to include a built-in way to filter these files out, but it permits the exclusion of files matching patterns listed in a file, so I decided to generate a list of mjpg files to exclude by using ls
and sed
to produce a list of the mp4 files with their file extensions changed to mjpg.
mv /home/patrick/hhcapture/exclude-mjpgs /home/patrick/hhcapture/exclude-mjpgs-last
ls -1 /home/patrick/tmp/ramdisk | sed -n "{ s/^\([0-9]\{4\}-\(0[1-9]\|1[0-2]\)-\(0[1-9]\|[1-2][0-9]\|30\|31\)T\(0[0-9]\|1[0-9]\|2[0-3]\)[0-5][0-9][0-6][0-9]\)\\.m\(p4\|jpg\)$/\1.mjpg/p }" | sort -u - /home/patrick/hhcapture/exclude-mjpgs-last > /home/patrick/hhcapture/exclude-mjpgs
This doesn’t cover the situation where the .mp4 file has been deleted, or when no .mp4 file was created from the .mjpg, but generally speaking the .mp4 files should outlast the .mjpgs and any .mjpgs that didn’t have an .mp4 created from them were either missed out by mistake (in which case they need transferring again) or were empty files posing very little transfer overhead, which will simply get deleted by mjpg2mp4
again.