Backblaze B2 Data Corruption Bug

Friday, February 26, 2021


Over many years of faithfully trusting my offsite backups (almost 10 TB) to Backblaze B2, questioning its data integrity never really crossed my mind. It was simply inconceivable that such a widely-used data storage service could possibly return corrupted data. After all, correctly storing and returning data is basically the entire premise of B2, and yet, here we are…

This is the story of how, for an unknown (to me) amount of time, Backblaze B2 returned corrupted data for a small fraction of my files, preventing a successful restore. Luckily, I did not find out this hard fact during disaster recovery.

TL;DR: Backblaze B2 had a data corruption bug that would cause it to return corrupted data for an unknown amount of time, but at least 23 days according to publicly known reports.

Duplicacy

I use Duplicacy to back up my data to Backblaze B2. Duplicacy is an officially recommend B2 client; see the B2 Integrations page. A quick summary of the relevant bits: Duplicacy performs incremental, deduplicated, and encrypted backups by producing chunks from the original files and storing those chunks into B2. Chunks are represented in B2 as files of around 1-10MB each.

In the past, I had periodically run duplicacy check -a to verify that every chunk exists, but this does not verify their contents. Downloading data from B2 is relatively costly ($100 for the 10 TB), and I figured that if my data made it to B2, it was probably fine…

A few days ago, I decided it was time to perform a full verification of my backup data just to be safe. After downloading a whole lot of data and verifying their bits (and racking up a B2 bill in the process), Duplicacy reported a bunch of failures like this:

$ duplicacy check -r 265 -chunks -threads 32 -persist
...
Failed to decrypt the chunk aec070645fe53ee3b3763059376134f058cc337247c978add178b6ccdfb0019f: cipher: message authentication failed; retrying
Failed to decrypt the chunk aec070645fe53ee3b3763059376134f058cc337247c978add178b6ccdfb0019f: cipher: message authentication failed
...
42 out of 1405935 chunks are corrupted

Duplicacy writes a MAC into each chunk, and the error message tells us that the MAC doesn’t match the chunk’s contents. Somewhere along the pipeline, some bits have changed.

I then noticed a very odd behavior: if I attempted to verify just those 42 corrupted chunks, a few more would successfully verify, and if I ran it again a few more would verify, and so on, until after about 20 tries, 38 of the 42 corrupt chunks eventually returned as verified. Duplicacy tries each chunk 4 times before giving up, so 4 chunks in my backup set still had yet to be successfully verified even after trying to download them 80 times.

I made a post on the Duplicacy forum as I was certain I must have been doing something wrong, or perhaps Duplicacy was buggy. Gilbert, the developer (who is very helpful, by the way), suggested it might be memory corruption and asked that I try again without -threads 32 and to try a different machine. I did so… with the same results.

Backblaze B2

Here’s where it gets interesting with respect to B2. Before proceeding, here’s a little background on Duplicacy and B2:

  • B2’s b2_upload_file API accepts a client-computed SHA1 hash of the file, and it will reject uploads received with a mismatched hash. This should prevent corrupted data from being written to B2 in the first place, unless of course the corruption happens before the client computes the SHA1.
  • Duplicacy always provides this SHA1 on uploads.
  • Upon download, aside from the file contents, B2 also returns the SHA1 that was originally supplied.
  • The official B2 CLI tool checks the hash after download and throws an error if verification fails
  • Duplicacy does not check the B2 hash on download, but instead checks its own MAC inside the file, which in this case serves the same function: rejection of corrupted downloads.

I took the list of 38 corrupted-but-later-verified chunks and used the B2 CLI to try to download each chunk one by one. Most of them failed to download and the CLI tool reported sha1 checksum mismatch — which should never happen, as this indicates data corruption somewhere between B2’s successful acceptance of the upload and the post-download hash verification on the client — but 3/38 downloaded successfully.

Even more alarmingly, I even managed to find a chunk that flip-flopped between sha1 checksum mismatch and successful download within seconds of each other:

[user@host duplicacy]$ b2 download-file-by-name [bucket-name] chunks/b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c 1
1: 100%|███████████████████████████████████| 3.27M/3.27M [00:00<00:00, 44.9MB/s]
ConsoleTool command error
Traceback (most recent call last):
  File "b2/console_tool.py", line 1521, in run_command
  File "b2/console_tool.py", line 690, in run
  File "logfury/v0_1/trace_call.py", line 84, in wrapper
  File "b2sdk/bucket.py", line 170, in download_file_by_name
  File "logfury/v0_1/trace_call.py", line 84, in wrapper
  File "b2sdk/transfer/inbound/download_manager.py", line 122, in download_file_from_url
  File "b2sdk/transfer/inbound/download_manager.py", line 134, in _validate_download
b2sdk.exception.ChecksumMismatch: sha1 checksum mismatch -- bad data
ERROR: sha1 checksum mismatch -- bad data
[user@host duplicacy]$ b2 download-file-by-name [bucket-name] chunks/b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c 2
2: 100%|███████████████████████████████████| 3.27M/3.27M [00:00<00:00, 49.9MB/s]
ConsoleTool command error
Traceback (most recent call last):
  File "b2/console_tool.py", line 1521, in run_command
  File "b2/console_tool.py", line 690, in run
  File "logfury/v0_1/trace_call.py", line 84, in wrapper
  File "b2sdk/bucket.py", line 170, in download_file_by_name
  File "logfury/v0_1/trace_call.py", line 84, in wrapper
  File "b2sdk/transfer/inbound/download_manager.py", line 122, in download_file_from_url
  File "b2sdk/transfer/inbound/download_manager.py", line 134, in _validate_download
b2sdk.exception.ChecksumMismatch: sha1 checksum mismatch -- bad data
ERROR: sha1 checksum mismatch -- bad data
[user@host duplicacy]$ b2 download-file-by-name [bucket-name] chunks/b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c 3
3: 100%|███████████████████████████████████| 3.27M/3.27M [00:00<00:00, 47.1MB/s]
File name:    chunks/b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c
File id:      [file-id]
File size:    3267110
Content type: application/octet-stream
Content sha1: f1d2d2f924e986ac86fdf7b36c94bcdf32beec15
checksum matches

A few other interesting characteristics of these broken files were:

  • For affected chunks, a successful download did not imply further successful downloads. In other words, downloading the same chunk repeatedly may succeed once but then fail repeatedly after that.
  • Failures seemed correlated in time; trying to download failed chunks repeatedly in a loop did not seem to help (I tried 4 arbitrarily-chosen chunks 100 times each and got 400 SHA1 mismatches), but waiting hours or days seemed to improve the likelihood of possibly getting a successful download.

Just to make sure it’s not something having to do with my local setup, I tried the B2 CLI client on not just my regular backup machine (one with ECC RAM and RAID 6), but also two other physical machines, including one with a different OS and another tethered to my cell data connection instead of using a my landline Internet, but all with the same results.

At this point, it seemed like a bug on Backblaze’s end must be the only explanation, so I submitted a support ticket.

Backblaze’s Response

Backblaze confirmed to me that this issue was affecting multiple customers. Gilbert (the Duplicacy developer) pointed me to issue #3268 on Restic (another backup client, similar to Duplicacy) where users encountered the same data corruption, and some more detail is provided in that thread.

Essentially what is happening is that files in B2 are sharded into 20 pieces, any 17 of which can be used to reconstruct the file, i.e. up to 3 shards can fail before data loss occurs. Backblaze did not verify checksums on reading for download, but rather via an async job that would scan for corruption, and a bad batch of hard drives caused this job to run more slowly than usual, lengthening the amount of time corrupted data was being served up.

The good news is that it seems very unlikely any data was permanently lost; however it’s quite surprising to me that Backblaze has gone so many years without verifying reads and thus knowingly returning corrupted data on a small percentage of downloads.

On March 1, 2021, a fix was applied which added checksum verification on download. All of my files seem to be readable now, so hopefully this particular issue is fixed for everyone.

Tags: backblaze | Posted at 22:08 | Comments (0)


Comments

There are no comments.

Add a comment

Name:
Email: (optional, not displayed to public)
URL: (do not fill this in — leave blank!)

Comment: