Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parity file support [$5] #314

Open
kenkendk opened this issue Aug 5, 2014 · 35 comments · May be fixed by #4574
Open

Add parity file support [$5] #314

kenkendk opened this issue Aug 5, 2014 · 35 comments · May be fixed by #4574

Comments

@kenkendk
Copy link
Member

kenkendk commented Aug 5, 2014

From kenneth@hexad.dk on December 07, 2010 22:27:02

If a volume archive is broken either due to transfer issues, or an issue on the backend, the data will likely not be recoverable, especially if it is encrypted.

One solution to this would be to simply store a parity file together with the volumes.

The most prominent parity application I know of this par2: http://www.par2.net/

Original issue: http://code.google.com/p/duplicati/issues/detail?id=314

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

@kenkendk
Copy link
Member Author

kenkendk commented Aug 5, 2014

@kenkendk kenkendk removed the imported label Mar 27, 2016
@kenkendk
Copy link
Member Author

There is an open source implementation here (in Java):
https://www.backblaze.com/blog/reed-solomon/

And one here (in C#):
https://zxingnet.codeplex.com/SourceControl/latest#1530218

@mxxcon
Copy link

mxxcon commented Sep 28, 2018

Added $5 bounty to try to get more attention to this feature...

@warwickmm warwickmm changed the title Add parity file support Add parity file support [$5] Nov 25, 2018
@BlueBlock
Copy link
Contributor

Is parity recover no longer seen as needed? Or is this concept already incorporated by a supported compression method?

@Pectojin
Copy link
Member

Parity is still a nice feature to add.

We don't have any way to combat block loss other than being lucky that the blocks are still on the machine.

@BlueBlock
Copy link
Contributor

Parity is still a nice feature to add.

We don't have any way to combat block loss other than being lucky that the blocks are still on the machine.

Ok thanks. I'm thinking of adding par2 at least to the backup process to prevent bit rot.

I'm thinking the par2 set of files would be zip'd so there is a 1:1 file ratio.

It's why I'm starting with my subfolder file limit of 2000 since I may have another 2000 par2 files in a directory.

@Pectojin
Copy link
Member

Would you see it as a separate file so that it's easier to later "add" for existing backups?

I had been thinking a little about how to best deal with that, since we obviously want to grandfather in existing backups without making them reupload everything.

Perhaps parity could be added when detected missing on verification? That way we don't even download more data.

@BlueBlock
Copy link
Contributor

BlueBlock commented Aug 12, 2019

I'm thinking a separate file.

The par files can be toggled by an option and added to an existing backup.

The idea being it all automagically creates par and repairs files as they traverse to and from the remote backend.

I think the adding of par files can be fairly easily accomplished. That could go first.

Then I would look to add auto recovery during the retrieval of remote files. If a volume file is bad then check for and pull par files, then repair the file, reload the repaired file and use the repaired file for whatever purpose it was pulled.

@BlueBlock
Copy link
Contributor

I'm wondering what options to provide:

enable-parity-file : (bool) true/false
parity-file-redundancy : (int) 0-100 to indicate the percentage

From the users perspective, would just having parity-file-redundancy be enough? If it equals 0 then parity is off and 1-100 would have it enabled.

Or are both options needed? With both then it allows the parity-file-redundancy to have a default of say 5.

@drwtsn32x
Copy link
Contributor

@BlueBlock - my vote is for a single option. Default for parity-file-redundancy should be 0 (disabled).

@BlueBlock
Copy link
Contributor

That makes it simple.

I have some ideas for how to allow existing backups to add par2 files but any way will mean pulling down a copy in order to add par2 files.

So far I have creation of a par2 file for each uploaded backend file. When retrieving a backend file, if the hash does not match then it pulls the par2, performs a repair and also uploads the repaired file to the backend. Then the repaired file is utilized for whatever operation.

I'm figuring even 1% redundancy would be good to protect against possible bit rot. I'd probably run at 5-10% for really important things.

@BlueBlock
Copy link
Contributor

BlueBlock commented Aug 23, 2019

I like the idea of going with 1 consolidated option.

For the default value, would "1" not be better? At 1% the parity file is pretty small and offers at least some protection.

Note that since a parity file exists in a 1-1 relationship with backend files, the number of files effectively doubles. This won't be a problem once I have the subfolders work done.

@drwtsn32x
Copy link
Contributor

Maybe.... Only negative I can think of is if people are using back ends that are nearing the max file limit. That may push it over the edge.

@BlueBlock
Copy link
Contributor

Maybe.... Only negative I can think of is if people are using back ends that are nearing the max file limit. That may push it over the edge.

The subfolder solution will take care of that even for existing backups.

@BlueBlock
Copy link
Contributor

BlueBlock commented Aug 23, 2019

Would it be useful to add "Parity" to the usage-reporter feature stats?
I haven't looked at the usage-reporter website code. Anyone have instructions on setting up a dev copy of the site?

@Pectojin
Copy link
Member

I believe the usage reporter only runs on Google app engine.

@BlueBlock
Copy link
Contributor

I believe the usage reporter only runs on Google app engine.

OK thanks. I'll look at it a bit and see if I can run a copy of it.

@ts678
Copy link
Collaborator

ts678 commented Aug 25, 2019

Would it be useful to add "Parity" to the usage-reporter feature stats?

What is this saying? Is this a proposal for future if the feature gets done, or is it of any use right now?

OK thanks. I'll look at it a bit and see if I can run a copy of it.

This interests me quite a lot because so little is known about which Duplicati versions are being used.

Survey for Linux users, which version of Mono do you have installed?
and
Release: 2.0.4.23 (beta) 2019-07-14
both made a pitch for usage reporter server-side update to allow access to info it's already collecting. Further discussion should probably be in its own thread though, unless it ties into parity feature well...

@mxxcon
Copy link

mxxcon commented Aug 26, 2019

So far I have creation of a par2 file for each uploaded backend file. When retrieving a backend file, if the hash does not match then it pulls the par2, performs a repair and also uploads the repaired file to the backend. Then the repaired file is utilized for whatever operation.

What would happen in a compaction situation where some backed up files/blocks were deleted? Wouldn't the whole par2 become invalid and you would have to re-retrieve the whole backup set to recreate new par2?

@BlueBlock
Copy link
Contributor

Would it be useful to add "Parity" to the usage-reporter feature stats?

What is this saying? Is this a proposal for future if the feature gets done, or is it of any use right now?

Just if it would be useful to track the subfolder feature usage in usage-reporter.

@BlueBlock
Copy link
Contributor

So far I have creation of a par2 file for each uploaded backend file. When retrieving a backend file, if the hash does not match then it pulls the par2, performs a repair and also uploads the repaired file to the backend. Then the repaired file is utilized for whatever operation.

What would happen in a compaction situation where some backed up files/blocks were deleted? Wouldn't the whole par2 become invalid and you would have to re-retrieve the whole backup set to recreate new par2?

Each backend volume file has a par2 file generated during the upload process. So any changes would go together.

@BlueBlock
Copy link
Contributor

BlueBlock commented Aug 26, 2019

For Windows I have the par2cmdline.exe but does anyone have a recommendation on handling this on the linux side as an external dependency? Do we just require the user to install par2cmdline through apt-get, etc?

@jibsaramnim
Copy link

Do we just require the user to install par2cmdline through apt-get, etc?

I think this would be a good way, as it can be easy to check for the presence of the tool and notify the user of the need for its installation if it's not present and the user wishes to make use of the parity feature.

(I know the question was posed a while ago now, but as the issue is still open and I have only just stumbled upon this issue and would love to see parity file support I thought I'd add my $0.02. I hope that's alright!)

@duplicatibot
Copy link

This issue has been mentioned on Duplicati. There might be relevant details there:

https://forum.duplicati.com/t/bountysource-pocketing-peoples-money/10092/1

@Korkman
Copy link

Korkman commented Jun 26, 2020

check for the presence of the tool and notify the user of the need for its installation

This could be done as a warning message when parity is enabled but binaries are not found.

@samuel-w
Copy link
Contributor

samuel-w commented May 9, 2021

There's also a Reed Solomon implementation for C# https://github.com/antiduh/ErrorCorrection.

Edit: And a C# parchive https://github.com/heksesang/Parchive.NET

@cmpute
Copy link

cmpute commented Jun 28, 2021

Maybe.... Only negative I can think of is if people are using back ends that are nearing the max file limit. That may push it over the edge.

I propose to implement simple Reed-Solomon code, then store the code of each dblock in corresponding dindex file, and store the the code of each dindex file in the dlist file, and finally dlist file has its own code stored as a separate file (can be with some suffix like dpar). Then the total number of files remains almost the same as before. Although it looks a little cumbersome, it at least guaranteed parity is available for all files.

With PAR2 standard, the number of files will be at least doubled, somewhat reduce the directory efficiency.

@mxxcon
Copy link

mxxcon commented Jun 28, 2021

Maybe.... Only negative I can think of is if people are using back ends that are nearing the max file limit. That may push it over the edge.

I propose to implement simple Reed-Solomon code, then store the code of each dblock in corresponding dindex file, and store the the code of each dindex file in the dlist file, and finally dlist file has its own code stored as a separate file (can be with some suffix like dpar). Then the total number of files remains almost the same as before. Although it looks a little cumbersome, it at least guaranteed parity is available for all files.

With PAR2 standard, the number of files will be at least doubled, somewhat reduce the directory efficiency.

I think this would be sub-optimal because if you decided to change parity ratio or just stop using it completely, you'd need to reshuffle your whole backup. Whereas if parity were stored in separate files, only they would need to be touched.

Also if your backup destination is nearing a file limit, I think you have bigger problems to worry about. If your backup suddenly grows in size, you'd be unable to backup anyway.

@cmpute
Copy link

cmpute commented Jun 28, 2021

@mxxcon Why is reshuffle required to change the parity ratio? The parity is calculated based on the final zip files, if parity format changed, you only need to update the parity data stored in dindex and dlist. So the only problem is that the dindex files need to be downloaded. But according to my experience, those files are relatively small.

Use external parity files are also viable, though I don't think using Par2 is an elegant way.

@cmpute
Copy link

cmpute commented Jun 28, 2021

Though I'm not quite familiar with the underlying logic of dindex and dlist, appreciate any corrections on my understandings

@ts678
Copy link
Collaborator

ts678 commented Jun 28, 2021

underlying logic of dindex and dlist

How the backup process works gets into dblock and dlist a little, but glosses over dindex, which is just an index to its dblock.
Developer documentation and Local database format are a bit deeper. IndexBlockLink table pairs the dblock and dindex files.

There's not a constant relationship between a dlist file and block storage dblock and dindex. Compacting files at the backend
produces never-before-seen dblock and dindex files (and deletes obsolete ones), all without having to change any dlist files...

Deduplication means that new dlist files can refer to the same blocks, if source didn't change. All of the references are hashes.

@cmpute
Copy link

cmpute commented Jun 28, 2021

@ts678 Thanks for the key information! Several unclear points for me:

  1. Is a dindex mandatory for each dblock file? Is the relationship between the dindex and dblock guaranteed 1-to-1?
  2. So dindex is only used for reconstruct the database? Then a dindex itself is actually redundant information?
  3. My idea is to store the parity of each dblock in its dindex, and dlist only store parity of those parity data in dindex (i.e. second order parity). So it won't introduce any problem if new dlist is created referring to different blocks

@ts678
Copy link
Collaborator

ts678 commented Jul 2, 2021

Is a dindex mandatory for each dblock file?

No, but it should be there. A missing dindex will be recreated from DB information. A DB Recreate will prefer dindex over dblock (because they're smaller) but will search all the dblocks if it has to. Duplicati.CommandLine.RecoveryTool uses dblock not dindex.
This makes it more likely to work if a dindex is messed up. If a dblock is messed up, the source data that it holds is unavailable...

Is the relationship between the dindex and dblock guaranteed 1-to-1?

I don't think anything guarantees it, but it's typical and it should be so. IndexBlockLink is 1-to-1, but files can be mismatched.
This is pretty much guaranteed in the middle of a backup, because the dindex and dblock don't finish upload at the same time...
There are some states that try to keep track of all this and clean up remote messes, e.g. if a backup got interrupted before done.

So dindex is only used for reconstruct the database? Then a dindex itself is actually redundant information?

Reconstructing a partial temporary database is also used for Direct restore from backup files. Without dindex, if you had a 10 TB backup and you wanted to direct-restore a 1 KB file, you would download 10 TB looking through blocks to find the right blocks.

C:\ProgramData\Duplicati\duplicati-2.0.5.114_canary_2021-03-10>Duplicati.CommandLine.exe help index-file-policy
  --index-file-policy (Enumeration): Determines usage of index files
    The index files are used to limit the need for downloading dblock files
    when there is no local database present. The more information is recorded
    in the index files, the faster operations can proceed without the
    database. The tradeoff is that larger index files take up more remote
    space and which may never be used.
    * values: None, Lookup, Full
    * default value: Full

so yes, it's redundant, but it's also very useful.

dlist only store parity of those parity data in dindex (i.e. second order parity).

Then where is the parity for the whole dindex file? How does it get to parity data in dindex if the dindex file itself gets corrupted?

So it won't introduce any problem if new dlist is created referring to different blocks

I don't follow. This proposal assumes a dlist tied to a dindex, but this isn't so. The dblock and dindex will change during compact.
This could affect any number of dlist files. Typically a backup creates many dblock and dindex files. Which dindex is a dlist tied to?

@cmpute
Copy link

cmpute commented Jul 2, 2021

@ts678 Thanks for your excellent explanation! I read through the code these days and I found that my initial understanding of the dindex files is partially incorrect (it would be great if the wiki provides more information on this). Indeed I was thinking that dlist is associated with dindex files, but actually dlist and dindex didn't have direct connection and they are both connected to dblock volumes.

So I plan to continue with what @BlueBlock proposed with a little modification. A separate PR was created.

@duplicatibot
Copy link

This issue has been mentioned on Duplicati. There might be relevant details there:

https://forum.duplicati.com/t/the-compacting-process-is-very-dangerous/10832/17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.