Intel's Ronler Acres Plant

Silicon Forest
If the type is too small, Ctrl+ is your friend

Monday, June 20, 2016

dedup

I'm trying to get the Eclipse IDE (Integrated Development Environment) setup and working on my recently repaired Linux system, and simultaneously I want to become familiar with the ins and outs of its operation. Nothing better for this than tackling a small programming project. I have an old program that I've been meaning to port to Linux, so this looks like the right time and place to tackle it. dedup is not a fancy program, but it can be useful, especially for someone like me who copies whole directories willy-nilly just to make sure I have saved a copy of what I need.

I started working on this program about a week ago, but I was so muddle headed I wasn't making any headway. I don't know if my allergies were acting up or I had some kind of bug, but in case I wasn't sleeping well and the result was that my head felt like it was full of wool. This weekend things cleared up a bit and I was able to think clearly about this program. And then, because I have found that writing down what I am trying to accomplish often helps clarify the problem, I wrote up the following explanation. I have since implemented my revised plan. It needs a little some testing and polishing, but I expect to post it before too long.


dedup is a program for deleting duplicate files

Given a starting point on a directory tree, it follows each branch to the end, then backs up to the last node and takes the next branch.

When it encounters a file, as opposed to a directory, which could be anywhere on its travels, it marks the spot and then proceeds to explore the rest of the tree. If it encounters a file with same name, it compares the length of the two files and then, if they are the same, it compares their contents. If the contents are identical, it deletes the most-recently-encountered file.

It then returns to the previously marked spot and resumes it exploration of the directory tree.

The first version of this program returns to the 'previously marked spot' by starting over and counting the number of files it encounters. For the first file this won't take too much time or trouble, but if the directory tree contains a large number of files, by the time you get towards the end, it could be taking a very long time.

As a matter of fact, this version of the program will completely traverse the directory tree once for each unique file found therein. It is not what you would call time efficient, but then it doesn't use very much memory.


A slightly improved version would 'mark the spot' by writing down its current location in the directory tree, and then when it had finished processing one file, it could go immediately back to the 'marked spot' without having to traverse the entire tree all over again. This would speed things by approximately a factor of two, and it wouldn't require any more memory, or at least not enough to notice.

Another way to do this would be to record the name of every file and the pathname of the directory where that file is stored. A simple sort operation on the list of filenames would enable you to quickly identify candidates for comparison and possible deletion. The problem with this is that if you have a very large number of files, this could take a large amount of memory. However, computers these days, have large amounts of memory, so the speed advantage should make this worthwhile.

So the way we are going to do this is we will traverse the entire directory tree once, recording the names and paths of all the files as we go. Sort the list of filenames, then scan the list looking for any duplicates, compare the metadata for the files, and if warranted, compare the files themselves. If identical, then we delete the newer of the two files.

Filenames and paths are stored in a string table. Each entry in our list of filenames will then simple hold the two pointers into this string table, one to the pathname, and one to the filename.

It might be more memory efficient to break up the pathnames into sequences of strings and store the indices for the those individual bits, but that would complicate matters as we would need to constantly be searching for previously used pathname segments, which would mean a binary search routine and an insert routine. I'm not sure the memory saved would be worth the effort
involved in coding or execution.

The last problem is how to decide how much memory to allocate in the first place, and then if you discover it is not enough, how to allocate more is such a manner that all your work is transferred transparently to the new allocation.

No comments: