Report directories with contents that exist elsewhere even if scattered

I want to generate a report of directories that I know I can safely delete (even if requiring a quick manual verification), because I know that the full contents all the way down, exist elsewhere--even if, and especially if, the duplicate files are scattered randomly elsewhere over the volume, possibly in wildly different directory layouts, among files that don’t exist in the directory in question.

In other words, the directory structure and contents won’t be identical. But 100% of the contained files, individually, will be duplicated...somewhere, anywhere, on the same FS.

Given my workflow and use-case below, it should be clear this will almost always be a one-way relationship. 100% of the file content of dir1 may exist elsewhere, with different file names and directory structures, often more than one copy per file.

For example, copies of dir1/file1 may exist in dir2 and dir3. Copies of dir1/file2 may exist in dir2 and dir4. dir2, dir3, and/or dir4 may also contain their own unique files, as well as copies of files from other directories. But dir1 can most definitely be safely deleted.

In other words, there’s no inverse correlation: dir1 has 100% redundancy scattered about; but dir2, dir3, dir4...etc. won’t necessarily. (They might, and therefore might also be deletion candidates themselves, but the main candidate in question for now is dir1.)

The rest of this question isn’t strictly necessary to read, in order to understand and answer the question. It just answers some potential tangential “why?” and “have you tried…?” questions.

Here's the use-case generating the need, which actually seems to be fairly common (or at least not uncommon). ...At least with variations on the end result:

On location:
1. I take GBs of photos and videos.
2. Each day, I move the files from memory cards, into folders organized by camera name and date, onto a redundant array of portable USB HDDs.
3. When I have time, I organize copies of those files into a folder structure like “(photos|videos)/year/date”, with filenames pre-pended with “yyyymmdd-hhmmss”. (In other words the original structure gets completely scrambled. And not always in such a predictable way.) Those organized copies go on an SSD drive for faster workflow, but I leave the original, unmanaged copies on the slow redundant storage, for backup, with the copies being physically separated except during the copying step.

Back at home:
1. I move all the unmanaged files from the USB HDD array, to a "permanent" (larger, more robust, and continuously cloud-backed-up) array, as an original source of truth in case my workflow goes sideways.
2. Do post-processing on the organized copies on the SSD. (Leaving the original raw files untouched other than being renamed--and saving changes to new files.
3. Once I'm finished and do whatever was intended with the results, I move the entire SSD file structure onto the same larger "permanent" array as the originals. (But remember the directory structure is completely different than the originals SD card-dump structure.)

Ideally in that workflow, I'd also delete the original card-dump folders that is now unnecessary. The problem is, as in life, my workflow is constantly interrupted. Either I don't have time to finish organizing on-location, or it gets put on hold for a while once home, or I don't organize exactly the same way every time, or I just get confused about what exists where and am afraid to delete anything. Often times before heading out, I’ll do a copy of portable media onto permanent array just in case, even if I suspect it may already exist 2 or 3 times already. (I’m not OCD. Just scarred by experience.) Sometimes (less so in later years) I reorganize my entire logical directory structure. Other times I update it midstream going forward, leaving previous ones alone. Over the many years I’ve also moved and completely lost track of where (and how) the “card-dump” files go. Sometimes my on-location workflow, as well-defined and tested as it is, results in uncertain states of various folders, so I make even more backup copies “just in case”. I also used to write programs that would create thousands of folder symlinks in order to view my massive folder structure differently. (Like a filesystem “pivot table”.) But then later rsync'ed the whole filesystem to replacement arrays, while forgetting to set the “preserve hardlinks and symlinks” flags, and wound up with copies that previously were clearly just links, and then over time lost track of which were actually the originals. (Try doing this with 20 years of photos/videos, and 30 years of other data with better results!)

In other words, I have millions of large files all over the place, most of it unnecessarily redundant, in a big beautiful mess. And I need to fix it. Not just to save the space (long since taken care of), but to reduce the confusion of what is safely (and more importantly canonically) where. The first step in this, for me, is to delete the thousands of folders with content I know with high confidence (not necessarily certainty) are 100% distributed elsewhere. Even if each deletion candidate requires quick manual verification.

It’s generating the initial list, that is humanly impossible in one lifetime. Ideally, the list would be “all files in this directory exist elsewhere but in a different directory layout, with those directories also containing non-matching files”. But at minimum, “all files in this directory also exist elsewhere”.

I've researched and tested about a dozen solutions for deduplication, some solutions which come tantalizingly close to also solving this problem—but not close enough. My "permanent" array has had inline ZFS deduplication enabled full-time for years. Even though it cuts write throughput to about 25%, I can afford to wait--but I can't afford the many thousands of dollars in extra drive space that would be needed for a couple of decades of twice and sometimes thrice-duplicate photo and video data (not to mention being stored on a stripe of 3-way mirrors).

I've just provisioned a local automatic backup array (to compliment cloud backup). I went with Btrfs RAID1 to avoid potential problems of using the same storage software having the same bugs at the same time. (Which has happened to me before with ZFS, fortunately only resulting in temporary inability to mount.) Also this solution has the beautiful feature of being able to easily scale the array up or out, a single disk at a time. :-), which is good because that is a very expensive and time-consuming proposition on my big primary ZFS array.

Anyway, the only reason that’s relevant to the question, is that Btrfs has a plethora of good utilities for offline deduplication, which as I said, some of which come tantalizingly close to solving this problem, but not enough of the way. A quick summary of what I've tried:

rdfind: Fast matching algorithm, great for deduplication via hardlinks. The problem is, that could result in disaster for any user (all users?). While partially OK for my distinctly separate requirement of saving space among large redundant media files regardless of name or location, I found that it was disastrous for other things that can't easily be untangled. For example, it also hardlinks other identical files together that have no business being the same file. For example various metadata files that OSes and applications automatically generate, most of which are the same across hundreds or thousands of directories, but absolutely have to be able to be different. E.g. "Thumbs.db", and referencing the same file can and almost certainly will result in losing data later—possibly trivially, possibly not.) It does have an option for deduplicating Btrfs reflinks (which can later differentiate with CoW), but that feature is marked "experimental".

duperemove: Dedupes with Btrfs reflinks, so that's an acceptable (great, even) approach for saving disk space while allowing files to diverge later. (Except that currently Btrfs apparently un-deduplicates files when defragmenting [depending on the kernel?], even snapshots. What a terrible quirk, but I avoid it by never defragmenting and accepting the consequences.) The problem with duperemove is that, since it blindly checksums every file in the search, it’s incredibly slow and works the disks long and hard. It basically performs a poor-man’s array scrub. Takes several days on my array. (bedup, bees, and some others are similar in that regard, even if very different in other ways. rdfind and some others are smarter. They first compare file sizes. Then first few bytes. Then last few bytes. Only when all those match, does it resort to checksum. )

rmlint: This currently seems the best fit for my other requirement of just saving disk space. It has two options for Btrfs reflinking (kernel-mode atomic cloning, and the slightly less robust 'cp --reflink' method). The scanning algorithm is the fastest I've tested; hashing can be bumped up to sha256 and higher (including bit-for-bit); and it has many useful options to satisfy many of my requirements. (Except, as best as I can tell, the one in this question.)

There are many other deduping utilities, including fdupes, fslint, etc. I’ve pretty much tested (or read about) them all, even if they don’t have Btrfs support (since that’s mostly irrelevant to this question). None of them, with the possible exception of rmlint, come close to doing what I need.

edited Feb 9 at 0:43

Rui F Ribeiro

40.4k1479137

asked Aug 22 '18 at 19:17

Jim

765

add a comment |

In other words, the directory structure and contents won’t be identical. But 100% of the contained files, individually, will be duplicated...somewhere, anywhere, on the same FS.

Here's the use-case generating the need, which actually seems to be fairly common (or at least not uncommon). ...At least with variations on the end result:

On location:
1. I take GBs of photos and videos.
2. Each day, I move the files from memory cards, into folders organized by camera name and date, onto a redundant array of portable USB HDDs.
3. When I have time, I organize copies of those files into a folder structure like “(photos|videos)/year/date”, with filenames pre-pended with “yyyymmdd-hhmmss”. (In other words the original structure gets completely scrambled. And not always in such a predictable way.) Those organized copies go on an SSD drive for faster workflow, but I leave the original, unmanaged copies on the slow redundant storage, for backup, with the copies being physically separated except during the copying step.

Back at home:
1. I move all the unmanaged files from the USB HDD array, to a "permanent" (larger, more robust, and continuously cloud-backed-up) array, as an original source of truth in case my workflow goes sideways.
2. Do post-processing on the organized copies on the SSD. (Leaving the original raw files untouched other than being renamed--and saving changes to new files.
3. Once I'm finished and do whatever was intended with the results, I move the entire SSD file structure onto the same larger "permanent" array as the originals. (But remember the directory structure is completely different than the originals SD card-dump structure.)

rdfind: Fast matching algorithm, great for deduplication via hardlinks. The problem is, that could result in disaster for any user (all users?). While partially OK for my distinctly separate requirement of saving space among large redundant media files regardless of name or location, I found that it was disastrous for other things that can't easily be untangled. For example, it also hardlinks other identical files together that have no business being the same file. For example various metadata files that OSes and applications automatically generate, most of which are the same across hundreds or thousands of directories, but absolutely have to be able to be different. E.g. "Thumbs.db", and referencing the same file can and almost certainly will result in losing data later—possibly trivially, possibly not.) It does have an option for deduplicating Btrfs reflinks (which can later differentiate with CoW), but that feature is marked "experimental".

duperemove: Dedupes with Btrfs reflinks, so that's an acceptable (great, even) approach for saving disk space while allowing files to diverge later. (Except that currently Btrfs apparently un-deduplicates files when defragmenting [depending on the kernel?], even snapshots. What a terrible quirk, but I avoid it by never defragmenting and accepting the consequences.) The problem with duperemove is that, since it blindly checksums every file in the search, it’s incredibly slow and works the disks long and hard. It basically performs a poor-man’s array scrub. Takes several days on my array. (bedup, bees, and some others are similar in that regard, even if very different in other ways. rdfind and some others are smarter. They first compare file sizes. Then first few bytes. Then last few bytes. Only when all those match, does it resort to checksum. )

rmlint: This currently seems the best fit for my other requirement of just saving disk space. It has two options for Btrfs reflinking (kernel-mode atomic cloning, and the slightly less robust 'cp --reflink' method). The scanning algorithm is the fastest I've tested; hashing can be bumped up to sha256 and higher (including bit-for-bit); and it has many useful options to satisfy many of my requirements. (Except, as best as I can tell, the one in this question.)

edited Feb 9 at 0:43

Rui F Ribeiro

40.4k1479137

asked Aug 22 '18 at 19:17

Jim

765

add a comment |

In other words, the directory structure and contents won’t be identical. But 100% of the contained files, individually, will be duplicated...somewhere, anywhere, on the same FS.

Here's the use-case generating the need, which actually seems to be fairly common (or at least not uncommon). ...At least with variations on the end result:

On location:
1. I take GBs of photos and videos.
2. Each day, I move the files from memory cards, into folders organized by camera name and date, onto a redundant array of portable USB HDDs.
3. When I have time, I organize copies of those files into a folder structure like “(photos|videos)/year/date”, with filenames pre-pended with “yyyymmdd-hhmmss”. (In other words the original structure gets completely scrambled. And not always in such a predictable way.) Those organized copies go on an SSD drive for faster workflow, but I leave the original, unmanaged copies on the slow redundant storage, for backup, with the copies being physically separated except during the copying step.

Back at home:
1. I move all the unmanaged files from the USB HDD array, to a "permanent" (larger, more robust, and continuously cloud-backed-up) array, as an original source of truth in case my workflow goes sideways.
2. Do post-processing on the organized copies on the SSD. (Leaving the original raw files untouched other than being renamed--and saving changes to new files.
3. Once I'm finished and do whatever was intended with the results, I move the entire SSD file structure onto the same larger "permanent" array as the originals. (But remember the directory structure is completely different than the originals SD card-dump structure.)

rdfind: Fast matching algorithm, great for deduplication via hardlinks. The problem is, that could result in disaster for any user (all users?). While partially OK for my distinctly separate requirement of saving space among large redundant media files regardless of name or location, I found that it was disastrous for other things that can't easily be untangled. For example, it also hardlinks other identical files together that have no business being the same file. For example various metadata files that OSes and applications automatically generate, most of which are the same across hundreds or thousands of directories, but absolutely have to be able to be different. E.g. "Thumbs.db", and referencing the same file can and almost certainly will result in losing data later—possibly trivially, possibly not.) It does have an option for deduplicating Btrfs reflinks (which can later differentiate with CoW), but that feature is marked "experimental".

duperemove: Dedupes with Btrfs reflinks, so that's an acceptable (great, even) approach for saving disk space while allowing files to diverge later. (Except that currently Btrfs apparently un-deduplicates files when defragmenting [depending on the kernel?], even snapshots. What a terrible quirk, but I avoid it by never defragmenting and accepting the consequences.) The problem with duperemove is that, since it blindly checksums every file in the search, it’s incredibly slow and works the disks long and hard. It basically performs a poor-man’s array scrub. Takes several days on my array. (bedup, bees, and some others are similar in that regard, even if very different in other ways. rdfind and some others are smarter. They first compare file sizes. Then first few bytes. Then last few bytes. Only when all those match, does it resort to checksum. )

rmlint: This currently seems the best fit for my other requirement of just saving disk space. It has two options for Btrfs reflinking (kernel-mode atomic cloning, and the slightly less robust 'cp --reflink' method). The scanning algorithm is the fastest I've tested; hashing can be bumped up to sha256 and higher (including bit-for-bit); and it has many useful options to satisfy many of my requirements. (Except, as best as I can tell, the one in this question.)

edited Feb 9 at 0:43

Rui F Ribeiro

40.4k1479137

asked Aug 22 '18 at 19:17

Jim

765

In other words, the directory structure and contents won’t be identical. But 100% of the contained files, individually, will be duplicated...somewhere, anywhere, on the same FS.

Here's the use-case generating the need, which actually seems to be fairly common (or at least not uncommon). ...At least with variations on the end result:

On location:
1. I take GBs of photos and videos.
2. Each day, I move the files from memory cards, into folders organized by camera name and date, onto a redundant array of portable USB HDDs.
3. When I have time, I organize copies of those files into a folder structure like “(photos|videos)/year/date”, with filenames pre-pended with “yyyymmdd-hhmmss”. (In other words the original structure gets completely scrambled. And not always in such a predictable way.) Those organized copies go on an SSD drive for faster workflow, but I leave the original, unmanaged copies on the slow redundant storage, for backup, with the copies being physically separated except during the copying step.

Back at home:
1. I move all the unmanaged files from the USB HDD array, to a "permanent" (larger, more robust, and continuously cloud-backed-up) array, as an original source of truth in case my workflow goes sideways.
2. Do post-processing on the organized copies on the SSD. (Leaving the original raw files untouched other than being renamed--and saving changes to new files.
3. Once I'm finished and do whatever was intended with the results, I move the entire SSD file structure onto the same larger "permanent" array as the originals. (But remember the directory structure is completely different than the originals SD card-dump structure.)

rdfind: Fast matching algorithm, great for deduplication via hardlinks. The problem is, that could result in disaster for any user (all users?). While partially OK for my distinctly separate requirement of saving space among large redundant media files regardless of name or location, I found that it was disastrous for other things that can't easily be untangled. For example, it also hardlinks other identical files together that have no business being the same file. For example various metadata files that OSes and applications automatically generate, most of which are the same across hundreds or thousands of directories, but absolutely have to be able to be different. E.g. "Thumbs.db", and referencing the same file can and almost certainly will result in losing data later—possibly trivially, possibly not.) It does have an option for deduplicating Btrfs reflinks (which can later differentiate with CoW), but that feature is marked "experimental".

duperemove: Dedupes with Btrfs reflinks, so that's an acceptable (great, even) approach for saving disk space while allowing files to diverge later. (Except that currently Btrfs apparently un-deduplicates files when defragmenting [depending on the kernel?], even snapshots. What a terrible quirk, but I avoid it by never defragmenting and accepting the consequences.) The problem with duperemove is that, since it blindly checksums every file in the search, it’s incredibly slow and works the disks long and hard. It basically performs a poor-man’s array scrub. Takes several days on my array. (bedup, bees, and some others are similar in that regard, even if very different in other ways. rdfind and some others are smarter. They first compare file sizes. Then first few bytes. Then last few bytes. Only when all those match, does it resort to checksum. )

rmlint: This currently seems the best fit for my other requirement of just saving disk space. It has two options for Btrfs reflinking (kernel-mode atomic cloning, and the slightly less robust 'cp --reflink' method). The scanning algorithm is the fastest I've tested; hashing can be bumped up to sha256 and higher (including bit-for-bit); and it has many useful options to satisfy many of my requirements. (Except, as best as I can tell, the one in this question.)

btrfs zfs storage deduplication copy-on-write

edited Feb 9 at 0:43

Rui F Ribeiro

40.4k1479137

asked Aug 22 '18 at 19:17

Jim

765

edited Feb 9 at 0:43

Rui F Ribeiro

40.4k1479137

asked Aug 22 '18 at 19:17

Jim

765

edited Feb 9 at 0:43

Rui F Ribeiro

40.4k1479137

edited Feb 9 at 0:43

Rui F Ribeiro

40.4k1479137

edited Feb 9 at 0:43

Rui F Ribeiro

40.4k1479137

asked Aug 22 '18 at 19:17

Jim

765

asked Aug 22 '18 at 19:17

Jim

765

asked Aug 22 '18 at 19:17

Jim

765

add a comment |

1 Answer
1

active

oldest

votes

You can use a program like fdupes to create a hard link to one file from two identical files. This already has the benefit to save space on your disk.

After you do this, if you have one directory that contains only files with a link count greater than one, you know that each of the files exists somewhere else on the disk.

To find the directories with only files with link count greater than one, you can use find to get a list of all directories, then again use find to eliminate the directories that contain files with link count one.

This example doesn't address spaces in file or directory names.

for dir in `find . -type d`; do

  if test -z "$(find $dir -maxdepth 1 -links 1 -quit)"; then

    echo $dir

  fi

done

edited Aug 22 '18 at 19:50

answered Aug 22 '18 at 19:34

RalfFriedl

5,4303925

With your first paragraph, I thought it was a knee-jerk answer of "try dedup program X", which I wrote multiple times I've tried--and actively rely on for the absolutely critical related requirement of slashing drive space consumption. (But doesn't answer this question.) But you totally redeemed yourself with the second paragraph! Brilliant, esp in it's simplicity! Why didn't I think of that? Only one catch: How to automate that and generate some kind of report or list?

– Jim
Aug 22 '18 at 19:42

Also, for others reading this answer, note that hardlinks can and almost certainly are a disastrous solution when applied to a large subset of a filesystem, unless you use a utility that can focus solely on specific filetypes an/dor minimum filesize. I go over why, in my question. The short reason: you wind up unintentionally making things the same object, that absolutely are not logically the same--including thousands of files you may not even be aware exists or that you rely on. Using Btrfs reflinks instead of hardlinks solves that particular problem with similar results in most other ways.

– Jim
Aug 22 '18 at 19:47

I added some commands to find such directories.

– RalfFriedl
Aug 22 '18 at 19:50

So close! (And possibly good enough.) The only problem is "-maxdepth 1". Ideally, the files in the entire subtree would also have to have >1 link. (At the highest level that satisfies and no lower.) I'd have to test removing that flag. I think it would generate an accurate list of directories, but highly redundant in terms of nesting. (E.g. would report dir1/sub1dir1/sub2dir1, dir1/sub2dir1, and dir1. Odious when involving thousands of results.) Granted, smarter might be orders of magnitude more complex! I may have to accept this answer though, and either way, thanks, and brilliant!

– Jim
Aug 22 '18 at 20:02

The second challenge could be solved with string work. E.g. generate a simple list of folders with the solution in your answer. Next, go through that list in two nested loops, removing lines that are substrings of other lines. You'd have to do the entire "scan an snip" operation after every folder deletion, as link counts will change after each folder deletion, randomly throughout the entire folder structure. Will be PAINFULLY slow in bash! But highly useful. I'll accept your answer, and if I script a coherent total solution will add an FYI answer with it (if I have enough points to).

– Jim
Aug 22 '18 at 20:14

|
show 3 more comments

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f464218%2freport-directories-with-contents-that-exist-elsewhere-even-if-scattered%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

You can use a program like fdupes to create a hard link to one file from two identical files. This already has the benefit to save space on your disk.

After you do this, if you have one directory that contains only files with a link count greater than one, you know that each of the files exists somewhere else on the disk.

This example doesn't address spaces in file or directory names.

for dir in `find . -type d`; do

  if test -z "$(find $dir -maxdepth 1 -links 1 -quit)"; then

    echo $dir

  fi

done

edited Aug 22 '18 at 19:50

answered Aug 22 '18 at 19:34

RalfFriedl

5,4303925

With your first paragraph, I thought it was a knee-jerk answer of "try dedup program X", which I wrote multiple times I've tried--and actively rely on for the absolutely critical related requirement of slashing drive space consumption. (But doesn't answer this question.) But you totally redeemed yourself with the second paragraph! Brilliant, esp in it's simplicity! Why didn't I think of that? Only one catch: How to automate that and generate some kind of report or list?

– Jim
Aug 22 '18 at 19:42

Also, for others reading this answer, note that hardlinks can and almost certainly are a disastrous solution when applied to a large subset of a filesystem, unless you use a utility that can focus solely on specific filetypes an/dor minimum filesize. I go over why, in my question. The short reason: you wind up unintentionally making things the same object, that absolutely are not logically the same--including thousands of files you may not even be aware exists or that you rely on. Using Btrfs reflinks instead of hardlinks solves that particular problem with similar results in most other ways.

– Jim
Aug 22 '18 at 19:47

I added some commands to find such directories.

– RalfFriedl
Aug 22 '18 at 19:50

So close! (And possibly good enough.) The only problem is "-maxdepth 1". Ideally, the files in the entire subtree would also have to have >1 link. (At the highest level that satisfies and no lower.) I'd have to test removing that flag. I think it would generate an accurate list of directories, but highly redundant in terms of nesting. (E.g. would report dir1/sub1dir1/sub2dir1, dir1/sub2dir1, and dir1. Odious when involving thousands of results.) Granted, smarter might be orders of magnitude more complex! I may have to accept this answer though, and either way, thanks, and brilliant!

– Jim
Aug 22 '18 at 20:02

The second challenge could be solved with string work. E.g. generate a simple list of folders with the solution in your answer. Next, go through that list in two nested loops, removing lines that are substrings of other lines. You'd have to do the entire "scan an snip" operation after every folder deletion, as link counts will change after each folder deletion, randomly throughout the entire folder structure. Will be PAINFULLY slow in bash! But highly useful. I'll accept your answer, and if I script a coherent total solution will add an FYI answer with it (if I have enough points to).

– Jim
Aug 22 '18 at 20:14

|
show 3 more comments

You can use a program like fdupes to create a hard link to one file from two identical files. This already has the benefit to save space on your disk.

After you do this, if you have one directory that contains only files with a link count greater than one, you know that each of the files exists somewhere else on the disk.

This example doesn't address spaces in file or directory names.

for dir in `find . -type d`; do

  if test -z "$(find $dir -maxdepth 1 -links 1 -quit)"; then

    echo $dir

  fi

done

edited Aug 22 '18 at 19:50

answered Aug 22 '18 at 19:34

RalfFriedl

5,4303925

With your first paragraph, I thought it was a knee-jerk answer of "try dedup program X", which I wrote multiple times I've tried--and actively rely on for the absolutely critical related requirement of slashing drive space consumption. (But doesn't answer this question.) But you totally redeemed yourself with the second paragraph! Brilliant, esp in it's simplicity! Why didn't I think of that? Only one catch: How to automate that and generate some kind of report or list?

– Jim
Aug 22 '18 at 19:42

Also, for others reading this answer, note that hardlinks can and almost certainly are a disastrous solution when applied to a large subset of a filesystem, unless you use a utility that can focus solely on specific filetypes an/dor minimum filesize. I go over why, in my question. The short reason: you wind up unintentionally making things the same object, that absolutely are not logically the same--including thousands of files you may not even be aware exists or that you rely on. Using Btrfs reflinks instead of hardlinks solves that particular problem with similar results in most other ways.

– Jim
Aug 22 '18 at 19:47

I added some commands to find such directories.

– RalfFriedl
Aug 22 '18 at 19:50

So close! (And possibly good enough.) The only problem is "-maxdepth 1". Ideally, the files in the entire subtree would also have to have >1 link. (At the highest level that satisfies and no lower.) I'd have to test removing that flag. I think it would generate an accurate list of directories, but highly redundant in terms of nesting. (E.g. would report dir1/sub1dir1/sub2dir1, dir1/sub2dir1, and dir1. Odious when involving thousands of results.) Granted, smarter might be orders of magnitude more complex! I may have to accept this answer though, and either way, thanks, and brilliant!

– Jim
Aug 22 '18 at 20:02

The second challenge could be solved with string work. E.g. generate a simple list of folders with the solution in your answer. Next, go through that list in two nested loops, removing lines that are substrings of other lines. You'd have to do the entire "scan an snip" operation after every folder deletion, as link counts will change after each folder deletion, randomly throughout the entire folder structure. Will be PAINFULLY slow in bash! But highly useful. I'll accept your answer, and if I script a coherent total solution will add an FYI answer with it (if I have enough points to).

– Jim
Aug 22 '18 at 20:14

|
show 3 more comments

You can use a program like fdupes to create a hard link to one file from two identical files. This already has the benefit to save space on your disk.

After you do this, if you have one directory that contains only files with a link count greater than one, you know that each of the files exists somewhere else on the disk.

This example doesn't address spaces in file or directory names.

for dir in `find . -type d`; do

  if test -z "$(find $dir -maxdepth 1 -links 1 -quit)"; then

    echo $dir

  fi

done

edited Aug 22 '18 at 19:50

answered Aug 22 '18 at 19:34

RalfFriedl

5,4303925

You can use a program like fdupes to create a hard link to one file from two identical files. This already has the benefit to save space on your disk.

After you do this, if you have one directory that contains only files with a link count greater than one, you know that each of the files exists somewhere else on the disk.

This example doesn't address spaces in file or directory names.

for dir in `find . -type d`; do

  if test -z "$(find $dir -maxdepth 1 -links 1 -quit)"; then

    echo $dir

  fi

done

edited Aug 22 '18 at 19:50

answered Aug 22 '18 at 19:34

RalfFriedl

5,4303925

edited Aug 22 '18 at 19:50

answered Aug 22 '18 at 19:34

RalfFriedl

5,4303925

answered Aug 22 '18 at 19:34

RalfFriedl

5,4303925

answered Aug 22 '18 at 19:34

RalfFriedl

5,4303925

With your first paragraph, I thought it was a knee-jerk answer of "try dedup program X", which I wrote multiple times I've tried--and actively rely on for the absolutely critical related requirement of slashing drive space consumption. (But doesn't answer this question.) But you totally redeemed yourself with the second paragraph! Brilliant, esp in it's simplicity! Why didn't I think of that? Only one catch: How to automate that and generate some kind of report or list?

– Jim
Aug 22 '18 at 19:42

Also, for others reading this answer, note that hardlinks can and almost certainly are a disastrous solution when applied to a large subset of a filesystem, unless you use a utility that can focus solely on specific filetypes an/dor minimum filesize. I go over why, in my question. The short reason: you wind up unintentionally making things the same object, that absolutely are not logically the same--including thousands of files you may not even be aware exists or that you rely on. Using Btrfs reflinks instead of hardlinks solves that particular problem with similar results in most other ways.

– Jim
Aug 22 '18 at 19:47

I added some commands to find such directories.

– RalfFriedl
Aug 22 '18 at 19:50

So close! (And possibly good enough.) The only problem is "-maxdepth 1". Ideally, the files in the entire subtree would also have to have >1 link. (At the highest level that satisfies and no lower.) I'd have to test removing that flag. I think it would generate an accurate list of directories, but highly redundant in terms of nesting. (E.g. would report dir1/sub1dir1/sub2dir1, dir1/sub2dir1, and dir1. Odious when involving thousands of results.) Granted, smarter might be orders of magnitude more complex! I may have to accept this answer though, and either way, thanks, and brilliant!

– Jim
Aug 22 '18 at 20:02

The second challenge could be solved with string work. E.g. generate a simple list of folders with the solution in your answer. Next, go through that list in two nested loops, removing lines that are substrings of other lines. You'd have to do the entire "scan an snip" operation after every folder deletion, as link counts will change after each folder deletion, randomly throughout the entire folder structure. Will be PAINFULLY slow in bash! But highly useful. I'll accept your answer, and if I script a coherent total solution will add an FYI answer with it (if I have enough points to).

– Jim
Aug 22 '18 at 20:14

|
show 3 more comments

With your first paragraph, I thought it was a knee-jerk answer of "try dedup program X", which I wrote multiple times I've tried--and actively rely on for the absolutely critical related requirement of slashing drive space consumption. (But doesn't answer this question.) But you totally redeemed yourself with the second paragraph! Brilliant, esp in it's simplicity! Why didn't I think of that? Only one catch: How to automate that and generate some kind of report or list?

– Jim
Aug 22 '18 at 19:42

Also, for others reading this answer, note that hardlinks can and almost certainly are a disastrous solution when applied to a large subset of a filesystem, unless you use a utility that can focus solely on specific filetypes an/dor minimum filesize. I go over why, in my question. The short reason: you wind up unintentionally making things the same object, that absolutely are not logically the same--including thousands of files you may not even be aware exists or that you rely on. Using Btrfs reflinks instead of hardlinks solves that particular problem with similar results in most other ways.

– Jim
Aug 22 '18 at 19:47

I added some commands to find such directories.

– RalfFriedl
Aug 22 '18 at 19:50

So close! (And possibly good enough.) The only problem is "-maxdepth 1". Ideally, the files in the entire subtree would also have to have >1 link. (At the highest level that satisfies and no lower.) I'd have to test removing that flag. I think it would generate an accurate list of directories, but highly redundant in terms of nesting. (E.g. would report dir1/sub1dir1/sub2dir1, dir1/sub2dir1, and dir1. Odious when involving thousands of results.) Granted, smarter might be orders of magnitude more complex! I may have to accept this answer though, and either way, thanks, and brilliant!

– Jim
Aug 22 '18 at 20:02

The second challenge could be solved with string work. E.g. generate a simple list of folders with the solution in your answer. Next, go through that list in two nested loops, removing lines that are substrings of other lines. You'd have to do the entire "scan an snip" operation after every folder deletion, as link counts will change after each folder deletion, randomly throughout the entire folder structure. Will be PAINFULLY slow in bash! But highly useful. I'll accept your answer, and if I script a coherent total solution will add an FYI answer with it (if I have enough points to).

– Jim
Aug 22 '18 at 20:14

With your first paragraph, I thought it was a knee-jerk answer of "try dedup program X", which I wrote multiple times I've tried--and actively rely on for the absolutely critical related requirement of slashing drive space consumption. (But doesn't answer this question.) But you totally redeemed yourself with the second paragraph! Brilliant, esp in it's simplicity! Why didn't I think of that? Only one catch: How to automate that and generate some kind of report or list?

– Jim
Aug 22 '18 at 19:42

Also, for others reading this answer, note that hardlinks can and almost certainly are a disastrous solution when applied to a large subset of a filesystem, unless you use a utility that can focus solely on specific filetypes an/dor minimum filesize. I go over why, in my question. The short reason: you wind up unintentionally making things the same object, that absolutely are not logically the same--including thousands of files you may not even be aware exists or that you rely on. Using Btrfs reflinks instead of hardlinks solves that particular problem with similar results in most other ways.

– Jim
Aug 22 '18 at 19:47

I added some commands to find such directories.

– RalfFriedl
Aug 22 '18 at 19:50

So close! (And possibly good enough.) The only problem is "-maxdepth 1". Ideally, the files in the entire subtree would also have to have >1 link. (At the highest level that satisfies and no lower.) I'd have to test removing that flag. I think it would generate an accurate list of directories, but highly redundant in terms of nesting. (E.g. would report dir1/sub1dir1/sub2dir1, dir1/sub2dir1, and dir1. Odious when involving thousands of results.) Granted, smarter might be orders of magnitude more complex! I may have to accept this answer though, and either way, thanks, and brilliant!

– Jim
Aug 22 '18 at 20:02

The second challenge could be solved with string work. E.g. generate a simple list of folders with the solution in your answer. Next, go through that list in two nested loops, removing lines that are substrings of other lines. You'd have to do the entire "scan an snip" operation after every folder deletion, as link counts will change after each folder deletion, randomly throughout the entire folder structure. Will be PAINFULLY slow in bash! But highly useful. I'll accept your answer, and if I script a coherent total solution will add an FYI answer with it (if I have enough points to).

– Jim
Aug 22 '18 at 20:14

|
show 3 more comments

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ytdyklly