Will we ever “find” files whose names are changed by “find”? Why not?












6














While answering an older question it struck me that it seems find, in the following example, potentially would process files multiple times:



find dir -type f -name '*.txt' 
-exec sh -c 'mv "$1" "${1%.txt}_hello.txt"' sh {} ';'


or the more efficient



find dir -type f -name '*.txt' 
-exec sh -c 'for n; do mv "$n" "${n%.txt}_hello.txt"; done' sh {} +


The command finds .txt files and changes their filename suffix from .txt to _hello.txt.



While doing so, the directories will start accumulating new files whose names matches the *.txt pattern, namely these _hello.txt files.



Question: Why are they not actually processed by find? Because in my experience they aren't, and we don't want them to be either as it would introduce a sort of infinite loop. This is also the case with mv replaced by cp, by the way.



The POSIX standard says (my emphasis)




If a file is removed from or added to the directory hierarchy being searched it is unspecified whether or not find includes that file in its search.




Since it's unspecified whether new files will be included, maybe a safer approach would be



find dir -type d -exec sh -c '
for n in "$1"/*.txt; do
test -f "$n" && mv "$n" "${n%.txt}_hello.txt"
done' sh {} ';'


Here, we don't look for files but for directories, and the for loop of the internal sh script evaluates its range once before the first iteration, so we don't have the same potential issue.



The GNU find manual does not explicitly say anything about this and neither does the OpenBSD find manual.










share|improve this question
























  • it is unspecified whether or not - I wonder why do the authors of find utility not get concerned of such tricky behavior
    – RomanPerekhrest
    Feb 13 '18 at 19:07






  • 1




    readdir has essentially the same specification. I would expect it to be potentially filesystem-specific, even (loading one block of directory entries at a time is pretty reasonable).
    – Michael Homer
    Feb 13 '18 at 19:12










  • @RomanPerekhrest, it's unspecified in the POSIX specification. That doesn't mean the authors of find utility don't get concerned with the behavior. It means it's left to the authors of any given implementation of the find utility how to handle that case, rather than being specified. (If that seems unclear I recommend you fully clear up the words "specification" and "implementation" as they apply to software.)
    – Wildcard
    Feb 13 '18 at 21:51












  • @Wildcard, words "specification" and "implementation" are clear. A cite from the question: The GNU find manual does not explicitly say anything about this and neither does the OpenBSD find manual. So, that's bad ... and I shouldn't be compelled to like that
    – RomanPerekhrest
    Feb 13 '18 at 22:01










  • @RomanPerekhrest, ah, I see. Your first comment quoted the POSIX spec so I wasn't sure. I wouldn't say it's bad, though (that the find devs don't mention it). As described in ikkachu's answer, that's a filesystem level behavior, not even specified in the readdir() system call spec. There are going to be race conditions no matter what.
    – Wildcard
    Feb 13 '18 at 22:04


















6














While answering an older question it struck me that it seems find, in the following example, potentially would process files multiple times:



find dir -type f -name '*.txt' 
-exec sh -c 'mv "$1" "${1%.txt}_hello.txt"' sh {} ';'


or the more efficient



find dir -type f -name '*.txt' 
-exec sh -c 'for n; do mv "$n" "${n%.txt}_hello.txt"; done' sh {} +


The command finds .txt files and changes their filename suffix from .txt to _hello.txt.



While doing so, the directories will start accumulating new files whose names matches the *.txt pattern, namely these _hello.txt files.



Question: Why are they not actually processed by find? Because in my experience they aren't, and we don't want them to be either as it would introduce a sort of infinite loop. This is also the case with mv replaced by cp, by the way.



The POSIX standard says (my emphasis)




If a file is removed from or added to the directory hierarchy being searched it is unspecified whether or not find includes that file in its search.




Since it's unspecified whether new files will be included, maybe a safer approach would be



find dir -type d -exec sh -c '
for n in "$1"/*.txt; do
test -f "$n" && mv "$n" "${n%.txt}_hello.txt"
done' sh {} ';'


Here, we don't look for files but for directories, and the for loop of the internal sh script evaluates its range once before the first iteration, so we don't have the same potential issue.



The GNU find manual does not explicitly say anything about this and neither does the OpenBSD find manual.










share|improve this question
























  • it is unspecified whether or not - I wonder why do the authors of find utility not get concerned of such tricky behavior
    – RomanPerekhrest
    Feb 13 '18 at 19:07






  • 1




    readdir has essentially the same specification. I would expect it to be potentially filesystem-specific, even (loading one block of directory entries at a time is pretty reasonable).
    – Michael Homer
    Feb 13 '18 at 19:12










  • @RomanPerekhrest, it's unspecified in the POSIX specification. That doesn't mean the authors of find utility don't get concerned with the behavior. It means it's left to the authors of any given implementation of the find utility how to handle that case, rather than being specified. (If that seems unclear I recommend you fully clear up the words "specification" and "implementation" as they apply to software.)
    – Wildcard
    Feb 13 '18 at 21:51












  • @Wildcard, words "specification" and "implementation" are clear. A cite from the question: The GNU find manual does not explicitly say anything about this and neither does the OpenBSD find manual. So, that's bad ... and I shouldn't be compelled to like that
    – RomanPerekhrest
    Feb 13 '18 at 22:01










  • @RomanPerekhrest, ah, I see. Your first comment quoted the POSIX spec so I wasn't sure. I wouldn't say it's bad, though (that the find devs don't mention it). As described in ikkachu's answer, that's a filesystem level behavior, not even specified in the readdir() system call spec. There are going to be race conditions no matter what.
    – Wildcard
    Feb 13 '18 at 22:04
















6












6








6


3





While answering an older question it struck me that it seems find, in the following example, potentially would process files multiple times:



find dir -type f -name '*.txt' 
-exec sh -c 'mv "$1" "${1%.txt}_hello.txt"' sh {} ';'


or the more efficient



find dir -type f -name '*.txt' 
-exec sh -c 'for n; do mv "$n" "${n%.txt}_hello.txt"; done' sh {} +


The command finds .txt files and changes their filename suffix from .txt to _hello.txt.



While doing so, the directories will start accumulating new files whose names matches the *.txt pattern, namely these _hello.txt files.



Question: Why are they not actually processed by find? Because in my experience they aren't, and we don't want them to be either as it would introduce a sort of infinite loop. This is also the case with mv replaced by cp, by the way.



The POSIX standard says (my emphasis)




If a file is removed from or added to the directory hierarchy being searched it is unspecified whether or not find includes that file in its search.




Since it's unspecified whether new files will be included, maybe a safer approach would be



find dir -type d -exec sh -c '
for n in "$1"/*.txt; do
test -f "$n" && mv "$n" "${n%.txt}_hello.txt"
done' sh {} ';'


Here, we don't look for files but for directories, and the for loop of the internal sh script evaluates its range once before the first iteration, so we don't have the same potential issue.



The GNU find manual does not explicitly say anything about this and neither does the OpenBSD find manual.










share|improve this question















While answering an older question it struck me that it seems find, in the following example, potentially would process files multiple times:



find dir -type f -name '*.txt' 
-exec sh -c 'mv "$1" "${1%.txt}_hello.txt"' sh {} ';'


or the more efficient



find dir -type f -name '*.txt' 
-exec sh -c 'for n; do mv "$n" "${n%.txt}_hello.txt"; done' sh {} +


The command finds .txt files and changes their filename suffix from .txt to _hello.txt.



While doing so, the directories will start accumulating new files whose names matches the *.txt pattern, namely these _hello.txt files.



Question: Why are they not actually processed by find? Because in my experience they aren't, and we don't want them to be either as it would introduce a sort of infinite loop. This is also the case with mv replaced by cp, by the way.



The POSIX standard says (my emphasis)




If a file is removed from or added to the directory hierarchy being searched it is unspecified whether or not find includes that file in its search.




Since it's unspecified whether new files will be included, maybe a safer approach would be



find dir -type d -exec sh -c '
for n in "$1"/*.txt; do
test -f "$n" && mv "$n" "${n%.txt}_hello.txt"
done' sh {} ';'


Here, we don't look for files but for directories, and the for loop of the internal sh script evaluates its range once before the first iteration, so we don't have the same potential issue.



The GNU find manual does not explicitly say anything about this and neither does the OpenBSD find manual.







find






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jul 3 '18 at 9:15







Kusalananda

















asked Feb 13 '18 at 18:56









KusalanandaKusalananda

123k16232377




123k16232377












  • it is unspecified whether or not - I wonder why do the authors of find utility not get concerned of such tricky behavior
    – RomanPerekhrest
    Feb 13 '18 at 19:07






  • 1




    readdir has essentially the same specification. I would expect it to be potentially filesystem-specific, even (loading one block of directory entries at a time is pretty reasonable).
    – Michael Homer
    Feb 13 '18 at 19:12










  • @RomanPerekhrest, it's unspecified in the POSIX specification. That doesn't mean the authors of find utility don't get concerned with the behavior. It means it's left to the authors of any given implementation of the find utility how to handle that case, rather than being specified. (If that seems unclear I recommend you fully clear up the words "specification" and "implementation" as they apply to software.)
    – Wildcard
    Feb 13 '18 at 21:51












  • @Wildcard, words "specification" and "implementation" are clear. A cite from the question: The GNU find manual does not explicitly say anything about this and neither does the OpenBSD find manual. So, that's bad ... and I shouldn't be compelled to like that
    – RomanPerekhrest
    Feb 13 '18 at 22:01










  • @RomanPerekhrest, ah, I see. Your first comment quoted the POSIX spec so I wasn't sure. I wouldn't say it's bad, though (that the find devs don't mention it). As described in ikkachu's answer, that's a filesystem level behavior, not even specified in the readdir() system call spec. There are going to be race conditions no matter what.
    – Wildcard
    Feb 13 '18 at 22:04




















  • it is unspecified whether or not - I wonder why do the authors of find utility not get concerned of such tricky behavior
    – RomanPerekhrest
    Feb 13 '18 at 19:07






  • 1




    readdir has essentially the same specification. I would expect it to be potentially filesystem-specific, even (loading one block of directory entries at a time is pretty reasonable).
    – Michael Homer
    Feb 13 '18 at 19:12










  • @RomanPerekhrest, it's unspecified in the POSIX specification. That doesn't mean the authors of find utility don't get concerned with the behavior. It means it's left to the authors of any given implementation of the find utility how to handle that case, rather than being specified. (If that seems unclear I recommend you fully clear up the words "specification" and "implementation" as they apply to software.)
    – Wildcard
    Feb 13 '18 at 21:51












  • @Wildcard, words "specification" and "implementation" are clear. A cite from the question: The GNU find manual does not explicitly say anything about this and neither does the OpenBSD find manual. So, that's bad ... and I shouldn't be compelled to like that
    – RomanPerekhrest
    Feb 13 '18 at 22:01










  • @RomanPerekhrest, ah, I see. Your first comment quoted the POSIX spec so I wasn't sure. I wouldn't say it's bad, though (that the find devs don't mention it). As described in ikkachu's answer, that's a filesystem level behavior, not even specified in the readdir() system call spec. There are going to be race conditions no matter what.
    – Wildcard
    Feb 13 '18 at 22:04


















it is unspecified whether or not - I wonder why do the authors of find utility not get concerned of such tricky behavior
– RomanPerekhrest
Feb 13 '18 at 19:07




it is unspecified whether or not - I wonder why do the authors of find utility not get concerned of such tricky behavior
– RomanPerekhrest
Feb 13 '18 at 19:07




1




1




readdir has essentially the same specification. I would expect it to be potentially filesystem-specific, even (loading one block of directory entries at a time is pretty reasonable).
– Michael Homer
Feb 13 '18 at 19:12




readdir has essentially the same specification. I would expect it to be potentially filesystem-specific, even (loading one block of directory entries at a time is pretty reasonable).
– Michael Homer
Feb 13 '18 at 19:12












@RomanPerekhrest, it's unspecified in the POSIX specification. That doesn't mean the authors of find utility don't get concerned with the behavior. It means it's left to the authors of any given implementation of the find utility how to handle that case, rather than being specified. (If that seems unclear I recommend you fully clear up the words "specification" and "implementation" as they apply to software.)
– Wildcard
Feb 13 '18 at 21:51






@RomanPerekhrest, it's unspecified in the POSIX specification. That doesn't mean the authors of find utility don't get concerned with the behavior. It means it's left to the authors of any given implementation of the find utility how to handle that case, rather than being specified. (If that seems unclear I recommend you fully clear up the words "specification" and "implementation" as they apply to software.)
– Wildcard
Feb 13 '18 at 21:51














@Wildcard, words "specification" and "implementation" are clear. A cite from the question: The GNU find manual does not explicitly say anything about this and neither does the OpenBSD find manual. So, that's bad ... and I shouldn't be compelled to like that
– RomanPerekhrest
Feb 13 '18 at 22:01




@Wildcard, words "specification" and "implementation" are clear. A cite from the question: The GNU find manual does not explicitly say anything about this and neither does the OpenBSD find manual. So, that's bad ... and I shouldn't be compelled to like that
– RomanPerekhrest
Feb 13 '18 at 22:01












@RomanPerekhrest, ah, I see. Your first comment quoted the POSIX spec so I wasn't sure. I wouldn't say it's bad, though (that the find devs don't mention it). As described in ikkachu's answer, that's a filesystem level behavior, not even specified in the readdir() system call spec. There are going to be race conditions no matter what.
– Wildcard
Feb 13 '18 at 22:04






@RomanPerekhrest, ah, I see. Your first comment quoted the POSIX spec so I wasn't sure. I wouldn't say it's bad, though (that the find devs don't mention it). As described in ikkachu's answer, that's a filesystem level behavior, not even specified in the readdir() system call spec. There are going to be race conditions no matter what.
– Wildcard
Feb 13 '18 at 22:04












1 Answer
1






active

oldest

votes


















7














Can find find files that were created while it was walking the directory?



In brief: Yes, but it depends on the implementation. It's probably best to write the conditions so that already processed files are ignored.



As mentioned, POSIX makes no guarantees either way, like it also makes no guarantees on the underlying readdir() system call:




If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir() returns an entry for that file is unspecified.






I tested the find on my Debian (GNU find, Debian package version 4.6.0+git+20161106-2). strace showed that it read the full directory before doing anything.



Browsing the source code a bit more makes it seem that GNU find uses parts of gnulib to read the directories, and there's this in gnulib/lib/fts.c (gl/lib/fts.c in the find tarball):



/* If possible (see max_entries, below), read no more than this many directory
entries at a time. Without this limit (i.e., when using non-NULL
fts_compar), processing a directory with 4,000,000 entries requires ~1GiB
of memory, and handling 64M entries would require 16GiB of memory. */
#ifndef FTS_MAX_READDIR_ENTRIES
# define FTS_MAX_READDIR_ENTRIES 100000
#endif


I changed that limit to 100, and did



mkdir test; cd test; touch {0000..2999}.foo
find . -type f -exec sh -c 'mv "$1" "${1%.foo}.barbarbarbarbarbarbarbar"' sh {} ; -print


resulting in such hilarious results as this file, which got renamed five times:



1046.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar




Obviously, a very large directory (more than 100 000 entries) would be needed to trigger that effect on a default build of GNU find, but a trivial readdir+process loop without caching would be even more vulnerable.



In theory, if the OS always added renamed files last in the order where readdir() returned them, a simple implementation like that could even fall into an endless loop.



On Linux, readdir() in the C library is implemented through the getdents() system call, which already returns multiple directory entries at one go. Which means that later calls to readdir() might return files that were already removed, but for very small directories you'd effectively get a snapshot of the starting state. I don't know about other systems.



In the above test, I did the renames to a longer file name on purpose: to prevent the file name from being overwritten in-place. No matter, the same test on a same-length rename also did double and triple renames. If and how this matters would of course depend on the filesystem internals.



Considering all this, it's probably prudent to avoid the whole issue by making the find expression not match the files that were already processed. That is, to add -name "*.foo" in my example or ! -name "*_hello.txt" to the command in the question.






share|improve this answer























  • This would seem to indicate that the default GNU find indeed would have issues with directories holding more than 100K files (or maybe entries of any type?) and that my precaution is not as silly as I first thought. (well, having 100K files in a directory is in itself a bit silly) I will look for similar code in my native OpenBSD find as soon as I get a chance.
    – Kusalananda
    Feb 13 '18 at 21:06












  • Interesting. Better be more careful with find regexps in the future.
    – Rui F Ribeiro
    Feb 13 '18 at 21:21










  • Actually, I wonder if the alternative approach would work on such a directory... expanding a glob to more than 100K pathnames? That would have its own issues.
    – Kusalananda
    Feb 13 '18 at 22:22











Your Answer








StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f423941%2fwill-we-ever-find-files-whose-names-are-changed-by-find-why-not%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









7














Can find find files that were created while it was walking the directory?



In brief: Yes, but it depends on the implementation. It's probably best to write the conditions so that already processed files are ignored.



As mentioned, POSIX makes no guarantees either way, like it also makes no guarantees on the underlying readdir() system call:




If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir() returns an entry for that file is unspecified.






I tested the find on my Debian (GNU find, Debian package version 4.6.0+git+20161106-2). strace showed that it read the full directory before doing anything.



Browsing the source code a bit more makes it seem that GNU find uses parts of gnulib to read the directories, and there's this in gnulib/lib/fts.c (gl/lib/fts.c in the find tarball):



/* If possible (see max_entries, below), read no more than this many directory
entries at a time. Without this limit (i.e., when using non-NULL
fts_compar), processing a directory with 4,000,000 entries requires ~1GiB
of memory, and handling 64M entries would require 16GiB of memory. */
#ifndef FTS_MAX_READDIR_ENTRIES
# define FTS_MAX_READDIR_ENTRIES 100000
#endif


I changed that limit to 100, and did



mkdir test; cd test; touch {0000..2999}.foo
find . -type f -exec sh -c 'mv "$1" "${1%.foo}.barbarbarbarbarbarbarbar"' sh {} ; -print


resulting in such hilarious results as this file, which got renamed five times:



1046.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar




Obviously, a very large directory (more than 100 000 entries) would be needed to trigger that effect on a default build of GNU find, but a trivial readdir+process loop without caching would be even more vulnerable.



In theory, if the OS always added renamed files last in the order where readdir() returned them, a simple implementation like that could even fall into an endless loop.



On Linux, readdir() in the C library is implemented through the getdents() system call, which already returns multiple directory entries at one go. Which means that later calls to readdir() might return files that were already removed, but for very small directories you'd effectively get a snapshot of the starting state. I don't know about other systems.



In the above test, I did the renames to a longer file name on purpose: to prevent the file name from being overwritten in-place. No matter, the same test on a same-length rename also did double and triple renames. If and how this matters would of course depend on the filesystem internals.



Considering all this, it's probably prudent to avoid the whole issue by making the find expression not match the files that were already processed. That is, to add -name "*.foo" in my example or ! -name "*_hello.txt" to the command in the question.






share|improve this answer























  • This would seem to indicate that the default GNU find indeed would have issues with directories holding more than 100K files (or maybe entries of any type?) and that my precaution is not as silly as I first thought. (well, having 100K files in a directory is in itself a bit silly) I will look for similar code in my native OpenBSD find as soon as I get a chance.
    – Kusalananda
    Feb 13 '18 at 21:06












  • Interesting. Better be more careful with find regexps in the future.
    – Rui F Ribeiro
    Feb 13 '18 at 21:21










  • Actually, I wonder if the alternative approach would work on such a directory... expanding a glob to more than 100K pathnames? That would have its own issues.
    – Kusalananda
    Feb 13 '18 at 22:22
















7














Can find find files that were created while it was walking the directory?



In brief: Yes, but it depends on the implementation. It's probably best to write the conditions so that already processed files are ignored.



As mentioned, POSIX makes no guarantees either way, like it also makes no guarantees on the underlying readdir() system call:




If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir() returns an entry for that file is unspecified.






I tested the find on my Debian (GNU find, Debian package version 4.6.0+git+20161106-2). strace showed that it read the full directory before doing anything.



Browsing the source code a bit more makes it seem that GNU find uses parts of gnulib to read the directories, and there's this in gnulib/lib/fts.c (gl/lib/fts.c in the find tarball):



/* If possible (see max_entries, below), read no more than this many directory
entries at a time. Without this limit (i.e., when using non-NULL
fts_compar), processing a directory with 4,000,000 entries requires ~1GiB
of memory, and handling 64M entries would require 16GiB of memory. */
#ifndef FTS_MAX_READDIR_ENTRIES
# define FTS_MAX_READDIR_ENTRIES 100000
#endif


I changed that limit to 100, and did



mkdir test; cd test; touch {0000..2999}.foo
find . -type f -exec sh -c 'mv "$1" "${1%.foo}.barbarbarbarbarbarbarbar"' sh {} ; -print


resulting in such hilarious results as this file, which got renamed five times:



1046.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar




Obviously, a very large directory (more than 100 000 entries) would be needed to trigger that effect on a default build of GNU find, but a trivial readdir+process loop without caching would be even more vulnerable.



In theory, if the OS always added renamed files last in the order where readdir() returned them, a simple implementation like that could even fall into an endless loop.



On Linux, readdir() in the C library is implemented through the getdents() system call, which already returns multiple directory entries at one go. Which means that later calls to readdir() might return files that were already removed, but for very small directories you'd effectively get a snapshot of the starting state. I don't know about other systems.



In the above test, I did the renames to a longer file name on purpose: to prevent the file name from being overwritten in-place. No matter, the same test on a same-length rename also did double and triple renames. If and how this matters would of course depend on the filesystem internals.



Considering all this, it's probably prudent to avoid the whole issue by making the find expression not match the files that were already processed. That is, to add -name "*.foo" in my example or ! -name "*_hello.txt" to the command in the question.






share|improve this answer























  • This would seem to indicate that the default GNU find indeed would have issues with directories holding more than 100K files (or maybe entries of any type?) and that my precaution is not as silly as I first thought. (well, having 100K files in a directory is in itself a bit silly) I will look for similar code in my native OpenBSD find as soon as I get a chance.
    – Kusalananda
    Feb 13 '18 at 21:06












  • Interesting. Better be more careful with find regexps in the future.
    – Rui F Ribeiro
    Feb 13 '18 at 21:21










  • Actually, I wonder if the alternative approach would work on such a directory... expanding a glob to more than 100K pathnames? That would have its own issues.
    – Kusalananda
    Feb 13 '18 at 22:22














7












7








7






Can find find files that were created while it was walking the directory?



In brief: Yes, but it depends on the implementation. It's probably best to write the conditions so that already processed files are ignored.



As mentioned, POSIX makes no guarantees either way, like it also makes no guarantees on the underlying readdir() system call:




If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir() returns an entry for that file is unspecified.






I tested the find on my Debian (GNU find, Debian package version 4.6.0+git+20161106-2). strace showed that it read the full directory before doing anything.



Browsing the source code a bit more makes it seem that GNU find uses parts of gnulib to read the directories, and there's this in gnulib/lib/fts.c (gl/lib/fts.c in the find tarball):



/* If possible (see max_entries, below), read no more than this many directory
entries at a time. Without this limit (i.e., when using non-NULL
fts_compar), processing a directory with 4,000,000 entries requires ~1GiB
of memory, and handling 64M entries would require 16GiB of memory. */
#ifndef FTS_MAX_READDIR_ENTRIES
# define FTS_MAX_READDIR_ENTRIES 100000
#endif


I changed that limit to 100, and did



mkdir test; cd test; touch {0000..2999}.foo
find . -type f -exec sh -c 'mv "$1" "${1%.foo}.barbarbarbarbarbarbarbar"' sh {} ; -print


resulting in such hilarious results as this file, which got renamed five times:



1046.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar




Obviously, a very large directory (more than 100 000 entries) would be needed to trigger that effect on a default build of GNU find, but a trivial readdir+process loop without caching would be even more vulnerable.



In theory, if the OS always added renamed files last in the order where readdir() returned them, a simple implementation like that could even fall into an endless loop.



On Linux, readdir() in the C library is implemented through the getdents() system call, which already returns multiple directory entries at one go. Which means that later calls to readdir() might return files that were already removed, but for very small directories you'd effectively get a snapshot of the starting state. I don't know about other systems.



In the above test, I did the renames to a longer file name on purpose: to prevent the file name from being overwritten in-place. No matter, the same test on a same-length rename also did double and triple renames. If and how this matters would of course depend on the filesystem internals.



Considering all this, it's probably prudent to avoid the whole issue by making the find expression not match the files that were already processed. That is, to add -name "*.foo" in my example or ! -name "*_hello.txt" to the command in the question.






share|improve this answer














Can find find files that were created while it was walking the directory?



In brief: Yes, but it depends on the implementation. It's probably best to write the conditions so that already processed files are ignored.



As mentioned, POSIX makes no guarantees either way, like it also makes no guarantees on the underlying readdir() system call:




If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir() returns an entry for that file is unspecified.






I tested the find on my Debian (GNU find, Debian package version 4.6.0+git+20161106-2). strace showed that it read the full directory before doing anything.



Browsing the source code a bit more makes it seem that GNU find uses parts of gnulib to read the directories, and there's this in gnulib/lib/fts.c (gl/lib/fts.c in the find tarball):



/* If possible (see max_entries, below), read no more than this many directory
entries at a time. Without this limit (i.e., when using non-NULL
fts_compar), processing a directory with 4,000,000 entries requires ~1GiB
of memory, and handling 64M entries would require 16GiB of memory. */
#ifndef FTS_MAX_READDIR_ENTRIES
# define FTS_MAX_READDIR_ENTRIES 100000
#endif


I changed that limit to 100, and did



mkdir test; cd test; touch {0000..2999}.foo
find . -type f -exec sh -c 'mv "$1" "${1%.foo}.barbarbarbarbarbarbarbar"' sh {} ; -print


resulting in such hilarious results as this file, which got renamed five times:



1046.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar




Obviously, a very large directory (more than 100 000 entries) would be needed to trigger that effect on a default build of GNU find, but a trivial readdir+process loop without caching would be even more vulnerable.



In theory, if the OS always added renamed files last in the order where readdir() returned them, a simple implementation like that could even fall into an endless loop.



On Linux, readdir() in the C library is implemented through the getdents() system call, which already returns multiple directory entries at one go. Which means that later calls to readdir() might return files that were already removed, but for very small directories you'd effectively get a snapshot of the starting state. I don't know about other systems.



In the above test, I did the renames to a longer file name on purpose: to prevent the file name from being overwritten in-place. No matter, the same test on a same-length rename also did double and triple renames. If and how this matters would of course depend on the filesystem internals.



Considering all this, it's probably prudent to avoid the whole issue by making the find expression not match the files that were already processed. That is, to add -name "*.foo" in my example or ! -name "*_hello.txt" to the command in the question.







share|improve this answer














share|improve this answer



share|improve this answer








edited 2 days ago

























answered Feb 13 '18 at 21:01









ilkkachuilkkachu

56.4k784156




56.4k784156












  • This would seem to indicate that the default GNU find indeed would have issues with directories holding more than 100K files (or maybe entries of any type?) and that my precaution is not as silly as I first thought. (well, having 100K files in a directory is in itself a bit silly) I will look for similar code in my native OpenBSD find as soon as I get a chance.
    – Kusalananda
    Feb 13 '18 at 21:06












  • Interesting. Better be more careful with find regexps in the future.
    – Rui F Ribeiro
    Feb 13 '18 at 21:21










  • Actually, I wonder if the alternative approach would work on such a directory... expanding a glob to more than 100K pathnames? That would have its own issues.
    – Kusalananda
    Feb 13 '18 at 22:22


















  • This would seem to indicate that the default GNU find indeed would have issues with directories holding more than 100K files (or maybe entries of any type?) and that my precaution is not as silly as I first thought. (well, having 100K files in a directory is in itself a bit silly) I will look for similar code in my native OpenBSD find as soon as I get a chance.
    – Kusalananda
    Feb 13 '18 at 21:06












  • Interesting. Better be more careful with find regexps in the future.
    – Rui F Ribeiro
    Feb 13 '18 at 21:21










  • Actually, I wonder if the alternative approach would work on such a directory... expanding a glob to more than 100K pathnames? That would have its own issues.
    – Kusalananda
    Feb 13 '18 at 22:22
















This would seem to indicate that the default GNU find indeed would have issues with directories holding more than 100K files (or maybe entries of any type?) and that my precaution is not as silly as I first thought. (well, having 100K files in a directory is in itself a bit silly) I will look for similar code in my native OpenBSD find as soon as I get a chance.
– Kusalananda
Feb 13 '18 at 21:06






This would seem to indicate that the default GNU find indeed would have issues with directories holding more than 100K files (or maybe entries of any type?) and that my precaution is not as silly as I first thought. (well, having 100K files in a directory is in itself a bit silly) I will look for similar code in my native OpenBSD find as soon as I get a chance.
– Kusalananda
Feb 13 '18 at 21:06














Interesting. Better be more careful with find regexps in the future.
– Rui F Ribeiro
Feb 13 '18 at 21:21




Interesting. Better be more careful with find regexps in the future.
– Rui F Ribeiro
Feb 13 '18 at 21:21












Actually, I wonder if the alternative approach would work on such a directory... expanding a glob to more than 100K pathnames? That would have its own issues.
– Kusalananda
Feb 13 '18 at 22:22




Actually, I wonder if the alternative approach would work on such a directory... expanding a glob to more than 100K pathnames? That would have its own issues.
– Kusalananda
Feb 13 '18 at 22:22


















draft saved

draft discarded




















































Thanks for contributing an answer to Unix & Linux Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f423941%2fwill-we-ever-find-files-whose-names-are-changed-by-find-why-not%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to reconfigure Docker Trusted Registry 2.x.x to use CEPH FS mount instead of NFS and other traditional...

is 'sed' thread safe

How to make a Squid Proxy server?