Will we ever “find” files whose names are changed by “find”? Why not?
While answering an older question it struck me that it seems find
, in the following example, potentially would process files multiple times:
find dir -type f -name '*.txt'
-exec sh -c 'mv "$1" "${1%.txt}_hello.txt"' sh {} ';'
or the more efficient
find dir -type f -name '*.txt'
-exec sh -c 'for n; do mv "$n" "${n%.txt}_hello.txt"; done' sh {} +
The command finds .txt
files and changes their filename suffix from .txt
to _hello.txt
.
While doing so, the directories will start accumulating new files whose names matches the *.txt
pattern, namely these _hello.txt
files.
Question: Why are they not actually processed by find
? Because in my experience they aren't, and we don't want them to be either as it would introduce a sort of infinite loop. This is also the case with mv
replaced by cp
, by the way.
The POSIX standard says (my emphasis)
If a file is removed from or added to the directory hierarchy being searched it is unspecified whether or not
find
includes that file in its search.
Since it's unspecified whether new files will be included, maybe a safer approach would be
find dir -type d -exec sh -c '
for n in "$1"/*.txt; do
test -f "$n" && mv "$n" "${n%.txt}_hello.txt"
done' sh {} ';'
Here, we don't look for files but for directories, and the for
loop of the internal sh
script evaluates its range once before the first iteration, so we don't have the same potential issue.
The GNU find
manual does not explicitly say anything about this and neither does the OpenBSD find
manual.
find
|
show 1 more comment
While answering an older question it struck me that it seems find
, in the following example, potentially would process files multiple times:
find dir -type f -name '*.txt'
-exec sh -c 'mv "$1" "${1%.txt}_hello.txt"' sh {} ';'
or the more efficient
find dir -type f -name '*.txt'
-exec sh -c 'for n; do mv "$n" "${n%.txt}_hello.txt"; done' sh {} +
The command finds .txt
files and changes their filename suffix from .txt
to _hello.txt
.
While doing so, the directories will start accumulating new files whose names matches the *.txt
pattern, namely these _hello.txt
files.
Question: Why are they not actually processed by find
? Because in my experience they aren't, and we don't want them to be either as it would introduce a sort of infinite loop. This is also the case with mv
replaced by cp
, by the way.
The POSIX standard says (my emphasis)
If a file is removed from or added to the directory hierarchy being searched it is unspecified whether or not
find
includes that file in its search.
Since it's unspecified whether new files will be included, maybe a safer approach would be
find dir -type d -exec sh -c '
for n in "$1"/*.txt; do
test -f "$n" && mv "$n" "${n%.txt}_hello.txt"
done' sh {} ';'
Here, we don't look for files but for directories, and the for
loop of the internal sh
script evaluates its range once before the first iteration, so we don't have the same potential issue.
The GNU find
manual does not explicitly say anything about this and neither does the OpenBSD find
manual.
find
it is unspecified whether or not - I wonder why do the authors offind
utility not get concerned of such tricky behavior
– RomanPerekhrest
Feb 13 '18 at 19:07
1
readdir
has essentially the same specification. I would expect it to be potentially filesystem-specific, even (loading one block of directory entries at a time is pretty reasonable).
– Michael Homer
Feb 13 '18 at 19:12
@RomanPerekhrest, it's unspecified in the POSIX specification. That doesn't mean the authors offind
utility don't get concerned with the behavior. It means it's left to the authors of any given implementation of thefind
utility how to handle that case, rather than being specified. (If that seems unclear I recommend you fully clear up the words "specification" and "implementation" as they apply to software.)
– Wildcard
Feb 13 '18 at 21:51
@Wildcard, words "specification" and "implementation" are clear. A cite from the question: The GNU find manual does not explicitly say anything about this and neither does the OpenBSD find manual. So, that's bad ... and I shouldn't be compelled to like that
– RomanPerekhrest
Feb 13 '18 at 22:01
@RomanPerekhrest, ah, I see. Your first comment quoted the POSIX spec so I wasn't sure. I wouldn't say it's bad, though (that thefind
devs don't mention it). As described in ikkachu's answer, that's a filesystem level behavior, not even specified in thereaddir()
system call spec. There are going to be race conditions no matter what.
– Wildcard
Feb 13 '18 at 22:04
|
show 1 more comment
While answering an older question it struck me that it seems find
, in the following example, potentially would process files multiple times:
find dir -type f -name '*.txt'
-exec sh -c 'mv "$1" "${1%.txt}_hello.txt"' sh {} ';'
or the more efficient
find dir -type f -name '*.txt'
-exec sh -c 'for n; do mv "$n" "${n%.txt}_hello.txt"; done' sh {} +
The command finds .txt
files and changes their filename suffix from .txt
to _hello.txt
.
While doing so, the directories will start accumulating new files whose names matches the *.txt
pattern, namely these _hello.txt
files.
Question: Why are they not actually processed by find
? Because in my experience they aren't, and we don't want them to be either as it would introduce a sort of infinite loop. This is also the case with mv
replaced by cp
, by the way.
The POSIX standard says (my emphasis)
If a file is removed from or added to the directory hierarchy being searched it is unspecified whether or not
find
includes that file in its search.
Since it's unspecified whether new files will be included, maybe a safer approach would be
find dir -type d -exec sh -c '
for n in "$1"/*.txt; do
test -f "$n" && mv "$n" "${n%.txt}_hello.txt"
done' sh {} ';'
Here, we don't look for files but for directories, and the for
loop of the internal sh
script evaluates its range once before the first iteration, so we don't have the same potential issue.
The GNU find
manual does not explicitly say anything about this and neither does the OpenBSD find
manual.
find
While answering an older question it struck me that it seems find
, in the following example, potentially would process files multiple times:
find dir -type f -name '*.txt'
-exec sh -c 'mv "$1" "${1%.txt}_hello.txt"' sh {} ';'
or the more efficient
find dir -type f -name '*.txt'
-exec sh -c 'for n; do mv "$n" "${n%.txt}_hello.txt"; done' sh {} +
The command finds .txt
files and changes their filename suffix from .txt
to _hello.txt
.
While doing so, the directories will start accumulating new files whose names matches the *.txt
pattern, namely these _hello.txt
files.
Question: Why are they not actually processed by find
? Because in my experience they aren't, and we don't want them to be either as it would introduce a sort of infinite loop. This is also the case with mv
replaced by cp
, by the way.
The POSIX standard says (my emphasis)
If a file is removed from or added to the directory hierarchy being searched it is unspecified whether or not
find
includes that file in its search.
Since it's unspecified whether new files will be included, maybe a safer approach would be
find dir -type d -exec sh -c '
for n in "$1"/*.txt; do
test -f "$n" && mv "$n" "${n%.txt}_hello.txt"
done' sh {} ';'
Here, we don't look for files but for directories, and the for
loop of the internal sh
script evaluates its range once before the first iteration, so we don't have the same potential issue.
The GNU find
manual does not explicitly say anything about this and neither does the OpenBSD find
manual.
find
find
edited Jul 3 '18 at 9:15
Kusalananda
asked Feb 13 '18 at 18:56
KusalanandaKusalananda
123k16232377
123k16232377
it is unspecified whether or not - I wonder why do the authors offind
utility not get concerned of such tricky behavior
– RomanPerekhrest
Feb 13 '18 at 19:07
1
readdir
has essentially the same specification. I would expect it to be potentially filesystem-specific, even (loading one block of directory entries at a time is pretty reasonable).
– Michael Homer
Feb 13 '18 at 19:12
@RomanPerekhrest, it's unspecified in the POSIX specification. That doesn't mean the authors offind
utility don't get concerned with the behavior. It means it's left to the authors of any given implementation of thefind
utility how to handle that case, rather than being specified. (If that seems unclear I recommend you fully clear up the words "specification" and "implementation" as they apply to software.)
– Wildcard
Feb 13 '18 at 21:51
@Wildcard, words "specification" and "implementation" are clear. A cite from the question: The GNU find manual does not explicitly say anything about this and neither does the OpenBSD find manual. So, that's bad ... and I shouldn't be compelled to like that
– RomanPerekhrest
Feb 13 '18 at 22:01
@RomanPerekhrest, ah, I see. Your first comment quoted the POSIX spec so I wasn't sure. I wouldn't say it's bad, though (that thefind
devs don't mention it). As described in ikkachu's answer, that's a filesystem level behavior, not even specified in thereaddir()
system call spec. There are going to be race conditions no matter what.
– Wildcard
Feb 13 '18 at 22:04
|
show 1 more comment
it is unspecified whether or not - I wonder why do the authors offind
utility not get concerned of such tricky behavior
– RomanPerekhrest
Feb 13 '18 at 19:07
1
readdir
has essentially the same specification. I would expect it to be potentially filesystem-specific, even (loading one block of directory entries at a time is pretty reasonable).
– Michael Homer
Feb 13 '18 at 19:12
@RomanPerekhrest, it's unspecified in the POSIX specification. That doesn't mean the authors offind
utility don't get concerned with the behavior. It means it's left to the authors of any given implementation of thefind
utility how to handle that case, rather than being specified. (If that seems unclear I recommend you fully clear up the words "specification" and "implementation" as they apply to software.)
– Wildcard
Feb 13 '18 at 21:51
@Wildcard, words "specification" and "implementation" are clear. A cite from the question: The GNU find manual does not explicitly say anything about this and neither does the OpenBSD find manual. So, that's bad ... and I shouldn't be compelled to like that
– RomanPerekhrest
Feb 13 '18 at 22:01
@RomanPerekhrest, ah, I see. Your first comment quoted the POSIX spec so I wasn't sure. I wouldn't say it's bad, though (that thefind
devs don't mention it). As described in ikkachu's answer, that's a filesystem level behavior, not even specified in thereaddir()
system call spec. There are going to be race conditions no matter what.
– Wildcard
Feb 13 '18 at 22:04
it is unspecified whether or not - I wonder why do the authors of
find
utility not get concerned of such tricky behavior– RomanPerekhrest
Feb 13 '18 at 19:07
it is unspecified whether or not - I wonder why do the authors of
find
utility not get concerned of such tricky behavior– RomanPerekhrest
Feb 13 '18 at 19:07
1
1
readdir
has essentially the same specification. I would expect it to be potentially filesystem-specific, even (loading one block of directory entries at a time is pretty reasonable).– Michael Homer
Feb 13 '18 at 19:12
readdir
has essentially the same specification. I would expect it to be potentially filesystem-specific, even (loading one block of directory entries at a time is pretty reasonable).– Michael Homer
Feb 13 '18 at 19:12
@RomanPerekhrest, it's unspecified in the POSIX specification. That doesn't mean the authors of
find
utility don't get concerned with the behavior. It means it's left to the authors of any given implementation of the find
utility how to handle that case, rather than being specified. (If that seems unclear I recommend you fully clear up the words "specification" and "implementation" as they apply to software.)– Wildcard
Feb 13 '18 at 21:51
@RomanPerekhrest, it's unspecified in the POSIX specification. That doesn't mean the authors of
find
utility don't get concerned with the behavior. It means it's left to the authors of any given implementation of the find
utility how to handle that case, rather than being specified. (If that seems unclear I recommend you fully clear up the words "specification" and "implementation" as they apply to software.)– Wildcard
Feb 13 '18 at 21:51
@Wildcard, words "specification" and "implementation" are clear. A cite from the question: The GNU find manual does not explicitly say anything about this and neither does the OpenBSD find manual. So, that's bad ... and I shouldn't be compelled to like that
– RomanPerekhrest
Feb 13 '18 at 22:01
@Wildcard, words "specification" and "implementation" are clear. A cite from the question: The GNU find manual does not explicitly say anything about this and neither does the OpenBSD find manual. So, that's bad ... and I shouldn't be compelled to like that
– RomanPerekhrest
Feb 13 '18 at 22:01
@RomanPerekhrest, ah, I see. Your first comment quoted the POSIX spec so I wasn't sure. I wouldn't say it's bad, though (that the
find
devs don't mention it). As described in ikkachu's answer, that's a filesystem level behavior, not even specified in the readdir()
system call spec. There are going to be race conditions no matter what.– Wildcard
Feb 13 '18 at 22:04
@RomanPerekhrest, ah, I see. Your first comment quoted the POSIX spec so I wasn't sure. I wouldn't say it's bad, though (that the
find
devs don't mention it). As described in ikkachu's answer, that's a filesystem level behavior, not even specified in the readdir()
system call spec. There are going to be race conditions no matter what.– Wildcard
Feb 13 '18 at 22:04
|
show 1 more comment
1 Answer
1
active
oldest
votes
Can find
find files that were created while it was walking the directory?
In brief: Yes, but it depends on the implementation. It's probably best to write the conditions so that already processed files are ignored.
As mentioned, POSIX makes no guarantees either way, like it also makes no guarantees on the underlying readdir()
system call:
If a file is removed from or added to the directory after the most recent call to
opendir()
orrewinddir()
, whether a subsequent call toreaddir()
returns an entry for that file is unspecified.
I tested the find
on my Debian (GNU find, Debian package version 4.6.0+git+20161106-2
). strace
showed that it read the full directory before doing anything.
Browsing the source code a bit more makes it seem that GNU find uses parts of gnulib to read the directories, and there's this in gnulib/lib/fts.c (gl/lib/fts.c
in the find
tarball):
/* If possible (see max_entries, below), read no more than this many directory
entries at a time. Without this limit (i.e., when using non-NULL
fts_compar), processing a directory with 4,000,000 entries requires ~1GiB
of memory, and handling 64M entries would require 16GiB of memory. */
#ifndef FTS_MAX_READDIR_ENTRIES
# define FTS_MAX_READDIR_ENTRIES 100000
#endif
I changed that limit to 100, and did
mkdir test; cd test; touch {0000..2999}.foo
find . -type f -exec sh -c 'mv "$1" "${1%.foo}.barbarbarbarbarbarbarbar"' sh {} ; -print
resulting in such hilarious results as this file, which got renamed five times:
1046.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar
Obviously, a very large directory (more than 100 000 entries) would be needed to trigger that effect on a default build of GNU find, but a trivial readdir+process loop without caching would be even more vulnerable.
In theory, if the OS always added renamed files last in the order where readdir()
returned them, a simple implementation like that could even fall into an endless loop.
On Linux, readdir()
in the C library is implemented through the getdents()
system call, which already returns multiple directory entries at one go. Which means that later calls to readdir()
might return files that were already removed, but for very small directories you'd effectively get a snapshot of the starting state. I don't know about other systems.
In the above test, I did the renames to a longer file name on purpose: to prevent the file name from being overwritten in-place. No matter, the same test on a same-length rename also did double and triple renames. If and how this matters would of course depend on the filesystem internals.
Considering all this, it's probably prudent to avoid the whole issue by making the find
expression not match the files that were already processed. That is, to add -name "*.foo"
in my example or ! -name "*_hello.txt"
to the command in the question.
This would seem to indicate that the default GNUfind
indeed would have issues with directories holding more than 100K files (or maybe entries of any type?) and that my precaution is not as silly as I first thought. (well, having 100K files in a directory is in itself a bit silly) I will look for similar code in my native OpenBSDfind
as soon as I get a chance.
– Kusalananda
Feb 13 '18 at 21:06
Interesting. Better be more careful withfind
regexps in the future.
– Rui F Ribeiro
Feb 13 '18 at 21:21
Actually, I wonder if the alternative approach would work on such a directory... expanding a glob to more than 100K pathnames? That would have its own issues.
– Kusalananda
Feb 13 '18 at 22:22
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f423941%2fwill-we-ever-find-files-whose-names-are-changed-by-find-why-not%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Can find
find files that were created while it was walking the directory?
In brief: Yes, but it depends on the implementation. It's probably best to write the conditions so that already processed files are ignored.
As mentioned, POSIX makes no guarantees either way, like it also makes no guarantees on the underlying readdir()
system call:
If a file is removed from or added to the directory after the most recent call to
opendir()
orrewinddir()
, whether a subsequent call toreaddir()
returns an entry for that file is unspecified.
I tested the find
on my Debian (GNU find, Debian package version 4.6.0+git+20161106-2
). strace
showed that it read the full directory before doing anything.
Browsing the source code a bit more makes it seem that GNU find uses parts of gnulib to read the directories, and there's this in gnulib/lib/fts.c (gl/lib/fts.c
in the find
tarball):
/* If possible (see max_entries, below), read no more than this many directory
entries at a time. Without this limit (i.e., when using non-NULL
fts_compar), processing a directory with 4,000,000 entries requires ~1GiB
of memory, and handling 64M entries would require 16GiB of memory. */
#ifndef FTS_MAX_READDIR_ENTRIES
# define FTS_MAX_READDIR_ENTRIES 100000
#endif
I changed that limit to 100, and did
mkdir test; cd test; touch {0000..2999}.foo
find . -type f -exec sh -c 'mv "$1" "${1%.foo}.barbarbarbarbarbarbarbar"' sh {} ; -print
resulting in such hilarious results as this file, which got renamed five times:
1046.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar
Obviously, a very large directory (more than 100 000 entries) would be needed to trigger that effect on a default build of GNU find, but a trivial readdir+process loop without caching would be even more vulnerable.
In theory, if the OS always added renamed files last in the order where readdir()
returned them, a simple implementation like that could even fall into an endless loop.
On Linux, readdir()
in the C library is implemented through the getdents()
system call, which already returns multiple directory entries at one go. Which means that later calls to readdir()
might return files that were already removed, but for very small directories you'd effectively get a snapshot of the starting state. I don't know about other systems.
In the above test, I did the renames to a longer file name on purpose: to prevent the file name from being overwritten in-place. No matter, the same test on a same-length rename also did double and triple renames. If and how this matters would of course depend on the filesystem internals.
Considering all this, it's probably prudent to avoid the whole issue by making the find
expression not match the files that were already processed. That is, to add -name "*.foo"
in my example or ! -name "*_hello.txt"
to the command in the question.
This would seem to indicate that the default GNUfind
indeed would have issues with directories holding more than 100K files (or maybe entries of any type?) and that my precaution is not as silly as I first thought. (well, having 100K files in a directory is in itself a bit silly) I will look for similar code in my native OpenBSDfind
as soon as I get a chance.
– Kusalananda
Feb 13 '18 at 21:06
Interesting. Better be more careful withfind
regexps in the future.
– Rui F Ribeiro
Feb 13 '18 at 21:21
Actually, I wonder if the alternative approach would work on such a directory... expanding a glob to more than 100K pathnames? That would have its own issues.
– Kusalananda
Feb 13 '18 at 22:22
add a comment |
Can find
find files that were created while it was walking the directory?
In brief: Yes, but it depends on the implementation. It's probably best to write the conditions so that already processed files are ignored.
As mentioned, POSIX makes no guarantees either way, like it also makes no guarantees on the underlying readdir()
system call:
If a file is removed from or added to the directory after the most recent call to
opendir()
orrewinddir()
, whether a subsequent call toreaddir()
returns an entry for that file is unspecified.
I tested the find
on my Debian (GNU find, Debian package version 4.6.0+git+20161106-2
). strace
showed that it read the full directory before doing anything.
Browsing the source code a bit more makes it seem that GNU find uses parts of gnulib to read the directories, and there's this in gnulib/lib/fts.c (gl/lib/fts.c
in the find
tarball):
/* If possible (see max_entries, below), read no more than this many directory
entries at a time. Without this limit (i.e., when using non-NULL
fts_compar), processing a directory with 4,000,000 entries requires ~1GiB
of memory, and handling 64M entries would require 16GiB of memory. */
#ifndef FTS_MAX_READDIR_ENTRIES
# define FTS_MAX_READDIR_ENTRIES 100000
#endif
I changed that limit to 100, and did
mkdir test; cd test; touch {0000..2999}.foo
find . -type f -exec sh -c 'mv "$1" "${1%.foo}.barbarbarbarbarbarbarbar"' sh {} ; -print
resulting in such hilarious results as this file, which got renamed five times:
1046.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar
Obviously, a very large directory (more than 100 000 entries) would be needed to trigger that effect on a default build of GNU find, but a trivial readdir+process loop without caching would be even more vulnerable.
In theory, if the OS always added renamed files last in the order where readdir()
returned them, a simple implementation like that could even fall into an endless loop.
On Linux, readdir()
in the C library is implemented through the getdents()
system call, which already returns multiple directory entries at one go. Which means that later calls to readdir()
might return files that were already removed, but for very small directories you'd effectively get a snapshot of the starting state. I don't know about other systems.
In the above test, I did the renames to a longer file name on purpose: to prevent the file name from being overwritten in-place. No matter, the same test on a same-length rename also did double and triple renames. If and how this matters would of course depend on the filesystem internals.
Considering all this, it's probably prudent to avoid the whole issue by making the find
expression not match the files that were already processed. That is, to add -name "*.foo"
in my example or ! -name "*_hello.txt"
to the command in the question.
This would seem to indicate that the default GNUfind
indeed would have issues with directories holding more than 100K files (or maybe entries of any type?) and that my precaution is not as silly as I first thought. (well, having 100K files in a directory is in itself a bit silly) I will look for similar code in my native OpenBSDfind
as soon as I get a chance.
– Kusalananda
Feb 13 '18 at 21:06
Interesting. Better be more careful withfind
regexps in the future.
– Rui F Ribeiro
Feb 13 '18 at 21:21
Actually, I wonder if the alternative approach would work on such a directory... expanding a glob to more than 100K pathnames? That would have its own issues.
– Kusalananda
Feb 13 '18 at 22:22
add a comment |
Can find
find files that were created while it was walking the directory?
In brief: Yes, but it depends on the implementation. It's probably best to write the conditions so that already processed files are ignored.
As mentioned, POSIX makes no guarantees either way, like it also makes no guarantees on the underlying readdir()
system call:
If a file is removed from or added to the directory after the most recent call to
opendir()
orrewinddir()
, whether a subsequent call toreaddir()
returns an entry for that file is unspecified.
I tested the find
on my Debian (GNU find, Debian package version 4.6.0+git+20161106-2
). strace
showed that it read the full directory before doing anything.
Browsing the source code a bit more makes it seem that GNU find uses parts of gnulib to read the directories, and there's this in gnulib/lib/fts.c (gl/lib/fts.c
in the find
tarball):
/* If possible (see max_entries, below), read no more than this many directory
entries at a time. Without this limit (i.e., when using non-NULL
fts_compar), processing a directory with 4,000,000 entries requires ~1GiB
of memory, and handling 64M entries would require 16GiB of memory. */
#ifndef FTS_MAX_READDIR_ENTRIES
# define FTS_MAX_READDIR_ENTRIES 100000
#endif
I changed that limit to 100, and did
mkdir test; cd test; touch {0000..2999}.foo
find . -type f -exec sh -c 'mv "$1" "${1%.foo}.barbarbarbarbarbarbarbar"' sh {} ; -print
resulting in such hilarious results as this file, which got renamed five times:
1046.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar
Obviously, a very large directory (more than 100 000 entries) would be needed to trigger that effect on a default build of GNU find, but a trivial readdir+process loop without caching would be even more vulnerable.
In theory, if the OS always added renamed files last in the order where readdir()
returned them, a simple implementation like that could even fall into an endless loop.
On Linux, readdir()
in the C library is implemented through the getdents()
system call, which already returns multiple directory entries at one go. Which means that later calls to readdir()
might return files that were already removed, but for very small directories you'd effectively get a snapshot of the starting state. I don't know about other systems.
In the above test, I did the renames to a longer file name on purpose: to prevent the file name from being overwritten in-place. No matter, the same test on a same-length rename also did double and triple renames. If and how this matters would of course depend on the filesystem internals.
Considering all this, it's probably prudent to avoid the whole issue by making the find
expression not match the files that were already processed. That is, to add -name "*.foo"
in my example or ! -name "*_hello.txt"
to the command in the question.
Can find
find files that were created while it was walking the directory?
In brief: Yes, but it depends on the implementation. It's probably best to write the conditions so that already processed files are ignored.
As mentioned, POSIX makes no guarantees either way, like it also makes no guarantees on the underlying readdir()
system call:
If a file is removed from or added to the directory after the most recent call to
opendir()
orrewinddir()
, whether a subsequent call toreaddir()
returns an entry for that file is unspecified.
I tested the find
on my Debian (GNU find, Debian package version 4.6.0+git+20161106-2
). strace
showed that it read the full directory before doing anything.
Browsing the source code a bit more makes it seem that GNU find uses parts of gnulib to read the directories, and there's this in gnulib/lib/fts.c (gl/lib/fts.c
in the find
tarball):
/* If possible (see max_entries, below), read no more than this many directory
entries at a time. Without this limit (i.e., when using non-NULL
fts_compar), processing a directory with 4,000,000 entries requires ~1GiB
of memory, and handling 64M entries would require 16GiB of memory. */
#ifndef FTS_MAX_READDIR_ENTRIES
# define FTS_MAX_READDIR_ENTRIES 100000
#endif
I changed that limit to 100, and did
mkdir test; cd test; touch {0000..2999}.foo
find . -type f -exec sh -c 'mv "$1" "${1%.foo}.barbarbarbarbarbarbarbar"' sh {} ; -print
resulting in such hilarious results as this file, which got renamed five times:
1046.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar.barbarbarbarbarbarbarbar
Obviously, a very large directory (more than 100 000 entries) would be needed to trigger that effect on a default build of GNU find, but a trivial readdir+process loop without caching would be even more vulnerable.
In theory, if the OS always added renamed files last in the order where readdir()
returned them, a simple implementation like that could even fall into an endless loop.
On Linux, readdir()
in the C library is implemented through the getdents()
system call, which already returns multiple directory entries at one go. Which means that later calls to readdir()
might return files that were already removed, but for very small directories you'd effectively get a snapshot of the starting state. I don't know about other systems.
In the above test, I did the renames to a longer file name on purpose: to prevent the file name from being overwritten in-place. No matter, the same test on a same-length rename also did double and triple renames. If and how this matters would of course depend on the filesystem internals.
Considering all this, it's probably prudent to avoid the whole issue by making the find
expression not match the files that were already processed. That is, to add -name "*.foo"
in my example or ! -name "*_hello.txt"
to the command in the question.
edited 2 days ago
answered Feb 13 '18 at 21:01
ilkkachuilkkachu
56.4k784156
56.4k784156
This would seem to indicate that the default GNUfind
indeed would have issues with directories holding more than 100K files (or maybe entries of any type?) and that my precaution is not as silly as I first thought. (well, having 100K files in a directory is in itself a bit silly) I will look for similar code in my native OpenBSDfind
as soon as I get a chance.
– Kusalananda
Feb 13 '18 at 21:06
Interesting. Better be more careful withfind
regexps in the future.
– Rui F Ribeiro
Feb 13 '18 at 21:21
Actually, I wonder if the alternative approach would work on such a directory... expanding a glob to more than 100K pathnames? That would have its own issues.
– Kusalananda
Feb 13 '18 at 22:22
add a comment |
This would seem to indicate that the default GNUfind
indeed would have issues with directories holding more than 100K files (or maybe entries of any type?) and that my precaution is not as silly as I first thought. (well, having 100K files in a directory is in itself a bit silly) I will look for similar code in my native OpenBSDfind
as soon as I get a chance.
– Kusalananda
Feb 13 '18 at 21:06
Interesting. Better be more careful withfind
regexps in the future.
– Rui F Ribeiro
Feb 13 '18 at 21:21
Actually, I wonder if the alternative approach would work on such a directory... expanding a glob to more than 100K pathnames? That would have its own issues.
– Kusalananda
Feb 13 '18 at 22:22
This would seem to indicate that the default GNU
find
indeed would have issues with directories holding more than 100K files (or maybe entries of any type?) and that my precaution is not as silly as I first thought. (well, having 100K files in a directory is in itself a bit silly) I will look for similar code in my native OpenBSD find
as soon as I get a chance.– Kusalananda
Feb 13 '18 at 21:06
This would seem to indicate that the default GNU
find
indeed would have issues with directories holding more than 100K files (or maybe entries of any type?) and that my precaution is not as silly as I first thought. (well, having 100K files in a directory is in itself a bit silly) I will look for similar code in my native OpenBSD find
as soon as I get a chance.– Kusalananda
Feb 13 '18 at 21:06
Interesting. Better be more careful with
find
regexps in the future.– Rui F Ribeiro
Feb 13 '18 at 21:21
Interesting. Better be more careful with
find
regexps in the future.– Rui F Ribeiro
Feb 13 '18 at 21:21
Actually, I wonder if the alternative approach would work on such a directory... expanding a glob to more than 100K pathnames? That would have its own issues.
– Kusalananda
Feb 13 '18 at 22:22
Actually, I wonder if the alternative approach would work on such a directory... expanding a glob to more than 100K pathnames? That would have its own issues.
– Kusalananda
Feb 13 '18 at 22:22
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f423941%2fwill-we-ever-find-files-whose-names-are-changed-by-find-why-not%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
it is unspecified whether or not - I wonder why do the authors of
find
utility not get concerned of such tricky behavior– RomanPerekhrest
Feb 13 '18 at 19:07
1
readdir
has essentially the same specification. I would expect it to be potentially filesystem-specific, even (loading one block of directory entries at a time is pretty reasonable).– Michael Homer
Feb 13 '18 at 19:12
@RomanPerekhrest, it's unspecified in the POSIX specification. That doesn't mean the authors of
find
utility don't get concerned with the behavior. It means it's left to the authors of any given implementation of thefind
utility how to handle that case, rather than being specified. (If that seems unclear I recommend you fully clear up the words "specification" and "implementation" as they apply to software.)– Wildcard
Feb 13 '18 at 21:51
@Wildcard, words "specification" and "implementation" are clear. A cite from the question: The GNU find manual does not explicitly say anything about this and neither does the OpenBSD find manual. So, that's bad ... and I shouldn't be compelled to like that
– RomanPerekhrest
Feb 13 '18 at 22:01
@RomanPerekhrest, ah, I see. Your first comment quoted the POSIX spec so I wasn't sure. I wouldn't say it's bad, though (that the
find
devs don't mention it). As described in ikkachu's answer, that's a filesystem level behavior, not even specified in thereaddir()
system call spec. There are going to be race conditions no matter what.– Wildcard
Feb 13 '18 at 22:04