create files through terminal and joining two files in script python3












1















I have a recursive directory called 'dir'. I am writing to list of files from all subdirectories to a CSV file with the following command in linux on the terminal.



dir$ find . -type f -printf '%fn' > old_names.csv


I am using a detox code to change filenames. And I am making a new list using



dir $ find . -type f -printf '%fn' > new_names.csv


I would like to join this to lists together and make a new list with two columns something like this;



enter image description here



To do that I read both csv files into pandas data frame and join them on index as follows in python3 script



 import pandas as pd
import csv

df_old=pd.read_csv(os.path.join(somepath,'old_names.csv')
df_new=pd.read_csv(os.path.join(somepath,'new_names.csv')
df_names=df_new.join(df_old)


The problem is I am getting something like this, wrong file pairs;



enter image description here



When I open the new_names.csv I see that file list is written in a different order than old_names list so joining on index resulting in wrong pairs. How can I solve this problem?










share|improve this question

























  • There's no particular reason that the two finds must produce output in the same order, so the whole exercise may be flawed, but if you're in a situation where they do, do you need to be using Python to join them or would paste suffice?

    – Michael Homer
    Jan 21 at 19:11











  • Hi, thanks! I have many folders like that so i wanted to do join them in a for loop over directories in python.

    – kutlus
    Jan 21 at 19:15











  • There's just no reason that this would work except under pretty controlled conditions (specific filesystems in use, possibly control of other simultaneous operations on the system). detox will tell you what changes it's making and you'd be better off to use that information instead, I think, rather than trying to reverse-engineer it.

    – Michael Homer
    Jan 21 at 19:18
















1















I have a recursive directory called 'dir'. I am writing to list of files from all subdirectories to a CSV file with the following command in linux on the terminal.



dir$ find . -type f -printf '%fn' > old_names.csv


I am using a detox code to change filenames. And I am making a new list using



dir $ find . -type f -printf '%fn' > new_names.csv


I would like to join this to lists together and make a new list with two columns something like this;



enter image description here



To do that I read both csv files into pandas data frame and join them on index as follows in python3 script



 import pandas as pd
import csv

df_old=pd.read_csv(os.path.join(somepath,'old_names.csv')
df_new=pd.read_csv(os.path.join(somepath,'new_names.csv')
df_names=df_new.join(df_old)


The problem is I am getting something like this, wrong file pairs;



enter image description here



When I open the new_names.csv I see that file list is written in a different order than old_names list so joining on index resulting in wrong pairs. How can I solve this problem?










share|improve this question

























  • There's no particular reason that the two finds must produce output in the same order, so the whole exercise may be flawed, but if you're in a situation where they do, do you need to be using Python to join them or would paste suffice?

    – Michael Homer
    Jan 21 at 19:11











  • Hi, thanks! I have many folders like that so i wanted to do join them in a for loop over directories in python.

    – kutlus
    Jan 21 at 19:15











  • There's just no reason that this would work except under pretty controlled conditions (specific filesystems in use, possibly control of other simultaneous operations on the system). detox will tell you what changes it's making and you'd be better off to use that information instead, I think, rather than trying to reverse-engineer it.

    – Michael Homer
    Jan 21 at 19:18














1












1








1








I have a recursive directory called 'dir'. I am writing to list of files from all subdirectories to a CSV file with the following command in linux on the terminal.



dir$ find . -type f -printf '%fn' > old_names.csv


I am using a detox code to change filenames. And I am making a new list using



dir $ find . -type f -printf '%fn' > new_names.csv


I would like to join this to lists together and make a new list with two columns something like this;



enter image description here



To do that I read both csv files into pandas data frame and join them on index as follows in python3 script



 import pandas as pd
import csv

df_old=pd.read_csv(os.path.join(somepath,'old_names.csv')
df_new=pd.read_csv(os.path.join(somepath,'new_names.csv')
df_names=df_new.join(df_old)


The problem is I am getting something like this, wrong file pairs;



enter image description here



When I open the new_names.csv I see that file list is written in a different order than old_names list so joining on index resulting in wrong pairs. How can I solve this problem?










share|improve this question
















I have a recursive directory called 'dir'. I am writing to list of files from all subdirectories to a CSV file with the following command in linux on the terminal.



dir$ find . -type f -printf '%fn' > old_names.csv


I am using a detox code to change filenames. And I am making a new list using



dir $ find . -type f -printf '%fn' > new_names.csv


I would like to join this to lists together and make a new list with two columns something like this;



enter image description here



To do that I read both csv files into pandas data frame and join them on index as follows in python3 script



 import pandas as pd
import csv

df_old=pd.read_csv(os.path.join(somepath,'old_names.csv')
df_new=pd.read_csv(os.path.join(somepath,'new_names.csv')
df_names=df_new.join(df_old)


The problem is I am getting something like this, wrong file pairs;



enter image description here



When I open the new_names.csv I see that file list is written in a different order than old_names list so joining on index resulting in wrong pairs. How can I solve this problem?







linux python3






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 21 at 17:13









Tomasz

9,51652965




9,51652965










asked Jan 21 at 16:55









kutluskutlus

536




536













  • There's no particular reason that the two finds must produce output in the same order, so the whole exercise may be flawed, but if you're in a situation where they do, do you need to be using Python to join them or would paste suffice?

    – Michael Homer
    Jan 21 at 19:11











  • Hi, thanks! I have many folders like that so i wanted to do join them in a for loop over directories in python.

    – kutlus
    Jan 21 at 19:15











  • There's just no reason that this would work except under pretty controlled conditions (specific filesystems in use, possibly control of other simultaneous operations on the system). detox will tell you what changes it's making and you'd be better off to use that information instead, I think, rather than trying to reverse-engineer it.

    – Michael Homer
    Jan 21 at 19:18



















  • There's no particular reason that the two finds must produce output in the same order, so the whole exercise may be flawed, but if you're in a situation where they do, do you need to be using Python to join them or would paste suffice?

    – Michael Homer
    Jan 21 at 19:11











  • Hi, thanks! I have many folders like that so i wanted to do join them in a for loop over directories in python.

    – kutlus
    Jan 21 at 19:15











  • There's just no reason that this would work except under pretty controlled conditions (specific filesystems in use, possibly control of other simultaneous operations on the system). detox will tell you what changes it's making and you'd be better off to use that information instead, I think, rather than trying to reverse-engineer it.

    – Michael Homer
    Jan 21 at 19:18

















There's no particular reason that the two finds must produce output in the same order, so the whole exercise may be flawed, but if you're in a situation where they do, do you need to be using Python to join them or would paste suffice?

– Michael Homer
Jan 21 at 19:11





There's no particular reason that the two finds must produce output in the same order, so the whole exercise may be flawed, but if you're in a situation where they do, do you need to be using Python to join them or would paste suffice?

– Michael Homer
Jan 21 at 19:11













Hi, thanks! I have many folders like that so i wanted to do join them in a for loop over directories in python.

– kutlus
Jan 21 at 19:15





Hi, thanks! I have many folders like that so i wanted to do join them in a for loop over directories in python.

– kutlus
Jan 21 at 19:15













There's just no reason that this would work except under pretty controlled conditions (specific filesystems in use, possibly control of other simultaneous operations on the system). detox will tell you what changes it's making and you'd be better off to use that information instead, I think, rather than trying to reverse-engineer it.

– Michael Homer
Jan 21 at 19:18





There's just no reason that this would work except under pretty controlled conditions (specific filesystems in use, possibly control of other simultaneous operations on the system). detox will tell you what changes it's making and you'd be better off to use that information instead, I think, rather than trying to reverse-engineer it.

– Michael Homer
Jan 21 at 19:18










1 Answer
1






active

oldest

votes


















0














The find command just outputs in the order the filesystem gives its directory entries in, without any sorting or processing. Depending on the filesystem you're using and other factors, renaming even a single file could change the iteration order, but changing all of them is quite likely to do so. Without a tightly-controlled environment there's no particular reason that two finds should give the same order like that.



For example, many modern filesystems store names in a hash table, and iterate in the order entries appear there. A tiny filename change may be much earlier or later in the table than the original, or even cause total re-hashing of the entire directory so that everything moves. There's no realistic way to put the pieces back together in that case.



It's possible that sorting the filenames might help, if they each have a unique unchanged prefix, but that's the only realistic sort of post-processing you could do and carry on with two separate files from two find runs. I don't recommend even trying that.





However, detox does have a -v option that prints out the changes it is making (and -n to print out what it would do). You could use that to produce your CSV file, or directly from Python using subprocess.run.



detox -v ... | sed -e 's/ -> /,/' > names.csv


would produce a CSV file at least as well as one of your finds, with the old and new names automatically matched up. For the basenames (like %f did) you'll need to postprocess, which you can do in Python if necessary, or in the shell.






share|improve this answer


























  • Thank you Michael this was helpful, I am convinced that i can`t get around this. The detox code not the detox package but just some functions I defined them to replace some characters with others, and called them detox functions. Now,I have decided convert first list into data frame, call each row with index and apply the detox functions on each row to create the new name list.

    – kutlus
    Jan 21 at 21:58











Your Answer








StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f495819%2fcreate-files-through-terminal-and-joining-two-files-in-script-python3%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














The find command just outputs in the order the filesystem gives its directory entries in, without any sorting or processing. Depending on the filesystem you're using and other factors, renaming even a single file could change the iteration order, but changing all of them is quite likely to do so. Without a tightly-controlled environment there's no particular reason that two finds should give the same order like that.



For example, many modern filesystems store names in a hash table, and iterate in the order entries appear there. A tiny filename change may be much earlier or later in the table than the original, or even cause total re-hashing of the entire directory so that everything moves. There's no realistic way to put the pieces back together in that case.



It's possible that sorting the filenames might help, if they each have a unique unchanged prefix, but that's the only realistic sort of post-processing you could do and carry on with two separate files from two find runs. I don't recommend even trying that.





However, detox does have a -v option that prints out the changes it is making (and -n to print out what it would do). You could use that to produce your CSV file, or directly from Python using subprocess.run.



detox -v ... | sed -e 's/ -> /,/' > names.csv


would produce a CSV file at least as well as one of your finds, with the old and new names automatically matched up. For the basenames (like %f did) you'll need to postprocess, which you can do in Python if necessary, or in the shell.






share|improve this answer


























  • Thank you Michael this was helpful, I am convinced that i can`t get around this. The detox code not the detox package but just some functions I defined them to replace some characters with others, and called them detox functions. Now,I have decided convert first list into data frame, call each row with index and apply the detox functions on each row to create the new name list.

    – kutlus
    Jan 21 at 21:58
















0














The find command just outputs in the order the filesystem gives its directory entries in, without any sorting or processing. Depending on the filesystem you're using and other factors, renaming even a single file could change the iteration order, but changing all of them is quite likely to do so. Without a tightly-controlled environment there's no particular reason that two finds should give the same order like that.



For example, many modern filesystems store names in a hash table, and iterate in the order entries appear there. A tiny filename change may be much earlier or later in the table than the original, or even cause total re-hashing of the entire directory so that everything moves. There's no realistic way to put the pieces back together in that case.



It's possible that sorting the filenames might help, if they each have a unique unchanged prefix, but that's the only realistic sort of post-processing you could do and carry on with two separate files from two find runs. I don't recommend even trying that.





However, detox does have a -v option that prints out the changes it is making (and -n to print out what it would do). You could use that to produce your CSV file, or directly from Python using subprocess.run.



detox -v ... | sed -e 's/ -> /,/' > names.csv


would produce a CSV file at least as well as one of your finds, with the old and new names automatically matched up. For the basenames (like %f did) you'll need to postprocess, which you can do in Python if necessary, or in the shell.






share|improve this answer


























  • Thank you Michael this was helpful, I am convinced that i can`t get around this. The detox code not the detox package but just some functions I defined them to replace some characters with others, and called them detox functions. Now,I have decided convert first list into data frame, call each row with index and apply the detox functions on each row to create the new name list.

    – kutlus
    Jan 21 at 21:58














0












0








0







The find command just outputs in the order the filesystem gives its directory entries in, without any sorting or processing. Depending on the filesystem you're using and other factors, renaming even a single file could change the iteration order, but changing all of them is quite likely to do so. Without a tightly-controlled environment there's no particular reason that two finds should give the same order like that.



For example, many modern filesystems store names in a hash table, and iterate in the order entries appear there. A tiny filename change may be much earlier or later in the table than the original, or even cause total re-hashing of the entire directory so that everything moves. There's no realistic way to put the pieces back together in that case.



It's possible that sorting the filenames might help, if they each have a unique unchanged prefix, but that's the only realistic sort of post-processing you could do and carry on with two separate files from two find runs. I don't recommend even trying that.





However, detox does have a -v option that prints out the changes it is making (and -n to print out what it would do). You could use that to produce your CSV file, or directly from Python using subprocess.run.



detox -v ... | sed -e 's/ -> /,/' > names.csv


would produce a CSV file at least as well as one of your finds, with the old and new names automatically matched up. For the basenames (like %f did) you'll need to postprocess, which you can do in Python if necessary, or in the shell.






share|improve this answer















The find command just outputs in the order the filesystem gives its directory entries in, without any sorting or processing. Depending on the filesystem you're using and other factors, renaming even a single file could change the iteration order, but changing all of them is quite likely to do so. Without a tightly-controlled environment there's no particular reason that two finds should give the same order like that.



For example, many modern filesystems store names in a hash table, and iterate in the order entries appear there. A tiny filename change may be much earlier or later in the table than the original, or even cause total re-hashing of the entire directory so that everything moves. There's no realistic way to put the pieces back together in that case.



It's possible that sorting the filenames might help, if they each have a unique unchanged prefix, but that's the only realistic sort of post-processing you could do and carry on with two separate files from two find runs. I don't recommend even trying that.





However, detox does have a -v option that prints out the changes it is making (and -n to print out what it would do). You could use that to produce your CSV file, or directly from Python using subprocess.run.



detox -v ... | sed -e 's/ -> /,/' > names.csv


would produce a CSV file at least as well as one of your finds, with the old and new names automatically matched up. For the basenames (like %f did) you'll need to postprocess, which you can do in Python if necessary, or in the shell.







share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 21 at 21:00

























answered Jan 21 at 19:41









Michael HomerMichael Homer

47.4k8124162




47.4k8124162













  • Thank you Michael this was helpful, I am convinced that i can`t get around this. The detox code not the detox package but just some functions I defined them to replace some characters with others, and called them detox functions. Now,I have decided convert first list into data frame, call each row with index and apply the detox functions on each row to create the new name list.

    – kutlus
    Jan 21 at 21:58



















  • Thank you Michael this was helpful, I am convinced that i can`t get around this. The detox code not the detox package but just some functions I defined them to replace some characters with others, and called them detox functions. Now,I have decided convert first list into data frame, call each row with index and apply the detox functions on each row to create the new name list.

    – kutlus
    Jan 21 at 21:58

















Thank you Michael this was helpful, I am convinced that i can`t get around this. The detox code not the detox package but just some functions I defined them to replace some characters with others, and called them detox functions. Now,I have decided convert first list into data frame, call each row with index and apply the detox functions on each row to create the new name list.

– kutlus
Jan 21 at 21:58





Thank you Michael this was helpful, I am convinced that i can`t get around this. The detox code not the detox package but just some functions I defined them to replace some characters with others, and called them detox functions. Now,I have decided convert first list into data frame, call each row with index and apply the detox functions on each row to create the new name list.

– kutlus
Jan 21 at 21:58


















draft saved

draft discarded




















































Thanks for contributing an answer to Unix & Linux Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f495819%2fcreate-files-through-terminal-and-joining-two-files-in-script-python3%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to reconfigure Docker Trusted Registry 2.x.x to use CEPH FS mount instead of NFS and other traditional...

is 'sed' thread safe

How to make a Squid Proxy server?