How to grep out all rows not matching the given pattern












0















I am pretty much a newbie in Unix/Linux, this question most likely might have been asked before, pardon me if you see so.



I have csv files received on Linux, they come with their own different patterns, one example here:



$ head test.csv
wampproduct,wamp_date,wampregion,region_search_phrase,wamp,date_pull,end_of_month_dt
CD Short-Term WAMP,2010-1-1,MA,MA,0.8763918845487475,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,RI,RI,0.8576695707678873,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,NH,NH,0.9038538021630779,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,CT,CT,0.9699202728104309,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,VT,VT,1.0631714504202636,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,PGH,PGH,0.9517353522520116,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,COM,COM,0.7401903422784099,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,DE,DE,0.8485585323154969,201901,2019-01-31
CD Short-Term WAMP,,2010-1-1,PHI|,PHI,,,,1.0009405151305597,201901,2019-01-31


As you might have noticed, all fields are in the pattern of xxxx,xxxx,xxxx,xxxx,xxxx,xxx,xxxx



However, there is a row (the last row in the example) that has mal-formed data:
xxxx,,xxxx,xxx|,xxx,,,,xxx,xxxx



I wonder how can I write a command or a script to 1. define the regex of the pattern (ideally put it in a seperate file); 2. grep out the unmatched rows from the original data, in this case, the last row should be grepped out.



Thank you very much.



Again, sorry for possible duplicates.










share|improve this question























  • You could start with a simple pattern (require at least one character in each field) up to a more complex one (dates must be numeric and in a certain range, regions should only be so many characters or of this list, etc).

    – Jeff Schaller
    Feb 13 at 15:10











  • How strict do you want the patterns to be?

    – Jeff Schaller
    Feb 13 at 15:11











  • -v display all unmatched lines (invert match)

    – Giacomo Catenazzi
    Feb 13 at 15:11











  • In the example shown, if two or more consecutive commas occur, then grep -v ',,' file would return the unmatched rows.

    – JRFerguson
    Feb 13 at 15:15











  • The grepped out rows must be very strictly NOT matching the pattern, as this script is intending to do some data quality check before it can be ingested and ETL'ed later. Bad data needs to be spotted and returned to client for correction. Thanks

    – mdivk
    Feb 13 at 15:25
















0















I am pretty much a newbie in Unix/Linux, this question most likely might have been asked before, pardon me if you see so.



I have csv files received on Linux, they come with their own different patterns, one example here:



$ head test.csv
wampproduct,wamp_date,wampregion,region_search_phrase,wamp,date_pull,end_of_month_dt
CD Short-Term WAMP,2010-1-1,MA,MA,0.8763918845487475,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,RI,RI,0.8576695707678873,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,NH,NH,0.9038538021630779,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,CT,CT,0.9699202728104309,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,VT,VT,1.0631714504202636,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,PGH,PGH,0.9517353522520116,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,COM,COM,0.7401903422784099,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,DE,DE,0.8485585323154969,201901,2019-01-31
CD Short-Term WAMP,,2010-1-1,PHI|,PHI,,,,1.0009405151305597,201901,2019-01-31


As you might have noticed, all fields are in the pattern of xxxx,xxxx,xxxx,xxxx,xxxx,xxx,xxxx



However, there is a row (the last row in the example) that has mal-formed data:
xxxx,,xxxx,xxx|,xxx,,,,xxx,xxxx



I wonder how can I write a command or a script to 1. define the regex of the pattern (ideally put it in a seperate file); 2. grep out the unmatched rows from the original data, in this case, the last row should be grepped out.



Thank you very much.



Again, sorry for possible duplicates.










share|improve this question























  • You could start with a simple pattern (require at least one character in each field) up to a more complex one (dates must be numeric and in a certain range, regions should only be so many characters or of this list, etc).

    – Jeff Schaller
    Feb 13 at 15:10











  • How strict do you want the patterns to be?

    – Jeff Schaller
    Feb 13 at 15:11











  • -v display all unmatched lines (invert match)

    – Giacomo Catenazzi
    Feb 13 at 15:11











  • In the example shown, if two or more consecutive commas occur, then grep -v ',,' file would return the unmatched rows.

    – JRFerguson
    Feb 13 at 15:15











  • The grepped out rows must be very strictly NOT matching the pattern, as this script is intending to do some data quality check before it can be ingested and ETL'ed later. Bad data needs to be spotted and returned to client for correction. Thanks

    – mdivk
    Feb 13 at 15:25














0












0








0








I am pretty much a newbie in Unix/Linux, this question most likely might have been asked before, pardon me if you see so.



I have csv files received on Linux, they come with their own different patterns, one example here:



$ head test.csv
wampproduct,wamp_date,wampregion,region_search_phrase,wamp,date_pull,end_of_month_dt
CD Short-Term WAMP,2010-1-1,MA,MA,0.8763918845487475,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,RI,RI,0.8576695707678873,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,NH,NH,0.9038538021630779,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,CT,CT,0.9699202728104309,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,VT,VT,1.0631714504202636,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,PGH,PGH,0.9517353522520116,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,COM,COM,0.7401903422784099,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,DE,DE,0.8485585323154969,201901,2019-01-31
CD Short-Term WAMP,,2010-1-1,PHI|,PHI,,,,1.0009405151305597,201901,2019-01-31


As you might have noticed, all fields are in the pattern of xxxx,xxxx,xxxx,xxxx,xxxx,xxx,xxxx



However, there is a row (the last row in the example) that has mal-formed data:
xxxx,,xxxx,xxx|,xxx,,,,xxx,xxxx



I wonder how can I write a command or a script to 1. define the regex of the pattern (ideally put it in a seperate file); 2. grep out the unmatched rows from the original data, in this case, the last row should be grepped out.



Thank you very much.



Again, sorry for possible duplicates.










share|improve this question














I am pretty much a newbie in Unix/Linux, this question most likely might have been asked before, pardon me if you see so.



I have csv files received on Linux, they come with their own different patterns, one example here:



$ head test.csv
wampproduct,wamp_date,wampregion,region_search_phrase,wamp,date_pull,end_of_month_dt
CD Short-Term WAMP,2010-1-1,MA,MA,0.8763918845487475,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,RI,RI,0.8576695707678873,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,NH,NH,0.9038538021630779,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,CT,CT,0.9699202728104309,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,VT,VT,1.0631714504202636,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,PGH,PGH,0.9517353522520116,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,COM,COM,0.7401903422784099,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,DE,DE,0.8485585323154969,201901,2019-01-31
CD Short-Term WAMP,,2010-1-1,PHI|,PHI,,,,1.0009405151305597,201901,2019-01-31


As you might have noticed, all fields are in the pattern of xxxx,xxxx,xxxx,xxxx,xxxx,xxx,xxxx



However, there is a row (the last row in the example) that has mal-formed data:
xxxx,,xxxx,xxx|,xxx,,,,xxx,xxxx



I wonder how can I write a command or a script to 1. define the regex of the pattern (ideally put it in a seperate file); 2. grep out the unmatched rows from the original data, in this case, the last row should be grepped out.



Thank you very much.



Again, sorry for possible duplicates.







grep regular-expression






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Feb 13 at 15:05









mdivkmdivk

1031




1031













  • You could start with a simple pattern (require at least one character in each field) up to a more complex one (dates must be numeric and in a certain range, regions should only be so many characters or of this list, etc).

    – Jeff Schaller
    Feb 13 at 15:10











  • How strict do you want the patterns to be?

    – Jeff Schaller
    Feb 13 at 15:11











  • -v display all unmatched lines (invert match)

    – Giacomo Catenazzi
    Feb 13 at 15:11











  • In the example shown, if two or more consecutive commas occur, then grep -v ',,' file would return the unmatched rows.

    – JRFerguson
    Feb 13 at 15:15











  • The grepped out rows must be very strictly NOT matching the pattern, as this script is intending to do some data quality check before it can be ingested and ETL'ed later. Bad data needs to be spotted and returned to client for correction. Thanks

    – mdivk
    Feb 13 at 15:25



















  • You could start with a simple pattern (require at least one character in each field) up to a more complex one (dates must be numeric and in a certain range, regions should only be so many characters or of this list, etc).

    – Jeff Schaller
    Feb 13 at 15:10











  • How strict do you want the patterns to be?

    – Jeff Schaller
    Feb 13 at 15:11











  • -v display all unmatched lines (invert match)

    – Giacomo Catenazzi
    Feb 13 at 15:11











  • In the example shown, if two or more consecutive commas occur, then grep -v ',,' file would return the unmatched rows.

    – JRFerguson
    Feb 13 at 15:15











  • The grepped out rows must be very strictly NOT matching the pattern, as this script is intending to do some data quality check before it can be ingested and ETL'ed later. Bad data needs to be spotted and returned to client for correction. Thanks

    – mdivk
    Feb 13 at 15:25

















You could start with a simple pattern (require at least one character in each field) up to a more complex one (dates must be numeric and in a certain range, regions should only be so many characters or of this list, etc).

– Jeff Schaller
Feb 13 at 15:10





You could start with a simple pattern (require at least one character in each field) up to a more complex one (dates must be numeric and in a certain range, regions should only be so many characters or of this list, etc).

– Jeff Schaller
Feb 13 at 15:10













How strict do you want the patterns to be?

– Jeff Schaller
Feb 13 at 15:11





How strict do you want the patterns to be?

– Jeff Schaller
Feb 13 at 15:11













-v display all unmatched lines (invert match)

– Giacomo Catenazzi
Feb 13 at 15:11





-v display all unmatched lines (invert match)

– Giacomo Catenazzi
Feb 13 at 15:11













In the example shown, if two or more consecutive commas occur, then grep -v ',,' file would return the unmatched rows.

– JRFerguson
Feb 13 at 15:15





In the example shown, if two or more consecutive commas occur, then grep -v ',,' file would return the unmatched rows.

– JRFerguson
Feb 13 at 15:15













The grepped out rows must be very strictly NOT matching the pattern, as this script is intending to do some data quality check before it can be ingested and ETL'ed later. Bad data needs to be spotted and returned to client for correction. Thanks

– mdivk
Feb 13 at 15:25





The grepped out rows must be very strictly NOT matching the pattern, as this script is intending to do some data quality check before it can be ingested and ETL'ed later. Bad data needs to be spotted and returned to client for correction. Thanks

– mdivk
Feb 13 at 15:25










1 Answer
1






active

oldest

votes


















2














Assuming none of the fields actually contains a quote:



awk -F, 'NF == 7' file
awk -F, 'NF != 7' file # show the "bad lines"


That prints any line with exactly 7 comma-separated fields.



If this is a proper CSV file (where quoted fields may contain the field separator), then you'll want a CSV parser. I like ruby for one-liners:



ruby -rcsv -pe 'next unless CSV.parse_line($_).length == 7' test.csv
ruby -rcsv -pe 'next if CSV.parse_line($_).length == 7' test.csv # show the "bad"


Regexes can provide elegant solutions, but not here IMO



grep -E '^([^,]+,){6}[^,]+$' test.csv
grep -vE '^([^,]+,){6}[^,]+$' test.csv # show the "bad" lines





share|improve this answer


























  • Thank you very much Glenn, adding -v to your command is the solution for me. would be better if you can explain the regex for later readers.

    – mdivk
    Feb 13 at 15:34













  • When you say "grep out", do you mean show or exclude the "bad" lines?

    – glenn jackman
    Feb 13 at 18:13











  • to show the "bad" lines

    – mdivk
    Feb 13 at 19:15











Your Answer








StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f500407%2fhow-to-grep-out-all-rows-not-matching-the-given-pattern%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














Assuming none of the fields actually contains a quote:



awk -F, 'NF == 7' file
awk -F, 'NF != 7' file # show the "bad lines"


That prints any line with exactly 7 comma-separated fields.



If this is a proper CSV file (where quoted fields may contain the field separator), then you'll want a CSV parser. I like ruby for one-liners:



ruby -rcsv -pe 'next unless CSV.parse_line($_).length == 7' test.csv
ruby -rcsv -pe 'next if CSV.parse_line($_).length == 7' test.csv # show the "bad"


Regexes can provide elegant solutions, but not here IMO



grep -E '^([^,]+,){6}[^,]+$' test.csv
grep -vE '^([^,]+,){6}[^,]+$' test.csv # show the "bad" lines





share|improve this answer


























  • Thank you very much Glenn, adding -v to your command is the solution for me. would be better if you can explain the regex for later readers.

    – mdivk
    Feb 13 at 15:34













  • When you say "grep out", do you mean show or exclude the "bad" lines?

    – glenn jackman
    Feb 13 at 18:13











  • to show the "bad" lines

    – mdivk
    Feb 13 at 19:15
















2














Assuming none of the fields actually contains a quote:



awk -F, 'NF == 7' file
awk -F, 'NF != 7' file # show the "bad lines"


That prints any line with exactly 7 comma-separated fields.



If this is a proper CSV file (where quoted fields may contain the field separator), then you'll want a CSV parser. I like ruby for one-liners:



ruby -rcsv -pe 'next unless CSV.parse_line($_).length == 7' test.csv
ruby -rcsv -pe 'next if CSV.parse_line($_).length == 7' test.csv # show the "bad"


Regexes can provide elegant solutions, but not here IMO



grep -E '^([^,]+,){6}[^,]+$' test.csv
grep -vE '^([^,]+,){6}[^,]+$' test.csv # show the "bad" lines





share|improve this answer


























  • Thank you very much Glenn, adding -v to your command is the solution for me. would be better if you can explain the regex for later readers.

    – mdivk
    Feb 13 at 15:34













  • When you say "grep out", do you mean show or exclude the "bad" lines?

    – glenn jackman
    Feb 13 at 18:13











  • to show the "bad" lines

    – mdivk
    Feb 13 at 19:15














2












2








2







Assuming none of the fields actually contains a quote:



awk -F, 'NF == 7' file
awk -F, 'NF != 7' file # show the "bad lines"


That prints any line with exactly 7 comma-separated fields.



If this is a proper CSV file (where quoted fields may contain the field separator), then you'll want a CSV parser. I like ruby for one-liners:



ruby -rcsv -pe 'next unless CSV.parse_line($_).length == 7' test.csv
ruby -rcsv -pe 'next if CSV.parse_line($_).length == 7' test.csv # show the "bad"


Regexes can provide elegant solutions, but not here IMO



grep -E '^([^,]+,){6}[^,]+$' test.csv
grep -vE '^([^,]+,){6}[^,]+$' test.csv # show the "bad" lines





share|improve this answer















Assuming none of the fields actually contains a quote:



awk -F, 'NF == 7' file
awk -F, 'NF != 7' file # show the "bad lines"


That prints any line with exactly 7 comma-separated fields.



If this is a proper CSV file (where quoted fields may contain the field separator), then you'll want a CSV parser. I like ruby for one-liners:



ruby -rcsv -pe 'next unless CSV.parse_line($_).length == 7' test.csv
ruby -rcsv -pe 'next if CSV.parse_line($_).length == 7' test.csv # show the "bad"


Regexes can provide elegant solutions, but not here IMO



grep -E '^([^,]+,){6}[^,]+$' test.csv
grep -vE '^([^,]+,){6}[^,]+$' test.csv # show the "bad" lines






share|improve this answer














share|improve this answer



share|improve this answer








edited Feb 13 at 19:29

























answered Feb 13 at 15:18









glenn jackmanglenn jackman

52k572112




52k572112













  • Thank you very much Glenn, adding -v to your command is the solution for me. would be better if you can explain the regex for later readers.

    – mdivk
    Feb 13 at 15:34













  • When you say "grep out", do you mean show or exclude the "bad" lines?

    – glenn jackman
    Feb 13 at 18:13











  • to show the "bad" lines

    – mdivk
    Feb 13 at 19:15



















  • Thank you very much Glenn, adding -v to your command is the solution for me. would be better if you can explain the regex for later readers.

    – mdivk
    Feb 13 at 15:34













  • When you say "grep out", do you mean show or exclude the "bad" lines?

    – glenn jackman
    Feb 13 at 18:13











  • to show the "bad" lines

    – mdivk
    Feb 13 at 19:15

















Thank you very much Glenn, adding -v to your command is the solution for me. would be better if you can explain the regex for later readers.

– mdivk
Feb 13 at 15:34







Thank you very much Glenn, adding -v to your command is the solution for me. would be better if you can explain the regex for later readers.

– mdivk
Feb 13 at 15:34















When you say "grep out", do you mean show or exclude the "bad" lines?

– glenn jackman
Feb 13 at 18:13





When you say "grep out", do you mean show or exclude the "bad" lines?

– glenn jackman
Feb 13 at 18:13













to show the "bad" lines

– mdivk
Feb 13 at 19:15





to show the "bad" lines

– mdivk
Feb 13 at 19:15


















draft saved

draft discarded




















































Thanks for contributing an answer to Unix & Linux Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f500407%2fhow-to-grep-out-all-rows-not-matching-the-given-pattern%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to make a Squid Proxy server?

Is this a new Fibonacci Identity?

19世紀