How to grep out all rows not matching the given pattern
I am pretty much a newbie in Unix/Linux, this question most likely might have been asked before, pardon me if you see so.
I have csv files received on Linux, they come with their own different patterns, one example here:
$ head test.csv
wampproduct,wamp_date,wampregion,region_search_phrase,wamp,date_pull,end_of_month_dt
CD Short-Term WAMP,2010-1-1,MA,MA,0.8763918845487475,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,RI,RI,0.8576695707678873,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,NH,NH,0.9038538021630779,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,CT,CT,0.9699202728104309,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,VT,VT,1.0631714504202636,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,PGH,PGH,0.9517353522520116,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,COM,COM,0.7401903422784099,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,DE,DE,0.8485585323154969,201901,2019-01-31
CD Short-Term WAMP,,2010-1-1,PHI|,PHI,,,,1.0009405151305597,201901,2019-01-31
As you might have noticed, all fields are in the pattern of xxxx,xxxx,xxxx,xxxx,xxxx,xxx,xxxx
However, there is a row (the last row in the example) that has mal-formed data:
xxxx,,xxxx,xxx|,xxx,,,,xxx,xxxx
I wonder how can I write a command or a script to 1. define the regex of the pattern (ideally put it in a seperate file); 2. grep out the unmatched rows from the original data, in this case, the last row should be grepped out.
Thank you very much.
Again, sorry for possible duplicates.
grep regular-expression
add a comment |
I am pretty much a newbie in Unix/Linux, this question most likely might have been asked before, pardon me if you see so.
I have csv files received on Linux, they come with their own different patterns, one example here:
$ head test.csv
wampproduct,wamp_date,wampregion,region_search_phrase,wamp,date_pull,end_of_month_dt
CD Short-Term WAMP,2010-1-1,MA,MA,0.8763918845487475,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,RI,RI,0.8576695707678873,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,NH,NH,0.9038538021630779,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,CT,CT,0.9699202728104309,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,VT,VT,1.0631714504202636,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,PGH,PGH,0.9517353522520116,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,COM,COM,0.7401903422784099,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,DE,DE,0.8485585323154969,201901,2019-01-31
CD Short-Term WAMP,,2010-1-1,PHI|,PHI,,,,1.0009405151305597,201901,2019-01-31
As you might have noticed, all fields are in the pattern of xxxx,xxxx,xxxx,xxxx,xxxx,xxx,xxxx
However, there is a row (the last row in the example) that has mal-formed data:
xxxx,,xxxx,xxx|,xxx,,,,xxx,xxxx
I wonder how can I write a command or a script to 1. define the regex of the pattern (ideally put it in a seperate file); 2. grep out the unmatched rows from the original data, in this case, the last row should be grepped out.
Thank you very much.
Again, sorry for possible duplicates.
grep regular-expression
You could start with a simple pattern (require at least one character in each field) up to a more complex one (dates must be numeric and in a certain range, regions should only be so many characters or of this list, etc).
– Jeff Schaller
Feb 13 at 15:10
How strict do you want the patterns to be?
– Jeff Schaller
Feb 13 at 15:11
-v
display all unmatched lines (invert match)
– Giacomo Catenazzi
Feb 13 at 15:11
In the example shown, if two or more consecutive commas occur, thengrep -v ',,' file
would return the unmatched rows.
– JRFerguson
Feb 13 at 15:15
The grepped out rows must be very strictly NOT matching the pattern, as this script is intending to do some data quality check before it can be ingested and ETL'ed later. Bad data needs to be spotted and returned to client for correction. Thanks
– mdivk
Feb 13 at 15:25
add a comment |
I am pretty much a newbie in Unix/Linux, this question most likely might have been asked before, pardon me if you see so.
I have csv files received on Linux, they come with their own different patterns, one example here:
$ head test.csv
wampproduct,wamp_date,wampregion,region_search_phrase,wamp,date_pull,end_of_month_dt
CD Short-Term WAMP,2010-1-1,MA,MA,0.8763918845487475,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,RI,RI,0.8576695707678873,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,NH,NH,0.9038538021630779,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,CT,CT,0.9699202728104309,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,VT,VT,1.0631714504202636,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,PGH,PGH,0.9517353522520116,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,COM,COM,0.7401903422784099,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,DE,DE,0.8485585323154969,201901,2019-01-31
CD Short-Term WAMP,,2010-1-1,PHI|,PHI,,,,1.0009405151305597,201901,2019-01-31
As you might have noticed, all fields are in the pattern of xxxx,xxxx,xxxx,xxxx,xxxx,xxx,xxxx
However, there is a row (the last row in the example) that has mal-formed data:
xxxx,,xxxx,xxx|,xxx,,,,xxx,xxxx
I wonder how can I write a command or a script to 1. define the regex of the pattern (ideally put it in a seperate file); 2. grep out the unmatched rows from the original data, in this case, the last row should be grepped out.
Thank you very much.
Again, sorry for possible duplicates.
grep regular-expression
I am pretty much a newbie in Unix/Linux, this question most likely might have been asked before, pardon me if you see so.
I have csv files received on Linux, they come with their own different patterns, one example here:
$ head test.csv
wampproduct,wamp_date,wampregion,region_search_phrase,wamp,date_pull,end_of_month_dt
CD Short-Term WAMP,2010-1-1,MA,MA,0.8763918845487475,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,RI,RI,0.8576695707678873,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,NH,NH,0.9038538021630779,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,CT,CT,0.9699202728104309,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,VT,VT,1.0631714504202636,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,PGH,PGH,0.9517353522520116,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,COM,COM,0.7401903422784099,201901,2019-01-31
CD Short-Term WAMP,2010-1-1,DE,DE,0.8485585323154969,201901,2019-01-31
CD Short-Term WAMP,,2010-1-1,PHI|,PHI,,,,1.0009405151305597,201901,2019-01-31
As you might have noticed, all fields are in the pattern of xxxx,xxxx,xxxx,xxxx,xxxx,xxx,xxxx
However, there is a row (the last row in the example) that has mal-formed data:
xxxx,,xxxx,xxx|,xxx,,,,xxx,xxxx
I wonder how can I write a command or a script to 1. define the regex of the pattern (ideally put it in a seperate file); 2. grep out the unmatched rows from the original data, in this case, the last row should be grepped out.
Thank you very much.
Again, sorry for possible duplicates.
grep regular-expression
grep regular-expression
asked Feb 13 at 15:05
mdivkmdivk
1031
1031
You could start with a simple pattern (require at least one character in each field) up to a more complex one (dates must be numeric and in a certain range, regions should only be so many characters or of this list, etc).
– Jeff Schaller
Feb 13 at 15:10
How strict do you want the patterns to be?
– Jeff Schaller
Feb 13 at 15:11
-v
display all unmatched lines (invert match)
– Giacomo Catenazzi
Feb 13 at 15:11
In the example shown, if two or more consecutive commas occur, thengrep -v ',,' file
would return the unmatched rows.
– JRFerguson
Feb 13 at 15:15
The grepped out rows must be very strictly NOT matching the pattern, as this script is intending to do some data quality check before it can be ingested and ETL'ed later. Bad data needs to be spotted and returned to client for correction. Thanks
– mdivk
Feb 13 at 15:25
add a comment |
You could start with a simple pattern (require at least one character in each field) up to a more complex one (dates must be numeric and in a certain range, regions should only be so many characters or of this list, etc).
– Jeff Schaller
Feb 13 at 15:10
How strict do you want the patterns to be?
– Jeff Schaller
Feb 13 at 15:11
-v
display all unmatched lines (invert match)
– Giacomo Catenazzi
Feb 13 at 15:11
In the example shown, if two or more consecutive commas occur, thengrep -v ',,' file
would return the unmatched rows.
– JRFerguson
Feb 13 at 15:15
The grepped out rows must be very strictly NOT matching the pattern, as this script is intending to do some data quality check before it can be ingested and ETL'ed later. Bad data needs to be spotted and returned to client for correction. Thanks
– mdivk
Feb 13 at 15:25
You could start with a simple pattern (require at least one character in each field) up to a more complex one (dates must be numeric and in a certain range, regions should only be so many characters or of this list, etc).
– Jeff Schaller
Feb 13 at 15:10
You could start with a simple pattern (require at least one character in each field) up to a more complex one (dates must be numeric and in a certain range, regions should only be so many characters or of this list, etc).
– Jeff Schaller
Feb 13 at 15:10
How strict do you want the patterns to be?
– Jeff Schaller
Feb 13 at 15:11
How strict do you want the patterns to be?
– Jeff Schaller
Feb 13 at 15:11
-v
display all unmatched lines (invert match)– Giacomo Catenazzi
Feb 13 at 15:11
-v
display all unmatched lines (invert match)– Giacomo Catenazzi
Feb 13 at 15:11
In the example shown, if two or more consecutive commas occur, then
grep -v ',,' file
would return the unmatched rows.– JRFerguson
Feb 13 at 15:15
In the example shown, if two or more consecutive commas occur, then
grep -v ',,' file
would return the unmatched rows.– JRFerguson
Feb 13 at 15:15
The grepped out rows must be very strictly NOT matching the pattern, as this script is intending to do some data quality check before it can be ingested and ETL'ed later. Bad data needs to be spotted and returned to client for correction. Thanks
– mdivk
Feb 13 at 15:25
The grepped out rows must be very strictly NOT matching the pattern, as this script is intending to do some data quality check before it can be ingested and ETL'ed later. Bad data needs to be spotted and returned to client for correction. Thanks
– mdivk
Feb 13 at 15:25
add a comment |
1 Answer
1
active
oldest
votes
Assuming none of the fields actually contains a quote:
awk -F, 'NF == 7' file
awk -F, 'NF != 7' file # show the "bad lines"
That prints any line with exactly 7 comma-separated fields.
If this is a proper CSV file (where quoted fields may contain the field separator), then you'll want a CSV parser. I like ruby for one-liners:
ruby -rcsv -pe 'next unless CSV.parse_line($_).length == 7' test.csv
ruby -rcsv -pe 'next if CSV.parse_line($_).length == 7' test.csv # show the "bad"
Regexes can provide elegant solutions, but not here IMO
grep -E '^([^,]+,){6}[^,]+$' test.csv
grep -vE '^([^,]+,){6}[^,]+$' test.csv # show the "bad" lines
Thank you very much Glenn, adding -v to your command is the solution for me. would be better if you can explain the regex for later readers.
– mdivk
Feb 13 at 15:34
When you say "grep out", do you mean show or exclude the "bad" lines?
– glenn jackman
Feb 13 at 18:13
to show the "bad" lines
– mdivk
Feb 13 at 19:15
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f500407%2fhow-to-grep-out-all-rows-not-matching-the-given-pattern%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Assuming none of the fields actually contains a quote:
awk -F, 'NF == 7' file
awk -F, 'NF != 7' file # show the "bad lines"
That prints any line with exactly 7 comma-separated fields.
If this is a proper CSV file (where quoted fields may contain the field separator), then you'll want a CSV parser. I like ruby for one-liners:
ruby -rcsv -pe 'next unless CSV.parse_line($_).length == 7' test.csv
ruby -rcsv -pe 'next if CSV.parse_line($_).length == 7' test.csv # show the "bad"
Regexes can provide elegant solutions, but not here IMO
grep -E '^([^,]+,){6}[^,]+$' test.csv
grep -vE '^([^,]+,){6}[^,]+$' test.csv # show the "bad" lines
Thank you very much Glenn, adding -v to your command is the solution for me. would be better if you can explain the regex for later readers.
– mdivk
Feb 13 at 15:34
When you say "grep out", do you mean show or exclude the "bad" lines?
– glenn jackman
Feb 13 at 18:13
to show the "bad" lines
– mdivk
Feb 13 at 19:15
add a comment |
Assuming none of the fields actually contains a quote:
awk -F, 'NF == 7' file
awk -F, 'NF != 7' file # show the "bad lines"
That prints any line with exactly 7 comma-separated fields.
If this is a proper CSV file (where quoted fields may contain the field separator), then you'll want a CSV parser. I like ruby for one-liners:
ruby -rcsv -pe 'next unless CSV.parse_line($_).length == 7' test.csv
ruby -rcsv -pe 'next if CSV.parse_line($_).length == 7' test.csv # show the "bad"
Regexes can provide elegant solutions, but not here IMO
grep -E '^([^,]+,){6}[^,]+$' test.csv
grep -vE '^([^,]+,){6}[^,]+$' test.csv # show the "bad" lines
Thank you very much Glenn, adding -v to your command is the solution for me. would be better if you can explain the regex for later readers.
– mdivk
Feb 13 at 15:34
When you say "grep out", do you mean show or exclude the "bad" lines?
– glenn jackman
Feb 13 at 18:13
to show the "bad" lines
– mdivk
Feb 13 at 19:15
add a comment |
Assuming none of the fields actually contains a quote:
awk -F, 'NF == 7' file
awk -F, 'NF != 7' file # show the "bad lines"
That prints any line with exactly 7 comma-separated fields.
If this is a proper CSV file (where quoted fields may contain the field separator), then you'll want a CSV parser. I like ruby for one-liners:
ruby -rcsv -pe 'next unless CSV.parse_line($_).length == 7' test.csv
ruby -rcsv -pe 'next if CSV.parse_line($_).length == 7' test.csv # show the "bad"
Regexes can provide elegant solutions, but not here IMO
grep -E '^([^,]+,){6}[^,]+$' test.csv
grep -vE '^([^,]+,){6}[^,]+$' test.csv # show the "bad" lines
Assuming none of the fields actually contains a quote:
awk -F, 'NF == 7' file
awk -F, 'NF != 7' file # show the "bad lines"
That prints any line with exactly 7 comma-separated fields.
If this is a proper CSV file (where quoted fields may contain the field separator), then you'll want a CSV parser. I like ruby for one-liners:
ruby -rcsv -pe 'next unless CSV.parse_line($_).length == 7' test.csv
ruby -rcsv -pe 'next if CSV.parse_line($_).length == 7' test.csv # show the "bad"
Regexes can provide elegant solutions, but not here IMO
grep -E '^([^,]+,){6}[^,]+$' test.csv
grep -vE '^([^,]+,){6}[^,]+$' test.csv # show the "bad" lines
edited Feb 13 at 19:29
answered Feb 13 at 15:18
glenn jackmanglenn jackman
52k572112
52k572112
Thank you very much Glenn, adding -v to your command is the solution for me. would be better if you can explain the regex for later readers.
– mdivk
Feb 13 at 15:34
When you say "grep out", do you mean show or exclude the "bad" lines?
– glenn jackman
Feb 13 at 18:13
to show the "bad" lines
– mdivk
Feb 13 at 19:15
add a comment |
Thank you very much Glenn, adding -v to your command is the solution for me. would be better if you can explain the regex for later readers.
– mdivk
Feb 13 at 15:34
When you say "grep out", do you mean show or exclude the "bad" lines?
– glenn jackman
Feb 13 at 18:13
to show the "bad" lines
– mdivk
Feb 13 at 19:15
Thank you very much Glenn, adding -v to your command is the solution for me. would be better if you can explain the regex for later readers.
– mdivk
Feb 13 at 15:34
Thank you very much Glenn, adding -v to your command is the solution for me. would be better if you can explain the regex for later readers.
– mdivk
Feb 13 at 15:34
When you say "grep out", do you mean show or exclude the "bad" lines?
– glenn jackman
Feb 13 at 18:13
When you say "grep out", do you mean show or exclude the "bad" lines?
– glenn jackman
Feb 13 at 18:13
to show the "bad" lines
– mdivk
Feb 13 at 19:15
to show the "bad" lines
– mdivk
Feb 13 at 19:15
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f500407%2fhow-to-grep-out-all-rows-not-matching-the-given-pattern%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You could start with a simple pattern (require at least one character in each field) up to a more complex one (dates must be numeric and in a certain range, regions should only be so many characters or of this list, etc).
– Jeff Schaller
Feb 13 at 15:10
How strict do you want the patterns to be?
– Jeff Schaller
Feb 13 at 15:11
-v
display all unmatched lines (invert match)– Giacomo Catenazzi
Feb 13 at 15:11
In the example shown, if two or more consecutive commas occur, then
grep -v ',,' file
would return the unmatched rows.– JRFerguson
Feb 13 at 15:15
The grepped out rows must be very strictly NOT matching the pattern, as this script is intending to do some data quality check before it can be ingested and ETL'ed later. Bad data needs to be spotted and returned to client for correction. Thanks
– mdivk
Feb 13 at 15:25