How to grep out all rows not matching the given pattern

I am pretty much a newbie in Unix/Linux, this question most likely might have been asked before, pardon me if you see so.

I have csv files received on Linux, they come with their own different patterns, one example here:

$ head test.csv

wampproduct,wamp_date,wampregion,region_search_phrase,wamp,date_pull,end_of_month_dt

CD Short-Term WAMP,2010-1-1,MA,MA,0.8763918845487475,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,RI,RI,0.8576695707678873,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,NH,NH,0.9038538021630779,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,CT,CT,0.9699202728104309,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,VT,VT,1.0631714504202636,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,PGH,PGH,0.9517353522520116,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,COM,COM,0.7401903422784099,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,DE,DE,0.8485585323154969,201901,2019-01-31

CD Short-Term WAMP,,2010-1-1,PHI|,PHI,,,,1.0009405151305597,201901,2019-01-31

As you might have noticed, all fields are in the pattern of xxxx,xxxx,xxxx,xxxx,xxxx,xxx,xxxx

However, there is a row (the last row in the example) that has mal-formed data:
xxxx,,xxxx,xxx|,xxx,,,,xxx,xxxx

I wonder how can I write a command or a script to 1. define the regex of the pattern (ideally put it in a seperate file); 2. grep out the unmatched rows from the original data, in this case, the last row should be grepped out.

Thank you very much.

Again, sorry for possible duplicates.

asked Feb 13 at 15:05

mdivk

1031

You could start with a simple pattern (require at least one character in each field) up to a more complex one (dates must be numeric and in a certain range, regions should only be so many characters or of this list, etc).

– Jeff Schaller
Feb 13 at 15:10

How strict do you want the patterns to be?

– Jeff Schaller
Feb 13 at 15:11

-v display all unmatched lines (invert match)

– Giacomo Catenazzi
Feb 13 at 15:11

In the example shown, if two or more consecutive commas occur, then grep -v ',,' file would return the unmatched rows.

– JRFerguson
Feb 13 at 15:15

The grepped out rows must be very strictly NOT matching the pattern, as this script is intending to do some data quality check before it can be ingested and ETL'ed later. Bad data needs to be spotted and returned to client for correction. Thanks

– mdivk
Feb 13 at 15:25

add a comment |

I am pretty much a newbie in Unix/Linux, this question most likely might have been asked before, pardon me if you see so.

I have csv files received on Linux, they come with their own different patterns, one example here:

$ head test.csv

wampproduct,wamp_date,wampregion,region_search_phrase,wamp,date_pull,end_of_month_dt

CD Short-Term WAMP,2010-1-1,MA,MA,0.8763918845487475,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,RI,RI,0.8576695707678873,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,NH,NH,0.9038538021630779,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,CT,CT,0.9699202728104309,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,VT,VT,1.0631714504202636,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,PGH,PGH,0.9517353522520116,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,COM,COM,0.7401903422784099,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,DE,DE,0.8485585323154969,201901,2019-01-31

CD Short-Term WAMP,,2010-1-1,PHI|,PHI,,,,1.0009405151305597,201901,2019-01-31

As you might have noticed, all fields are in the pattern of xxxx,xxxx,xxxx,xxxx,xxxx,xxx,xxxx

However, there is a row (the last row in the example) that has mal-formed data:
xxxx,,xxxx,xxx|,xxx,,,,xxx,xxxx

Thank you very much.

Again, sorry for possible duplicates.

asked Feb 13 at 15:05

mdivk

1031

You could start with a simple pattern (require at least one character in each field) up to a more complex one (dates must be numeric and in a certain range, regions should only be so many characters or of this list, etc).

– Jeff Schaller
Feb 13 at 15:10

How strict do you want the patterns to be?

– Jeff Schaller
Feb 13 at 15:11

-v display all unmatched lines (invert match)

– Giacomo Catenazzi
Feb 13 at 15:11

In the example shown, if two or more consecutive commas occur, then grep -v ',,' file would return the unmatched rows.

– JRFerguson
Feb 13 at 15:15

The grepped out rows must be very strictly NOT matching the pattern, as this script is intending to do some data quality check before it can be ingested and ETL'ed later. Bad data needs to be spotted and returned to client for correction. Thanks

– mdivk
Feb 13 at 15:25

add a comment |

I am pretty much a newbie in Unix/Linux, this question most likely might have been asked before, pardon me if you see so.

I have csv files received on Linux, they come with their own different patterns, one example here:

$ head test.csv

wampproduct,wamp_date,wampregion,region_search_phrase,wamp,date_pull,end_of_month_dt

CD Short-Term WAMP,2010-1-1,MA,MA,0.8763918845487475,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,RI,RI,0.8576695707678873,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,NH,NH,0.9038538021630779,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,CT,CT,0.9699202728104309,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,VT,VT,1.0631714504202636,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,PGH,PGH,0.9517353522520116,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,COM,COM,0.7401903422784099,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,DE,DE,0.8485585323154969,201901,2019-01-31

CD Short-Term WAMP,,2010-1-1,PHI|,PHI,,,,1.0009405151305597,201901,2019-01-31

As you might have noticed, all fields are in the pattern of xxxx,xxxx,xxxx,xxxx,xxxx,xxx,xxxx

However, there is a row (the last row in the example) that has mal-formed data:
xxxx,,xxxx,xxx|,xxx,,,,xxx,xxxx

Thank you very much.

Again, sorry for possible duplicates.

asked Feb 13 at 15:05

mdivk

1031

I am pretty much a newbie in Unix/Linux, this question most likely might have been asked before, pardon me if you see so.

I have csv files received on Linux, they come with their own different patterns, one example here:

$ head test.csv

wampproduct,wamp_date,wampregion,region_search_phrase,wamp,date_pull,end_of_month_dt

CD Short-Term WAMP,2010-1-1,MA,MA,0.8763918845487475,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,RI,RI,0.8576695707678873,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,NH,NH,0.9038538021630779,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,CT,CT,0.9699202728104309,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,VT,VT,1.0631714504202636,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,PGH,PGH,0.9517353522520116,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,COM,COM,0.7401903422784099,201901,2019-01-31

CD Short-Term WAMP,2010-1-1,DE,DE,0.8485585323154969,201901,2019-01-31

CD Short-Term WAMP,,2010-1-1,PHI|,PHI,,,,1.0009405151305597,201901,2019-01-31

As you might have noticed, all fields are in the pattern of xxxx,xxxx,xxxx,xxxx,xxxx,xxx,xxxx

However, there is a row (the last row in the example) that has mal-formed data:
xxxx,,xxxx,xxx|,xxx,,,,xxx,xxxx

Thank you very much.

Again, sorry for possible duplicates.

grep regular-expression

asked Feb 13 at 15:05

mdivk

1031

asked Feb 13 at 15:05

mdivk

1031

asked Feb 13 at 15:05

mdivk

1031

asked Feb 13 at 15:05

mdivk

1031

asked Feb 13 at 15:05

mdivk

1031

You could start with a simple pattern (require at least one character in each field) up to a more complex one (dates must be numeric and in a certain range, regions should only be so many characters or of this list, etc).

– Jeff Schaller
Feb 13 at 15:10

How strict do you want the patterns to be?

– Jeff Schaller
Feb 13 at 15:11

-v display all unmatched lines (invert match)

– Giacomo Catenazzi
Feb 13 at 15:11

In the example shown, if two or more consecutive commas occur, then grep -v ',,' file would return the unmatched rows.

– JRFerguson
Feb 13 at 15:15

The grepped out rows must be very strictly NOT matching the pattern, as this script is intending to do some data quality check before it can be ingested and ETL'ed later. Bad data needs to be spotted and returned to client for correction. Thanks

– mdivk
Feb 13 at 15:25

add a comment |

You could start with a simple pattern (require at least one character in each field) up to a more complex one (dates must be numeric and in a certain range, regions should only be so many characters or of this list, etc).

– Jeff Schaller
Feb 13 at 15:10

How strict do you want the patterns to be?

– Jeff Schaller
Feb 13 at 15:11

-v display all unmatched lines (invert match)

– Giacomo Catenazzi
Feb 13 at 15:11

In the example shown, if two or more consecutive commas occur, then grep -v ',,' file would return the unmatched rows.

– JRFerguson
Feb 13 at 15:15

The grepped out rows must be very strictly NOT matching the pattern, as this script is intending to do some data quality check before it can be ingested and ETL'ed later. Bad data needs to be spotted and returned to client for correction. Thanks

– mdivk
Feb 13 at 15:25

You could start with a simple pattern (require at least one character in each field) up to a more complex one (dates must be numeric and in a certain range, regions should only be so many characters or of this list, etc).

– Jeff Schaller
Feb 13 at 15:10

How strict do you want the patterns to be?

– Jeff Schaller
Feb 13 at 15:11

-v display all unmatched lines (invert match)

– Giacomo Catenazzi
Feb 13 at 15:11

In the example shown, if two or more consecutive commas occur, then grep -v ',,' file would return the unmatched rows.

– JRFerguson
Feb 13 at 15:15

The grepped out rows must be very strictly NOT matching the pattern, as this script is intending to do some data quality check before it can be ingested and ETL'ed later. Bad data needs to be spotted and returned to client for correction. Thanks

– mdivk
Feb 13 at 15:25

add a comment |

1 Answer
1

active

oldest

votes

Assuming none of the fields actually contains a quote:

awk -F, 'NF == 7' file

awk -F, 'NF != 7' file   # show the "bad lines"

That prints any line with exactly 7 comma-separated fields.

If this is a proper CSV file (where quoted fields may contain the field separator), then you'll want a CSV parser. I like ruby for one-liners:

ruby -rcsv -pe 'next unless CSV.parse_line($_).length == 7' test.csv

ruby -rcsv -pe 'next if CSV.parse_line($_).length == 7' test.csv  # show the "bad"

Regexes can provide elegant solutions, but not here IMO

grep -E '^([^,]+,){6}[^,]+$' test.csv

grep -vE '^([^,]+,){6}[^,]+$' test.csv   # show the "bad" lines

edited Feb 13 at 19:29

answered Feb 13 at 15:18

glenn jackman

52k572112

Thank you very much Glenn, adding -v to your command is the solution for me. would be better if you can explain the regex for later readers.

– mdivk
Feb 13 at 15:34

When you say "grep out", do you mean show or exclude the "bad" lines?

– glenn jackman
Feb 13 at 18:13

to show the "bad" lines

– mdivk
Feb 13 at 19:15

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f500407%2fhow-to-grep-out-all-rows-not-matching-the-given-pattern%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Assuming none of the fields actually contains a quote:

awk -F, 'NF == 7' file

awk -F, 'NF != 7' file   # show the "bad lines"

That prints any line with exactly 7 comma-separated fields.

If this is a proper CSV file (where quoted fields may contain the field separator), then you'll want a CSV parser. I like ruby for one-liners:

ruby -rcsv -pe 'next unless CSV.parse_line($_).length == 7' test.csv

ruby -rcsv -pe 'next if CSV.parse_line($_).length == 7' test.csv  # show the "bad"

Regexes can provide elegant solutions, but not here IMO

grep -E '^([^,]+,){6}[^,]+$' test.csv

grep -vE '^([^,]+,){6}[^,]+$' test.csv   # show the "bad" lines

edited Feb 13 at 19:29

answered Feb 13 at 15:18

glenn jackman

52k572112

Thank you very much Glenn, adding -v to your command is the solution for me. would be better if you can explain the regex for later readers.

– mdivk
Feb 13 at 15:34

When you say "grep out", do you mean show or exclude the "bad" lines?

– glenn jackman
Feb 13 at 18:13

to show the "bad" lines

– mdivk
Feb 13 at 19:15

add a comment |

Assuming none of the fields actually contains a quote:

awk -F, 'NF == 7' file

awk -F, 'NF != 7' file   # show the "bad lines"

That prints any line with exactly 7 comma-separated fields.

If this is a proper CSV file (where quoted fields may contain the field separator), then you'll want a CSV parser. I like ruby for one-liners:

ruby -rcsv -pe 'next unless CSV.parse_line($_).length == 7' test.csv

ruby -rcsv -pe 'next if CSV.parse_line($_).length == 7' test.csv  # show the "bad"

Regexes can provide elegant solutions, but not here IMO

grep -E '^([^,]+,){6}[^,]+$' test.csv

grep -vE '^([^,]+,){6}[^,]+$' test.csv   # show the "bad" lines

edited Feb 13 at 19:29

answered Feb 13 at 15:18

glenn jackman

52k572112

Thank you very much Glenn, adding -v to your command is the solution for me. would be better if you can explain the regex for later readers.

– mdivk
Feb 13 at 15:34

When you say "grep out", do you mean show or exclude the "bad" lines?

– glenn jackman
Feb 13 at 18:13

to show the "bad" lines

– mdivk
Feb 13 at 19:15

add a comment |

Assuming none of the fields actually contains a quote:

awk -F, 'NF == 7' file

awk -F, 'NF != 7' file   # show the "bad lines"

That prints any line with exactly 7 comma-separated fields.

If this is a proper CSV file (where quoted fields may contain the field separator), then you'll want a CSV parser. I like ruby for one-liners:

ruby -rcsv -pe 'next unless CSV.parse_line($_).length == 7' test.csv

ruby -rcsv -pe 'next if CSV.parse_line($_).length == 7' test.csv  # show the "bad"

Regexes can provide elegant solutions, but not here IMO

grep -E '^([^,]+,){6}[^,]+$' test.csv

grep -vE '^([^,]+,){6}[^,]+$' test.csv   # show the "bad" lines

edited Feb 13 at 19:29

answered Feb 13 at 15:18

glenn jackman

52k572112

Assuming none of the fields actually contains a quote:

awk -F, 'NF == 7' file

awk -F, 'NF != 7' file   # show the "bad lines"

That prints any line with exactly 7 comma-separated fields.

If this is a proper CSV file (where quoted fields may contain the field separator), then you'll want a CSV parser. I like ruby for one-liners:

ruby -rcsv -pe 'next unless CSV.parse_line($_).length == 7' test.csv

ruby -rcsv -pe 'next if CSV.parse_line($_).length == 7' test.csv  # show the "bad"

Regexes can provide elegant solutions, but not here IMO

grep -E '^([^,]+,){6}[^,]+$' test.csv

grep -vE '^([^,]+,){6}[^,]+$' test.csv   # show the "bad" lines

edited Feb 13 at 19:29

answered Feb 13 at 15:18

glenn jackman

52k572112

edited Feb 13 at 19:29

answered Feb 13 at 15:18

glenn jackman

52k572112

answered Feb 13 at 15:18

glenn jackman

52k572112

answered Feb 13 at 15:18

glenn jackman

52k572112

Thank you very much Glenn, adding -v to your command is the solution for me. would be better if you can explain the regex for later readers.

– mdivk
Feb 13 at 15:34

When you say "grep out", do you mean show or exclude the "bad" lines?

– glenn jackman
Feb 13 at 18:13

to show the "bad" lines

– mdivk
Feb 13 at 19:15

add a comment |

Thank you very much Glenn, adding -v to your command is the solution for me. would be better if you can explain the regex for later readers.

– mdivk
Feb 13 at 15:34

When you say "grep out", do you mean show or exclude the "bad" lines?

– glenn jackman
Feb 13 at 18:13

to show the "bad" lines

– mdivk
Feb 13 at 19:15

Thank you very much Glenn, adding -v to your command is the solution for me. would be better if you can explain the regex for later readers.

– mdivk
Feb 13 at 15:34

When you say "grep out", do you mean show or exclude the "bad" lines?

– glenn jackman
Feb 13 at 18:13

to show the "bad" lines

– mdivk
Feb 13 at 19:15

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ytdyklly