Grep a range of values with specific starting characters
I have 10GB files in which i want to count the occurrences of some specific text i.e TY[0-9].
Example file:
ABC,2A,2018-07-06,2018-06-20 00:00:00
BCD,TY1,2018-07-06,2018-06-20 00:00:00
EFG,TY2,2018-07-06,2018-06-20 00:00:00
IGH,2A,2018-07-06,2018-06-20 00:00:00
I want to get the count of all text starting with TY
and then a digit. I tried using egrep but am not getting the correct result.
egrep "^TY[0-9]" Filename
awk grep
add a comment |
I have 10GB files in which i want to count the occurrences of some specific text i.e TY[0-9].
Example file:
ABC,2A,2018-07-06,2018-06-20 00:00:00
BCD,TY1,2018-07-06,2018-06-20 00:00:00
EFG,TY2,2018-07-06,2018-06-20 00:00:00
IGH,2A,2018-07-06,2018-06-20 00:00:00
I want to get the count of all text starting with TY
and then a digit. I tried using egrep but am not getting the correct result.
egrep "^TY[0-9]" Filename
awk grep
^
means beginning of a line. YouTY
entries are not at the beginning of a line.
– Thomas
Jan 30 at 11:15
add a comment |
I have 10GB files in which i want to count the occurrences of some specific text i.e TY[0-9].
Example file:
ABC,2A,2018-07-06,2018-06-20 00:00:00
BCD,TY1,2018-07-06,2018-06-20 00:00:00
EFG,TY2,2018-07-06,2018-06-20 00:00:00
IGH,2A,2018-07-06,2018-06-20 00:00:00
I want to get the count of all text starting with TY
and then a digit. I tried using egrep but am not getting the correct result.
egrep "^TY[0-9]" Filename
awk grep
I have 10GB files in which i want to count the occurrences of some specific text i.e TY[0-9].
Example file:
ABC,2A,2018-07-06,2018-06-20 00:00:00
BCD,TY1,2018-07-06,2018-06-20 00:00:00
EFG,TY2,2018-07-06,2018-06-20 00:00:00
IGH,2A,2018-07-06,2018-06-20 00:00:00
I want to get the count of all text starting with TY
and then a digit. I tried using egrep but am not getting the correct result.
egrep "^TY[0-9]" Filename
awk grep
awk grep
edited Jan 30 at 5:18
Crypteya
19514
19514
asked Jun 21 '18 at 18:37
DeveloperDeveloper
15517
15517
^
means beginning of a line. YouTY
entries are not at the beginning of a line.
– Thomas
Jan 30 at 11:15
add a comment |
^
means beginning of a line. YouTY
entries are not at the beginning of a line.
– Thomas
Jan 30 at 11:15
^
means beginning of a line. You TY
entries are not at the beginning of a line.– Thomas
Jan 30 at 11:15
^
means beginning of a line. You TY
entries are not at the beginning of a line.– Thomas
Jan 30 at 11:15
add a comment |
3 Answers
3
active
oldest
votes
The main issue with your attempted solution is that it assumes that the sting TY
occurs at the start of the line (you are anchoring the expression there with ^
), but it doesn't. It occurs at the start of the second comma-delimited field.
Using awk
to count the number of times the second comma-delimited field in the file starts with the string TY
followed by a digit:
awk -F, '$2 ~ /^TY[[:digit:]]/ { n++ } END { print n }' filename
I'm wondering whether using cut
in combination with grep
would be quick? Cutting out the second column would give grep
less data to work with, and so it may be quicker than just grep
alone.
cut -d, -f2 filename | grep -c '^TY[[:digit:]]'
... but I'm not sure.
After some testing on my OpenBSD system, using a 1.1GB file, the cut
+grep
is actually almost 50% quicker than awk
(8 seconds vs. 15 seconds). And a pure grep
solution (grep -Ec '<TY[0-9]' filename
, taken from glenn's solution) takes 13 seconds.
So if the string is to picked out of the second field only, one may gain some time by extracting only that field before matching.
In your second example, why notcut -d, -f2 inputfile | grep -c [...]
rather than| grep | wc -l
?
– DopeGhoti
Jun 21 '18 at 18:59
@DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.
– Kusalananda
Jun 21 '18 at 19:02
add a comment |
You want to use a word boundary instead of the start-of-line anchor:
$ grep -Ec '<TY[0-9]' file
2
Note: that is a count of all lines with a "TY word". It is not a count of all "TY word"s. If you can have more than one per line, then
$ grep -Eo '<TY[0-9]' file | wc -l
add a comment |
If you want to find the number of occurrence of a ,
delimited field that starts with TY
and is followed by any number of decimal digits, you could do:
<file perl -lne '$n += () = /(?<![^,])TYd+(?![^,])/g; END{print 0+$n}'
Which on an input like:
TY1,TY2,TY,TYFOO
TY213,X-TY2,TY4
Would return 4
(TY1
, TY2
, TY213
, TY4
).
(?<!...)
and (?!...)
are respectively negative look behing and ahead operators. So here, we're looking for TY
followed by one or more (+
) digits (d
), provided its neither preceded nor followed by a character other than ,
.
Another way to do it would be to convert ,
s to newlines and count the number of resulting lines that start with TY
followed by one or more digits:
<file tr , 'n' | LC_ALL=C grep -xEc 'TY[[:digit:]]+'
(on my system, that's about 10 times as fast as the perl
solution)
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f451168%2fgrep-a-range-of-values-with-specific-starting-characters%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
The main issue with your attempted solution is that it assumes that the sting TY
occurs at the start of the line (you are anchoring the expression there with ^
), but it doesn't. It occurs at the start of the second comma-delimited field.
Using awk
to count the number of times the second comma-delimited field in the file starts with the string TY
followed by a digit:
awk -F, '$2 ~ /^TY[[:digit:]]/ { n++ } END { print n }' filename
I'm wondering whether using cut
in combination with grep
would be quick? Cutting out the second column would give grep
less data to work with, and so it may be quicker than just grep
alone.
cut -d, -f2 filename | grep -c '^TY[[:digit:]]'
... but I'm not sure.
After some testing on my OpenBSD system, using a 1.1GB file, the cut
+grep
is actually almost 50% quicker than awk
(8 seconds vs. 15 seconds). And a pure grep
solution (grep -Ec '<TY[0-9]' filename
, taken from glenn's solution) takes 13 seconds.
So if the string is to picked out of the second field only, one may gain some time by extracting only that field before matching.
In your second example, why notcut -d, -f2 inputfile | grep -c [...]
rather than| grep | wc -l
?
– DopeGhoti
Jun 21 '18 at 18:59
@DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.
– Kusalananda
Jun 21 '18 at 19:02
add a comment |
The main issue with your attempted solution is that it assumes that the sting TY
occurs at the start of the line (you are anchoring the expression there with ^
), but it doesn't. It occurs at the start of the second comma-delimited field.
Using awk
to count the number of times the second comma-delimited field in the file starts with the string TY
followed by a digit:
awk -F, '$2 ~ /^TY[[:digit:]]/ { n++ } END { print n }' filename
I'm wondering whether using cut
in combination with grep
would be quick? Cutting out the second column would give grep
less data to work with, and so it may be quicker than just grep
alone.
cut -d, -f2 filename | grep -c '^TY[[:digit:]]'
... but I'm not sure.
After some testing on my OpenBSD system, using a 1.1GB file, the cut
+grep
is actually almost 50% quicker than awk
(8 seconds vs. 15 seconds). And a pure grep
solution (grep -Ec '<TY[0-9]' filename
, taken from glenn's solution) takes 13 seconds.
So if the string is to picked out of the second field only, one may gain some time by extracting only that field before matching.
In your second example, why notcut -d, -f2 inputfile | grep -c [...]
rather than| grep | wc -l
?
– DopeGhoti
Jun 21 '18 at 18:59
@DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.
– Kusalananda
Jun 21 '18 at 19:02
add a comment |
The main issue with your attempted solution is that it assumes that the sting TY
occurs at the start of the line (you are anchoring the expression there with ^
), but it doesn't. It occurs at the start of the second comma-delimited field.
Using awk
to count the number of times the second comma-delimited field in the file starts with the string TY
followed by a digit:
awk -F, '$2 ~ /^TY[[:digit:]]/ { n++ } END { print n }' filename
I'm wondering whether using cut
in combination with grep
would be quick? Cutting out the second column would give grep
less data to work with, and so it may be quicker than just grep
alone.
cut -d, -f2 filename | grep -c '^TY[[:digit:]]'
... but I'm not sure.
After some testing on my OpenBSD system, using a 1.1GB file, the cut
+grep
is actually almost 50% quicker than awk
(8 seconds vs. 15 seconds). And a pure grep
solution (grep -Ec '<TY[0-9]' filename
, taken from glenn's solution) takes 13 seconds.
So if the string is to picked out of the second field only, one may gain some time by extracting only that field before matching.
The main issue with your attempted solution is that it assumes that the sting TY
occurs at the start of the line (you are anchoring the expression there with ^
), but it doesn't. It occurs at the start of the second comma-delimited field.
Using awk
to count the number of times the second comma-delimited field in the file starts with the string TY
followed by a digit:
awk -F, '$2 ~ /^TY[[:digit:]]/ { n++ } END { print n }' filename
I'm wondering whether using cut
in combination with grep
would be quick? Cutting out the second column would give grep
less data to work with, and so it may be quicker than just grep
alone.
cut -d, -f2 filename | grep -c '^TY[[:digit:]]'
... but I'm not sure.
After some testing on my OpenBSD system, using a 1.1GB file, the cut
+grep
is actually almost 50% quicker than awk
(8 seconds vs. 15 seconds). And a pure grep
solution (grep -Ec '<TY[0-9]' filename
, taken from glenn's solution) takes 13 seconds.
So if the string is to picked out of the second field only, one may gain some time by extracting only that field before matching.
edited Jan 30 at 6:51
answered Jun 21 '18 at 18:47
KusalanandaKusalananda
129k16243400
129k16243400
In your second example, why notcut -d, -f2 inputfile | grep -c [...]
rather than| grep | wc -l
?
– DopeGhoti
Jun 21 '18 at 18:59
@DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.
– Kusalananda
Jun 21 '18 at 19:02
add a comment |
In your second example, why notcut -d, -f2 inputfile | grep -c [...]
rather than| grep | wc -l
?
– DopeGhoti
Jun 21 '18 at 18:59
@DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.
– Kusalananda
Jun 21 '18 at 19:02
In your second example, why not
cut -d, -f2 inputfile | grep -c [...]
rather than | grep | wc -l
?– DopeGhoti
Jun 21 '18 at 18:59
In your second example, why not
cut -d, -f2 inputfile | grep -c [...]
rather than | grep | wc -l
?– DopeGhoti
Jun 21 '18 at 18:59
@DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.
– Kusalananda
Jun 21 '18 at 19:02
@DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.
– Kusalananda
Jun 21 '18 at 19:02
add a comment |
You want to use a word boundary instead of the start-of-line anchor:
$ grep -Ec '<TY[0-9]' file
2
Note: that is a count of all lines with a "TY word". It is not a count of all "TY word"s. If you can have more than one per line, then
$ grep -Eo '<TY[0-9]' file | wc -l
add a comment |
You want to use a word boundary instead of the start-of-line anchor:
$ grep -Ec '<TY[0-9]' file
2
Note: that is a count of all lines with a "TY word". It is not a count of all "TY word"s. If you can have more than one per line, then
$ grep -Eo '<TY[0-9]' file | wc -l
add a comment |
You want to use a word boundary instead of the start-of-line anchor:
$ grep -Ec '<TY[0-9]' file
2
Note: that is a count of all lines with a "TY word". It is not a count of all "TY word"s. If you can have more than one per line, then
$ grep -Eo '<TY[0-9]' file | wc -l
You want to use a word boundary instead of the start-of-line anchor:
$ grep -Ec '<TY[0-9]' file
2
Note: that is a count of all lines with a "TY word". It is not a count of all "TY word"s. If you can have more than one per line, then
$ grep -Eo '<TY[0-9]' file | wc -l
answered Jun 21 '18 at 18:45
glenn jackmanglenn jackman
51.4k571111
51.4k571111
add a comment |
add a comment |
If you want to find the number of occurrence of a ,
delimited field that starts with TY
and is followed by any number of decimal digits, you could do:
<file perl -lne '$n += () = /(?<![^,])TYd+(?![^,])/g; END{print 0+$n}'
Which on an input like:
TY1,TY2,TY,TYFOO
TY213,X-TY2,TY4
Would return 4
(TY1
, TY2
, TY213
, TY4
).
(?<!...)
and (?!...)
are respectively negative look behing and ahead operators. So here, we're looking for TY
followed by one or more (+
) digits (d
), provided its neither preceded nor followed by a character other than ,
.
Another way to do it would be to convert ,
s to newlines and count the number of resulting lines that start with TY
followed by one or more digits:
<file tr , 'n' | LC_ALL=C grep -xEc 'TY[[:digit:]]+'
(on my system, that's about 10 times as fast as the perl
solution)
add a comment |
If you want to find the number of occurrence of a ,
delimited field that starts with TY
and is followed by any number of decimal digits, you could do:
<file perl -lne '$n += () = /(?<![^,])TYd+(?![^,])/g; END{print 0+$n}'
Which on an input like:
TY1,TY2,TY,TYFOO
TY213,X-TY2,TY4
Would return 4
(TY1
, TY2
, TY213
, TY4
).
(?<!...)
and (?!...)
are respectively negative look behing and ahead operators. So here, we're looking for TY
followed by one or more (+
) digits (d
), provided its neither preceded nor followed by a character other than ,
.
Another way to do it would be to convert ,
s to newlines and count the number of resulting lines that start with TY
followed by one or more digits:
<file tr , 'n' | LC_ALL=C grep -xEc 'TY[[:digit:]]+'
(on my system, that's about 10 times as fast as the perl
solution)
add a comment |
If you want to find the number of occurrence of a ,
delimited field that starts with TY
and is followed by any number of decimal digits, you could do:
<file perl -lne '$n += () = /(?<![^,])TYd+(?![^,])/g; END{print 0+$n}'
Which on an input like:
TY1,TY2,TY,TYFOO
TY213,X-TY2,TY4
Would return 4
(TY1
, TY2
, TY213
, TY4
).
(?<!...)
and (?!...)
are respectively negative look behing and ahead operators. So here, we're looking for TY
followed by one or more (+
) digits (d
), provided its neither preceded nor followed by a character other than ,
.
Another way to do it would be to convert ,
s to newlines and count the number of resulting lines that start with TY
followed by one or more digits:
<file tr , 'n' | LC_ALL=C grep -xEc 'TY[[:digit:]]+'
(on my system, that's about 10 times as fast as the perl
solution)
If you want to find the number of occurrence of a ,
delimited field that starts with TY
and is followed by any number of decimal digits, you could do:
<file perl -lne '$n += () = /(?<![^,])TYd+(?![^,])/g; END{print 0+$n}'
Which on an input like:
TY1,TY2,TY,TYFOO
TY213,X-TY2,TY4
Would return 4
(TY1
, TY2
, TY213
, TY4
).
(?<!...)
and (?!...)
are respectively negative look behing and ahead operators. So here, we're looking for TY
followed by one or more (+
) digits (d
), provided its neither preceded nor followed by a character other than ,
.
Another way to do it would be to convert ,
s to newlines and count the number of resulting lines that start with TY
followed by one or more digits:
<file tr , 'n' | LC_ALL=C grep -xEc 'TY[[:digit:]]+'
(on my system, that's about 10 times as fast as the perl
solution)
edited Jun 21 '18 at 19:03
answered Jun 21 '18 at 18:51
Stéphane ChazelasStéphane Chazelas
305k57574928
305k57574928
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f451168%2fgrep-a-range-of-values-with-specific-starting-characters%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
^
means beginning of a line. YouTY
entries are not at the beginning of a line.– Thomas
Jan 30 at 11:15