Grep a range of values with specific starting characters












-1















I have 10GB files in which i want to count the occurrences of some specific text i.e TY[0-9].



Example file:



ABC,2A,2018-07-06,2018-06-20 00:00:00
BCD,TY1,2018-07-06,2018-06-20 00:00:00
EFG,TY2,2018-07-06,2018-06-20 00:00:00
IGH,2A,2018-07-06,2018-06-20 00:00:00


I want to get the count of all text starting with TY and then a digit. I tried using egrep but am not getting the correct result.



egrep  "^TY[0-9]" Filename









share|improve this question

























  • ^ means beginning of a line. You TY entries are not at the beginning of a line.

    – Thomas
    Jan 30 at 11:15
















-1















I have 10GB files in which i want to count the occurrences of some specific text i.e TY[0-9].



Example file:



ABC,2A,2018-07-06,2018-06-20 00:00:00
BCD,TY1,2018-07-06,2018-06-20 00:00:00
EFG,TY2,2018-07-06,2018-06-20 00:00:00
IGH,2A,2018-07-06,2018-06-20 00:00:00


I want to get the count of all text starting with TY and then a digit. I tried using egrep but am not getting the correct result.



egrep  "^TY[0-9]" Filename









share|improve this question

























  • ^ means beginning of a line. You TY entries are not at the beginning of a line.

    – Thomas
    Jan 30 at 11:15














-1












-1








-1








I have 10GB files in which i want to count the occurrences of some specific text i.e TY[0-9].



Example file:



ABC,2A,2018-07-06,2018-06-20 00:00:00
BCD,TY1,2018-07-06,2018-06-20 00:00:00
EFG,TY2,2018-07-06,2018-06-20 00:00:00
IGH,2A,2018-07-06,2018-06-20 00:00:00


I want to get the count of all text starting with TY and then a digit. I tried using egrep but am not getting the correct result.



egrep  "^TY[0-9]" Filename









share|improve this question
















I have 10GB files in which i want to count the occurrences of some specific text i.e TY[0-9].



Example file:



ABC,2A,2018-07-06,2018-06-20 00:00:00
BCD,TY1,2018-07-06,2018-06-20 00:00:00
EFG,TY2,2018-07-06,2018-06-20 00:00:00
IGH,2A,2018-07-06,2018-06-20 00:00:00


I want to get the count of all text starting with TY and then a digit. I tried using egrep but am not getting the correct result.



egrep  "^TY[0-9]" Filename






awk grep






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 30 at 5:18









Crypteya

19514




19514










asked Jun 21 '18 at 18:37









DeveloperDeveloper

15517




15517













  • ^ means beginning of a line. You TY entries are not at the beginning of a line.

    – Thomas
    Jan 30 at 11:15



















  • ^ means beginning of a line. You TY entries are not at the beginning of a line.

    – Thomas
    Jan 30 at 11:15

















^ means beginning of a line. You TY entries are not at the beginning of a line.

– Thomas
Jan 30 at 11:15





^ means beginning of a line. You TY entries are not at the beginning of a line.

– Thomas
Jan 30 at 11:15










3 Answers
3






active

oldest

votes


















3














The main issue with your attempted solution is that it assumes that the sting TY occurs at the start of the line (you are anchoring the expression there with ^), but it doesn't. It occurs at the start of the second comma-delimited field.





Using awk to count the number of times the second comma-delimited field in the file starts with the string TY followed by a digit:



awk -F, '$2 ~ /^TY[[:digit:]]/ { n++ } END { print n }' filename


I'm wondering whether using cut in combination with grep would be quick? Cutting out the second column would give grep less data to work with, and so it may be quicker than just grep alone.



cut -d, -f2 filename | grep -c '^TY[[:digit:]]'


... but I'm not sure.





After some testing on my OpenBSD system, using a 1.1GB file, the cut+grep is actually almost 50% quicker than awk (8 seconds vs. 15 seconds). And a pure grep solution (grep -Ec '<TY[0-9]' filename, taken from glenn's solution) takes 13 seconds.



So if the string is to picked out of the second field only, one may gain some time by extracting only that field before matching.






share|improve this answer


























  • In your second example, why not cut -d, -f2 inputfile | grep -c [...] rather than | grep | wc -l?

    – DopeGhoti
    Jun 21 '18 at 18:59













  • @DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.

    – Kusalananda
    Jun 21 '18 at 19:02





















2














You want to use a word boundary instead of the start-of-line anchor:



$ grep -Ec '<TY[0-9]' file
2


Note: that is a count of all lines with a "TY word". It is not a count of all "TY word"s. If you can have more than one per line, then



$ grep -Eo '<TY[0-9]' file | wc -l





share|improve this answer































    1














    If you want to find the number of occurrence of a , delimited field that starts with TY and is followed by any number of decimal digits, you could do:



    <file perl -lne '$n += () = /(?<![^,])TYd+(?![^,])/g; END{print 0+$n}'


    Which on an input like:



    TY1,TY2,TY,TYFOO
    TY213,X-TY2,TY4


    Would return 4 (TY1, TY2, TY213, TY4).



    (?<!...) and (?!...) are respectively negative look behing and ahead operators. So here, we're looking for TY followed by one or more (+) digits (d), provided its neither preceded nor followed by a character other than ,.



    Another way to do it would be to convert ,s to newlines and count the number of resulting lines that start with TY followed by one or more digits:



    <file tr , 'n' | LC_ALL=C grep -xEc 'TY[[:digit:]]+'


    (on my system, that's about 10 times as fast as the perl solution)






    share|improve this answer

























      Your Answer








      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "106"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f451168%2fgrep-a-range-of-values-with-specific-starting-characters%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      3














      The main issue with your attempted solution is that it assumes that the sting TY occurs at the start of the line (you are anchoring the expression there with ^), but it doesn't. It occurs at the start of the second comma-delimited field.





      Using awk to count the number of times the second comma-delimited field in the file starts with the string TY followed by a digit:



      awk -F, '$2 ~ /^TY[[:digit:]]/ { n++ } END { print n }' filename


      I'm wondering whether using cut in combination with grep would be quick? Cutting out the second column would give grep less data to work with, and so it may be quicker than just grep alone.



      cut -d, -f2 filename | grep -c '^TY[[:digit:]]'


      ... but I'm not sure.





      After some testing on my OpenBSD system, using a 1.1GB file, the cut+grep is actually almost 50% quicker than awk (8 seconds vs. 15 seconds). And a pure grep solution (grep -Ec '<TY[0-9]' filename, taken from glenn's solution) takes 13 seconds.



      So if the string is to picked out of the second field only, one may gain some time by extracting only that field before matching.






      share|improve this answer


























      • In your second example, why not cut -d, -f2 inputfile | grep -c [...] rather than | grep | wc -l?

        – DopeGhoti
        Jun 21 '18 at 18:59













      • @DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.

        – Kusalananda
        Jun 21 '18 at 19:02


















      3














      The main issue with your attempted solution is that it assumes that the sting TY occurs at the start of the line (you are anchoring the expression there with ^), but it doesn't. It occurs at the start of the second comma-delimited field.





      Using awk to count the number of times the second comma-delimited field in the file starts with the string TY followed by a digit:



      awk -F, '$2 ~ /^TY[[:digit:]]/ { n++ } END { print n }' filename


      I'm wondering whether using cut in combination with grep would be quick? Cutting out the second column would give grep less data to work with, and so it may be quicker than just grep alone.



      cut -d, -f2 filename | grep -c '^TY[[:digit:]]'


      ... but I'm not sure.





      After some testing on my OpenBSD system, using a 1.1GB file, the cut+grep is actually almost 50% quicker than awk (8 seconds vs. 15 seconds). And a pure grep solution (grep -Ec '<TY[0-9]' filename, taken from glenn's solution) takes 13 seconds.



      So if the string is to picked out of the second field only, one may gain some time by extracting only that field before matching.






      share|improve this answer


























      • In your second example, why not cut -d, -f2 inputfile | grep -c [...] rather than | grep | wc -l?

        – DopeGhoti
        Jun 21 '18 at 18:59













      • @DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.

        – Kusalananda
        Jun 21 '18 at 19:02
















      3












      3








      3







      The main issue with your attempted solution is that it assumes that the sting TY occurs at the start of the line (you are anchoring the expression there with ^), but it doesn't. It occurs at the start of the second comma-delimited field.





      Using awk to count the number of times the second comma-delimited field in the file starts with the string TY followed by a digit:



      awk -F, '$2 ~ /^TY[[:digit:]]/ { n++ } END { print n }' filename


      I'm wondering whether using cut in combination with grep would be quick? Cutting out the second column would give grep less data to work with, and so it may be quicker than just grep alone.



      cut -d, -f2 filename | grep -c '^TY[[:digit:]]'


      ... but I'm not sure.





      After some testing on my OpenBSD system, using a 1.1GB file, the cut+grep is actually almost 50% quicker than awk (8 seconds vs. 15 seconds). And a pure grep solution (grep -Ec '<TY[0-9]' filename, taken from glenn's solution) takes 13 seconds.



      So if the string is to picked out of the second field only, one may gain some time by extracting only that field before matching.






      share|improve this answer















      The main issue with your attempted solution is that it assumes that the sting TY occurs at the start of the line (you are anchoring the expression there with ^), but it doesn't. It occurs at the start of the second comma-delimited field.





      Using awk to count the number of times the second comma-delimited field in the file starts with the string TY followed by a digit:



      awk -F, '$2 ~ /^TY[[:digit:]]/ { n++ } END { print n }' filename


      I'm wondering whether using cut in combination with grep would be quick? Cutting out the second column would give grep less data to work with, and so it may be quicker than just grep alone.



      cut -d, -f2 filename | grep -c '^TY[[:digit:]]'


      ... but I'm not sure.





      After some testing on my OpenBSD system, using a 1.1GB file, the cut+grep is actually almost 50% quicker than awk (8 seconds vs. 15 seconds). And a pure grep solution (grep -Ec '<TY[0-9]' filename, taken from glenn's solution) takes 13 seconds.



      So if the string is to picked out of the second field only, one may gain some time by extracting only that field before matching.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Jan 30 at 6:51

























      answered Jun 21 '18 at 18:47









      KusalanandaKusalananda

      129k16243400




      129k16243400













      • In your second example, why not cut -d, -f2 inputfile | grep -c [...] rather than | grep | wc -l?

        – DopeGhoti
        Jun 21 '18 at 18:59













      • @DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.

        – Kusalananda
        Jun 21 '18 at 19:02





















      • In your second example, why not cut -d, -f2 inputfile | grep -c [...] rather than | grep | wc -l?

        – DopeGhoti
        Jun 21 '18 at 18:59













      • @DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.

        – Kusalananda
        Jun 21 '18 at 19:02



















      In your second example, why not cut -d, -f2 inputfile | grep -c [...] rather than | grep | wc -l?

      – DopeGhoti
      Jun 21 '18 at 18:59







      In your second example, why not cut -d, -f2 inputfile | grep -c [...] rather than | grep | wc -l?

      – DopeGhoti
      Jun 21 '18 at 18:59















      @DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.

      – Kusalananda
      Jun 21 '18 at 19:02







      @DopeGhoti Derrp. Yes. Thanks. Made it even quicker too.

      – Kusalananda
      Jun 21 '18 at 19:02















      2














      You want to use a word boundary instead of the start-of-line anchor:



      $ grep -Ec '<TY[0-9]' file
      2


      Note: that is a count of all lines with a "TY word". It is not a count of all "TY word"s. If you can have more than one per line, then



      $ grep -Eo '<TY[0-9]' file | wc -l





      share|improve this answer




























        2














        You want to use a word boundary instead of the start-of-line anchor:



        $ grep -Ec '<TY[0-9]' file
        2


        Note: that is a count of all lines with a "TY word". It is not a count of all "TY word"s. If you can have more than one per line, then



        $ grep -Eo '<TY[0-9]' file | wc -l





        share|improve this answer


























          2












          2








          2







          You want to use a word boundary instead of the start-of-line anchor:



          $ grep -Ec '<TY[0-9]' file
          2


          Note: that is a count of all lines with a "TY word". It is not a count of all "TY word"s. If you can have more than one per line, then



          $ grep -Eo '<TY[0-9]' file | wc -l





          share|improve this answer













          You want to use a word boundary instead of the start-of-line anchor:



          $ grep -Ec '<TY[0-9]' file
          2


          Note: that is a count of all lines with a "TY word". It is not a count of all "TY word"s. If you can have more than one per line, then



          $ grep -Eo '<TY[0-9]' file | wc -l






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jun 21 '18 at 18:45









          glenn jackmanglenn jackman

          51.4k571111




          51.4k571111























              1














              If you want to find the number of occurrence of a , delimited field that starts with TY and is followed by any number of decimal digits, you could do:



              <file perl -lne '$n += () = /(?<![^,])TYd+(?![^,])/g; END{print 0+$n}'


              Which on an input like:



              TY1,TY2,TY,TYFOO
              TY213,X-TY2,TY4


              Would return 4 (TY1, TY2, TY213, TY4).



              (?<!...) and (?!...) are respectively negative look behing and ahead operators. So here, we're looking for TY followed by one or more (+) digits (d), provided its neither preceded nor followed by a character other than ,.



              Another way to do it would be to convert ,s to newlines and count the number of resulting lines that start with TY followed by one or more digits:



              <file tr , 'n' | LC_ALL=C grep -xEc 'TY[[:digit:]]+'


              (on my system, that's about 10 times as fast as the perl solution)






              share|improve this answer






























                1














                If you want to find the number of occurrence of a , delimited field that starts with TY and is followed by any number of decimal digits, you could do:



                <file perl -lne '$n += () = /(?<![^,])TYd+(?![^,])/g; END{print 0+$n}'


                Which on an input like:



                TY1,TY2,TY,TYFOO
                TY213,X-TY2,TY4


                Would return 4 (TY1, TY2, TY213, TY4).



                (?<!...) and (?!...) are respectively negative look behing and ahead operators. So here, we're looking for TY followed by one or more (+) digits (d), provided its neither preceded nor followed by a character other than ,.



                Another way to do it would be to convert ,s to newlines and count the number of resulting lines that start with TY followed by one or more digits:



                <file tr , 'n' | LC_ALL=C grep -xEc 'TY[[:digit:]]+'


                (on my system, that's about 10 times as fast as the perl solution)






                share|improve this answer




























                  1












                  1








                  1







                  If you want to find the number of occurrence of a , delimited field that starts with TY and is followed by any number of decimal digits, you could do:



                  <file perl -lne '$n += () = /(?<![^,])TYd+(?![^,])/g; END{print 0+$n}'


                  Which on an input like:



                  TY1,TY2,TY,TYFOO
                  TY213,X-TY2,TY4


                  Would return 4 (TY1, TY2, TY213, TY4).



                  (?<!...) and (?!...) are respectively negative look behing and ahead operators. So here, we're looking for TY followed by one or more (+) digits (d), provided its neither preceded nor followed by a character other than ,.



                  Another way to do it would be to convert ,s to newlines and count the number of resulting lines that start with TY followed by one or more digits:



                  <file tr , 'n' | LC_ALL=C grep -xEc 'TY[[:digit:]]+'


                  (on my system, that's about 10 times as fast as the perl solution)






                  share|improve this answer















                  If you want to find the number of occurrence of a , delimited field that starts with TY and is followed by any number of decimal digits, you could do:



                  <file perl -lne '$n += () = /(?<![^,])TYd+(?![^,])/g; END{print 0+$n}'


                  Which on an input like:



                  TY1,TY2,TY,TYFOO
                  TY213,X-TY2,TY4


                  Would return 4 (TY1, TY2, TY213, TY4).



                  (?<!...) and (?!...) are respectively negative look behing and ahead operators. So here, we're looking for TY followed by one or more (+) digits (d), provided its neither preceded nor followed by a character other than ,.



                  Another way to do it would be to convert ,s to newlines and count the number of resulting lines that start with TY followed by one or more digits:



                  <file tr , 'n' | LC_ALL=C grep -xEc 'TY[[:digit:]]+'


                  (on my system, that's about 10 times as fast as the perl solution)







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Jun 21 '18 at 19:03

























                  answered Jun 21 '18 at 18:51









                  Stéphane ChazelasStéphane Chazelas

                  305k57574928




                  305k57574928






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Unix & Linux Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f451168%2fgrep-a-range-of-values-with-specific-starting-characters%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      How to reconfigure Docker Trusted Registry 2.x.x to use CEPH FS mount instead of NFS and other traditional...

                      is 'sed' thread safe

                      How to make a Squid Proxy server?