Optimizing large file reads with a custom line terminator in Python












4












$begingroup$


I asked for help accomplishing this on StackExchange.
How to efficiently read a large file with a custom newline character using Python?



And ended up providing my own answer which is duplicated here. 👇



class ChunkedFile(object):
"""
Yields file in 100 megabyte chunks (by default).
If lineterminator is provided reads through file in chunks but yields lines.
"""

def __init__(self, file_to_process, lineterminator=None):
self._file = file_to_process
self._lineterminator = lineterminator

def read(self, encoding="utf-8", chunk_size=None):
one_hundred_megabytes = 100 * 1024
chunk_size = chunk_size or one_hundred_megabytes

with self._file.open(encoding=encoding) as f:
chunk = ""
while True:
data = f.read(chunk_size)
if not data: # EOF
break

chunk += data
if self._lineterminator is None:
yield chunk
chunk = ""

elif self._lineterminator in chunk:
lines = chunk.split(self._lineterminator)
chunk = "" if chunk.endswith(self._lineterminator) else lines.pop()
for line in lines:
if line: # Don't return empty lines.
yield line

if __name__ == "__main__":
longabstract_file = Path(__file__).parent / "longabstract_en.csv"
file_chunk_reader = ChunkedFile(longabstract_file, lineterminator="tln")
for line in file_chunk_reader.read():
print(line)


The code works but I'm interested in further optimizations and proper error handling. Nothing broke when I ran it on a few 1GB+ files so it's a start! 🙃



PS - I prefer pure Python 3 code (if relevant). Thanks.










share|improve this question









$endgroup$












  • $begingroup$
    I'd be great if you could add the problem description to your question so that one doesn't have to navigate over to stack overflow.
    $endgroup$
    – t3chb0t
    Nov 4 '18 at 7:50


















4












$begingroup$


I asked for help accomplishing this on StackExchange.
How to efficiently read a large file with a custom newline character using Python?



And ended up providing my own answer which is duplicated here. 👇



class ChunkedFile(object):
"""
Yields file in 100 megabyte chunks (by default).
If lineterminator is provided reads through file in chunks but yields lines.
"""

def __init__(self, file_to_process, lineterminator=None):
self._file = file_to_process
self._lineterminator = lineterminator

def read(self, encoding="utf-8", chunk_size=None):
one_hundred_megabytes = 100 * 1024
chunk_size = chunk_size or one_hundred_megabytes

with self._file.open(encoding=encoding) as f:
chunk = ""
while True:
data = f.read(chunk_size)
if not data: # EOF
break

chunk += data
if self._lineterminator is None:
yield chunk
chunk = ""

elif self._lineterminator in chunk:
lines = chunk.split(self._lineterminator)
chunk = "" if chunk.endswith(self._lineterminator) else lines.pop()
for line in lines:
if line: # Don't return empty lines.
yield line

if __name__ == "__main__":
longabstract_file = Path(__file__).parent / "longabstract_en.csv"
file_chunk_reader = ChunkedFile(longabstract_file, lineterminator="tln")
for line in file_chunk_reader.read():
print(line)


The code works but I'm interested in further optimizations and proper error handling. Nothing broke when I ran it on a few 1GB+ files so it's a start! 🙃



PS - I prefer pure Python 3 code (if relevant). Thanks.










share|improve this question









$endgroup$












  • $begingroup$
    I'd be great if you could add the problem description to your question so that one doesn't have to navigate over to stack overflow.
    $endgroup$
    – t3chb0t
    Nov 4 '18 at 7:50
















4












4








4


1



$begingroup$


I asked for help accomplishing this on StackExchange.
How to efficiently read a large file with a custom newline character using Python?



And ended up providing my own answer which is duplicated here. 👇



class ChunkedFile(object):
"""
Yields file in 100 megabyte chunks (by default).
If lineterminator is provided reads through file in chunks but yields lines.
"""

def __init__(self, file_to_process, lineterminator=None):
self._file = file_to_process
self._lineterminator = lineterminator

def read(self, encoding="utf-8", chunk_size=None):
one_hundred_megabytes = 100 * 1024
chunk_size = chunk_size or one_hundred_megabytes

with self._file.open(encoding=encoding) as f:
chunk = ""
while True:
data = f.read(chunk_size)
if not data: # EOF
break

chunk += data
if self._lineterminator is None:
yield chunk
chunk = ""

elif self._lineterminator in chunk:
lines = chunk.split(self._lineterminator)
chunk = "" if chunk.endswith(self._lineterminator) else lines.pop()
for line in lines:
if line: # Don't return empty lines.
yield line

if __name__ == "__main__":
longabstract_file = Path(__file__).parent / "longabstract_en.csv"
file_chunk_reader = ChunkedFile(longabstract_file, lineterminator="tln")
for line in file_chunk_reader.read():
print(line)


The code works but I'm interested in further optimizations and proper error handling. Nothing broke when I ran it on a few 1GB+ files so it's a start! 🙃



PS - I prefer pure Python 3 code (if relevant). Thanks.










share|improve this question









$endgroup$




I asked for help accomplishing this on StackExchange.
How to efficiently read a large file with a custom newline character using Python?



And ended up providing my own answer which is duplicated here. 👇



class ChunkedFile(object):
"""
Yields file in 100 megabyte chunks (by default).
If lineterminator is provided reads through file in chunks but yields lines.
"""

def __init__(self, file_to_process, lineterminator=None):
self._file = file_to_process
self._lineterminator = lineterminator

def read(self, encoding="utf-8", chunk_size=None):
one_hundred_megabytes = 100 * 1024
chunk_size = chunk_size or one_hundred_megabytes

with self._file.open(encoding=encoding) as f:
chunk = ""
while True:
data = f.read(chunk_size)
if not data: # EOF
break

chunk += data
if self._lineterminator is None:
yield chunk
chunk = ""

elif self._lineterminator in chunk:
lines = chunk.split(self._lineterminator)
chunk = "" if chunk.endswith(self._lineterminator) else lines.pop()
for line in lines:
if line: # Don't return empty lines.
yield line

if __name__ == "__main__":
longabstract_file = Path(__file__).parent / "longabstract_en.csv"
file_chunk_reader = ChunkedFile(longabstract_file, lineterminator="tln")
for line in file_chunk_reader.read():
print(line)


The code works but I'm interested in further optimizations and proper error handling. Nothing broke when I ran it on a few 1GB+ files so it's a start! 🙃



PS - I prefer pure Python 3 code (if relevant). Thanks.







python python-3.x






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Oct 15 '18 at 19:53









GollyJerGollyJer

1849




1849












  • $begingroup$
    I'd be great if you could add the problem description to your question so that one doesn't have to navigate over to stack overflow.
    $endgroup$
    – t3chb0t
    Nov 4 '18 at 7:50




















  • $begingroup$
    I'd be great if you could add the problem description to your question so that one doesn't have to navigate over to stack overflow.
    $endgroup$
    – t3chb0t
    Nov 4 '18 at 7:50


















$begingroup$
I'd be great if you could add the problem description to your question so that one doesn't have to navigate over to stack overflow.
$endgroup$
– t3chb0t
Nov 4 '18 at 7:50






$begingroup$
I'd be great if you could add the problem description to your question so that one doesn't have to navigate over to stack overflow.
$endgroup$
– t3chb0t
Nov 4 '18 at 7:50












2 Answers
2






active

oldest

votes


















4












$begingroup$

1. Bug



If the file does not end with the line terminator, then the last line is lost. That's because the code says:



if not data:  # EOF
break


but at this point there might still be an unterminated line remaining in chunk, so this needs to be:



if not data:  # EOF
if chunk:
yield chunk
break


Perhaps in your case the file is required to always end with a line terminator, but if so it would be a good idea to assert this, rather than silently discarding data if the file does not meet the requirement:



if not data:  # EOF
if chunk:
raise ValueError("file does not end with line terminator")
break


2. Separation of concerns



The code in the post has four concerns:




  1. Opening and closing a file.

  2. Generating chunks of 100 kilobytes from a file.

  3. Generating lines from a file with arbitrary line terminator.

  4. Filtering out blank lines.


The principle of separation of concerns suggests that these should not be combined into one function. Taking the four concerns in turn:




  1. As a general rule, if you have a function that operates on the contents of a file, it's better if the function takes a file-like object (not a filename) and leaves the opening and closing of the file up to the caller. That's because not all file-like objects correspond to named files. The caller might have a pipe from a call to subprocess.Popen, or they might have some data in memory wrapped as a io.StringIO. If an interface takes a filename and opens the file itself, then it can't be used with these other kinds of file-like object.


  2. Reading fixed-size chunks from a file is already implemented by the read method.


  3. This is fine, but see below for some improvements.


  4. Filtering out blank lines might not be needed or wanted, so it should be left to the caller to decide whether to do it. It is in any case easy to implement using an if statement, or a comprehension, or filter.



3. Other points




  1. There are no docstrings.


  2. When you have a class with __init__ and one other method, then there is usually no need for a class: what you want is a function. (See Jack Diederich's PyCon 2012 talk "Stop Writing Classes" for a detailed look at this issue.)


  3. 100 × 1024 bytes is one hundred kilobytes (not megabytes).



  4. Instead of having a keyword argument with default value of None:



    def read(..., chunk_size=None):


    and then later setting it to a particular value if it tests false:



    one_hundred_megabytes = 100 * 1024
    chunk_size = chunk_size or one_hundred_megabytes


    just specify the keyword argument with the default value you want:



    def read(..., chunk_size=100 * 1024):



  5. The code appends to chunk before testing for the line terminator:



    chunk += data
    # ...
    elif self._lineterminator in chunk:


    but we know that if the line terminator appears, then it must appear in the new part of the chunk (that we just read). It can't appear in the old part because then it would have been split on the previous iteration. This means that if multiple chunks need to be read before the line terminator appears, then the initial part of the line gets searched again each time. This leads to quadratic runtime complexity if lines are much longer than the chunk size.



    For efficiency, we should test self._lineterminator in data instead.




  6. The code tests for the line terminator appearing in the chunk and then splits the chunk if it is found.



    elif self._lineterminator in chunk:
    lines = chunk.split(self._lineterminator)


    but this has to search the chunk twice, once for the in operator, and again for the split. It would be faster to split first and then see how many pieces there are:



    lines = chunk.split(self._lineterminator)
    if len(lines) > 1:



  7. Combining points 5 and 6 above, we should split data first and append to chunk afterwards:



    first, *rest = data.split(self._lineterminator)
    chunk += first
    if rest:
    yield chunk
    yield from rest[:-1]
    chunk = rest[-1]


  8. I think the names would be clearer if chunk were renamed part (because it is part of a line), and data were renamed chunk (because it is a chunk of text read from the file).



4. Revised code



def splitlines(file, newline, chunk_size=4096):
"""Generate lines from file (a file-like object), where a line is
followed by the string newline (or the end of the file). Optional
argument chunk_size is the size of the chunks read from the file.

"""
part = ''
while True:
chunk = file.read(chunk_size)
if not chunk:
if part:
yield part
break
first, *rest = chunk.split(newline)
part += first
if rest:
yield part
yield from rest[:-1]
part = rest[-1]


The caller should write something like:



with open(longabstract_file) as f:
for line in splitlines(f, "tln"):
if line: # ignore blank lines
print(line)





share|improve this answer











$endgroup$













  • $begingroup$
    Gareth... thank you for taking the time! This has been so instructive. --- You nailed the scent I was smelling. Specifically the separation of concerns. Thanks. --- Your version gets around the pointed out bug eloquently. I was ignoring because the files do end with newline but your solution is awesome. I've never even thought about short circuiting with yield. Thanks! --- SPEED!! My "baseline" is simply reading through the 10 million line file line-by-line with a pass. This takes ~10 secs. My function takes ~22 MINUTES. Yours takes ~15 seconds!!! Clearly points 3.5, 3.6 and 3.7 are key! 🏎
    $endgroup$
    – GollyJer
    Oct 17 '18 at 0:18










  • $begingroup$
    Also, do you want to post this code as an answer to my question on SO? ✅ is in order over there as well.
    $endgroup$
    – GollyJer
    Oct 17 '18 at 0:20










  • $begingroup$
    Found a little bug and updated with a new version in my answer below. Feedback welcome. 🤪
    $endgroup$
    – GollyJer
    2 hours ago



















0












$begingroup$

There is a bug in @Gareth Rees' code. It doesn't work if the the newline variable is cut off by the chunk_size.



Here is a version that fixes the bug.



def splitlines(file, newline, chunk_size=4096):
tail = ""
while True:
chunk = file.read(chunk_size)
if not chunk:
if tail:
yield tail
break
lines = (tail + chunk).split(newline)
tail = lines.pop(0)
if lines:
yield tail
tail = lines.pop()
yield from lines





share|improve this answer









$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "196"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f205631%2foptimizing-large-file-reads-with-a-custom-line-terminator-in-python%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    4












    $begingroup$

    1. Bug



    If the file does not end with the line terminator, then the last line is lost. That's because the code says:



    if not data:  # EOF
    break


    but at this point there might still be an unterminated line remaining in chunk, so this needs to be:



    if not data:  # EOF
    if chunk:
    yield chunk
    break


    Perhaps in your case the file is required to always end with a line terminator, but if so it would be a good idea to assert this, rather than silently discarding data if the file does not meet the requirement:



    if not data:  # EOF
    if chunk:
    raise ValueError("file does not end with line terminator")
    break


    2. Separation of concerns



    The code in the post has four concerns:




    1. Opening and closing a file.

    2. Generating chunks of 100 kilobytes from a file.

    3. Generating lines from a file with arbitrary line terminator.

    4. Filtering out blank lines.


    The principle of separation of concerns suggests that these should not be combined into one function. Taking the four concerns in turn:




    1. As a general rule, if you have a function that operates on the contents of a file, it's better if the function takes a file-like object (not a filename) and leaves the opening and closing of the file up to the caller. That's because not all file-like objects correspond to named files. The caller might have a pipe from a call to subprocess.Popen, or they might have some data in memory wrapped as a io.StringIO. If an interface takes a filename and opens the file itself, then it can't be used with these other kinds of file-like object.


    2. Reading fixed-size chunks from a file is already implemented by the read method.


    3. This is fine, but see below for some improvements.


    4. Filtering out blank lines might not be needed or wanted, so it should be left to the caller to decide whether to do it. It is in any case easy to implement using an if statement, or a comprehension, or filter.



    3. Other points




    1. There are no docstrings.


    2. When you have a class with __init__ and one other method, then there is usually no need for a class: what you want is a function. (See Jack Diederich's PyCon 2012 talk "Stop Writing Classes" for a detailed look at this issue.)


    3. 100 × 1024 bytes is one hundred kilobytes (not megabytes).



    4. Instead of having a keyword argument with default value of None:



      def read(..., chunk_size=None):


      and then later setting it to a particular value if it tests false:



      one_hundred_megabytes = 100 * 1024
      chunk_size = chunk_size or one_hundred_megabytes


      just specify the keyword argument with the default value you want:



      def read(..., chunk_size=100 * 1024):



    5. The code appends to chunk before testing for the line terminator:



      chunk += data
      # ...
      elif self._lineterminator in chunk:


      but we know that if the line terminator appears, then it must appear in the new part of the chunk (that we just read). It can't appear in the old part because then it would have been split on the previous iteration. This means that if multiple chunks need to be read before the line terminator appears, then the initial part of the line gets searched again each time. This leads to quadratic runtime complexity if lines are much longer than the chunk size.



      For efficiency, we should test self._lineterminator in data instead.




    6. The code tests for the line terminator appearing in the chunk and then splits the chunk if it is found.



      elif self._lineterminator in chunk:
      lines = chunk.split(self._lineterminator)


      but this has to search the chunk twice, once for the in operator, and again for the split. It would be faster to split first and then see how many pieces there are:



      lines = chunk.split(self._lineterminator)
      if len(lines) > 1:



    7. Combining points 5 and 6 above, we should split data first and append to chunk afterwards:



      first, *rest = data.split(self._lineterminator)
      chunk += first
      if rest:
      yield chunk
      yield from rest[:-1]
      chunk = rest[-1]


    8. I think the names would be clearer if chunk were renamed part (because it is part of a line), and data were renamed chunk (because it is a chunk of text read from the file).



    4. Revised code



    def splitlines(file, newline, chunk_size=4096):
    """Generate lines from file (a file-like object), where a line is
    followed by the string newline (or the end of the file). Optional
    argument chunk_size is the size of the chunks read from the file.

    """
    part = ''
    while True:
    chunk = file.read(chunk_size)
    if not chunk:
    if part:
    yield part
    break
    first, *rest = chunk.split(newline)
    part += first
    if rest:
    yield part
    yield from rest[:-1]
    part = rest[-1]


    The caller should write something like:



    with open(longabstract_file) as f:
    for line in splitlines(f, "tln"):
    if line: # ignore blank lines
    print(line)





    share|improve this answer











    $endgroup$













    • $begingroup$
      Gareth... thank you for taking the time! This has been so instructive. --- You nailed the scent I was smelling. Specifically the separation of concerns. Thanks. --- Your version gets around the pointed out bug eloquently. I was ignoring because the files do end with newline but your solution is awesome. I've never even thought about short circuiting with yield. Thanks! --- SPEED!! My "baseline" is simply reading through the 10 million line file line-by-line with a pass. This takes ~10 secs. My function takes ~22 MINUTES. Yours takes ~15 seconds!!! Clearly points 3.5, 3.6 and 3.7 are key! 🏎
      $endgroup$
      – GollyJer
      Oct 17 '18 at 0:18










    • $begingroup$
      Also, do you want to post this code as an answer to my question on SO? ✅ is in order over there as well.
      $endgroup$
      – GollyJer
      Oct 17 '18 at 0:20










    • $begingroup$
      Found a little bug and updated with a new version in my answer below. Feedback welcome. 🤪
      $endgroup$
      – GollyJer
      2 hours ago
















    4












    $begingroup$

    1. Bug



    If the file does not end with the line terminator, then the last line is lost. That's because the code says:



    if not data:  # EOF
    break


    but at this point there might still be an unterminated line remaining in chunk, so this needs to be:



    if not data:  # EOF
    if chunk:
    yield chunk
    break


    Perhaps in your case the file is required to always end with a line terminator, but if so it would be a good idea to assert this, rather than silently discarding data if the file does not meet the requirement:



    if not data:  # EOF
    if chunk:
    raise ValueError("file does not end with line terminator")
    break


    2. Separation of concerns



    The code in the post has four concerns:




    1. Opening and closing a file.

    2. Generating chunks of 100 kilobytes from a file.

    3. Generating lines from a file with arbitrary line terminator.

    4. Filtering out blank lines.


    The principle of separation of concerns suggests that these should not be combined into one function. Taking the four concerns in turn:




    1. As a general rule, if you have a function that operates on the contents of a file, it's better if the function takes a file-like object (not a filename) and leaves the opening and closing of the file up to the caller. That's because not all file-like objects correspond to named files. The caller might have a pipe from a call to subprocess.Popen, or they might have some data in memory wrapped as a io.StringIO. If an interface takes a filename and opens the file itself, then it can't be used with these other kinds of file-like object.


    2. Reading fixed-size chunks from a file is already implemented by the read method.


    3. This is fine, but see below for some improvements.


    4. Filtering out blank lines might not be needed or wanted, so it should be left to the caller to decide whether to do it. It is in any case easy to implement using an if statement, or a comprehension, or filter.



    3. Other points




    1. There are no docstrings.


    2. When you have a class with __init__ and one other method, then there is usually no need for a class: what you want is a function. (See Jack Diederich's PyCon 2012 talk "Stop Writing Classes" for a detailed look at this issue.)


    3. 100 × 1024 bytes is one hundred kilobytes (not megabytes).



    4. Instead of having a keyword argument with default value of None:



      def read(..., chunk_size=None):


      and then later setting it to a particular value if it tests false:



      one_hundred_megabytes = 100 * 1024
      chunk_size = chunk_size or one_hundred_megabytes


      just specify the keyword argument with the default value you want:



      def read(..., chunk_size=100 * 1024):



    5. The code appends to chunk before testing for the line terminator:



      chunk += data
      # ...
      elif self._lineterminator in chunk:


      but we know that if the line terminator appears, then it must appear in the new part of the chunk (that we just read). It can't appear in the old part because then it would have been split on the previous iteration. This means that if multiple chunks need to be read before the line terminator appears, then the initial part of the line gets searched again each time. This leads to quadratic runtime complexity if lines are much longer than the chunk size.



      For efficiency, we should test self._lineterminator in data instead.




    6. The code tests for the line terminator appearing in the chunk and then splits the chunk if it is found.



      elif self._lineterminator in chunk:
      lines = chunk.split(self._lineterminator)


      but this has to search the chunk twice, once for the in operator, and again for the split. It would be faster to split first and then see how many pieces there are:



      lines = chunk.split(self._lineterminator)
      if len(lines) > 1:



    7. Combining points 5 and 6 above, we should split data first and append to chunk afterwards:



      first, *rest = data.split(self._lineterminator)
      chunk += first
      if rest:
      yield chunk
      yield from rest[:-1]
      chunk = rest[-1]


    8. I think the names would be clearer if chunk were renamed part (because it is part of a line), and data were renamed chunk (because it is a chunk of text read from the file).



    4. Revised code



    def splitlines(file, newline, chunk_size=4096):
    """Generate lines from file (a file-like object), where a line is
    followed by the string newline (or the end of the file). Optional
    argument chunk_size is the size of the chunks read from the file.

    """
    part = ''
    while True:
    chunk = file.read(chunk_size)
    if not chunk:
    if part:
    yield part
    break
    first, *rest = chunk.split(newline)
    part += first
    if rest:
    yield part
    yield from rest[:-1]
    part = rest[-1]


    The caller should write something like:



    with open(longabstract_file) as f:
    for line in splitlines(f, "tln"):
    if line: # ignore blank lines
    print(line)





    share|improve this answer











    $endgroup$













    • $begingroup$
      Gareth... thank you for taking the time! This has been so instructive. --- You nailed the scent I was smelling. Specifically the separation of concerns. Thanks. --- Your version gets around the pointed out bug eloquently. I was ignoring because the files do end with newline but your solution is awesome. I've never even thought about short circuiting with yield. Thanks! --- SPEED!! My "baseline" is simply reading through the 10 million line file line-by-line with a pass. This takes ~10 secs. My function takes ~22 MINUTES. Yours takes ~15 seconds!!! Clearly points 3.5, 3.6 and 3.7 are key! 🏎
      $endgroup$
      – GollyJer
      Oct 17 '18 at 0:18










    • $begingroup$
      Also, do you want to post this code as an answer to my question on SO? ✅ is in order over there as well.
      $endgroup$
      – GollyJer
      Oct 17 '18 at 0:20










    • $begingroup$
      Found a little bug and updated with a new version in my answer below. Feedback welcome. 🤪
      $endgroup$
      – GollyJer
      2 hours ago














    4












    4








    4





    $begingroup$

    1. Bug



    If the file does not end with the line terminator, then the last line is lost. That's because the code says:



    if not data:  # EOF
    break


    but at this point there might still be an unterminated line remaining in chunk, so this needs to be:



    if not data:  # EOF
    if chunk:
    yield chunk
    break


    Perhaps in your case the file is required to always end with a line terminator, but if so it would be a good idea to assert this, rather than silently discarding data if the file does not meet the requirement:



    if not data:  # EOF
    if chunk:
    raise ValueError("file does not end with line terminator")
    break


    2. Separation of concerns



    The code in the post has four concerns:




    1. Opening and closing a file.

    2. Generating chunks of 100 kilobytes from a file.

    3. Generating lines from a file with arbitrary line terminator.

    4. Filtering out blank lines.


    The principle of separation of concerns suggests that these should not be combined into one function. Taking the four concerns in turn:




    1. As a general rule, if you have a function that operates on the contents of a file, it's better if the function takes a file-like object (not a filename) and leaves the opening and closing of the file up to the caller. That's because not all file-like objects correspond to named files. The caller might have a pipe from a call to subprocess.Popen, or they might have some data in memory wrapped as a io.StringIO. If an interface takes a filename and opens the file itself, then it can't be used with these other kinds of file-like object.


    2. Reading fixed-size chunks from a file is already implemented by the read method.


    3. This is fine, but see below for some improvements.


    4. Filtering out blank lines might not be needed or wanted, so it should be left to the caller to decide whether to do it. It is in any case easy to implement using an if statement, or a comprehension, or filter.



    3. Other points




    1. There are no docstrings.


    2. When you have a class with __init__ and one other method, then there is usually no need for a class: what you want is a function. (See Jack Diederich's PyCon 2012 talk "Stop Writing Classes" for a detailed look at this issue.)


    3. 100 × 1024 bytes is one hundred kilobytes (not megabytes).



    4. Instead of having a keyword argument with default value of None:



      def read(..., chunk_size=None):


      and then later setting it to a particular value if it tests false:



      one_hundred_megabytes = 100 * 1024
      chunk_size = chunk_size or one_hundred_megabytes


      just specify the keyword argument with the default value you want:



      def read(..., chunk_size=100 * 1024):



    5. The code appends to chunk before testing for the line terminator:



      chunk += data
      # ...
      elif self._lineterminator in chunk:


      but we know that if the line terminator appears, then it must appear in the new part of the chunk (that we just read). It can't appear in the old part because then it would have been split on the previous iteration. This means that if multiple chunks need to be read before the line terminator appears, then the initial part of the line gets searched again each time. This leads to quadratic runtime complexity if lines are much longer than the chunk size.



      For efficiency, we should test self._lineterminator in data instead.




    6. The code tests for the line terminator appearing in the chunk and then splits the chunk if it is found.



      elif self._lineterminator in chunk:
      lines = chunk.split(self._lineterminator)


      but this has to search the chunk twice, once for the in operator, and again for the split. It would be faster to split first and then see how many pieces there are:



      lines = chunk.split(self._lineterminator)
      if len(lines) > 1:



    7. Combining points 5 and 6 above, we should split data first and append to chunk afterwards:



      first, *rest = data.split(self._lineterminator)
      chunk += first
      if rest:
      yield chunk
      yield from rest[:-1]
      chunk = rest[-1]


    8. I think the names would be clearer if chunk were renamed part (because it is part of a line), and data were renamed chunk (because it is a chunk of text read from the file).



    4. Revised code



    def splitlines(file, newline, chunk_size=4096):
    """Generate lines from file (a file-like object), where a line is
    followed by the string newline (or the end of the file). Optional
    argument chunk_size is the size of the chunks read from the file.

    """
    part = ''
    while True:
    chunk = file.read(chunk_size)
    if not chunk:
    if part:
    yield part
    break
    first, *rest = chunk.split(newline)
    part += first
    if rest:
    yield part
    yield from rest[:-1]
    part = rest[-1]


    The caller should write something like:



    with open(longabstract_file) as f:
    for line in splitlines(f, "tln"):
    if line: # ignore blank lines
    print(line)





    share|improve this answer











    $endgroup$



    1. Bug



    If the file does not end with the line terminator, then the last line is lost. That's because the code says:



    if not data:  # EOF
    break


    but at this point there might still be an unterminated line remaining in chunk, so this needs to be:



    if not data:  # EOF
    if chunk:
    yield chunk
    break


    Perhaps in your case the file is required to always end with a line terminator, but if so it would be a good idea to assert this, rather than silently discarding data if the file does not meet the requirement:



    if not data:  # EOF
    if chunk:
    raise ValueError("file does not end with line terminator")
    break


    2. Separation of concerns



    The code in the post has four concerns:




    1. Opening and closing a file.

    2. Generating chunks of 100 kilobytes from a file.

    3. Generating lines from a file with arbitrary line terminator.

    4. Filtering out blank lines.


    The principle of separation of concerns suggests that these should not be combined into one function. Taking the four concerns in turn:




    1. As a general rule, if you have a function that operates on the contents of a file, it's better if the function takes a file-like object (not a filename) and leaves the opening and closing of the file up to the caller. That's because not all file-like objects correspond to named files. The caller might have a pipe from a call to subprocess.Popen, or they might have some data in memory wrapped as a io.StringIO. If an interface takes a filename and opens the file itself, then it can't be used with these other kinds of file-like object.


    2. Reading fixed-size chunks from a file is already implemented by the read method.


    3. This is fine, but see below for some improvements.


    4. Filtering out blank lines might not be needed or wanted, so it should be left to the caller to decide whether to do it. It is in any case easy to implement using an if statement, or a comprehension, or filter.



    3. Other points




    1. There are no docstrings.


    2. When you have a class with __init__ and one other method, then there is usually no need for a class: what you want is a function. (See Jack Diederich's PyCon 2012 talk "Stop Writing Classes" for a detailed look at this issue.)


    3. 100 × 1024 bytes is one hundred kilobytes (not megabytes).



    4. Instead of having a keyword argument with default value of None:



      def read(..., chunk_size=None):


      and then later setting it to a particular value if it tests false:



      one_hundred_megabytes = 100 * 1024
      chunk_size = chunk_size or one_hundred_megabytes


      just specify the keyword argument with the default value you want:



      def read(..., chunk_size=100 * 1024):



    5. The code appends to chunk before testing for the line terminator:



      chunk += data
      # ...
      elif self._lineterminator in chunk:


      but we know that if the line terminator appears, then it must appear in the new part of the chunk (that we just read). It can't appear in the old part because then it would have been split on the previous iteration. This means that if multiple chunks need to be read before the line terminator appears, then the initial part of the line gets searched again each time. This leads to quadratic runtime complexity if lines are much longer than the chunk size.



      For efficiency, we should test self._lineterminator in data instead.




    6. The code tests for the line terminator appearing in the chunk and then splits the chunk if it is found.



      elif self._lineterminator in chunk:
      lines = chunk.split(self._lineterminator)


      but this has to search the chunk twice, once for the in operator, and again for the split. It would be faster to split first and then see how many pieces there are:



      lines = chunk.split(self._lineterminator)
      if len(lines) > 1:



    7. Combining points 5 and 6 above, we should split data first and append to chunk afterwards:



      first, *rest = data.split(self._lineterminator)
      chunk += first
      if rest:
      yield chunk
      yield from rest[:-1]
      chunk = rest[-1]


    8. I think the names would be clearer if chunk were renamed part (because it is part of a line), and data were renamed chunk (because it is a chunk of text read from the file).



    4. Revised code



    def splitlines(file, newline, chunk_size=4096):
    """Generate lines from file (a file-like object), where a line is
    followed by the string newline (or the end of the file). Optional
    argument chunk_size is the size of the chunks read from the file.

    """
    part = ''
    while True:
    chunk = file.read(chunk_size)
    if not chunk:
    if part:
    yield part
    break
    first, *rest = chunk.split(newline)
    part += first
    if rest:
    yield part
    yield from rest[:-1]
    part = rest[-1]


    The caller should write something like:



    with open(longabstract_file) as f:
    for line in splitlines(f, "tln"):
    if line: # ignore blank lines
    print(line)






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Oct 16 '18 at 11:38

























    answered Oct 16 '18 at 11:23









    Gareth ReesGareth Rees

    45.6k3101185




    45.6k3101185












    • $begingroup$
      Gareth... thank you for taking the time! This has been so instructive. --- You nailed the scent I was smelling. Specifically the separation of concerns. Thanks. --- Your version gets around the pointed out bug eloquently. I was ignoring because the files do end with newline but your solution is awesome. I've never even thought about short circuiting with yield. Thanks! --- SPEED!! My "baseline" is simply reading through the 10 million line file line-by-line with a pass. This takes ~10 secs. My function takes ~22 MINUTES. Yours takes ~15 seconds!!! Clearly points 3.5, 3.6 and 3.7 are key! 🏎
      $endgroup$
      – GollyJer
      Oct 17 '18 at 0:18










    • $begingroup$
      Also, do you want to post this code as an answer to my question on SO? ✅ is in order over there as well.
      $endgroup$
      – GollyJer
      Oct 17 '18 at 0:20










    • $begingroup$
      Found a little bug and updated with a new version in my answer below. Feedback welcome. 🤪
      $endgroup$
      – GollyJer
      2 hours ago


















    • $begingroup$
      Gareth... thank you for taking the time! This has been so instructive. --- You nailed the scent I was smelling. Specifically the separation of concerns. Thanks. --- Your version gets around the pointed out bug eloquently. I was ignoring because the files do end with newline but your solution is awesome. I've never even thought about short circuiting with yield. Thanks! --- SPEED!! My "baseline" is simply reading through the 10 million line file line-by-line with a pass. This takes ~10 secs. My function takes ~22 MINUTES. Yours takes ~15 seconds!!! Clearly points 3.5, 3.6 and 3.7 are key! 🏎
      $endgroup$
      – GollyJer
      Oct 17 '18 at 0:18










    • $begingroup$
      Also, do you want to post this code as an answer to my question on SO? ✅ is in order over there as well.
      $endgroup$
      – GollyJer
      Oct 17 '18 at 0:20










    • $begingroup$
      Found a little bug and updated with a new version in my answer below. Feedback welcome. 🤪
      $endgroup$
      – GollyJer
      2 hours ago
















    $begingroup$
    Gareth... thank you for taking the time! This has been so instructive. --- You nailed the scent I was smelling. Specifically the separation of concerns. Thanks. --- Your version gets around the pointed out bug eloquently. I was ignoring because the files do end with newline but your solution is awesome. I've never even thought about short circuiting with yield. Thanks! --- SPEED!! My "baseline" is simply reading through the 10 million line file line-by-line with a pass. This takes ~10 secs. My function takes ~22 MINUTES. Yours takes ~15 seconds!!! Clearly points 3.5, 3.6 and 3.7 are key! 🏎
    $endgroup$
    – GollyJer
    Oct 17 '18 at 0:18




    $begingroup$
    Gareth... thank you for taking the time! This has been so instructive. --- You nailed the scent I was smelling. Specifically the separation of concerns. Thanks. --- Your version gets around the pointed out bug eloquently. I was ignoring because the files do end with newline but your solution is awesome. I've never even thought about short circuiting with yield. Thanks! --- SPEED!! My "baseline" is simply reading through the 10 million line file line-by-line with a pass. This takes ~10 secs. My function takes ~22 MINUTES. Yours takes ~15 seconds!!! Clearly points 3.5, 3.6 and 3.7 are key! 🏎
    $endgroup$
    – GollyJer
    Oct 17 '18 at 0:18












    $begingroup$
    Also, do you want to post this code as an answer to my question on SO? ✅ is in order over there as well.
    $endgroup$
    – GollyJer
    Oct 17 '18 at 0:20




    $begingroup$
    Also, do you want to post this code as an answer to my question on SO? ✅ is in order over there as well.
    $endgroup$
    – GollyJer
    Oct 17 '18 at 0:20












    $begingroup$
    Found a little bug and updated with a new version in my answer below. Feedback welcome. 🤪
    $endgroup$
    – GollyJer
    2 hours ago




    $begingroup$
    Found a little bug and updated with a new version in my answer below. Feedback welcome. 🤪
    $endgroup$
    – GollyJer
    2 hours ago













    0












    $begingroup$

    There is a bug in @Gareth Rees' code. It doesn't work if the the newline variable is cut off by the chunk_size.



    Here is a version that fixes the bug.



    def splitlines(file, newline, chunk_size=4096):
    tail = ""
    while True:
    chunk = file.read(chunk_size)
    if not chunk:
    if tail:
    yield tail
    break
    lines = (tail + chunk).split(newline)
    tail = lines.pop(0)
    if lines:
    yield tail
    tail = lines.pop()
    yield from lines





    share|improve this answer









    $endgroup$


















      0












      $begingroup$

      There is a bug in @Gareth Rees' code. It doesn't work if the the newline variable is cut off by the chunk_size.



      Here is a version that fixes the bug.



      def splitlines(file, newline, chunk_size=4096):
      tail = ""
      while True:
      chunk = file.read(chunk_size)
      if not chunk:
      if tail:
      yield tail
      break
      lines = (tail + chunk).split(newline)
      tail = lines.pop(0)
      if lines:
      yield tail
      tail = lines.pop()
      yield from lines





      share|improve this answer









      $endgroup$
















        0












        0








        0





        $begingroup$

        There is a bug in @Gareth Rees' code. It doesn't work if the the newline variable is cut off by the chunk_size.



        Here is a version that fixes the bug.



        def splitlines(file, newline, chunk_size=4096):
        tail = ""
        while True:
        chunk = file.read(chunk_size)
        if not chunk:
        if tail:
        yield tail
        break
        lines = (tail + chunk).split(newline)
        tail = lines.pop(0)
        if lines:
        yield tail
        tail = lines.pop()
        yield from lines





        share|improve this answer









        $endgroup$



        There is a bug in @Gareth Rees' code. It doesn't work if the the newline variable is cut off by the chunk_size.



        Here is a version that fixes the bug.



        def splitlines(file, newline, chunk_size=4096):
        tail = ""
        while True:
        chunk = file.read(chunk_size)
        if not chunk:
        if tail:
        yield tail
        break
        lines = (tail + chunk).split(newline)
        tail = lines.pop(0)
        if lines:
        yield tail
        tail = lines.pop()
        yield from lines






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered 2 hours ago









        GollyJerGollyJer

        1849




        1849






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Code Review Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f205631%2foptimizing-large-file-reads-with-a-custom-line-terminator-in-python%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to reconfigure Docker Trusted Registry 2.x.x to use CEPH FS mount instead of NFS and other traditional...

            is 'sed' thread safe

            How to make a Squid Proxy server?