Training Tesseract-OCR for english language fonts












3















I have about 3000 small images of single words that I am trying to convert to text.
I have installed tesseract on my windows 7 machine using the installer and successfully managed to OCR images throught cmd and powershell.



 tesseract.exe imagename.png imagename 


produces a text file with the converted text.



The results I got were terrible with only about 40% of characters successfully converted.
I would like to improve the results.



Does anyone know what the optional configurations that can be given in this command?
The required arguments are:



tesseract imagename outputbase [- lang] [configfile [+|-]varfile]...]


Also could someone describe the training procedure, I am finding it hard to understand the documentation. I know that my text is in times new roman. Do I need to train it for TNR or is that already built in and/or is it possible to download files that allows tesseract to recognize it?










share|improve this question

























  • I found some docs for training code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

    – andrew
    Jan 19 '11 at 20:44













  • After reading the instructions that @andrew (you) found, what part are you not understanding? How far have you gotten in that process?

    – Everett
    Aug 19 '12 at 9:20
















3















I have about 3000 small images of single words that I am trying to convert to text.
I have installed tesseract on my windows 7 machine using the installer and successfully managed to OCR images throught cmd and powershell.



 tesseract.exe imagename.png imagename 


produces a text file with the converted text.



The results I got were terrible with only about 40% of characters successfully converted.
I would like to improve the results.



Does anyone know what the optional configurations that can be given in this command?
The required arguments are:



tesseract imagename outputbase [- lang] [configfile [+|-]varfile]...]


Also could someone describe the training procedure, I am finding it hard to understand the documentation. I know that my text is in times new roman. Do I need to train it for TNR or is that already built in and/or is it possible to download files that allows tesseract to recognize it?










share|improve this question

























  • I found some docs for training code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

    – andrew
    Jan 19 '11 at 20:44













  • After reading the instructions that @andrew (you) found, what part are you not understanding? How far have you gotten in that process?

    – Everett
    Aug 19 '12 at 9:20














3












3








3








I have about 3000 small images of single words that I am trying to convert to text.
I have installed tesseract on my windows 7 machine using the installer and successfully managed to OCR images throught cmd and powershell.



 tesseract.exe imagename.png imagename 


produces a text file with the converted text.



The results I got were terrible with only about 40% of characters successfully converted.
I would like to improve the results.



Does anyone know what the optional configurations that can be given in this command?
The required arguments are:



tesseract imagename outputbase [- lang] [configfile [+|-]varfile]...]


Also could someone describe the training procedure, I am finding it hard to understand the documentation. I know that my text is in times new roman. Do I need to train it for TNR or is that already built in and/or is it possible to download files that allows tesseract to recognize it?










share|improve this question
















I have about 3000 small images of single words that I am trying to convert to text.
I have installed tesseract on my windows 7 machine using the installer and successfully managed to OCR images throught cmd and powershell.



 tesseract.exe imagename.png imagename 


produces a text file with the converted text.



The results I got were terrible with only about 40% of characters successfully converted.
I would like to improve the results.



Does anyone know what the optional configurations that can be given in this command?
The required arguments are:



tesseract imagename outputbase [- lang] [configfile [+|-]varfile]...]


Also could someone describe the training procedure, I am finding it hard to understand the documentation. I know that my text is in times new roman. Do I need to train it for TNR or is that already built in and/or is it possible to download files that allows tesseract to recognize it?







ocr tesseract-ocr






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Sep 26 '12 at 17:55









climenole

2,94011429




2,94011429










asked Jan 19 '11 at 19:51









andrewandrew

4721712




4721712













  • I found some docs for training code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

    – andrew
    Jan 19 '11 at 20:44













  • After reading the instructions that @andrew (you) found, what part are you not understanding? How far have you gotten in that process?

    – Everett
    Aug 19 '12 at 9:20



















  • I found some docs for training code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

    – andrew
    Jan 19 '11 at 20:44













  • After reading the instructions that @andrew (you) found, what part are you not understanding? How far have you gotten in that process?

    – Everett
    Aug 19 '12 at 9:20

















I found some docs for training code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

– andrew
Jan 19 '11 at 20:44







I found some docs for training code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

– andrew
Jan 19 '11 at 20:44















After reading the instructions that @andrew (you) found, what part are you not understanding? How far have you gotten in that process?

– Everett
Aug 19 '12 at 9:20





After reading the instructions that @andrew (you) found, what part are you not understanding? How far have you gotten in that process?

– Everett
Aug 19 '12 at 9:20










1 Answer
1






active

oldest

votes


















0














One way to remove the results is to preprocess them like remove any skew and thresholding them. You can use open CV. Later you can train the text






share|improve this answer























    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "3"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f235345%2ftraining-tesseract-ocr-for-english-language-fonts%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    One way to remove the results is to preprocess them like remove any skew and thresholding them. You can use open CV. Later you can train the text






    share|improve this answer




























      0














      One way to remove the results is to preprocess them like remove any skew and thresholding them. You can use open CV. Later you can train the text






      share|improve this answer


























        0












        0








        0







        One way to remove the results is to preprocess them like remove any skew and thresholding them. You can use open CV. Later you can train the text






        share|improve this answer













        One way to remove the results is to preprocess them like remove any skew and thresholding them. You can use open CV. Later you can train the text







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Dec 8 '13 at 19:08









        PranaysharmaPranaysharma

        1564




        1564






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Super User!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f235345%2ftraining-tesseract-ocr-for-english-language-fonts%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to make a Squid Proxy server?

            Is this a new Fibonacci Identity?

            Touch on Surface Book