Text extraction of document files












1












$begingroup$


I have this script that pulls text out of .docx, .doc and .pdf files and uploads that text to an Azure SQL Server. This is so users can search on the contents of those documents without using Windows Search / Azure Search.



The filenames are all in the following format:



firstname surname - id.extension



The id is incorrect though, the ID is from an outdated database and the new database that I am updating holds both (newID and oldID).



COLUMNS:




  • ID - New ID of the candidate record

  • OldID - Old ID of the candidate record (old database schema)

  • OriginalResumeID - Document link ID for the candidate table to the document table

  • CachedText - The field I am updating (holds the document text) at the moment this will mostly be NULL


Here is the script:



## Get resume list
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "SELECT id, OldID, OriginalResumeID FROM Candidate WHERE OriginalResumeID IS NOT NULL"
}
$IDCheck = Invoke-Sqlcmd @params

## Word object
$files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx
$word = New-Object -ComObject word.application
$word.Visible = $false
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatText")

foreach ($file in $files) {
Write-Output "Processing: $($file.FullName)"
$doc = $word.Documents.Open($file.FullName)
$fileName = $file.BaseName + '.txt'
$doc.SaveAs("$env:TEMP$fileName", [ref]$saveFormat)
Write-Output "File saved as $env:TEMP$
fileName"
$doc.Close()

$4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")
$text = Get-Content "$env:TEMP$fileName"
$text = $text.replace("'", "''")

$resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID
$resumeID = $resumeID.OriginalResumeID

<# Upload to azure #>
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"
}
Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"
Remove-Item -Force "$env:TEMP$fileName"
}

$word.Quit()


The problem is that running this on a large dataset, let's say 750000 documents takes far too long per document. I'm fairly certain that this is because it has to search through the entire $IDCheck object of 750000 records before it can get the originalResumeID of the record to upload to.



Running this on a smaller database is quite quick (around 200000 per 24 hours). I was thinking I could check the documents table and only pull rows where the CachedText field is null and loop that to run every 50000 documents so it would get quicker as it goes. Problem is the documents table will be massive and will take a long time to search through every time this is called.



Any help on speeding this up would be much appreciated.



EDIT:



Looks like it is the upload to azure causing the delay:



<# Upload to azure #>
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"
}
Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"









share|improve this question











$endgroup$












  • $begingroup$
    Your Get-Content is super slow too. Use -raw parameter to get the text as a string.
    $endgroup$
    – wOxxOm
    Sep 29 '18 at 7:02










  • $begingroup$
    Thanks for the response, can you explain why this would be faster? I did a test by creating 10000 text files in a folder and running both Get-Content and Get-Content -raw on all files, it took 7 seconds on both runs.
    $endgroup$
    – Owain Esau
    Oct 1 '18 at 1:21










  • $begingroup$
    Those were smallish files I guess. Get-Files without -raw creates an array of lines so with really large files the difference could be insane. This is a known peculiarity.
    $endgroup$
    – wOxxOm
    Oct 1 '18 at 4:23










  • $begingroup$
    Thanks, this made a small difference, the files aren't very large. They are resumes so 1 - 2 page documents.
    $endgroup$
    – Owain Esau
    Oct 2 '18 at 0:07
















1












$begingroup$


I have this script that pulls text out of .docx, .doc and .pdf files and uploads that text to an Azure SQL Server. This is so users can search on the contents of those documents without using Windows Search / Azure Search.



The filenames are all in the following format:



firstname surname - id.extension



The id is incorrect though, the ID is from an outdated database and the new database that I am updating holds both (newID and oldID).



COLUMNS:




  • ID - New ID of the candidate record

  • OldID - Old ID of the candidate record (old database schema)

  • OriginalResumeID - Document link ID for the candidate table to the document table

  • CachedText - The field I am updating (holds the document text) at the moment this will mostly be NULL


Here is the script:



## Get resume list
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "SELECT id, OldID, OriginalResumeID FROM Candidate WHERE OriginalResumeID IS NOT NULL"
}
$IDCheck = Invoke-Sqlcmd @params

## Word object
$files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx
$word = New-Object -ComObject word.application
$word.Visible = $false
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatText")

foreach ($file in $files) {
Write-Output "Processing: $($file.FullName)"
$doc = $word.Documents.Open($file.FullName)
$fileName = $file.BaseName + '.txt'
$doc.SaveAs("$env:TEMP$fileName", [ref]$saveFormat)
Write-Output "File saved as $env:TEMP$
fileName"
$doc.Close()

$4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")
$text = Get-Content "$env:TEMP$fileName"
$text = $text.replace("'", "''")

$resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID
$resumeID = $resumeID.OriginalResumeID

<# Upload to azure #>
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"
}
Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"
Remove-Item -Force "$env:TEMP$fileName"
}

$word.Quit()


The problem is that running this on a large dataset, let's say 750000 documents takes far too long per document. I'm fairly certain that this is because it has to search through the entire $IDCheck object of 750000 records before it can get the originalResumeID of the record to upload to.



Running this on a smaller database is quite quick (around 200000 per 24 hours). I was thinking I could check the documents table and only pull rows where the CachedText field is null and loop that to run every 50000 documents so it would get quicker as it goes. Problem is the documents table will be massive and will take a long time to search through every time this is called.



Any help on speeding this up would be much appreciated.



EDIT:



Looks like it is the upload to azure causing the delay:



<# Upload to azure #>
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"
}
Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"









share|improve this question











$endgroup$












  • $begingroup$
    Your Get-Content is super slow too. Use -raw parameter to get the text as a string.
    $endgroup$
    – wOxxOm
    Sep 29 '18 at 7:02










  • $begingroup$
    Thanks for the response, can you explain why this would be faster? I did a test by creating 10000 text files in a folder and running both Get-Content and Get-Content -raw on all files, it took 7 seconds on both runs.
    $endgroup$
    – Owain Esau
    Oct 1 '18 at 1:21










  • $begingroup$
    Those were smallish files I guess. Get-Files without -raw creates an array of lines so with really large files the difference could be insane. This is a known peculiarity.
    $endgroup$
    – wOxxOm
    Oct 1 '18 at 4:23










  • $begingroup$
    Thanks, this made a small difference, the files aren't very large. They are resumes so 1 - 2 page documents.
    $endgroup$
    – Owain Esau
    Oct 2 '18 at 0:07














1












1








1





$begingroup$


I have this script that pulls text out of .docx, .doc and .pdf files and uploads that text to an Azure SQL Server. This is so users can search on the contents of those documents without using Windows Search / Azure Search.



The filenames are all in the following format:



firstname surname - id.extension



The id is incorrect though, the ID is from an outdated database and the new database that I am updating holds both (newID and oldID).



COLUMNS:




  • ID - New ID of the candidate record

  • OldID - Old ID of the candidate record (old database schema)

  • OriginalResumeID - Document link ID for the candidate table to the document table

  • CachedText - The field I am updating (holds the document text) at the moment this will mostly be NULL


Here is the script:



## Get resume list
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "SELECT id, OldID, OriginalResumeID FROM Candidate WHERE OriginalResumeID IS NOT NULL"
}
$IDCheck = Invoke-Sqlcmd @params

## Word object
$files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx
$word = New-Object -ComObject word.application
$word.Visible = $false
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatText")

foreach ($file in $files) {
Write-Output "Processing: $($file.FullName)"
$doc = $word.Documents.Open($file.FullName)
$fileName = $file.BaseName + '.txt'
$doc.SaveAs("$env:TEMP$fileName", [ref]$saveFormat)
Write-Output "File saved as $env:TEMP$
fileName"
$doc.Close()

$4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")
$text = Get-Content "$env:TEMP$fileName"
$text = $text.replace("'", "''")

$resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID
$resumeID = $resumeID.OriginalResumeID

<# Upload to azure #>
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"
}
Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"
Remove-Item -Force "$env:TEMP$fileName"
}

$word.Quit()


The problem is that running this on a large dataset, let's say 750000 documents takes far too long per document. I'm fairly certain that this is because it has to search through the entire $IDCheck object of 750000 records before it can get the originalResumeID of the record to upload to.



Running this on a smaller database is quite quick (around 200000 per 24 hours). I was thinking I could check the documents table and only pull rows where the CachedText field is null and loop that to run every 50000 documents so it would get quicker as it goes. Problem is the documents table will be massive and will take a long time to search through every time this is called.



Any help on speeding this up would be much appreciated.



EDIT:



Looks like it is the upload to azure causing the delay:



<# Upload to azure #>
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"
}
Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"









share|improve this question











$endgroup$




I have this script that pulls text out of .docx, .doc and .pdf files and uploads that text to an Azure SQL Server. This is so users can search on the contents of those documents without using Windows Search / Azure Search.



The filenames are all in the following format:



firstname surname - id.extension



The id is incorrect though, the ID is from an outdated database and the new database that I am updating holds both (newID and oldID).



COLUMNS:




  • ID - New ID of the candidate record

  • OldID - Old ID of the candidate record (old database schema)

  • OriginalResumeID - Document link ID for the candidate table to the document table

  • CachedText - The field I am updating (holds the document text) at the moment this will mostly be NULL


Here is the script:



## Get resume list
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "SELECT id, OldID, OriginalResumeID FROM Candidate WHERE OriginalResumeID IS NOT NULL"
}
$IDCheck = Invoke-Sqlcmd @params

## Word object
$files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx
$word = New-Object -ComObject word.application
$word.Visible = $false
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatText")

foreach ($file in $files) {
Write-Output "Processing: $($file.FullName)"
$doc = $word.Documents.Open($file.FullName)
$fileName = $file.BaseName + '.txt'
$doc.SaveAs("$env:TEMP$fileName", [ref]$saveFormat)
Write-Output "File saved as $env:TEMP$
fileName"
$doc.Close()

$4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")
$text = Get-Content "$env:TEMP$fileName"
$text = $text.replace("'", "''")

$resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID
$resumeID = $resumeID.OriginalResumeID

<# Upload to azure #>
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"
}
Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"
Remove-Item -Force "$env:TEMP$fileName"
}

$word.Quit()


The problem is that running this on a large dataset, let's say 750000 documents takes far too long per document. I'm fairly certain that this is because it has to search through the entire $IDCheck object of 750000 records before it can get the originalResumeID of the record to upload to.



Running this on a smaller database is quite quick (around 200000 per 24 hours). I was thinking I could check the documents table and only pull rows where the CachedText field is null and loop that to run every 50000 documents so it would get quicker as it goes. Problem is the documents table will be massive and will take a long time to search through every time this is called.



Any help on speeding this up would be much appreciated.



EDIT:



Looks like it is the upload to azure causing the delay:



<# Upload to azure #>
$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "Update Document SET CachedText = '$text' WHERE id = $ResumeID"
}
Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"






time-limit-exceeded powershell pdf ms-word azure






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Sep 28 '18 at 7:06







Owain Esau

















asked Sep 28 '18 at 5:19









Owain EsauOwain Esau

614




614












  • $begingroup$
    Your Get-Content is super slow too. Use -raw parameter to get the text as a string.
    $endgroup$
    – wOxxOm
    Sep 29 '18 at 7:02










  • $begingroup$
    Thanks for the response, can you explain why this would be faster? I did a test by creating 10000 text files in a folder and running both Get-Content and Get-Content -raw on all files, it took 7 seconds on both runs.
    $endgroup$
    – Owain Esau
    Oct 1 '18 at 1:21










  • $begingroup$
    Those were smallish files I guess. Get-Files without -raw creates an array of lines so with really large files the difference could be insane. This is a known peculiarity.
    $endgroup$
    – wOxxOm
    Oct 1 '18 at 4:23










  • $begingroup$
    Thanks, this made a small difference, the files aren't very large. They are resumes so 1 - 2 page documents.
    $endgroup$
    – Owain Esau
    Oct 2 '18 at 0:07


















  • $begingroup$
    Your Get-Content is super slow too. Use -raw parameter to get the text as a string.
    $endgroup$
    – wOxxOm
    Sep 29 '18 at 7:02










  • $begingroup$
    Thanks for the response, can you explain why this would be faster? I did a test by creating 10000 text files in a folder and running both Get-Content and Get-Content -raw on all files, it took 7 seconds on both runs.
    $endgroup$
    – Owain Esau
    Oct 1 '18 at 1:21










  • $begingroup$
    Those were smallish files I guess. Get-Files without -raw creates an array of lines so with really large files the difference could be insane. This is a known peculiarity.
    $endgroup$
    – wOxxOm
    Oct 1 '18 at 4:23










  • $begingroup$
    Thanks, this made a small difference, the files aren't very large. They are resumes so 1 - 2 page documents.
    $endgroup$
    – Owain Esau
    Oct 2 '18 at 0:07
















$begingroup$
Your Get-Content is super slow too. Use -raw parameter to get the text as a string.
$endgroup$
– wOxxOm
Sep 29 '18 at 7:02




$begingroup$
Your Get-Content is super slow too. Use -raw parameter to get the text as a string.
$endgroup$
– wOxxOm
Sep 29 '18 at 7:02












$begingroup$
Thanks for the response, can you explain why this would be faster? I did a test by creating 10000 text files in a folder and running both Get-Content and Get-Content -raw on all files, it took 7 seconds on both runs.
$endgroup$
– Owain Esau
Oct 1 '18 at 1:21




$begingroup$
Thanks for the response, can you explain why this would be faster? I did a test by creating 10000 text files in a folder and running both Get-Content and Get-Content -raw on all files, it took 7 seconds on both runs.
$endgroup$
– Owain Esau
Oct 1 '18 at 1:21












$begingroup$
Those were smallish files I guess. Get-Files without -raw creates an array of lines so with really large files the difference could be insane. This is a known peculiarity.
$endgroup$
– wOxxOm
Oct 1 '18 at 4:23




$begingroup$
Those were smallish files I guess. Get-Files without -raw creates an array of lines so with really large files the difference could be insane. This is a known peculiarity.
$endgroup$
– wOxxOm
Oct 1 '18 at 4:23












$begingroup$
Thanks, this made a small difference, the files aren't very large. They are resumes so 1 - 2 page documents.
$endgroup$
– Owain Esau
Oct 2 '18 at 0:07




$begingroup$
Thanks, this made a small difference, the files aren't very large. They are resumes so 1 - 2 page documents.
$endgroup$
– Owain Esau
Oct 2 '18 at 0:07










1 Answer
1






active

oldest

votes


















0












$begingroup$

I would try to use bulkcopy to load all of the IDs and their CachedText at once into a staging table in Azure, and then do a single update on your document table.



 CREATE TABLE document
(docKey BIGINT IDENTITY(1, 1) PRIMARY KEY,
CachedText NVARCHAR(MAX),
id INT
);
CREATE TABLE document_stage
(CachedText NVARCHAR(MAX),
id INT
);


As you iterate over the files, you create a PSObject with the properties you want in your sql table and add it to an collection. Then after all files are done, or at set batching limits you can use Out-DataTable to convert the collection into a data table, and then let SqlBulkCopy upload to the stage table in one batch, and a single UPDATE will update your primary table.



UPDATE Document
SET
CachedText = stg.CachedText
FROM document_stage stg
WHERE document.id = stg.id;


PS script



$files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx

$stagedDataAsArray = @()
foreach ($file in $files) {

$fileName = $file.name
$4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")
$text = Get-Content "$($file.FullName)"
$text = $text.replace("'", "''")


$resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID
$resumeID = $resumeID.OriginalResumeID

<# create the row and add it to our #>
$fileInstance = New-Object -TypeName psobject
$fileInstance | add-member -type NoteProperty -Name cachedText -Value $text
$fileInstance | add-member -type NoteProperty -Name resumeID -Value $resumeID
$stagedDataAsArray += $fileInstance
Remove-Item -Force "$env:TEMP$fileName"

}
$stagedDataAsTable = $stagedDataAsArray | Out-DataTable
$cn = new-object System.Data.SqlClient.SqlConnection("YOUR AZURE DB CONNECTION STRING");
$cn.Open()
$bc = new-object ("System.Data.SqlClient.SqlBulkCopy") $cn
$bc.DestinationTableName = "dbo.document_stage"
$bc.WriteToServer($stagedDataAsTable)
$cn.Close()


$params = @{
'Database' = $TRIS5DATABASENAME
'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
'Username' = $AdminLogin
'Password' = $InsecurePassword
'query' = "UPDATE Document
SET
CachedText = stg.CachedText
FROM document_stage stg
WHERE document.id = stg.id;"
}
Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"





share|improve this answer








New contributor




Brandon McClure is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "196"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f204506%2ftext-extraction-of-document-files%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0












    $begingroup$

    I would try to use bulkcopy to load all of the IDs and their CachedText at once into a staging table in Azure, and then do a single update on your document table.



     CREATE TABLE document
    (docKey BIGINT IDENTITY(1, 1) PRIMARY KEY,
    CachedText NVARCHAR(MAX),
    id INT
    );
    CREATE TABLE document_stage
    (CachedText NVARCHAR(MAX),
    id INT
    );


    As you iterate over the files, you create a PSObject with the properties you want in your sql table and add it to an collection. Then after all files are done, or at set batching limits you can use Out-DataTable to convert the collection into a data table, and then let SqlBulkCopy upload to the stage table in one batch, and a single UPDATE will update your primary table.



    UPDATE Document
    SET
    CachedText = stg.CachedText
    FROM document_stage stg
    WHERE document.id = stg.id;


    PS script



    $files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx

    $stagedDataAsArray = @()
    foreach ($file in $files) {

    $fileName = $file.name
    $4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")
    $text = Get-Content "$($file.FullName)"
    $text = $text.replace("'", "''")


    $resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID
    $resumeID = $resumeID.OriginalResumeID

    <# create the row and add it to our #>
    $fileInstance = New-Object -TypeName psobject
    $fileInstance | add-member -type NoteProperty -Name cachedText -Value $text
    $fileInstance | add-member -type NoteProperty -Name resumeID -Value $resumeID
    $stagedDataAsArray += $fileInstance
    Remove-Item -Force "$env:TEMP$fileName"

    }
    $stagedDataAsTable = $stagedDataAsArray | Out-DataTable
    $cn = new-object System.Data.SqlClient.SqlConnection("YOUR AZURE DB CONNECTION STRING");
    $cn.Open()
    $bc = new-object ("System.Data.SqlClient.SqlBulkCopy") $cn
    $bc.DestinationTableName = "dbo.document_stage"
    $bc.WriteToServer($stagedDataAsTable)
    $cn.Close()


    $params = @{
    'Database' = $TRIS5DATABASENAME
    'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
    'Username' = $AdminLogin
    'Password' = $InsecurePassword
    'query' = "UPDATE Document
    SET
    CachedText = stg.CachedText
    FROM document_stage stg
    WHERE document.id = stg.id;"
    }
    Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"





    share|improve this answer








    New contributor




    Brandon McClure is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$


















      0












      $begingroup$

      I would try to use bulkcopy to load all of the IDs and their CachedText at once into a staging table in Azure, and then do a single update on your document table.



       CREATE TABLE document
      (docKey BIGINT IDENTITY(1, 1) PRIMARY KEY,
      CachedText NVARCHAR(MAX),
      id INT
      );
      CREATE TABLE document_stage
      (CachedText NVARCHAR(MAX),
      id INT
      );


      As you iterate over the files, you create a PSObject with the properties you want in your sql table and add it to an collection. Then after all files are done, or at set batching limits you can use Out-DataTable to convert the collection into a data table, and then let SqlBulkCopy upload to the stage table in one batch, and a single UPDATE will update your primary table.



      UPDATE Document
      SET
      CachedText = stg.CachedText
      FROM document_stage stg
      WHERE document.id = stg.id;


      PS script



      $files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx

      $stagedDataAsArray = @()
      foreach ($file in $files) {

      $fileName = $file.name
      $4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")
      $text = Get-Content "$($file.FullName)"
      $text = $text.replace("'", "''")


      $resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID
      $resumeID = $resumeID.OriginalResumeID

      <# create the row and add it to our #>
      $fileInstance = New-Object -TypeName psobject
      $fileInstance | add-member -type NoteProperty -Name cachedText -Value $text
      $fileInstance | add-member -type NoteProperty -Name resumeID -Value $resumeID
      $stagedDataAsArray += $fileInstance
      Remove-Item -Force "$env:TEMP$fileName"

      }
      $stagedDataAsTable = $stagedDataAsArray | Out-DataTable
      $cn = new-object System.Data.SqlClient.SqlConnection("YOUR AZURE DB CONNECTION STRING");
      $cn.Open()
      $bc = new-object ("System.Data.SqlClient.SqlBulkCopy") $cn
      $bc.DestinationTableName = "dbo.document_stage"
      $bc.WriteToServer($stagedDataAsTable)
      $cn.Close()


      $params = @{
      'Database' = $TRIS5DATABASENAME
      'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
      'Username' = $AdminLogin
      'Password' = $InsecurePassword
      'query' = "UPDATE Document
      SET
      CachedText = stg.CachedText
      FROM document_stage stg
      WHERE document.id = stg.id;"
      }
      Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"





      share|improve this answer








      New contributor




      Brandon McClure is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      $endgroup$
















        0












        0








        0





        $begingroup$

        I would try to use bulkcopy to load all of the IDs and their CachedText at once into a staging table in Azure, and then do a single update on your document table.



         CREATE TABLE document
        (docKey BIGINT IDENTITY(1, 1) PRIMARY KEY,
        CachedText NVARCHAR(MAX),
        id INT
        );
        CREATE TABLE document_stage
        (CachedText NVARCHAR(MAX),
        id INT
        );


        As you iterate over the files, you create a PSObject with the properties you want in your sql table and add it to an collection. Then after all files are done, or at set batching limits you can use Out-DataTable to convert the collection into a data table, and then let SqlBulkCopy upload to the stage table in one batch, and a single UPDATE will update your primary table.



        UPDATE Document
        SET
        CachedText = stg.CachedText
        FROM document_stage stg
        WHERE document.id = stg.id;


        PS script



        $files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx

        $stagedDataAsArray = @()
        foreach ($file in $files) {

        $fileName = $file.name
        $4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")
        $text = Get-Content "$($file.FullName)"
        $text = $text.replace("'", "''")


        $resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID
        $resumeID = $resumeID.OriginalResumeID

        <# create the row and add it to our #>
        $fileInstance = New-Object -TypeName psobject
        $fileInstance | add-member -type NoteProperty -Name cachedText -Value $text
        $fileInstance | add-member -type NoteProperty -Name resumeID -Value $resumeID
        $stagedDataAsArray += $fileInstance
        Remove-Item -Force "$env:TEMP$fileName"

        }
        $stagedDataAsTable = $stagedDataAsArray | Out-DataTable
        $cn = new-object System.Data.SqlClient.SqlConnection("YOUR AZURE DB CONNECTION STRING");
        $cn.Open()
        $bc = new-object ("System.Data.SqlClient.SqlBulkCopy") $cn
        $bc.DestinationTableName = "dbo.document_stage"
        $bc.WriteToServer($stagedDataAsTable)
        $cn.Close()


        $params = @{
        'Database' = $TRIS5DATABASENAME
        'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
        'Username' = $AdminLogin
        'Password' = $InsecurePassword
        'query' = "UPDATE Document
        SET
        CachedText = stg.CachedText
        FROM document_stage stg
        WHERE document.id = stg.id;"
        }
        Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"





        share|improve this answer








        New contributor




        Brandon McClure is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






        $endgroup$



        I would try to use bulkcopy to load all of the IDs and their CachedText at once into a staging table in Azure, and then do a single update on your document table.



         CREATE TABLE document
        (docKey BIGINT IDENTITY(1, 1) PRIMARY KEY,
        CachedText NVARCHAR(MAX),
        id INT
        );
        CREATE TABLE document_stage
        (CachedText NVARCHAR(MAX),
        id INT
        );


        As you iterate over the files, you create a PSObject with the properties you want in your sql table and add it to an collection. Then after all files are done, or at set batching limits you can use Out-DataTable to convert the collection into a data table, and then let SqlBulkCopy upload to the stage table in one batch, and a single UPDATE will update your primary table.



        UPDATE Document
        SET
        CachedText = stg.CachedText
        FROM document_stage stg
        WHERE document.id = stg.id;


        PS script



        $files = Get-ChildItem -force -recurse $documentFolder -include *.doc, *.pdf, *.docx

        $stagedDataAsArray = @()
        foreach ($file in $files) {

        $fileName = $file.name
        $4ID = $fileName.split('-')[1].replace(' ', '').replace(".txt", "")
        $text = Get-Content "$($file.FullName)"
        $text = $text.replace("'", "''")


        $resumeID = $IDCheck | where {$_.OldID -eq $4id} | Select-Object OriginalResumeID
        $resumeID = $resumeID.OriginalResumeID

        <# create the row and add it to our #>
        $fileInstance = New-Object -TypeName psobject
        $fileInstance | add-member -type NoteProperty -Name cachedText -Value $text
        $fileInstance | add-member -type NoteProperty -Name resumeID -Value $resumeID
        $stagedDataAsArray += $fileInstance
        Remove-Item -Force "$env:TEMP$fileName"

        }
        $stagedDataAsTable = $stagedDataAsArray | Out-DataTable
        $cn = new-object System.Data.SqlClient.SqlConnection("YOUR AZURE DB CONNECTION STRING");
        $cn.Open()
        $bc = new-object ("System.Data.SqlClient.SqlBulkCopy") $cn
        $bc.DestinationTableName = "dbo.document_stage"
        $bc.WriteToServer($stagedDataAsTable)
        $cn.Close()


        $params = @{
        'Database' = $TRIS5DATABASENAME
        'ServerInstance' = $($AzureServerInstance.FullyQualifiedDomainName)
        'Username' = $AdminLogin
        'Password' = $InsecurePassword
        'query' = "UPDATE Document
        SET
        CachedText = stg.CachedText
        FROM document_stage stg
        WHERE document.id = stg.id;"
        }
        Invoke-Sqlcmd @params -ErrorAction "SilentlyContinue"






        share|improve this answer








        New contributor




        Brandon McClure is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.









        share|improve this answer



        share|improve this answer






        New contributor




        Brandon McClure is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.









        answered 1 hour ago









        Brandon McClureBrandon McClure

        1011




        1011




        New contributor




        Brandon McClure is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.





        New contributor





        Brandon McClure is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






        Brandon McClure is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Code Review Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f204506%2ftext-extraction-of-document-files%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to make a Squid Proxy server?

            Is this a new Fibonacci Identity?

            19世紀