image (1)

How to Convert Pdf to Text

Converting Pdf to Text: 

PDF to text conversion is a required step in integration workflows in order to automate data extraction and make PDF content available for further processing. 

Common Questions: 

  1. In a pdf file i have 2 pages, but it reads only the first Page 

  1. How to find the number of pages in PDF Document 

  1. How to split the pdf file 

Implementation: 

  1.first download the below jar files from the below links 

  • fontbox-2.0.16-javadoc.jar 

  •  itextpdf-5.5.13.jar 

  •  pdfbox-2.0.16-javadoc.jar 

https://mvnrepository.com/artifact/com.itextpdf/itextpdf/5.5.13

https://javadoc.io/doc/org.apache.pdfbox/pdfbox/2.0.26/index.html

2.once you download  the jar files upload in to Account libraries. 

go to Account>Setup>Account Libraries 

3. Once the files are uploaded, go to the Build tab and create a custom library. 

4. Set the custom library type as Scripting and add the 3 jars mentioned above, and deployed to the atom 

Boomi Process for Converting pdf to text: 

  1. In the Boomi process, once you receive the PDF file, add a Data Process shape to count the pages and split them using the code provided below. 

import com.itextpdf.text.Document; 
import com.itextpdf.text.pdf.PdfReader; 
import com.itextpdf.text.pdf.PdfCopy; 
import java.io.InputStream; 
import java.io.ByteArrayOutputStream; 
import java.io.ByteArrayInputStream; 
import java.util.ArrayList; 

// Function to split PDF pages 

ArrayList splitPdfPages(byte[] pdfBytes) { 
    PdfReader reader = new PdfReader(new ByteArrayInputStream(pdfBytes)); 
    int numPages = reader.getNumberOfPages(); 
    ArrayList pdfPages = new ArrayList(); 
    for (int i = 1; i <= numPages; i++) { 
        Document document = new Document(); 
        ByteArrayOutputStream outputStream = new ByteArrayOutputStream(); 
        PdfCopy copy = new PdfCopy(document, outputStream); 
        document.open(); 
        copy.addPage(copy.getImportedPage(reader, i)); 
        document.close(); 
        pdfPages.add(outputStream); 
    } 
    reader.close(); 
    return pdfPages; 

// Loop through each document in the data context 

for (int i = 0; i < dataContext.getDataCount(); i++) { 
    InputStream pdfInputStream = dataContext.getStream(i); 
    // Convert InputStream to ByteArray for reuse 
    ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(); 
    byte[] buffer = new byte[1024]; 
    int bytesRead; 
    while ((bytesRead = pdfInputStream.read(buffer)) != -1) { 
        byteArrayOutputStream.write(buffer, 0, bytesRead); 
    } 
    byte[] pdfBytes = byteArrayOutputStream.toByteArray(); 


    // Get the total page count of the PDF 
    PdfReader reader = new PdfReader(new ByteArrayInputStream(pdfBytes)); 
int totalPageCount = reader.getNumberOfPages(); 
    reader.close(); 

    // Split the PDF into individual pages 
    ArrayList pdfPages = splitPdfPages(pdfBytes); 

    // Store each split page as a separate document in the data context 
    for (int j = 0; j < pdfPages.size(); j++) { 
        ByteArrayOutputStream pdfPage = pdfPages.get(j); 

        // Set the total page count of the original document as a Dynamic Document Property 

        Properties props = dataContext.getProperties(i); 
        props.setProperty(“document.dynamic.userdefined.PageCount”, String.valueOf(totalPageCount)); 

        // Store the split page in the data context 
        dataContext.storeStream(new ByteArrayInputStream(pdfPage.toByteArray()), props); 
    } 

     2. The Count will the assigned to the DDP PageCount Value  

     3. Use the below code to convert the Data from pdf to Text 

  import java.util.Properties; 
import java.io.InputStream; 
import com.itextpdf.text.pdf.PdfReader; 
import com.itextpdf.text.pdf.parser.PdfTextExtractor; 
for( int i = 0; i < dataContext.getDataCount(); i++ ) { 
   InputStream is = dataContext.getStream(i); 
   Properties props = dataContext.getProperties(i); 

// Convert inputstream to PdfReader 

  PdfReader reader = new PdfReader(is); 

//Extract the text from reader using PdfTextExtractor 

   String textFromPage = PdfTextExtractor.getTextFromPage(reader, 1); 

//Convert text to inputstream 

  is = new ByteArrayInputStream(textFromPage.getBytes()); 
   dataContext.storeStream(is, props); 

Example: 

 In this example, I took the PDF files from the disk and used the Decision to separate them. If the PDF file had more than one page, those went to the false path. Then, I used the data process to convert each split page to text, and after all the pages were converted to text, we had to merge the document. 

Tags: No tags

Add a Comment

Your email address will not be published. Required fields are marked *