Converting Pdf to Text:
PDF to text conversion is a required step in integration workflows in order to automate data extraction and make PDF content available for further processing.
Common Questions:
- In a pdf file i have 2 pages, but it reads only the first Page
- How to find the number of pages in PDF Document
- How to split the pdf file
Implementation:
1.first download the below jar files from the below links
- fontbox-2.0.16-javadoc.jar
- itextpdf-5.5.13.jar
- pdfbox-2.0.16-javadoc.jar
https://mvnrepository.com/artifact/com.itextpdf/itextpdf/5.5.13
https://javadoc.io/doc/org.apache.pdfbox/pdfbox/2.0.26/index.html
2.once you download the jar files upload in to Account libraries.
go to Account>Setup>Account Libraries
3. Once the files are uploaded, go to the Build tab and create a custom library.
4. Set the custom library type as Scripting and add the 3 jars mentioned above, and deployed to the atom
Boomi Process for Converting pdf to text:
- In the Boomi process, once you receive the PDF file, add a Data Process shape to count the pages and split them using the code provided below.
import com.itextpdf.text.Document;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfCopy;
import java.io.InputStream;
import java.io.ByteArrayOutputStream;
import java.io.ByteArrayInputStream;
import java.util.ArrayList;
// Function to split PDF pages
ArrayList splitPdfPages(byte[] pdfBytes) {
PdfReader reader = new PdfReader(new ByteArrayInputStream(pdfBytes));
int numPages = reader.getNumberOfPages();
ArrayList pdfPages = new ArrayList();
for (int i = 1; i <= numPages; i++) {
Document document = new Document();
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
PdfCopy copy = new PdfCopy(document, outputStream);
document.open();
copy.addPage(copy.getImportedPage(reader, i));
document.close();
pdfPages.add(outputStream);
}
reader.close();
return pdfPages;
}
// Loop through each document in the data context
for (int i = 0; i < dataContext.getDataCount(); i++) {
InputStream pdfInputStream = dataContext.getStream(i);
// Convert InputStream to ByteArray for reuse
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
byte[] buffer = new byte[1024];
int bytesRead;
while ((bytesRead = pdfInputStream.read(buffer)) != -1) {
byteArrayOutputStream.write(buffer, 0, bytesRead);
}
byte[] pdfBytes = byteArrayOutputStream.toByteArray();
// Get the total page count of the PDF
PdfReader reader = new PdfReader(new ByteArrayInputStream(pdfBytes));
int totalPageCount = reader.getNumberOfPages();
reader.close();
// Split the PDF into individual pages
ArrayList pdfPages = splitPdfPages(pdfBytes);
// Store each split page as a separate document in the data context
for (int j = 0; j < pdfPages.size(); j++) {
ByteArrayOutputStream pdfPage = pdfPages.get(j);
// Set the total page count of the original document as a Dynamic Document Property
Properties props = dataContext.getProperties(i);
props.setProperty(“document.dynamic.userdefined.PageCount”, String.valueOf(totalPageCount));
// Store the split page in the data context
dataContext.storeStream(new ByteArrayInputStream(pdfPage.toByteArray()), props);
}
}
2. The Count will the assigned to the DDP PageCount Value
3. Use the below code to convert the Data from pdf to Text
import java.util.Properties;
import java.io.InputStream;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
for( int i = 0; i < dataContext.getDataCount(); i++ ) {
InputStream is = dataContext.getStream(i);
Properties props = dataContext.getProperties(i);
// Convert inputstream to PdfReader
PdfReader reader = new PdfReader(is);
//Extract the text from reader using PdfTextExtractor
String textFromPage = PdfTextExtractor.getTextFromPage(reader, 1);
//Convert text to inputstream
is = new ByteArrayInputStream(textFromPage.getBytes());
dataContext.storeStream(is, props);
}
Example:
In this example, I took the PDF files from the disk and used the Decision to separate them. If the PDF file had more than one page, those went to the false path. Then, I used the data process to convert each split page to text, and after all the pages were converted to text, we had to merge the document.
Add a Comment