Skip to main content

Extract Text From PDF in Java

In this article, we’re going to explain how to extract text from a Pdf file in Java.
An overview of content:
  1. Extract All Text from a Pdf
  2. Read/Extract Text from a Specific Rectangle Area in a Pdf Page
  3. Read/Extract Text using SimpleTextExtractionStrategy
The Pdf library we need:
The example Pdf file:
alt text

Sample Code

Imported Namespaces
import com.spire.pdf.*;
import com.spire.pdf.exporting.text.SimpleTextExtractionStrategy;
import java.awt.geom.Rectangle2D;
import java.io.*;
Read/Extract All Text from a Pdf
//Instantiate a PdfDocument object
PdfDocument pdf = new PdfDocument();
//Load the Pdf file
pdf.loadFromFile("Additional.pdf");

StringBuilder sb= new StringBuilder();

//Extract text from every page of the Pdf
for (PdfPageBase page: (Iterable<PdfPageBase>) pdf.getPages()) {
    sb.append(page.extractText(true));
}

try {
    //Write the text into a .txt file 
    FileWriter writer = new FileWriter("ExtractText.txt");
    writer.write(sb.toString());
    writer.flush();
} catch (IOException e) {
    e.printStackTrace();
}

//Close the PdfDocument object
pdf.close();
Output:alt text
Read/Extract Text from a Specific Rectangle Area in a Pdf Page
//Instantiate a PdfDocument object
PdfDocument pdf = new PdfDocument();
//Load the Pdf file
pdf.loadFromFile("Additional.pdf");

//Get the first page of the Pdf
PdfPageBase page = pdf.getPages().get(0);

//Instantiate a Rectangle2D object 
Rectangle2D rect = new Rectangle2D.Float();
//Set location and size
rect.setFrame( 50, 50, 500, 100);

//Extract text from the given rectangle area in the first page
StringBuilder sb= new StringBuilder();
StringBuilder append = sb.append(page.extractText(rect));

try {
    //Write the text into a .txt file 
    FileWriter writer = new FileWriter("ExtractText.txt");
    writer.write(sb.toString());
    writer.flush();
} catch (IOException e) {
    e.printStackTrace();
}

//Close the PdfDocument object
pdf.close();
Output:alt text
Read/Extract Text using SimpleTextExtractionStrategy
//Instantiate a PdfDocument object
PdfDocument pdf = new PdfDocument();
//Load the Pdf file
pdf.loadFromFile("Additional.pdf");

//Get the first page of the Pdf
PdfPageBase page = pdf.getPages().get(0);

//Extract text from the first page using SimpleTextExtractionStrategy
SimpleTextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
StringBuilder sb= new StringBuilder();
StringBuilder append = sb.append(page.extractText(strategy));

try {
    //Write the text into a .txt file 
    FileWriter writer = new FileWriter("ExtractText.txt");
    writer.write(sb.toString());
    writer.flush();
} catch (IOException e) {
    e.printStackTrace();
}

//Close the PdfDocument object
pdf.close();
Output:alt text

Comments

Popular posts from this blog

3 Ways to Generate Word Documents from Templates in Java

A template is a document with pre-applied formatting like styles, tabs, line spacing and so on. You can quickly generate a batch of documents with the same structure based on the template. In this article, I am going to show you the different ways to generate Word documents from templates programmatically in Java using Free Spire.Doc for Java library. Prerequisite First of all, you need to add needed dependencies for including Free Spire.Doc for Java into your Java project. There are two ways to do that. If you use maven, you need to add the following code to your project’s pom.xml file. <repositories>               <repository>                   <id>com.e-iceblue</id>                   <name>e-iceblue</name>...

Simple Java Code to Convert Excel to PDF in Java

This article demonstrates a simple solution to convert an Excel file to PDF in Java by using free Excel API – Free Spire.XLS for Java . The following examples illustrate two possibilities to convert Excel to PDF:      Convert the whole Excel file to PDF     Convert a particular Excel Worksheet to PDF Before start with coding, you need to Download Free Spire.XLS for Java package , unzip it and import Spire.Xls.jar file from the lib folder in your project as a denpendency. 1. Convert the whole Excel file to PDF Spire.XLS for Java provides saveToFile method in Workbook class that enables us to easily save a whole Excel file to PDF. import com.spire.xls.FileFormat; import com.spire.xls.Workbook; public class ExcelToPDF {     public static void main(String[] args){         //Create a Workbook         Workbook workbook = new Workbook();   ...

Insert and Extract OLE objects in Word in Java

You can use OLE (Object Linking and Embedding) to include content from other programs, such as another Word document, an Excel or PowerPoint document to an existing Word document. This article demonstrates how to insert and extract embedded OLE objects in a Word document in Java by using Free Spire.Doc for Java API.   Add dependencies First of all, you need to add needed dependencies for including Free Spire.Doc for Java into your Java project. There are two ways to do that. If you use maven, you need to add the following code to your project’s pom.xml file.     <repositories>               <repository>                   <id>com.e-iceblue</id>                   <name>e-iceblue</name>    ...