In this article, we’re going to explain how to extract text from a Pdf file in Java.
An overview of content:
- Extract All Text from a Pdf
- Read/Extract Text from a Specific Rectangle Area in a Pdf Page
- Read/Extract Text using SimpleTextExtractionStrategy
The Pdf library we need:
The example Pdf file:
Sample Code
Imported Namespaces
import com.spire.pdf.*;
import com.spire.pdf.exporting.text.SimpleTextExtractionStrategy;
import java.awt.geom.Rectangle2D;
import java.io.*;
Read/Extract All Text from a Pdf
//Instantiate a PdfDocument object
PdfDocument pdf = new PdfDocument();
//Load the Pdf file
pdf.loadFromFile("Additional.pdf");
StringBuilder sb= new StringBuilder();
//Extract text from every page of the Pdf
for (PdfPageBase page: (Iterable<PdfPageBase>) pdf.getPages()) {
sb.append(page.extractText(true));
}
try {
//Write the text into a .txt file
FileWriter writer = new FileWriter("ExtractText.txt");
writer.write(sb.toString());
writer.flush();
} catch (IOException e) {
e.printStackTrace();
}
//Close the PdfDocument object
pdf.close();
Output:
Read/Extract Text from a Specific Rectangle Area in a Pdf Page
//Instantiate a PdfDocument object
PdfDocument pdf = new PdfDocument();
//Load the Pdf file
pdf.loadFromFile("Additional.pdf");
//Get the first page of the Pdf
PdfPageBase page = pdf.getPages().get(0);
//Instantiate a Rectangle2D object
Rectangle2D rect = new Rectangle2D.Float();
//Set location and size
rect.setFrame( 50, 50, 500, 100);
//Extract text from the given rectangle area in the first page
StringBuilder sb= new StringBuilder();
StringBuilder append = sb.append(page.extractText(rect));
try {
//Write the text into a .txt file
FileWriter writer = new FileWriter("ExtractText.txt");
writer.write(sb.toString());
writer.flush();
} catch (IOException e) {
e.printStackTrace();
}
//Close the PdfDocument object
pdf.close();
Output:
Read/Extract Text using SimpleTextExtractionStrategy
//Instantiate a PdfDocument object
PdfDocument pdf = new PdfDocument();
//Load the Pdf file
pdf.loadFromFile("Additional.pdf");
//Get the first page of the Pdf
PdfPageBase page = pdf.getPages().get(0);
//Extract text from the first page using SimpleTextExtractionStrategy
SimpleTextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
StringBuilder sb= new StringBuilder();
StringBuilder append = sb.append(page.extractText(strategy));
try {
//Write the text into a .txt file
FileWriter writer = new FileWriter("ExtractText.txt");
writer.write(sb.toString());
writer.flush();
} catch (IOException e) {
e.printStackTrace();
}
//Close the PdfDocument object
pdf.close();
Output:
Comments
Post a Comment