Skip to main content

Extract Text and Images from Word Document in Java


In this blog, I’ll introduce an easy solution to extract text and images from a Word document within Java application.

Required Library

Free Spire.Doc for Java

Before using the below code, we need to download Free Spire.Doc for Java and then import the Spire.Doc.jar file into our project. For maven project, you can refer this online tutorial to install Free Spire.Doc for Java from maven repository.

Extract Text

Free Spire.Doc for Java provides a getText method in Document class which we can use to get text from a Word document.

import com.spire.doc.Document;

import java.io.FileWriter;
import java.io.IOException;

public class ReadText{
    public static void main(String[] args) throws IOException {
        //load Word document        
        Document document = new Document();
        document.loadFromFile("C:\\Users\\Administrator\\Desktop\\sample.docx");

        //get text from document as string        
        String text=document.getText();

        //write string to a .txt file        
        writeStringToTxt(text," ExtractedText.txt");
    }
    public static void writeStringToTxt(String content, String txtFileName) throws IOException {

        FileWriter fWriter = new FileWriter(txtFileName, true);
        try {
            fWriter.write(content);
        } catch (IOException ex) {
            ex.printStackTrace();
        } finally {
            try {
                fWriter.flush();
                fWriter.close();
            } catch (IOException ex) {
                ex.printStackTrace();
            }
        }
    }
}

Extract Images

Extract images is a little bit complicate than extract text. We need to loop through the objects in the document, find the image objects and then extract them.

import com.spire.doc.Document;
import com.spire.doc.documents.DocumentObjectType;
import com.spire.doc.fields.DocPicture;
import com.spire.doc.interfaces.ICompositeObject;
import com.spire.doc.interfaces.IDocumentObject;


import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;
import java.util.Queue;

public class ReadTextAndImages {
    public static void main(String[] args) throws IOException {
        //load word document 
        Document document = new Document();
        document.loadFromFile("C:\\Users\\Administrator\\Desktop\\sample.docx");

        //create a Queue object 
        Queue<ICompositeObject> nodes = new LinkedList<ICompositeObject>();

        nodes.add(document);

        //create a List object 
        List<BufferedImage> images = new ArrayList<BufferedImage>();

        //loop through the child objects of the document 
        while (nodes.size() > 0) {
            ICompositeObject node = nodes.poll();

            for (int i = 0; i < node.getChildObjects().getCount(); i++) {
                IDocumentObject child = node.getChildObjects().get(i);
                if (child instanceof ICompositeObject) {
                    nodes.add((ICompositeObject) child);

                    if (child.getDocumentObjectType() == DocumentObjectType.Picture) {
                        DocPicture picture = (DocPicture) child;
                        images.add(picture.getImage());
                    }
                }
            }
        }
        //save images 
        for (int i = 0; i < images.size(); i++) {
            File file = new File(String.format("output/extractImageAndText-%d.png", i));
            ImageIO.write(images.get(i), "PNG", file);
    }
  
    }
}


Comments

Popular posts from this blog

3 Ways to Generate Word Documents from Templates in Java

A template is a document with pre-applied formatting like styles, tabs, line spacing and so on. You can quickly generate a batch of documents with the same structure based on the template. In this article, I am going to show you the different ways to generate Word documents from templates programmatically in Java using Free Spire.Doc for Java library. Prerequisite First of all, you need to add needed dependencies for including Free Spire.Doc for Java into your Java project. There are two ways to do that. If you use maven, you need to add the following code to your project’s pom.xml file. <repositories>               <repository>                   <id>com.e-iceblue</id>                   <name>e-iceblue</name>                   <url>http: //repo.e-iceblue.com/nexus/content/groups/public/</url>                </repository>       </repositories>       <dependencies>           <dependency>               <g

Simple Java Code to Convert Excel to PDF in Java

This article demonstrates a simple solution to convert an Excel file to PDF in Java by using free Excel API – Free Spire.XLS for Java . The following examples illustrate two possibilities to convert Excel to PDF:      Convert the whole Excel file to PDF     Convert a particular Excel Worksheet to PDF Before start with coding, you need to Download Free Spire.XLS for Java package , unzip it and import Spire.Xls.jar file from the lib folder in your project as a denpendency. 1. Convert the whole Excel file to PDF Spire.XLS for Java provides saveToFile method in Workbook class that enables us to easily save a whole Excel file to PDF. import com.spire.xls.FileFormat; import com.spire.xls.Workbook; public class ExcelToPDF {     public static void main(String[] args){         //Create a Workbook         Workbook workbook = new Workbook();         workbook.loadFromFile( "Sample.xlsx" );         //Fit to page         workbook.getConverterSetting().setShee

Insert and Extract OLE objects in Word in Java

You can use OLE (Object Linking and Embedding) to include content from other programs, such as another Word document, an Excel or PowerPoint document to an existing Word document. This article demonstrates how to insert and extract embedded OLE objects in a Word document in Java by using Free Spire.Doc for Java API.   Add dependencies First of all, you need to add needed dependencies for including Free Spire.Doc for Java into your Java project. There are two ways to do that. If you use maven, you need to add the following code to your project’s pom.xml file.     <repositories>               <repository>                   <id>com.e-iceblue</id>                   <name>e-iceblue</name>                   <url>http://repo.e-iceblue.com/nexus/content/groups/public/</url>               </repository>       </repositories>       <dependencies>           <dependency>               <groupId>