Extract Text and Images from Word Document in Java

In this blog, I’ll introduce an easy solution to extract text and images from a Word document within Java application.

Required Library

Free Spire.Doc for Java

Before using the below code, we need to download Free Spire.Doc for Java and then import the Spire.Doc.jar file into our project. For maven project, you can refer this online tutorial to install Free Spire.Doc for Java from maven repository.

Extract Text

Free Spire.Doc for Java provides a getText method in Document class which we can use to get text from a Word document.

import com.spire.doc.Document;

import java.io.FileWriter;
import java.io.IOException;

public class ReadText{
    public static void main(String[] args) throws IOException {
        //load Word document

        Document document = new Document();
        document.loadFromFile("C:\\Users\\Administrator\\Desktop\\sample.docx");

        //get text from document as string

        String text=document.getText();

        //write string to a .txt file

        writeStringToTxt(text," ExtractedText.txt");
    }

    public static void writeStringToTxt(String content, String txtFileName) throws IOException {

        FileWriter fWriter = new FileWriter(txtFileName, true);
        try {
            fWriter.write(content);
        } catch (IOException ex) {
            ex.printStackTrace();
        } finally {
            try {
                fWriter.flush();
                fWriter.close();
            } catch (IOException ex) {
                ex.printStackTrace();
            }
        }
    }
}

Extract Images

Extract images is a little bit complicate than extract text. We need to loop through the objects in the document, find the image objects and then extract them.

import com.spire.doc.Document;
import com.spire.doc.documents.DocumentObjectType;
import com.spire.doc.fields.DocPicture;
import com.spire.doc.interfaces.ICompositeObject;
import com.spire.doc.interfaces.IDocumentObject;


import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;
import java.util.Queue;

public class ReadTextAndImages {
    public static void main(String[] args) throws IOException {
        //load word document 
        Document document = new Document();
        document.loadFromFile("C:\\Users\\Administrator\\Desktop\\sample.docx");

        //create a Queue object 
        Queue<ICompositeObject> nodes = new LinkedList<ICompositeObject>();

        nodes.add(document);

        //create a List object 
        List<BufferedImage> images = new ArrayList<BufferedImage>();

        //loop through the child objects of the document 
        while (nodes.size() > 0) {
            ICompositeObject node = nodes.poll();

            for (int i = 0; i < node.getChildObjects().getCount(); i++) {
                IDocumentObject child = node.getChildObjects().get(i);
                if (child instanceof ICompositeObject) {
                    nodes.add((ICompositeObject) child);

                    if (child.getDocumentObjectType() == DocumentObjectType.Picture) {
                        DocPicture picture = (DocPicture) child;
                        images.add(picture.getImage());
                    }
                }
            }
        }
        //save images 
        for (int i = 0; i < images.size(); i++) {
            File file = new File(String.format("output/extractImageAndText-%d.png", i));
            ImageIO.write(images.get(i), "PNG", file);
    }
  
    }
}

File APIs for Word/Excel/PowerPoint/PDF

Search This Blog

Extract Text and Images from Word Document in Java

Labels

Comments

Post a Comment

Popular posts from this blog

3 Ways to Generate Word Documents from Templates in Java

Insert and Extract OLE objects in Word in Java

Simple Java Code to Convert Excel to PDF in Java