In this blog, I’ll introduce an easy solution to extract
text and images from a Word document within Java application.
Required Library
Free Spire.Doc for Java
Before using the below code,
we need to download Free
Spire.Doc for Java and then import the Spire.Doc.jar file into our
project. For maven project, you can refer this online
tutorial to install Free Spire.Doc for Java from maven
repository.
Extract Text
Free Spire.Doc for Java provides a getText method in
Document class which we can use to get text from a Word document.
import com.spire.doc.Document; import java.io.FileWriter; import java.io.IOException; public class ReadText{ public static void main(String[] args) throws IOException { //load Word document
Document document = new Document(); document.loadFromFile("C:\\Users\\Administrator\\Desktop\\sample.docx"); //get text from document as string
String text=document.getText(); //write string to a .txt file
writeStringToTxt(text," ExtractedText.txt"); }
public static void writeStringToTxt(String content, String txtFileName) throws IOException { FileWriter fWriter = new FileWriter(txtFileName, true); try { fWriter.write(content); } catch (IOException ex) { ex.printStackTrace(); } finally { try { fWriter.flush(); fWriter.close(); } catch (IOException ex) { ex.printStackTrace(); } } } }
Extract Images
Extract images is a little bit complicate than extract text.
We need to loop through the objects in the document, find the image objects and
then extract them.
import com.spire.doc.Document; import com.spire.doc.documents.DocumentObjectType; import com.spire.doc.fields.DocPicture; import com.spire.doc.interfaces.ICompositeObject; import com.spire.doc.interfaces.IDocumentObject; import javax.imageio.ImageIO; import java.awt.image.BufferedImage; import java.io.File; import java.io.IOException; import java.util.ArrayList; import java.util.LinkedList; import java.util.List; import java.util.Queue; public class ReadTextAndImages { public static void main(String[] args) throws IOException { //load word documentDocument document = new Document(); document.loadFromFile("C:\\Users\\Administrator\\Desktop\\sample.docx"); //create a Queue objectQueue<ICompositeObject> nodes = new LinkedList<ICompositeObject>(); nodes.add(document); //create a List objectList<BufferedImage> images = new ArrayList<BufferedImage>(); //loop through the child objects of the documentwhile (nodes.size() > 0) { ICompositeObject node = nodes.poll(); for (int i = 0; i < node.getChildObjects().getCount(); i++) { IDocumentObject child = node.getChildObjects().get(i); if (child instanceof ICompositeObject) { nodes.add((ICompositeObject) child); if (child.getDocumentObjectType() == DocumentObjectType.Picture) { DocPicture picture = (DocPicture) child; images.add(picture.getImage()); } } } } //save imagesfor (int i = 0; i < images.size(); i++) { File file = new File(String.format("output/extractImageAndText-%d.png", i)); ImageIO.write(images.get(i), "PNG", file); } } }
Comments
Post a Comment