Java - Extract Images from PDF and Compress PDF Image Size

In some document processing scenarios, PDF files often contain many images, such as scanned documents, product manuals, report attachments, contract images, and so on. Sometimes we need to extract images from a PDF for archiving, recognition, or further editing. In other cases, a PDF file may become too large because it contains high-resolution images, and image compression becomes necessary.

This article uses demonstrate two common operations:

Extract images from a PDF document in Java
Compress high-quality images in a PDF document in Java

1. Preparation

The sample code uses a Maven project. First, add the Spire.PDF for Java dependency to your pom.xml file.

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>

<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.pdf</artifactId>
        <version>12.6.1</version>
    </dependency>
</dependencies>

If you are using a regular Java project, you can also manually add the corresponding Jar file to your project dependencies. For demonstration purposes, the following code uses local file paths. You need to modify them according to your actual directory structure.

2. Extract Images from a PDF in Java

1. Implementation Logic

The core process of extracting images is straightforward:

Load the PDF file;
Traverse each page of the PDF;
Get the image information on the current page;
Save the image object as a local image file.

In Spire.PDF, you can use PdfImageHelper to get image information from a PDF page, then use PdfImageInfo.getImage() to obtain a BufferedImage object, and finally use ImageIO.write() to write it to a local file.

2. Sample Code

import com.spire.pdf.PdfDocument;
import com.spire.pdf.PdfPageBase;
import com.spire.pdf.utilities.PdfImageHelper;
import com.spire.pdf.utilities.PdfImageInfo;

import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;

public class ExtractPdfImages {

    public static void main(String[] args) throws IOException {
        String inputPdf = "C:/pdf/input.pdf";
        String outputDir = "C:/pdf/images/";

        PdfDocument document = new PdfDocument();

        try {
            // Load the PDF file
            document.loadFromFile(inputPdf);

            // Create the output directory
            File dir = new File(outputDir);
            if (!dir.exists()) {
                dir.mkdirs();
            }

            PdfImageHelper imageHelper = new PdfImageHelper();
            int imageIndex = 1;

            // Traverse each page of the PDF
            for (int pageIndex = 0; pageIndex < document.getPages().getCount(); pageIndex++) {
                PdfPageBase page = document.getPages().get(pageIndex);

                // Get all images from the current page
                PdfImageInfo[] imageInfos = imageHelper.getImagesInfo(page);

                // Save images
                for (PdfImageInfo imageInfo : imageInfos) {
                    BufferedImage image = imageInfo.getImage();

                    String fileName = String.format(
                            "page-%d-image-%d.png",
                            pageIndex + 1,
                            imageIndex++
                    );

                    File outputFile = new File(outputDir + fileName);
                    ImageIO.write(image, "PNG", outputFile);
                }
            }

            System.out.println("Image extraction completed. Total images exported: " + (imageIndex - 1));

        } finally {
            document.dispose();
        }
    }
}

3. Code Explanation

Instead of using a simple name like image-1.png, the code includes the page number in the file name, for example:

page-1-image-1.png
page-2-image-3.png

This makes it easier to locate the source page of each extracted image later. Especially when a PDF has many pages, using only sequential numbers makes it harder to identify where each image originally came from.

In this example, the extracted images are saved in PNG format. If the images in the original PDF are already large, saving them as PNG may not reduce the file size. If the images are only used for preview or web display, you can save them as JPG instead.

For example:

ImageIO.write(image, "JPG", outputFile);

However, JPG is a lossy format and is more suitable for photos. It is not ideal for text screenshots, table screenshots, or images with transparent backgrounds.

3. Compress Images in a PDF in Java

1. Applicable Scenarios

When a PDF file is too large, images are often one of the main reasons. For example:

The resolution of scanned documents is too high;
A report contains many high-definition images;
Images are inserted directly into the PDF without compression;
The PDF has only a few pages, but the file size reaches dozens of MB.

In such cases, you can try compressing the images in the PDF to reduce the overall file size.

However, image compression may reduce image quality to some extent. If the PDF is used for printing, archiving, or formal delivery, it is recommended to back up the original file first and compress a copy instead.

2. Sample Code

import com.spire.pdf.FileFormat;
import com.spire.pdf.PdfDocument;
import com.spire.pdf.PdfPageBase;
import com.spire.pdf.utilities.PdfImageHelper;
import com.spire.pdf.utilities.PdfImageInfo;

import java.io.File;

public class CompressPdfImages {

    public static void main(String[] args) {
        String inputPdf = "C:/pdf/input.pdf";
        String outputPdf = "C:/pdf/output-compressed.pdf";

        PdfDocument document = new PdfDocument();

        try {
            // Load the PDF file
            document.loadFromFile(inputPdf);

            // Disable incremental update to avoid abnormal file size increase after saving
            document.getFileInfo().setIncrementalUpdate(false);

            PdfImageHelper imageHelper = new PdfImageHelper();

            // Traverse each page and compress images on the page
            for (int pageIndex = 0; pageIndex < document.getPages().getCount(); pageIndex++) {
                PdfPageBase page = document.getPages().get(pageIndex);

                PdfImageInfo[] imageInfos = imageHelper.getImagesInfo(page);

                for (PdfImageInfo imageInfo : imageInfos) {
                    imageInfo.tryCompressImage();
                }
            }

            // Save the compressed PDF
            document.saveToFile(outputPdf, FileFormat.PDF);

            printFileSize(inputPdf, outputPdf);

        } finally {
            document.dispose();
        }
    }

    private static void printFileSize(String beforePath, String afterPath) {
        File before = new File(beforePath);
        File after = new File(afterPath);

        double beforeMb = before.length() / 1024.0 / 1024.0;
        double afterMb = after.length() / 1024.0 / 1024.0;

        System.out.printf("Before compression: %.2f MB%n", beforeMb);
        System.out.printf("After compression: %.2f MB%n", afterMb);
    }
}

4. Explanation of `setIncrementalUpdate(false)`

When compressing a PDF, there is one line in the code:

document.getFileInfo().setIncrementalUpdate(false);

It is recommended to keep this step.

When a PDF is saved, it may use incremental update mode. This means that the modified content is appended to the end of the original file instead of rewriting the entire file. This approach is suitable for some editing scenarios, but when compressing images, it may cause old data to remain in the file. As a result, you may find that the images have been compressed, but the PDF file size has not decreased significantly.

Therefore, when compressing images in a PDF, disabling incremental update can make the saved result better match the expected compressed output.

5. Frequently Asked Questions

1. Why can’t some images be extracted from a PDF?

Not everything that looks like an image is necessarily an image object in a PDF.

For example:

The page content may be vector graphics;
Text and graphics may be generated by drawing instructions;
A scanned document may use one large image for the entire page;
Images in some PDFs may be specially encoded or packaged.

Therefore, in real projects, you should test with PDFs from different sources instead of judging compatibility based on only one sample file.

2. What if the PDF file size does not change much after compression?

There may be several reasons:

The main file size does not come from images, but from fonts, attachments, or other resources;
The images have already been compressed;
There are not many images in the PDF;
The original PDF has a special saving structure;
The compression ratio is limited, so the size change is not obvious.

It is recommended to compare the file size before and after compression and test with real business documents as much as possible.

3. Will compression affect image clarity?

It may.

The goal of image compression is to reduce file size, which usually affects image quality to some degree. If the PDF needs to be printed, stamped, or archived, it is better not to overwrite the original file. Instead, output a new compressed version.

6. Conclusion

This article demonstrates two common PDF image processing requirements in Java:

Use PdfImageHelper.getImagesInfo() to get image information from PDF pages;
Use PdfImageInfo.getImage() to extract images;
Use PdfImageInfo.tryCompressImage() to compress images in a PDF;
Disable incremental update when saving the PDF to avoid ineffective compression.

In real business scenarios, it is recommended to encapsulate “image extraction” and “PDF compression” as separate utility methods, and add file size checks, exception handling, directory validation, and other logic. This makes the code easier to reuse for document archiving, attachment processing, or batch PDF compression.

File APIs for Word/Excel/PowerPoint/PDF

Search This Blog