Replacing image in a PDF with Python

Being a freelancer is an interesting role. You come across a variety of projects. I recently worked on a project involving replacing images in a PDF which taught me a couple of things.

  1. While there are a number of tools to deal with PDF in Python, the general purpose tools can only do so much because… reason 2
  2. PDF is a dump of instructions to put things in specific places. There is no logical way it is done that make general purposes tools manipulate the PDF in a consistent way.
  3. Not everything is bad. Almost all positive changes like adding text or image and whole page changes like rotating, cropping are usually possible and so are all read operations like text, image extraction ..etc.,
  4. The issue is when you want to delete something and replace it with something else.

With that learnt, I set out to achieve the goal anyway.

Step 1 – Understanding the format

Humans invented the PDF format, which means they used words to describe things in the file, which means we can read them. So opening a PDF file in a text editor like VIM will show something like this.

PDF in VIM

Without getting into the entirety of the PDF spec, let us see what this means. PDF is a collection of objects. There is usually an identifier like int int obj followed by some metadata and then a stream of binary information starting with stream and ends with endstream and endobj. A image in our case would be represented as

16 0 obj
<< /Length 17 0 R /Type /XObject /Subtype /Image /Width 242 /Height 291 /Interpolate
true /ColorSpace 7 0 R /Intent /Perceptual /BitsPerComponent 8 /Filter /DCTDecode
>>
stream
Image binary data here like ÿØÿá^@VExif^@^@MM^@*^@^@^@^H^@^D^A^Z^@^E^@^@
endstream
endobj

So to successfully replace an image we will have to replace the image binary data and the metadata like width and height.

Step 2 – Uncompressing the PDF and extracting the images

Use a PDF manipulation called toolkit called PDFtk.

pdftk sample.pdf output uncompressed.pdf uncompress

What this command does is, it uncompresses the file and makes it easier to read and manipulate. Let us open the uncompressed.pdf in VIM to see the difference.

uncompressed pdf

Step 3 – Identifying the image to replace

PDF is essentially a collection of objects and a PDF file might contain multiple images, there is no way to identify a particular image in the binary data of the PDF file (unless you are from Matrix). We will have to first extract the images from the PDF and match the PDF object to the image using its metadata like height and width. To do that install pdfimages command-line tool (part of poppler-utils) and run pdfimages -list uncompressed.pdf. This will list all the images in the PDF with their metadata.

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     277   185  icc     3   8  jpeg   yes       11  0   113   113 69.2K  46%
   1     1 image     277   185  icc     3   8  jpeg   yes       10  0   113   113 31.9K  21%
   1     2 image     242   291  icc     3   8  jpeg   yes       12  0   112   112 55.2K  27%

Next extract all the images in their original formats using

pdfimages -all uncompressed.pdf image

That extracts the files and names them after the prefix we provided like this image-000.jpg image-001.jpg image-002.jpg.

Now open your images check their file’s height, width and file size and mark the details for the one to replace. In my case the file details were:

  • height – 185
  • width – 277
  • size – 70836

There are two images which matches the height and width, thankfully they have different file sizes.

Step 4 – Identifying the object in PDF that represents the image

I opened the uncompressed.pdf in VIM and searched for the most unique value I have found for the image – its size.

identifying the image object

Now we can identify the object identifier, in this case it is 11 0 obj.

Step 5 – Replacing the image with another image

Now the job is to switch the object 11’s image data with our image’s data. You can use the following Python script to achieve that.

Download the file, change the OBJECT_ID value, save the file and run:

python process.py <your pdf> <new image>

I just used the one of the extracted images to replace another one. So here are the before and after images.

image replaced pdf

Step 6 – Compressing the file back (OPTIONAL)

Do this only if you really need to do it for some reason. It is usually cool to just use the uncompressed file.

pdftk uncompressed.pdf output replaced.pdf compress

Author: Arunmozhi

Arunmozhi is a freelance programmer and an open-source enthusiast.

4 thoughts on “Replacing image in a PDF with Python”

  1. Hello,
    Very good work !
    I’m trying to replicate your example but I get a corrupted PDF.
    How should the image format be?
    Thank you and congratulations for your code.

    Like

  2. I tried to run your code but it gives me:
    > TypeError: a bytes-like object is required, not ‘str’
    which is referring to:
    > f.write(contents[:start])

    Like

  3. I don’t remember if I wrote the script in Python 2 or 3. But from the error you posted, I think, I might have used Python 2, so when I open the files in “r” mode they were reading bytes and that you are using Python 3, so when you open your file, it is read as a string. Try specifying “rb” as the open method instead of “r” and see if that solves your problem.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.