Replacing image in a PDF with Python

Being a freelancer is an interesting role. You come across a variety of projects. I recently worked on a project involving replacing images in a PDF which taught me a couple of things.

  1. While there are a number of tools to deal with PDF in Python, the general purpose tools can only do so much because… reason 2
  2. PDF is a dump of instructions to put things in specific places. There is no logical way it is done that make general purposes tools manipulate the PDF in a consistent way.
  3. Not everything is bad. Almost all positive changes like adding text or image and whole page changes like rotating, cropping are usually possible and so are all read operations like text, image extraction ..etc.,
  4. The issue is when you want to delete something and replace it with something else.

With that learnt, I set out to achieve the goal anyway.

Step 1 – Understanding the format

Humans invented the PDF format, which means they used words to describe things in the file, which means we can read them. So opening a PDF file in a text editor like VIM will show something like this.

PDF in VIM

Without getting into the entirety of the PDF spec, let us see what this means. PDF is a collection of objects. There is usually an identifier like int int obj followed by some metadata and then a stream of binary information starting with stream and ends with endstream and endobj. A image in our case would be represented as

16 0 obj
<< /Length 17 0 R /Type /XObject /Subtype /Image /Width 242 /Height 291 /Interpolate
true /ColorSpace 7 0 R /Intent /Perceptual /BitsPerComponent 8 /Filter /DCTDecode
>>
stream
Image binary data here like ÿØÿá^@VExif^@^@MM^@*^@^@^@^H^@^D^A^Z^@^E^@^@
endstream
endobj

So to successfully replace an image we will have to replace the image binary data and the metadata like width and height.

Step 2 – Uncompressing the PDF and extracting the images

Use a PDF manipulation called toolkit called PDFtk.

pdftk sample.pdf output uncompressed.pdf uncompress

What this command does is, it uncompresses the file and makes it easier to read and manipulate. Let us open the uncompressed.pdf in VIM to see the difference.

uncompressed pdf

Step 3 – Identifying the image to replace

PDF is essentially a collection of objects and a PDF file might contain multiple images, there is no way to identify a particular image in the binary data of the PDF file (unless you are from Matrix). We will have to first extract the images from the PDF and match the PDF object to the image using its metadata like height and width. To do that install pdfimages command-line tool (part of poppler-utils) and run pdfimages -list uncompressed.pdf. This will list all the images in the PDF with their metadata.

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     277   185  icc     3   8  jpeg   yes       11  0   113   113 69.2K  46%
   1     1 image     277   185  icc     3   8  jpeg   yes       10  0   113   113 31.9K  21%
   1     2 image     242   291  icc     3   8  jpeg   yes       12  0   112   112 55.2K  27%

Next extract all the images in their original formats using

pdfimages -all uncompressed.pdf image

That extracts the files and names them after the prefix we provided like this image-000.jpg image-001.jpg image-002.jpg.

Now open your images check their file’s height, width and file size and mark the details for the one to replace. In my case the file details were:

  • height – 185
  • width – 277
  • size – 70836

There are two images which matches the height and width, thankfully they have different file sizes.

Step 4 – Identifying the object in PDF that represents the image

I opened the uncompressed.pdf in VIM and searched for the most unique value I have found for the image – its size.

identifying the image object

Now we can identify the object identifier, in this case it is 11 0 obj.

Step 5 – Replacing the image with another image

Now the job is to switch the object 11’s image data with our image’s data. You can use the following Python script to achieve that.


import sys
import os
from PIL import Image
# Include the \n to ensure extact match and avoid partials from 111, 211…
OBJECT_ID = "\n11 0 obj"
def replace_image(filepath, new_image):
f = open(filepath, "r")
contents = f.read()
f.close()
image = Image.open(new_image)
width, height = image.size
length = os.path.getsize(new_image)
start = contents.find(OBJECT_ID)
stream = contents.find("stream", start)
image_beginning = stream + 7
# Process the metadata and update with new image's details
meta = contents[start: image_beginning]
meta = meta.split("\n")
new_meta = []
for item in meta:
if "/Width" in item:
new_meta.append("/Width {0}".format(width))
elif "/Height" in item:
new_meta.append("/Height {0}".format(height))
elif "/Length" in item:
new_meta.append("/Length {0}".format(length))
else:
new_meta.append(item)
new_meta = "\n".join(new_meta)
# Find the end location
image_end = contents.find("endstream", stream) – 1
# read the image
f = open(new_image, "r")
new_image_data = f.read()
f.close()
# recreate the PDF file with the new_sign
with open(filepath, "wb") as f:
f.write(contents[:start])
f.write("\n")
f.write(new_meta)
f.write(new_image_data)
f.write(contents[image_end:])
if __name__ == "__main__":
if len(sys.argv) == 3:
replace_image(sys.argv[1], sys.argv[2])
else:
print("Usage: python process.py <pdfile> <new_image>")

view raw

process.py

hosted with ❤ by GitHub

Download the file, change the OBJECT_ID value, save the file and run:

python process.py <your pdf> <new image>

I just used the one of the extracted images to replace another one. So here are the before and after images.

image replaced pdf

Step 6 – Compressing the file back (OPTIONAL)

Do this only if you really need to do it for some reason. It is usually cool to just use the uncompressed file.

pdftk uncompressed.pdf output replaced.pdf compress

Author: Arunmozhi

Arunmozhi is a freelance programmer and an open-source enthusiast.

15 thoughts on “Replacing image in a PDF with Python”

  1. Hello,
    Very good work !
    I’m trying to replicate your example but I get a corrupted PDF.
    How should the image format be?
    Thank you and congratulations for your code.

    Like

      1. Hi, I want to replace in a pdf the same image type (same N images in the file) in another. It is a simple replace for a LOGO in a pdf? How python.py change? Thanks

        Like

  2. I tried to run your code but it gives me:
    > TypeError: a bytes-like object is required, not ‘str’
    which is referring to:
    > f.write(contents[:start])

    Like

  3. I don’t remember if I wrote the script in Python 2 or 3. But from the error you posted, I think, I might have used Python 2, so when I open the files in “r” mode they were reading bytes and that you are using Python 3, so when you open your file, it is read as a string. Try specifying “rb” as the open method instead of “r” and see if that solves your problem.

    Like

    1. I tried with your code. However getting the same error.
      TypeError: a bytes-like object is required, not ‘str’
      I am using python3.6 and while opening the uncompressed.pdf file I am using “rb” mode.

      Like

  4. This is such a useful project! Thanks, Arunmozhi!
    I’m having similar issues with encodings. Would you mind emailing me soI can share my code with you?
    Thanks!

    Like

  5. I tried to adapt this code to a working version in the current version of Python (3.10.6)

    You can find the code on GitHub:


    import sys
    import os
    from PIL import Image
    # Include the \n to ensure extact match and avoid partials from 111, 211…
    OBJECT_ID = "\n11 0 obj"
    def replace_image(filepath, new_image):
    f = open(filepath, "rb")
    contents = f.read()
    f.close()
    image = Image.open(new_image)
    width, height = image.size
    length = os.path.getsize(new_image)
    start = contents.find(str.encode(OBJECT_ID))
    stream = contents.find(str.encode("stream"), start)
    image_beginning = stream + 7
    # Process the metadata and update with new image's details
    meta = contents[start: image_beginning]
    meta = meta.split(str.encode("\n"))
    new_meta = []
    for item in meta:
    if str.encode("/Width") in item:
    new_meta.append("/Width {0}".format(width))
    elif str.encode("/Height") in item:
    new_meta.append("/Height {0}".format(height))
    elif str.encode("/Length") in item:
    new_meta.append("/Length {0}".format(length))
    else:
    new_meta.append(item.decode(encoding='utf-8'))
    new_meta = "\n".join(new_meta)
    # Find the end location
    image_end = contents.find(str.encode("endstream"), stream) – 1
    # read the image
    f = open(new_image, "rb")
    new_image_data = f.read()
    f.close()
    # recreate the PDF file with the new_sign
    with open(filepath, "wb") as f:
    f.write(contents[:start])
    f.write(str.encode("\n"))
    f.write(str.encode(new_meta))
    f.write(new_image_data)
    f.write(contents[image_end:])
    #replace_image('pdfuncompressedfile.pdf' 'new_image')
    if __name__ == "__main__":
    if len(sys.argv) == 3:
    replace_image(sys.argv[1], sys.argv[2])
    else:
    print("Usage: python process.py <pdfuncompressedfile> <new_image>")

    view raw

    main.py

    hosted with ❤ by GitHub

    Nice work Arunmozhi!

    Like

  6. Hi,
    every file use for the example the error is the same:

    File “….\process.py”, line 38
    image_end = contents.find(“endstream”, stream) – 1
    ^
    SyntaxError: invalid character ‘–’ (U+2013)

    How can resolve it?

    Like

    1. Hi, I wrote this code 4 years back and don’t fully remember enough to debug this via comments. I would suggest, you use the article and the code as a guide rewrite it to suit your needs. I am sorry, I can’t more help than this.

      Like

  7. Hey, try to use the most recently created script, adapted from old Arunmozhi code, you can find it 3 comments above. Hope this helps.

    Like

Leave a reply to Arunmozhi Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.