Being a freelancer is an interesting role. You come across a variety of projects. I recently worked on a project involving replacing images in a PDF which taught me a couple of things.
- While there are a number of tools to deal with PDF in Python, the general purpose tools can only do so much because… reason 2
- PDF is a dump of instructions to put things in specific places. There is no logical way it is done that make general purposes tools manipulate the PDF in a consistent way.
- Not everything is bad. Almost all positive changes like adding text or image and whole page changes like rotating, cropping are usually possible and so are all read operations like text, image extraction ..etc.,
- The issue is when you want to delete something and replace it with something else.
With that learnt, I set out to achieve the goal anyway.
Step 1 – Understanding the format
Humans invented the PDF format, which means they used words to describe things in the file, which means we can read them. So opening a PDF file in a text editor like VIM will show something like this.
Without getting into the entirety of the PDF spec, let us see what this means. PDF is a collection of objects. There is usually an identifier like int int obj
followed by some metadata and then a stream of binary information starting with stream
and ends with endstream
and endobj
. A image in our case would be represented as
16 0 obj << /Length 17 0 R /Type /XObject /Subtype /Image /Width 242 /Height 291 /Interpolate true /ColorSpace 7 0 R /Intent /Perceptual /BitsPerComponent 8 /Filter /DCTDecode >> stream Image binary data here like ÿØÿá^@VExif^@^@MM^@*^@^@^@^H^@^D^A^Z^@^E^@^@ endstream endobj
So to successfully replace an image we will have to replace the image binary data and the metadata like width and height.
Step 2 – Uncompressing the PDF and extracting the images
Use a PDF manipulation called toolkit called PDFtk.
pdftk sample.pdf output uncompressed.pdf uncompress
What this command does is, it uncompresses the file and makes it easier to read and manipulate. Let us open the uncompressed.pdf
in VIM to see the difference.
Step 3 – Identifying the image to replace
PDF is essentially a collection of objects and a PDF file might contain multiple images, there is no way to identify a particular image in the binary data of the PDF file (unless you are from Matrix). We will have to first extract the images from the PDF and match the PDF object to the image using its metadata like height and width. To do that install pdfimages
command-line tool (part of poppler-utils) and run pdfimages -list uncompressed.pdf
. This will list all the images in the PDF with their metadata.
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio -------------------------------------------------------------------------------------------- 1 0 image 277 185 icc 3 8 jpeg yes 11 0 113 113 69.2K 46% 1 1 image 277 185 icc 3 8 jpeg yes 10 0 113 113 31.9K 21% 1 2 image 242 291 icc 3 8 jpeg yes 12 0 112 112 55.2K 27%
Next extract all the images in their original formats using
pdfimages -all uncompressed.pdf image
That extracts the files and names them after the prefix we provided like this image-000.jpg image-001.jpg image-002.jpg
.
Now open your images check their file’s height, width and file size and mark the details for the one to replace. In my case the file details were:
- height – 185
- width – 277
- size – 70836
There are two images which matches the height and width, thankfully they have different file sizes.
Step 4 – Identifying the object in PDF that represents the image
I opened the uncompressed.pdf
in VIM and searched for the most unique value I have found for the image – its size.
Now we can identify the object identifier, in this case it is 11 0 obj
.
Step 5 – Replacing the image with another image
Now the job is to switch the object 11’s image data with our image’s data. You can use the following Python script to achieve that.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import sys | |
import os | |
from PIL import Image | |
# Include the \n to ensure extact match and avoid partials from 111, 211… | |
OBJECT_ID = "\n11 0 obj" | |
def replace_image(filepath, new_image): | |
f = open(filepath, "r") | |
contents = f.read() | |
f.close() | |
image = Image.open(new_image) | |
width, height = image.size | |
length = os.path.getsize(new_image) | |
start = contents.find(OBJECT_ID) | |
stream = contents.find("stream", start) | |
image_beginning = stream + 7 | |
# Process the metadata and update with new image's details | |
meta = contents[start: image_beginning] | |
meta = meta.split("\n") | |
new_meta = [] | |
for item in meta: | |
if "/Width" in item: | |
new_meta.append("/Width {0}".format(width)) | |
elif "/Height" in item: | |
new_meta.append("/Height {0}".format(height)) | |
elif "/Length" in item: | |
new_meta.append("/Length {0}".format(length)) | |
else: | |
new_meta.append(item) | |
new_meta = "\n".join(new_meta) | |
# Find the end location | |
image_end = contents.find("endstream", stream) – 1 | |
# read the image | |
f = open(new_image, "r") | |
new_image_data = f.read() | |
f.close() | |
# recreate the PDF file with the new_sign | |
with open(filepath, "wb") as f: | |
f.write(contents[:start]) | |
f.write("\n") | |
f.write(new_meta) | |
f.write(new_image_data) | |
f.write(contents[image_end:]) | |
if __name__ == "__main__": | |
if len(sys.argv) == 3: | |
replace_image(sys.argv[1], sys.argv[2]) | |
else: | |
print("Usage: python process.py <pdfile> <new_image>") |
Download the file, change the OBJECT_ID
value, save the file and run:
python process.py <your pdf> <new image>
I just used the one of the extracted images to replace another one. So here are the before and after images.
Step 6 – Compressing the file back (OPTIONAL)
Do this only if you really need to do it for some reason. It is usually cool to just use the uncompressed file.
pdftk uncompressed.pdf output replaced.pdf compress
Hello,
Very good work !
I’m trying to replicate your example but I get a corrupted PDF.
How should the image format be?
Thank you and congratulations for your code.
LikeLike
Thank you. Having the image format in the same aspect ratio and format (png/jpeg) is ideal.
LikeLike
Hi, I want to replace in a pdf the same image type (same N images in the file) in another. It is a simple replace for a LOGO in a pdf? How python.py change? Thanks
LikeLike
I tried to run your code but it gives me:
> TypeError: a bytes-like object is required, not ‘str’
which is referring to:
> f.write(contents[:start])
LikeLike
I don’t remember if I wrote the script in Python 2 or 3. But from the error you posted, I think, I might have used Python 2, so when I open the files in “r” mode they were reading bytes and that you are using Python 3, so when you open your file, it is read as a string. Try specifying “rb” as the open method instead of “r” and see if that solves your problem.
LikeLike
I tried with your code. However getting the same error.
TypeError: a bytes-like object is required, not ‘str’
I am using python3.6 and while opening the uncompressed.pdf file I am using “rb” mode.
LikeLike
This is such a useful project! Thanks, Arunmozhi!
I’m having similar issues with encodings. Would you mind emailing me soI can share my code with you?
Thanks!
LikeLike
I no longer maintain this code. Kindly use this as a starter and build your own solution.
LikeLike
Hai sir..is there possible to replace text in place of image?
LikeLike
I am not sure actually.
LikeLike
I tried to adapt this code to a working version in the current version of Python (3.10.6)
You can find the code on GitHub:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
main.py
hosted with ❤ by GitHub
Nice work Arunmozhi!
LikeLike
Hi,
every file use for the example the error is the same:
File “….\process.py”, line 38
image_end = contents.find(“endstream”, stream) – 1
^
SyntaxError: invalid character ‘–’ (U+2013)
How can resolve it?
LikeLike
Hi, I wrote this code 4 years back and don’t fully remember enough to debug this via comments. I would suggest, you use the article and the code as a guide rewrite it to suit your needs. I am sorry, I can’t more help than this.
LikeLike
Hey, try to use the most recently created script, adapted from old Arunmozhi code, you can find it 3 comments above. Hope this helps.
LikeLike