When you do
echo "x" > my_file and then check its MIME type using
file --mime-type my_file it would say
text/plain. But, when you do the same in Python by
with open("my_file_2", "w") as fp: fp.write("x")
and then check the MIME type it would say
application/octet-stream. What’s the difference?
For the impatient
echo adds a new line to file which tells the file utility it is a text file.
For the curious
When I saw this question on StackOverflow, I was really stumped due to the following reasons:
- I didn’t know the file utility can be used to get the mime-type of the file. I thought MIME Type is only relevant in the context of web server and clients. After all, MIME stands for Multipurpose Internet Mail Extensions
- I thought operating systems usually use the file extension to decide the file type, by extension the mime type. Don’t the OSes warn when we touch the extension part of the files while renaming, all the time? So, how does file utility do this on a files without any extension?
Lets try adding extensions:
$ echo "x" > some_file.txt $ file --mime-type some_file.txt some_file.txt: text/plain
Okay, that’s all good. Now to the Python side:
with open("some_file_2.txt", "w") as fp: fp.write("x")
$ file --mime-type some_file_2.txt some_file_2.txt: application/octet-stream
What? file doesn’t recognise file extensions?
The OS conspiracy theory
Maybe echo writes the mimetype as a metadata onto the disk because echo is a system utility and it knows to do that and in Python the user (me) doesn’t know how to? Clearly the operating system utilities are a cabal of some forbidden knowledge. And I am going to uncover that today, starting with the file utility which seems to have different answers to different programs.
How does ‘file’ determine MIME Type?
Answers to this question has some useful information:
- MIME Type is a fictional value. There is no inherent metadata field that stores MIME Types of files.
- Each operating system uses a different technique to decide file type. Windows uses file extension, Mac OS uses type creator & type codes and Unix uses magic numbers.
- The file command guesses file type by reading the content and looking for magic numbers and strings.
Time to reveal the magic
Let us peer into the souls of these files in their purest forms where there is no magic but only 1s and 0s. So, I printed the binary representation of the two files.
$ xxd -b my_file 00000000: 01111000 00001010 x. $ xxd -b my_file_2 00000000: 01111000 x
The file generated by echo has two bytes (notice the . after the x) whereas the file I created with Python only has one byte. What is that second byte?
>>> number = int('00001010', 2) >>> chr(number) '\n'
And it turns out like every movie on magic, there is no such thing as magic. Just clever people putting new lines to tell file it is a text file.
Creating a trick
Now that the trick is revealed, lets create our own magic trick
$ echo "<file></file>" > xml_file $ file --mime-type xml_file xml_file: text/plain $ echo "<?xml version="1.0"?><file></file>" > xml_file $ file --mime-type xml_file xml_file: text/xml