Blog

Do Government Teachers Deserve Better Pay Than Private Teachers in Tamilnadu?

Recently Tamilnadu’s government teachers went on a strike and captured the attention of everyone in the state. There were emotions flying from everyone. Around 5 out their 9 demands revolved around money: pay, pension and arrears. An argument which could be heard around was:

Private school teachers do much more work for much less pay, so government school teachers shouldn’t be greedy

Is it really that way? I decided to investigate with whatever little data I could. One key factor that could be used to quantify the workload of teachers is Student-to-Teacher ratio, simply stated, the number of children each teacher is responsible for. Higher the number, more the workload, more notebooks to correct, more exam papers to evaluate, longer queues to handle … you get the idea.

With that in mind, let us put data to work.

Data Used

Calculations

With the above sources giving a neat count of schools, students and teachers based on the management type of the school, it was just a matter of selecting the right columns and dividing one by the other.

Student-to-teacher Ratio = No.of Students / No.of Teachers

I uploaded the dataset to Kaggle and wrote a kernel script to perform the above calculation for each type of schools: Government, Government-Aided, and Private schools.

Observations

Here is the heatmap of the Student to teacher ratios.

student_teacher_ratios_table

There is a clear pattern that can be observed. The government aided school teachers have in some cases twice as much workload as their peers in govt or private schools. Aided school teachers do the work of all the govt. teachers like Census data, Electoral rolls, Election booth staff..etc., too.

Here is graph to give a sense of how far removed are the aided school teachers from their peers.

student_teacher_ratio_plot

Conclusion

To answer the question asked in the title. I am not sure about government school teachers, but it certainly looks like the govt. aided school teachers deserve better.

Replacing image in a PDF with Python

Being a freelancer is an interesting role. You come across a variety of projects. I recently worked on a project involving replacing images in a PDF which taught me a couple of things.

  1. While there are a number of tools to deal with PDF in Python, the general purpose tools can only do so much because… reason 2
  2. PDF is a dump of instructions to put things in specific places. There is no logical way it is done that make general purposes tools manipulate the PDF in a consistent way.
  3. Not everything is bad. Almost all positive changes like adding text or image and whole page changes like rotating, cropping are usually possible and so are all read operations like text, image extraction ..etc.,
  4. The issue is when you want to delete something and replace it with something else.

With that learnt, I set out to achieve the goal anyway.

Step 1 – Understanding the format

Humans invented the PDF format, which means they used words to describe things in the file, which means we can read them. So opening a PDF file in a text editor like VIM will show something like this.

PDF in VIM

Without getting into the entirety of the PDF spec, let us see what this means. PDF is a collection of objects. There is usually an identifier like int int obj followed by some metadata and then a stream of binary information starting with stream and ends with endstream and endobj. A image in our case would be represented as

16 0 obj
<< /Length 17 0 R /Type /XObject /Subtype /Image /Width 242 /Height 291 /Interpolate
true /ColorSpace 7 0 R /Intent /Perceptual /BitsPerComponent 8 /Filter /DCTDecode
>>
stream
Image binary data here like ÿØÿá^@VExif^@^@MM^@*^@^@^@^H^@^D^A^Z^@^E^@^@
endstream
endobj

So to successfully replace an image we will have to replace the image binary data and the metadata like width and height.

Step 2 – Uncompressing the PDF and extracting the images

Use a PDF manipulation called toolkit called PDFtk.

pdftk sample.pdf output uncompressed.pdf uncompress

What this command does is, it uncompresses the file and makes it easier to read and manipulate. Let us open the uncompressed.pdf in VIM to see the difference.

uncompressed pdf

Step 3 – Identifying the image to replace

PDF is essentially a collection of objects and a PDF file might contain multiple images, there is no way to identify a particular image in the binary data of the PDF file (unless you are from Matrix). We will have to first extract the images from the PDF and match the PDF object to the image using its metadata like height and width. To do that install pdfimages command-line tool (part of poppler-utils) and run pdfimages -list uncompressed.pdf. This will list all the images in the PDF with their metadata.

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     277   185  icc     3   8  jpeg   yes       11  0   113   113 69.2K  46%
   1     1 image     277   185  icc     3   8  jpeg   yes       10  0   113   113 31.9K  21%
   1     2 image     242   291  icc     3   8  jpeg   yes       12  0   112   112 55.2K  27%

Next extract all the images in their original formats using

pdfimages -all uncompressed.pdf image

That extracts the files and names them after the prefix we provided like this image-000.jpg image-001.jpg image-002.jpg.

Now open your images check their file’s height, width and file size and mark the details for the one to replace. In my case the file details were:

  • height – 185
  • width – 277
  • size – 70836

There are two images which matches the height and width, thankfully they have different file sizes.

Step 4 – Identifying the object in PDF that represents the image

I opened the uncompressed.pdf in VIM and searched for the most unique value I have found for the image – its size.

identifying the image object

Now we can identify the object identifier, in this case it is 11 0 obj.

Step 5 – Replacing the image with another image

Now the job is to switch the object 11’s image data with our image’s data. You can use the following Python script to achieve that.

Download the file, change the OBJECT_ID value, save the file and run:

python process.py <your pdf> <new image>

I just used the one of the extracted images to replace another one. So here are the before and after images.

image replaced pdf

Step 6 – Compressing the file back (OPTIONAL)

Do this only if you really need to do it for some reason. It is usually cool to just use the uncompressed file.

pdftk uncompressed.pdf output replaced.pdf compress

Map of PM Modi’s Domestic Visits

PM Modi visited Tamil Nadu on 27th January 2019 for the AIIMS Hospital ground breaking ceremony. Twitter was trending with #GoBackModi and #TNWelcomesModi and I was curious about the number of times has PM Modi visited Tamil Nadu before.

The PM India site has a neat list of all the visits http://www.pmindia.gov.in/en/pm-visits/?visittype=domestic_visit

So, I created a map out of it.

Visits_by_PM_Modi.png

Update:

This map was replaced after some errors were discovered in the base data.

Python Technical Interview – An Experience

As a freelancer one of the things that comes with getting a project/job is handling technical interviews. I have so far managed to convince the client with a work sample, test project …etc., This is literally the first time I sat for a full technical interview. And it did teach a few lessons. Let me document it for future use.

It started off with the basic of the language:

1. What is the difference between an iterable and an iterator?

Vincent Driessen provides a clear explanation of the difference with the examples here https://nvie.com/posts/iterators-vs-generators/

As an aside, he has a number of posts which are really great like his Git workflow model that I have used in my projects. Bookmark it

2. What is a Context Manager? What is its purpose? How is it different from a try…finally block? Why would you use one over another?

Context Manager are functions/classes that allow us to allocate and release resources as required. Used with the with keyword in code.

The difference between context manager and try..finally block is explained in technical detail here: https://stackoverflow.com/questions/26096435/is-python-with-statement-exactly-equivalent-to-a-try-except-finally-bloc

But a simpler more practical difference is given by Dan Bader: https://dbader.org/blog/python-context-managers-and-with-statement

3. Can you tell me some advantages of Python over other languages?

I rambled something like, it is is easier to read and write. The file structure (I should have said modules/packages) is great. Even modern iterations of Javascript are copying the import from syntax. Native implementation of a lot of things in standard library…etc.,et.,

But the thing my interviewer was looking for were the words “automatic garbage collection” because the next question was

4. How does Python handle memory?

Python has automated memory management and garbage collection.That is why we never worry about how much memory we are allocating like C’s malloc `calloc functions.

5. Do you know how Python does that? Do you know about GIL?

sheepish smiles and saying no’s ensued. I ran into an issue a few months back, I think maybe with a DB connection issue or something which led me on a rabbit hole that ended with GIL. I should have learnt it that day.

Anyway, here is the article about Python’s memory management. https://realpython.com/python-memory-management/

6. Have you worked on projects involving multi-threading? What do you know about multi-threading?

I hadn’t. Someday maybe I will.

7. Can you explain in detail the steps involved in a form submit to response cycle in detail?

https://developer.mozilla.org/en-US/docs/Learn/HTML/Forms/Sending_and_retrieving_form_data

8. How does the browser know where your server is when the information is submitted to a particular URL?

DNS servers – IP resolution

9. The server sends back text as a string how do you see colorful information in browser?

The text is converted into DOM elements which are rendered by the browsers rendering engine.

10. If a browser is showing unreadable character and question marks instead of displaying the information what could be the reason?

Document Encoding mismatch. The server might send the data encoded in Unicode UTF-8 and the browser might be decoding it as ASCII or LATIN-1 resulting in weird characters and question marks being rendered in the browser.

11. You said Unicode and UTF-8 what is the difference?

Unicode is the term used to describe the character set. If it is encoded with 8 bits it is called UTF-8, if encoded with 16 bits it is called UTF-16 etc.,

For deep dive into Unicode (a must): https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

12. What kind of request does the browser make to a server? And what are the types of requests that can be made?

Browsers make a HTTP requests. The types are GET, POST, PUT, DELETE, HEAD, OPTIONS ..etc., (I think I said UPDATE instead of PUT, silly)

13. What is the difference between `==` and `===` in JavaScript?

StackOverflow: https://stackoverflow.com/questions/523643/difference-between-and-in-javascript

Some other questions, that were asked:
1. Do you know Docker? Have you used AWS?
2. Do you know Data Base schema design?
3. You have a SQL query that takes a long time to execute. How would you begin to make it faster? Do you know about Query optimisation and execution plans?

What is the relation between the SC population and SC literacy?

TL;DR: Nothing (that I could find)

After I published a set of maps, one thing that people (including me) wondered is

What is the relationship between the population size of SC community and their literacy? Are those regions, doing better, do it because the SC are more in number and are able to assert themselves better in the society?

This post is aimed at answering that question based on the Census 2011 data available from Tamilnadu Government website: http://www.tn.gov.in/deptst/areaandpopulation.pdf

The following data points were used:
1. Total population
2. Overall Literacy
3. Total SC population
4. SC Literacy

  • The gap between Overall literacy rate and SC literacy was calculated
  • The percentage of SC population was calculated
  • Finally, the correlation coefficient between the SC percentage and Literacy Gap was calculated

sc_lit_pop_correlation

There is a very week (0.13) or almost no correlation between the strength of the SC population of in society and their literacy gap. Here are the maps for comparison

You might be thinking, forget the gap, maybe there is a correlation between their literacy and population number.

sc_pop_litrate_correlation

Nope. Still nothing. Here are the maps to compare.

Literacy Gap of SC community in TN districts

I was going through the Census 2011 data once again and Erode district’s low Schedule Caste (SC) literacy rate caught my eyes. It is not a very lagging state when in overall literacy. But its SC literacy was less than the least literate district of Dharmapuri. So I added the data to the TN Districts shapefile and visualised it to see how lagging are the SC community across the districts.

Here are the maps

tn_overall_literacy_2011tn_sc_literacy_2011general_sc_literacy_gap

Correction

In an earlier map, the gap of Thoothukudi was mentioned as -14%, while the actual gap is around 6% due to a typo during the data processing. The map has been updated to reflect the change.

My observations

  1. Kongu Belt (Coimbatore, Tiruppur, Erode) is the worst. ~~The Gounder (land owning) community has ensured their position and the social ladder and ensured the peasantry remained uneducated and illiterate.~~

Update: While there might be an element of truth to it, the maps alone are not indicative of the inference. I have made the above observation based on the number of issue that have appeared on the media like the Mid-day meal staff harassment, two tumbler system etc.,

  1. Dharmapuri is a peculiar case, it has the lowest overall literacy in TN, but it is also the only district where SC community is more literate than the general population.
  2. Kanniyakumari which tops the overall literacy rates also tops the SC literacy. In fact the SC community of Kanniyakumari is more literate than the general population of almost all other districts. I think it would be an interesting place of humanities research in the area of literacy, education and caste.

Data Source:

http://www.tn.gov.in/deptst/areaandpopulation.pdf

QGIS – Creating new column from existing using Python

Yesterday, I was working on the ward level parks map of Chennai I had to join a CSV data layer with the boundary polygon layer, but there was one issue while my CSV file has the ward numbers as integers (1,2,3..etc), the polygon layer had them as strings (Ward 1, Ward 2, Ward 3 …etc.,) So I was thinking, wouldn’t it be nice just to strip the word Ward and put it in a new column, so that I can make a join by matching the ward numbers. Turns out Python integration in QGIS is so good that, I did it without even searching the internet. Here is how.

  1. Open the Attribute table
  2. Open Field Calculator.
  3. Enter the “Output field name”
  4. Switch to “Function Editor”
  5. Click the [+] button to create a new function file.
  6. Changed the function name, parameter and return the value after stripping “Ward ” from the string. Read the docs given below the function editor to understand what’s going on the file.
QGIS Field Calculator
QGIS Field Calculator
from qgis.core import *
from qgis.gui import *

@qgsfunction(args='auto', group='Custom')
def strip_ward(name, feature, parent ):
    return name.split(" ")[-1]

Now switch back to the Expression tab and call the function to calculate the new field

strip_ward.png

Click OK. Now the new field with the computed value would be created.

I had a simple use case, by one can use the power of Python to calculate anything from existing data and generate a new field based on it. I was really blown away by the level of Python integration in QGIS.