Data Science – First Impressions

After some thoughts on what to learn next, I enrolled myself into IBM Data Science Professional Certificate program. It is beginner level program to provide the necessary foundations for Data Science. I have completed 3 courses so far:

  1. What is Data Science?
  2. Tools for Data Science
  3. Data Science Methodology

I liked all the courses so far. Despite IBM Cloud tools’ big presence, the courses cover the concepts in a somewhat generic way.

This post is NOT a review of the course or certification.

This post is mostly about my first impressions on the domain of Data Science.

1. It’s not strictly a science

Data Science doesn’t have a clear definition. Everyone defines it the way they want. Murtaza Haider – author of Getting Started with Data Science puts it simply as

Data Science is what data scientists do

It is not statistics either, so I guess instead of inventing a new word for experimental statistics, data analysis, visualisation and model building, someone called it science.

2. Engineers will probably hate it

If there is one underlying principle that defines engineers, it is their love for certainty. Engineers strive to build systems governed by a set of rules that will produce predictable outcomes. Data Science is actually quite the other way around, you start with a question and move towards an answer which could be anything, valid or invalid. It is a frustrating experience to go the full way without knowing where. Sure, we get some hints here and there throughout the process. If one has the “journey is more important than destination” attitude, it is a good ride. But if you yearn for certainty or get frustrated mid way through the process, you will probably end up torturing the data to confess the way you want it to.

3. The Jargon

It is a field of jargon. Everything has a specific word/phrase. If you open the data in excel and see it, you will probably call it inspecting the data. As a data scientist, you do the same with a couple of lines of Python Code on a Jupyter Notebook and call it “Exploratory Data Analysis” or EDA, if you change value or format them to a specific standard is “Data Manipulation” or DML (did that thing even need a 3 letter acronym?). If you query some data, do some manipulations, it is a Extract-Transform-Load or ETL pipeline.

While some of them are valid terminologies (like ETL) which exist to communicate effectively, some of them are just marketing jargon which exists to make stuff seem bigger than they are.

4. The Ah..ha moments

This is the best thing about Data Science according to me. It is those moments when your perception is altered by the data or when the output is altered by the perception. Engineering by its nature is application of established principles, we get our ah..ha moments when we learn a new concept invented by a scientist. But in data science the data can produce such moments by their mere nature of being exploratory in nature. Since we start with a question and follow the data to the answer, it usually results in a light bulb moment.

5. Practicality

This is the area which is most conflicting for me. Things like “Data is the new oil” has been said for close to a decade now and the general consensus seems to be that more data the better. But both my personal experience as a teacher and the case study about a health insurance provider in the course have made me wary of the practicality of this approach. I can write about my personal experience as anecdotal evidence. But I think I would give a more data oriented reason (the post being about data science and all).

Explosion of healthcare administrators in the US

While studying the data science methodology, it becomes acutely clear how this happened. I am not saying data science is the root cause of the problem, but it is definitely a contributing factor as no model built by any data scientist can be static. It is iterative process which keeps cycle of data collection modelling and output going. So when overdone, it feels more of a problem than a solution.

6. The machines are coming for the data scientists

The hype for data scientists went mainstream, I think with the McKinsey report of 2018 saying there will be a shortage of 140,000 to 190,000 data professionals in US alone. Followed by data scientists becoming the highest paid in technical jobs. Going by IBM tools that I have used during the course, I don’t think the hype will age well.

Money is in automation – an estimated 70 to 90% time is spent by data scientists in collecting and preparing the data for the analysis. If there is anything that computers were invented for, it is to automate such mundane processes. Even if a company is saving 50% of that time, it is great cost saving. So automation tools will cut down the work and thus the demand for data scientists.

Side Note: You can create a free account at cloud.ibm.com and checkout their tools to get an idea of the direction and sophistication of the tools that are being developed.

Uncertainty of the outcome – as mentioned earlier, not all organisations will gain from having a data science team. Apart from the data, a variety of things like – organisation size, scope for data driven decision making and the talent of the data science team all impact the outcome of the exercise. Combined with off the shelf offerings from IT companies, data scientist’s role might shrink to just an computer operator in some cases.

No, I am not saying data scientists will become obsolete. Just that it is not going to live up to the hype.

Before you throw brick bats

I have completed only 3 of 9 courses in the program. Once I complete another 3 courses, I think I will have a better understanding of things and will revisit the topic again.

NiftyBot

The Mastodon ecosystem is really nice. The concept of Fediverse bringing in decentralized social network is a breath of fresh air. I created NiftyBot account in botsin.space – a dedicated server for Mastodon Bots.

What is NiftyBot?

  • It is a Mastodon Bot Account

What does it do?

  • It posts updates about Indian Markets
  • Currently it posts NSE closing report at 4.01 PM everyday. Sample post below

niftybot-sample

How does it work?

It is a Python script running in AWS Lambda.

lambda-niftybot

A scheduler tiggers the Lambda Function at 4.01 every Monday – Friday. The lambda function is a Python Script that fetches the necessary details from NSE’s output data and posts to Mastodon.

Source Code:

Some asked about if this bot is open source. Obviously, you see the source right here. 🙂 Still I will add the license here.

The above source code is released into the Public Domain. You can do what ever you want with it.

How much does it cost to run this Bot?

Nothing.

Numbers Please:

The AWS Lambda Free tier comes with 1 Million requests and 400,000 GBSec, which is a combination of how much memory we use and the time taken by our process. Since I have used the CloudWatch Scheduler Event as the trigger, I am using 20-22 requests, the Python function takes about 60 MB to run so running at the lowest memory of 128MB block, and usually completes in around 2600-2700 msec. The metrics says my worst billed event so far is about 0.3375 GBSec. With about 20-22 trading days in a month, I might use a total of 8-10 GBSeconds, leaving enough room to run many more bots like this one 🙂

Do Government Teachers Deserve Better Pay Than Private Teachers in Tamilnadu?

Recently Tamilnadu’s government teachers went on a strike and captured the attention of everyone in the state. There were emotions flying from everyone. Around 5 out their 9 demands revolved around money: pay, pension and arrears. An argument which could be heard around was:

Private school teachers do much more work for much less pay, so government school teachers shouldn’t be greedy

Is it really that way? I decided to investigate with whatever little data I could. One key factor that could be used to quantify the workload of teachers is Student-to-Teacher ratio, simply stated, the number of children each teacher is responsible for. Higher the number, more the workload, more notebooks to correct, more exam papers to evaluate, longer queues to handle … you get the idea.

With that in mind, let us put data to work.

Data Used

Calculations

With the above sources giving a neat count of schools, students and teachers based on the management type of the school, it was just a matter of selecting the right columns and dividing one by the other.

Student-to-teacher Ratio = No.of Students / No.of Teachers

I uploaded the dataset to Kaggle and wrote a kernel script to perform the above calculation for each type of schools: Government, Government-Aided, and Private schools.

Observations

Here is the heatmap of the Student to teacher ratios.

student_teacher_ratios_table

There is a clear pattern that can be observed. The government aided school teachers have in some cases twice as much workload as their peers in govt or private schools. Aided school teachers do the work of all the govt. teachers like Census data, Electoral rolls, Election booth staff..etc., too.

Here is graph to give a sense of how far removed are the aided school teachers from their peers.

student_teacher_ratio_plot

Conclusion

To answer the question asked in the title. I am not sure about government school teachers, but it certainly looks like the govt. aided school teachers deserve better.