After some thoughts on what to learn next, I enrolled myself into IBM Data Science Professional Certificate program. It is beginner level program to provide the necessary foundations for Data Science. I have completed 3 courses so far:
- What is Data Science?
- Tools for Data Science
- Data Science Methodology
I liked all the courses so far. Despite IBM Cloud tools’ big presence, the courses cover the concepts in a somewhat generic way.
This post is NOT a review of the course or certification.
This post is mostly about my first impressions on the domain of Data Science.
1. It’s not strictly a science
Data Science doesn’t have a clear definition. Everyone defines it the way they want. Murtaza Haider – author of Getting Started with Data Science puts it simply as
Data Science is what data scientists do
It is not statistics either, so I guess instead of inventing a new word for experimental statistics, data analysis, visualisation and model building, someone called it science.
2. Engineers will probably hate it
If there is one underlying principle that defines engineers, it is their love for certainty. Engineers strive to build systems governed by a set of rules that will produce predictable outcomes. Data Science is actually quite the other way around, you start with a question and move towards an answer which could be anything, valid or invalid. It is a frustrating experience to go the full way without knowing where. Sure, we get some hints here and there throughout the process. If one has the “journey is more important than destination” attitude, it is a good ride. But if you yearn for certainty or get frustrated mid way through the process, you will probably end up torturing the data to confess the way you want it to.
3. The Jargon
It is a field of jargon. Everything has a specific word/phrase. If you open the data in excel and see it, you will probably call it inspecting the data. As a data scientist, you do the same with a couple of lines of Python Code on a Jupyter Notebook and call it “Exploratory Data Analysis” or EDA, if you change value or format them to a specific standard is “Data Manipulation” or DML (did that thing even need a 3 letter acronym?). If you query some data, do some manipulations, it is a Extract-Transform-Load or ETL pipeline.
While some of them are valid terminologies (like ETL) which exist to communicate effectively, some of them are just marketing jargon which exists to make stuff seem bigger than they are.
4. The Ah..ha moments
This is the best thing about Data Science according to me. It is those moments when your perception is altered by the data or when the output is altered by the perception. Engineering by its nature is application of established principles, we get our ah..ha moments when we learn a new concept invented by a scientist. But in data science the data can produce such moments by their mere nature of being exploratory in nature. Since we start with a question and follow the data to the answer, it usually results in a light bulb moment.
This is the area which is most conflicting for me. Things like “Data is the new oil” has been said for close to a decade now and the general consensus seems to be that more data the better. But both my personal experience as a teacher and the case study about a health insurance provider in the course have made me wary of the practicality of this approach. I can write about my personal experience as anecdotal evidence. But I think I would give a more data oriented reason (the post being about data science and all).
Explosion of healthcare administrators in the US
While studying the data science methodology, it becomes acutely clear how this happened. I am not saying data science is the root cause of the problem, but it is definitely a contributing factor as no model built by any data scientist can be static. It is iterative process which keeps cycle of data collection modelling and output going. So when overdone, it feels more of a problem than a solution.
6. The machines are coming for the data scientists
The hype for data scientists went mainstream, I think with the McKinsey report of 2018 saying there will be a shortage of 140,000 to 190,000 data professionals in US alone. Followed by data scientists becoming the highest paid in technical jobs. Going by IBM tools that I have used during the course, I don’t think the hype will age well.
Money is in automation – an estimated 70 to 90% time is spent by data scientists in collecting and preparing the data for the analysis. If there is anything that computers were invented for, it is to automate such mundane processes. Even if a company is saving 50% of that time, it is great cost saving. So automation tools will cut down the work and thus the demand for data scientists.
Side Note: You can create a free account at cloud.ibm.com and checkout their tools to get an idea of the direction and sophistication of the tools that are being developed.
Uncertainty of the outcome – as mentioned earlier, not all organisations will gain from having a data science team. Apart from the data, a variety of things like – organisation size, scope for data driven decision making and the talent of the data science team all impact the outcome of the exercise. Combined with off the shelf offerings from IT companies, data scientist’s role might shrink to just an computer operator in some cases.
No, I am not saying data scientists will become obsolete. Just that it is not going to live up to the hype.
Before you throw brick bats
I have completed only 3 of 9 courses in the program. Once I complete another 3 courses, I think I will have a better understanding of things and will revisit the topic again.