Setting up my HomeServer – Part 3

I have so far written about:

and this post is about how I use Nomad to deploy software.

But before that…

I have some updates in the networking department. I had mentioned in Part 2 that, I use Nginx as a reverse proxy to route traffic from the internet into my public apps like Misskey.

By picking up the cues from Nemo’s setup, I experimented with Traefik and replace Nginx with Traefik. It was made easier by the fact that Traefik fully supports Nomad service discovery. That means, I don’t have to run something like Consul just to handle the service to Traefik Proxy mapping.

Running Traefik

I was running Nginx as a Systemd service in the OCI virtual machine and Nomad was limited to running services in my Intel NUC. While moving to Traefik, I configured the OCI VM as a Nomad client and attached it to the Nomad “Cluster”. Since Nomad is running on the Tailscale network, I didn’t have to fiddle with any of the networking/firewall stuff in the OCI VM, making the setup simple.

Traefik is run as a Docker container in the VM with the “host” network mode so that it listens to the VM ports 80/443, which are open to the outside internet. I have specifically mapped the Traefik dashboard to the “tailscale” network, allowing me to access the dashboard via SSH tunneling without have to have the 8080 port open to the rest of the world.

// Nomad configuration

    network {
      port  "http" {
        static = 80
      }
      port "https" {
        static = 443
      }

      port "admin" {
        static = 8080
        host_network = "tailscale"
      }
    }

    task "server" {
      driver = "docker"
      config {
        image = "traefik:2.9"
        ports = ["admin", "http", "https"]
        network_mode = "host"
        volumes = [
          "local/traefik.toml:/etc/traefik/traefik.toml",
          "local/ssl/cert.pem:/etc/ssl/cert.pem",
          "local/ssl/private.key:/etc/ssl/private.key",
        ]
      }
    }

Deploying Applications

All the services are written as Nomad Job specifications, with a specific network config, and service definition. Then I deploy the software from my laptop to my homeserver by running Terraform.

Ideally, I should be creating the Oracle VM using Terraform as well. But the entire journey has been a series of trail and error experiments, that I haven’t done that. I think I will migrate to a Terrform defined/created VM once I figure out how to get Nomad installed and setup without manually SSHing into the VM.

Typical Setup Workflow

Since, most of what we are dealing with are web services, I haven’t run into a situation where I had to deal with a non-docker deployment yet. But I rest easy knowing that when the day comes, I can rely on the “exec” and “raw_exec” drivers of Nomad to run them using pretty much the same workflow.

Dealing with Secrets

One of the biggest concerns about all of this is dealing with secrets like, DB credentials, API Keys, ..etc., For example, how do I supply the DB Username and Password to the Nomad Job running my application without storing them in the Job configuration files which I have on version control?

There are many ways to do it, from defining them as Terraform variables and storing it as git-ignored file withing the repo to deploying Hashicorp Vault and using the Vault – Nomad integration (which I tried and found to be an overkill).

I have chosen the simpler solution of storing them as Nomad Variables. I create them by hand using the Nomad UI, and they are defined with specific task access.

An example set of secrets

These are then injected into the service’s container as environment variables using the template block with nomadVars.

// Nomad config for getting secrets from Nomad Variables

      template {
        data = <<EOH
{{ with nomadVar "nomad/jobs/owncloud/owncloud/owncloud" }}
OWNCLOUD_DB_NAME={{.db_name}}
OWNCLOUD_DB_USERNAME={{.db_username}}
OWNCLOUD_DB_PASSWORD={{.db_password}}
OWNCLOUD_ADMIN_USERNAME={{.admin_username}}
OWNCLOUD_ADMIN_PASSWORD={{.admin_password}}
{{ end }}

{{ range nomadService "db" }}
OWNCLOUD_DB_HOST={{.Address}}
{{ end }}

{{ range nomadService "redis" }}
OWNCLOUD_REDIS_HOST={{.Address}}
{{ end }}
EOH
        env = true
        destination = "local/env"
      }

Accessing Private Services

While this entire project began with trying to self host a Misskey instance as my personal Mastodon alternate, I have started using the setup to run some private services as well – like Node RED, that runs automation like my RSS-to-ActivityPub bots.

I don’t expose these to the internet, or even the Tailscale network at the moment. They are run on my home local network on a dynamic port assigned to by Nomad. I just access it through the IP:PORT generated by Nomad.

Node-RED running at local IP and Dynamic (random) port

I will probably migrate these to the Tailscale Network, if I start traveling and would still want to have access to them. But for now, they are just restricted to my home network.

Conclusion

It has been a wonderful journey figuring all of this out over the last couple of weeks and running the home-server has been a great source of satisfaction . With Docker and Nomad, it has been really easy to try out new services, and set them up quickly.

I woke up on a Sunday and wanted to setup Pihole for blocking ads. I had the house running via Pi-hole in 30 mins. I have found a new kind freedom with this setup. I see this as a small digital balcony garden on the internet.

References

  1. Understanding Networking in Nomad by Karan Sharma
  2. Translating Docker Compose & Kubernetes to Nomad by Luiz Aoqui
  3. Nomad Past Present and Future by Luiz Aoqui

Setting up my HomeServer – Part 2

In part 1, I wrote about the hardware, operating system and the software deployment method and tools used in my home server. In this one, I am going to cover one topic that I am least experienced in – Networking

Problem: Exposing the server to the internet

At the heart of it, this the only thing in Networking I really wanted to achieve. Ideally the following is all I should have needed:

  1. Login to my router
  2. Make the IP of my Intel NUC static, so that DHCP doesn’t assign a different value every time it reboots
  3. Setup port forwarding for 80 (HTTP) and 443 (HTTPS) in the router to the NUC.
  4. Use something like DuckDNS to update my domain records to point to my public address.

I did the first 3 steps, and tried to hit my IP. Nothing. After some searching on the internet, I came to realize that my ISP doesn’t provide Public IPs for home connections anymore and my router is under some NAT (don’t know what it is).

Before I outline how everything is setup, I want to highlight 2 people and their self-hosting related blogs:

  1. Abhay Rana aka Nemo’s Setup
  2. Karan Sharma’s setup

Their blogs provided a huge amount of inspiration and a number of ideas.

Solution: Tailscale + OCI Free VM

After asking around, I settled on using Tailscale and a free VM on Oracle Cloud Infrastructure to route the traffic from the internet to the VM.

Here is how Tailscale helps:

  1. All the devices that I install Tailscale in and login becomes a part of my private network.
  2. I added my Intel NUC and the Oracle VM to the Tailscale network and added the public IP of the Oracle VM to the DNS records of my domain.
  3. Now requests to my domain go to the OCI VM, which then get forwarded to my NUC via the Tailscale network.

Some Tips

  1. Tailscale has something called MagicDNS which once turned on allows accessing the devices using their name instead of their IPs. This allows configuring things quite easy.
  2. Oracle VM’s by default have all of their Ports Blocked except for 22 (SSH). So after installing the webserver like Nginx or Apache, 2 things need to be done:
    • Add Ingress Rules allowing traffic to ports 80 and 443 for the Virtual Cloud network that the VM is configured with
    • Configure the firewall to open the ports 80 and 443 (ufw allow <port>)

I think there are many ways to forward the requests from one machine to another (OCI Instance to Homeserver in this case). I have currently settled for the most familiar one – using Nginx as Reverse Proxy.

There are 2 types of applications I plan to host:

  1. Public applications like Misskey which are going to be accessed by anyone who wants to.
  2. Private applications like Node-RED which I am going to access 99.99% from my laptop connected to my home-network.

I deploy my public applications to the Tailscale Network IP in my homer server and make them listed to a specific port. Then, in the Nginx reverse-proxy configuration on the OCI VM, I set the proxy_pass to something like http://intel-nuc:<app-port>. Since I am using Tailscale’s magic DNS, I don’t have to hard-code IP values here.

For Private applications, I simply run them on the Intel NUC’s default local network, which is my router and other things connected to it (including my laptop) and access it locally.

Up next…

Now that the connectivity is sorted out, the next part is deploying the actual application and related services. I use Nomad to do that. In the next post I will share my “architecture” and how I use Nomad and Terraform to do deployments and some tricks I learnt using them.

Setting up my HomeServer – Part 1

Preamble

I had an old Intel NUC lying around that I always wanted to put it to good use. I had set it up as a home media server using Plex a couple years back, but streaming services and the newer content coupled with 100-300Mbps Fiber Internet connections have kind of made it redundant.

So, when finally a couple of weeks back when I wanted to switch from Twitter to Mastodon, I got the idea to just host an instance for myself. But Mastodon was in Ruby and I was at the time reading through the ActivityPub Protocol specifications to see if I can create an account using just a couple of scripts and a bunch of JSON files. I didn’t get far in the experiment, but someone ended up finding out Misskey. I setup a Misskey instance on Digital Ocean and set out to preparing the NUC for a home-server.

Hardware

The NUC I have has a Intel i3 processor with 4 cores each running at 2.1 GHz and has 16 GB of RAM and a 128 GB SSD. That’s the equivalent of an AWS EC2 Compute optimized c7g.2xlarge instance which costs about 0.289 USD/hour which comes to $200/month (approx).

Sidenote: I know its not apple to apple comparison between a hardware device and a cloud instance, but for my intended purposes, it totally works.

OS

I used to have this as an alternate desktop system, so it runs Ubuntu 22.04 Desktop Version with the default Gnome interface. I removed almost all the desktop software and left only the bare minimum necessary to run the OS and the desktop.

I thought about installing Ubuntu Server edition but couldn’t find a pendrive. So a stripped down desktop OS will have to do.

Software Deployment

Now this I am very particular about.

  1. I want to use Infrastructure as Code tools like Terraform as much as possible. This allows storing all the necessary configuration in a repo, so if and when my HDD fails, I can redeploy them again without having to do each and every one of them by hand again.
  2. I want to have some form of Web UI that can show the running services and resource consumption, if possible

First up is YUNOHost – It is an OS dedicated to self-hosting and supports a huge number of software and has a nice UI to manage them. But, the ones I want to host (Misskey) aren’t there and I don’t think storing config as code is an option here.

Next I looked at docker-compose – I am very familiar with it. The config can be stored as files and reused for redeployment. A lot of software are distributed with docker-compose files themselves. But, there is no web UI by default, and I also don’t want to run multiple copies of the same software.[1]

I have some experience with Kubernetes – There are lightweight alternatives like K3S, that might be suitable for a single node system. It fulfills the config-as-code and the Web UI requirements. But, the complexity is a bit daunting and the yaml can get a little unwieldy. And also everything needs to be in a container.

Finally I settled on Nomad – It seemed to have a fine balance.

  • It can handle docker containers, execute shell scripts, run Java programs..etc.,
  • The config is stored in JSON like HCL files, which feels better than the YAML
  • It has a web UI to see running services and resource utilization.

Footnotes

[1] – When software are distributed using docker-compose files, they tend to have all the necessary services defined in them. That usually means, along with the core software they also have other things like a database (Postgres/MariaDB), a web server (Nginx/Apache), a cache (Redis/Memcache)..etc., So, when multiple software are deployed with vendor supplied docker-compose files, it ends up running multiple copies of the same services using up unnecessary CPU and memory.

Personal Bookmarking using YACY & yacy-it

A recent post on HackerNews titled Ask HN: Does anybody still use bookmarking services? caught my attention. Specifically, the top response which mentioned a distributed Search Engine YACY.

The author of the post mentions how, he has configured it to be a standalone personal search engine. The workflow is something like this:

  1. Browse the web
  2. Come across an interesting link that you need to bookmark
  3. Add the URL to the YACY Crawler and crawl to the depth=0, which crawls just that page and indexes it.
  4. Next time you need it, just search for any word that might be present on the page.

This is brilliant, because, I don’t have to spend time putting it the right folder (like in browser bookmark) or tagging it with right keywords (as I do in Notion collections). The full text indexing takes care of this automatically.

But, I do have to spend time adding the URL to the YACY Crawler. The user actions are:

  • I have to open http://localhost:8090/CrawlStartSite.html
  • Copy the URL of the page I need to bookmark
  • Paste it in the Crawling page and start the crawler
  • Close the Tab.

Now, this is high friction. The mental load saved by not tagging a bookmark is easily eaten away by doing all of the above.

yacy-it

Since I like the YACY functionality so much, I decided I will reduce the friction by writing a Firefox Browser Extension – https://github.com/tecoholic/yacy-it

This extension uses the YACY API to start crawling of the current tab’s URL which I click the extension’s icon next to the address bar.

Note: If you notice error messages when using the addon, you might have to configure YaCy for CORS headers as described here https://github.com/tecoholic/yacy-it#configuring-yacy

Add pages to YaCy index directly from the address bar
Right-click on a Link and add it to YaCy Index
If you running YaCy on the cloud or in a different computer on the network, set it in the Extension Preferences

Tip – Search your bookmarks directly from the address bar

You can search through YaCy indexed links from your addressbar by added the YaCy as a search engine in Firefox as describe here => https://community.searchlab.eu/t/adding-yacy-to-firefox-search-menue/95

  1. Go to Setting/Preferences => Search and select “Add search bar in toolbar
  2. Now Go to the YaCy homepage at http://localhost:8090
  3. Click the “Lens” icon to open the list of search engines
  4. This should now show the YaCy icon with a tiny + icon. Click that to add it as a search engine.
  5. Go back to search settings and select “Use the address bar for search and navigation” to hide the search box
  6. Scroll down to Search shortcuts -> double click the Keyword column next to the Yacy and enter a keyword eg., @yacy or @bm
  7. Now you can search Yacy from the address bar like @yacy <keyword> or @bm <keyword> to look through your bookmarks.

Podcast Notes: Darren Palmer on the EV Revolution

Preamble

I listen to podcasts while doing chores like cooking, travelling in public transport, doing dishes…etc., In most cases, I just absorb what I can with the partial attention I provide to the podcast. Recently I thought, I would listen to some of them with a bit more attention, like a lecture and take some notes. I learnt a lot in this one from how EV range anxiety is a non-problem to wineries hedge their quality of wine my mixing grapes from different fields.

Introduction

This one is a conversation between Bloomberg Opinion columnist Barry Ritholtz speaks and Darren Palmer, Ford Motor Co.’s vice president for electric vehicle programs. They take about Darren Palmer’s professional journey in Ford and Ford’s electric vehicle lineup and the future of Ford and Electric vehicles.

Ford Electric Vehicles

  • 11 Billion starting investment
  • 50 billion in current investment
  • use lessons from startups for velocity of execution
  • travelled to EV rich countries like Norway to China to learn
  • Interesting case where an EV user declined 100% refund of the EV vehicle and a BMW car to get rid of the EV and switch to petrol car, because he was holding the future and he doesn’t want to go back
  • focus on millennial market + Mustang brand
  • lack of Operating System is what was the limiting factor once the design was completed
  • using web technologies for creating the interface provided the team with great velocity
  • UI was hosted by a developer sitting at home, which was live updated during a test run based on customer complaint
  • from the first day, social media is being watched and feedback followed up
  • Team Edison – has a very flat structure
  • Ask what they want and move out of their way
  • fast track approvals and let them do their jobs

Converting Petrol Heads

  • Mustang club wasn’t comfortable and said they won’t be endorsing the EV Mustang
  • end of the launch presidents of both the Mustang clubs had purchased multiple vehicles
  • realisation – EV provides complementary experience with Petrol vehicles
  • petrol heads are not petrol heads, but performance heads and electric vehicles can deliver better performance
  • EV Mustang has such a performance that the only thing needed to be done to convert a person into EV is just getting them into the EV vehicle
  • tests for the vehicles are no longer done with humans because the acceleration limits are too high
  • torture tests for Ford vehicles on YouTube

E Transit – Vans

  • Focus on commercial customers
  • Payback starts from day 1
  • running costs are about half
  • extra mileage is limited because of predictability of routes – so EV mileage anxiety is not an issue
  • trips are planned and executed with right charging window
  • only 10% of our market
  • everyday seeing new use-cases in the commercial sector

Use-case #1: French Winery

  • wineries collect grapes from different parcels of land and mix them to get an average better wine (if 1 of 3 is bad), they still get a decent wine because of the 2 good parcels
  • but they don’t know if they have had 2 of 3 bad until a year later when they pull samples from the vats
  • a winery wanted to get the best quality and decided to use 1 vat per parcel, so at t
  • vat sizes are big and restriction on building above ground, so winery built the vats under ground by a mountain side
  • used the diesel vehicles for transportation
  • converted entire fleet to the electrical after seeing electrical performance
  • talks to American group of wine makers
  • now there is a huge demand from the wineries for electrical vehicles

USE-CASE #2: Mobile kitchens

  • popup kitchens with electric equipment
  • uses the battery to drive to people’s homes, plug-in and start cooking
  • battery operated electrical vehicles can use the battery to power the kitchen

BlueOval Charging Network

  • have the resources to setup independent network
  • but poor value for customers with each manufacturer setting up their own networks – so created a coalition
  • there are regional networks
  • remote monitoring of all stations
  • payment systems, terminals, ..etc all have to play along
  • they have instrumented vehicles, roaming the network, just to test the charging stations
  • problems identified are communicates to the CEOs of the stations of the network
  • problematic stations can be removed from the network if the quality problems aren’t addressed

Long charge times

  • perception is very different
  • charge in iPhone is bad compared to a flip phone, still no-one wants to switch back to a flip phone after using an iphone
  • the process of going to get petrol is obsolete
  • it actually takes 30 seconds to charge the car – because all you do is just plugin
  • daily charge is almost never conscious – go home, plugin and walk away, its charged and ready to go when you come back to it again
  • tech to charge faster is still developing
  • but it makes less difference than we think because the human element plays a very big role in when/how we charge
  • with 300 mile cars (current range of Ford EVs), it’s not actually a requirement
  • users will take a break after driving for a couple of hours (~200 miles), pit stops are more than necessary to replenish the charge
  • large miles like 800-1000 miles

Recycling and reusablity

  • materials are really in demand
  • if they don’t have access to minerals for batteries now, they are going to be in trouble
  • Blue City (Blue Ocean City? Blue Oval City??) – vertical integrated city for battery integration
  • collection of leftover and recycling to be more efficient in production

Buttons in the Cars

  • hardware switches/buttons are difficult to modify once installed
  • software provides the ability to change and modify
  • context level buttons – adaptable interface based on the context
  • parking camera buttons are not needed at 60 miles/hr – UI can hide them
  • smart vehicles can remember the way you park, so it will auto launch camera when on parking mode and can remember which camera is more frequently used and launch it automatically
  • smart UI with sensible human overrides
  • we think we need a button until we are presented with a better experience
  • no cycling around the cameras pressing the same button again and again

Customer education

  • basic problems because customers completely ignore the instructions
  • they just buy the car and use it without any research
  • so gamification kind of notifications to teach customers about nuances of an EV

Goodbye! Brave

Final Update (moved old updates to bottom)

Brave supports Chrome extensions. The problem was with the author’s version of Brave; it was roughly a year old. Very old versions of Brave didn’t include service keys (necessary for interacting with Brave’s privacy-preserving proxy-service), whereas modern versions do (which is why you and I are able to install extensions without any issue)

Sampson from Brave

To explain – the place I have installed Brave hasn’t made available any newer versions of the browser since December 2020. So the keys it shipped with have become outdated. Since no-update was available, I didn’t see the usual orange “Update” button on the taskbar.

⚠️ NOTICE: Closing comments as they have moved from discussing the issue to attacking me for not being crypto friendly.

Original Post

I have been using the Brave Browser for almost 2 years I think. @logic introduced it to me at some point and it has been my primary browser both in Desktop and Mobile, home and office computers since then.

I got my first heads up when I came across a post on HackerNews about Brave misbehaving due to the “Brave backend servers” being unreachable. It struck me as strange when a comment on the Github ticket mentioned that Brave servers need to be up for Brave to function.

This is a big design NO-NO for something as essential as a web browser. But then, the inertia of it being a daily driver, its amazing ad-blocking and tracker protection, Chrome extension compatibility, and the fact I haven’t faced any such issues prevented me from doing any changes.

Today I was looking to install an extension to manage the browser tabs and I ran into this

Can’t install any extension

I thought maybe the extension was buggy and tried a couple more and the same result for everything. And searching for the error led me to this Github Ticket, which again describes that it is a “server-side” issue and it was fixed.

Well, it is not fixed for me. But that’s beside the point. This amount of dependency for a browser to have on “backend servers” is ridiculous. For software, as important as a browser, through which I have come to access almost everything digital for me is unacceptable. So with this post being the last thing I will do on Brave, I will bid goodbye.

Exploring options…

  1. An interesting alternate is Vivaldi – It is trying to do what Opera was doing pre-Chrome. It rolls email, calendar, RSS reader, browser all into one and also provides built-in ad-blocking.
  2. Open source Chrome aka Chromium – This used to be my primary dirver before. So I am thinking of going back to it with the usual extensions like Ghostery, AdBlock+..etc., Not sure how much things have changed there.

Update:

Not sure who posted this in HackerNews. Thanks for all the feedback.

  1. I will be trying Firefox. So many people have recommended it. It’s something I have forgotten over the last couple of years and before that it frequently caused issues and was only my secondary browser for testing.
  2. There is nothing sinister about the decision or PR at work. I tried installing extensions, it didn’t work, I uninstalled and made a note of why I am doing it. Interpretations are all yours.

Update 2:

This is for people suggesting I jumped the gun and probably didn’t take the time to understand the real problem. I am an Chrome extension author myself, I had just published a new version of it only 8 hours before and tested installation on Brave and Chrome. So, I understand the issue. And I have linked to GitHub issues where this has been discussed.

Data Science – First Impressions

After some thoughts on what to learn next, I enrolled myself into IBM Data Science Professional Certificate program. It is beginner level program to provide the necessary foundations for Data Science. I have completed 3 courses so far:

  1. What is Data Science?
  2. Tools for Data Science
  3. Data Science Methodology

I liked all the courses so far. Despite IBM Cloud tools’ big presence, the courses cover the concepts in a somewhat generic way.

This post is NOT a review of the course or certification.

This post is mostly about my first impressions on the domain of Data Science.

1. It’s not strictly a science

Data Science doesn’t have a clear definition. Everyone defines it the way they want. Murtaza Haider – author of Getting Started with Data Science puts it simply as

Data Science is what data scientists do

It is not statistics either, so I guess instead of inventing a new word for experimental statistics, data analysis, visualisation and model building, someone called it science.

2. Engineers will probably hate it

If there is one underlying principle that defines engineers, it is their love for certainty. Engineers strive to build systems governed by a set of rules that will produce predictable outcomes. Data Science is actually quite the other way around, you start with a question and move towards an answer which could be anything, valid or invalid. It is a frustrating experience to go the full way without knowing where. Sure, we get some hints here and there throughout the process. If one has the “journey is more important than destination” attitude, it is a good ride. But if you yearn for certainty or get frustrated mid way through the process, you will probably end up torturing the data to confess the way you want it to.

3. The Jargon

It is a field of jargon. Everything has a specific word/phrase. If you open the data in excel and see it, you will probably call it inspecting the data. As a data scientist, you do the same with a couple of lines of Python Code on a Jupyter Notebook and call it “Exploratory Data Analysis” or EDA, if you change value or format them to a specific standard is “Data Manipulation” or DML (did that thing even need a 3 letter acronym?). If you query some data, do some manipulations, it is a Extract-Transform-Load or ETL pipeline.

While some of them are valid terminologies (like ETL) which exist to communicate effectively, some of them are just marketing jargon which exists to make stuff seem bigger than they are.

4. The Ah..ha moments

This is the best thing about Data Science according to me. It is those moments when your perception is altered by the data or when the output is altered by the perception. Engineering by its nature is application of established principles, we get our ah..ha moments when we learn a new concept invented by a scientist. But in data science the data can produce such moments by their mere nature of being exploratory in nature. Since we start with a question and follow the data to the answer, it usually results in a light bulb moment.

5. Practicality

This is the area which is most conflicting for me. Things like “Data is the new oil” has been said for close to a decade now and the general consensus seems to be that more data the better. But both my personal experience as a teacher and the case study about a health insurance provider in the course have made me wary of the practicality of this approach. I can write about my personal experience as anecdotal evidence. But I think I would give a more data oriented reason (the post being about data science and all).

Explosion of healthcare administrators in the US

While studying the data science methodology, it becomes acutely clear how this happened. I am not saying data science is the root cause of the problem, but it is definitely a contributing factor as no model built by any data scientist can be static. It is iterative process which keeps cycle of data collection modelling and output going. So when overdone, it feels more of a problem than a solution.

6. The machines are coming for the data scientists

The hype for data scientists went mainstream, I think with the McKinsey report of 2018 saying there will be a shortage of 140,000 to 190,000 data professionals in US alone. Followed by data scientists becoming the highest paid in technical jobs. Going by IBM tools that I have used during the course, I don’t think the hype will age well.

Money is in automation – an estimated 70 to 90% time is spent by data scientists in collecting and preparing the data for the analysis. If there is anything that computers were invented for, it is to automate such mundane processes. Even if a company is saving 50% of that time, it is great cost saving. So automation tools will cut down the work and thus the demand for data scientists.

Side Note: You can create a free account at cloud.ibm.com and checkout their tools to get an idea of the direction and sophistication of the tools that are being developed.

Uncertainty of the outcome – as mentioned earlier, not all organisations will gain from having a data science team. Apart from the data, a variety of things like – organisation size, scope for data driven decision making and the talent of the data science team all impact the outcome of the exercise. Combined with off the shelf offerings from IT companies, data scientist’s role might shrink to just an computer operator in some cases.

No, I am not saying data scientists will become obsolete. Just that it is not going to live up to the hype.

Before you throw brick bats

I have completed only 3 of 9 courses in the program. Once I complete another 3 courses, I think I will have a better understanding of things and will revisit the topic again.

text/plain MIME Type and Python

When you do echo "x" > my_file and then check its MIME type using file --mime-type my_file it would say text/plain. But, when you do the same in Python by

with open("my_file_2", "w") as fp:
fp.write("x")

and then check the MIME type it would say application/octet-stream. What’s the difference?

For the impatient

echo adds a new line to file which tells the file utility it is a text file.

For the curious

When I saw this question on StackOverflow, I was really stumped due to the following reasons:

  1. I didn’t know the file utility can be used to get the mime-type of the file. I thought MIME Type is only relevant in the context of web server and clients. After all, MIME stands for Multipurpose Internet Mail Extensions
  2. I thought operating systems usually use the file extension to decide the file type, by extension the mime type. Don’t the OSes warn when we touch the extension part of the files while renaming, all the time? So, how does file utility do this on a files without any extension?

Adding extensions

Lets try adding extensions:

$ echo "x" > some_file.txt
$ file --mime-type some_file.txt
some_file.txt: text/plain

Okay, that’s all good. Now to the Python side:

with open("some_file_2.txt", "w") as fp:
fp.write("x")
$ file --mime-type some_file_2.txt
some_file_2.txt: application/octet-stream

What? file doesn’t recognise file extensions?

The OS conspiracy theory

Maybe echo writes the mimetype as a metadata onto the disk because echo is a system utility and it knows to do that and in Python the user (me) doesn’t know how to? Clearly the operating system utilities are a cabal of some forbidden knowledge. And I am going to uncover that today, starting with the file utility which seems to have different answers to different programs.

How does ‘file’ determine MIME Type?

Answers to this question has some useful information:

How do you change the MIME type of a file from the terminal?

  1. MIME Type is a fictional value. There is no inherent metadata field that stores MIME Types of files.
  2. Each operating system uses a different technique to decide file type. Windows uses file extension, Mac OS uses type creator & type codes and Unix uses magic numbers.
  3. The file command guesses file type by reading the content and looking for magic numbers and strings.

Time to reveal the magic

Let us peer into the souls of these files in their purest forms where there is no magic but only 1s and 0s. So, I printed the binary representation of the two files.

$ xxd -b my_file
00000000: 01111000 00001010 x.

$ xxd -b my_file_2
00000000: 01111000 x

The file generated by echo has two bytes (notice the . after the x) whereas the file I created with Python only has one byte. What is that second byte?

>>> number = int('00001010', 2)
>>> chr(number)
'\n'

And it turns out like every movie on magic, there is no such thing as magic. Just clever people putting new lines to tell file it is a text file.

Creating a trick

Now that the trick is revealed, lets create our own magic trick

$ echo "<file></file>" > xml_file
$ file --mime-type xml_file
xml_file: text/plain

$ echo "<?xml version="1.0"?><file></file>" > xml_file
$ file --mime-type xml_file
xml_file: text/xml

Useful Links

  1. https://www.baeldung.com/linux/file-mime-types
  2. https://unix.stackexchange.com/questions/185216/file-command-apparently-returning-wrong-mime-type
  3. https://stackoverflow.com/questions/29017725/how-do-you-change-the-mime-type-of-a-file-from-the-terminal

Building a quick and dirty data collection app with React, Google Sheets and AWS S3

Covid-19 has created a number of challenges for the society that people are trying to solve with the tools they have. One such challenge was to create an app for data collection from volunteers for food supply requirements for their communities.

This needed a form with the following inputs:

  1. Some text inputs like the volunteer’s name, his vehicle number, address of delivery..etc.,
  2. The location in geographic coordinates so that the delivery person can launch google maps and drive to the place
  3. A couple of photos of the closest landmark and the building of delivery.

Multiple ready made solutions like Google Forms, Zoho Forms were attempted, but we hit a block when it came to having a map widget which would let the location to be picked manually, and uploading photos. After an insightful experience with CovidCrowd, we were no longer in a mood to build a CRUD app with Database, servers..etc., So the hunt for low to zero maintenance solution resulted in a collection of pieces that work together like an app.

Piece 1: Google Script Triggers

Someone has successfully converted a Google Sheet into a writable database (sort of) with some Google Script magic. This allows any form to be submitted to the Google Sheet and the information would be stored in the columns like in a Database. This solved two issues, no need to have a database or a back-end interface to access the data.

Piece 2: AWS S3 Uploads from Browser

The AWS JavaScript SDK allows direct upload of files into buckets from the browser using the Congnito Credentials and Pool ID. Now we can upload the images to the S3 bucket and send the URLs of the images to the Google Sheet.

Piece 3: HTML 5 Geolocation API and Leaflet JS

Almost 100% of this data collection is going to happen via a mobile phone, to we have a high chance of getting the location directly from the browser using the browser’s native Geolocation API. In a scenario where the device location is not available or user has denied location access, A LeafletJS widget is embedded in the form with a marker which the user can move to the right location manually. This is also sent to the Google Sheets as a Google Maps URL with the Lat-Long.

Piece 4: Tying it all together – React

All of this was tied together into a React app using React hook form with data validation and custom logic which orchestras the location, file upload ..etc., When the app it built it results in a index.html file and a bunch of static CSS and JS files which can be hosted freely as Github Pages or in an existing server as a subdirectory. Maybe even server over a CDN gzipped files, because there is nothing to be done on the server side.

We even added things like image preview in the form so the user can see the photos he is uploading on the form.

resource_form

Architecture Diagram

resource_form_architecture

Caveats

  1. Google Script Trigger Limits – There is a limit to how many times the Google Script can be triggered
  2. AWS Pool ID exposed – The Pool ID of with write capabilities is exposed to the world. If there is someone smart enough and your S3 bucket could become their free storage or if you have enabled DELETE access, then lose your data as well.
  3. DDOS and Spam – There are also other considerations like Spamming by watching the Google Script trigger or DDOS by triggering with random requests to the Google Script URL that you exhaust the limits.

All of these are overlooked for now as the volunteers involved are trusted and the URL is not publicly shared. Moreover the entire lifetime of this app might be just a couple of weeks. For now this zero maintenance architecture allows us to collect custom data the we want.

Conclusion

Building this solution showed me how problems can be solved without having to write a CRUD app with a admin dashboard every time. Sometimes a Google Sheet might be all that we need.

Source Code: https://github.com/tecoholic/ResourceForm

PS Do you know Covid19India.org is just a single Google Sheet and a collection of static files on Github Pages? It servers 150,000 to 300,000 visitors at any given time.

Lottie – Amazing Animations for the Web

15549-no-wifi

Modern websites come with some amazing animations. I remember Sentry.io used to have an animation that showed packets of information going through a system and it getting processed in a processor.etc., If you browse Dribble you will see a number of landing page animations that just blow our mind. The most mainstream brand that employs animations is Apple. Their web page was a playground when they launched Apple Arcade.

Sidenote: Sadly all these animations vanish once the pages are updated. It would be cool if they could be saved in some gallery when we can view them at later points in time.

We were left wondering how do they do it?

animation_discussion

I might have found the answer to this. The answer could be Lottie.

What is Lottie? The website says

A Lottie is a JSON-based animation file format that enables designers to ship animations on any platform as easily as shipping static assets. They are small files that work on any device and can scale up or down without pixelation.

Go to their official page here to learn more. It is quite interesting.

Take a peek at the gallery as well, there are some interesting animations that can be downloaded and used in websites for free as well.