Personal Bookmarking using YACY & yacy-it

A recent post on HackerNews titled Ask HN: Does anybody still use bookmarking services? caught my attention. Specifically, the top response which mentioned a distributed Search Engine YACY.

The author of the post mentions how, he has configured it to be a standalone personal search engine. The workflow is something like this:

  1. Browse the web
  2. Come across an interesting link that you need to bookmark
  3. Add the URL to the YACY Crawler and crawl to the depth=0, which crawls just that page and indexes it.
  4. Next time you need it, just search for any word that might be present on the page.

This is brilliant, because, I don’t have to spend time putting it the right folder (like in browser bookmark) or tagging it with right keywords (as I do in Notion collections). The full text indexing takes care of this automatically.

But, I do have to spend time adding the URL to the YACY Crawler. The user actions are:

  • I have to open http://localhost:8090/CrawlStartSite.html
  • Copy the URL of the page I need to bookmark
  • Paste it in the Crawling page and start the crawler
  • Close the Tab.

Now, this is high friction. The mental load saved by not tagging a bookmark is easily eaten away by doing all of the above.


Since I like the YACY functionality so much, I decided I will reduce the friction by writing a Firefox Browser Extension –

This extension uses the YACY API to start crawling of the current tab’s URL which I click the extension’s icon next to the address bar.

Note: If you notice error messages when using the addon, you might have to configure YaCy for CORS headers as described here

Add pages to YaCy index directly from the address bar
Right-click on a Link and add it to YaCy Index
If you running YaCy on the cloud or in a different computer on the network, set it in the Extension Preferences

Tip – Search your bookmarks directly from the address bar

You can search through YaCy indexed links from your addressbar by added the YaCy as a search engine in Firefox as describe here =>

  1. Go to Setting/Preferences => Search and select “Add search bar in toolbar
  2. Now Go to the YaCy homepage at http://localhost:8090
  3. Click the “Lens” icon to open the list of search engines
  4. This should now show the YaCy icon with a tiny + icon. Click that to add it as a search engine.
  5. Go back to search settings and select “Use the address bar for search and navigation” to hide the search box
  6. Scroll down to Search shortcuts -> double click the Keyword column next to the Yacy and enter a keyword eg., @yacy or @bm
  7. Now you can search Yacy from the address bar like @yacy <keyword> or @bm <keyword> to look through your bookmarks.

NER Annotator / NER Tagger for Spacy

NER Annotator is now available to use directly from the browser


As with most things, this started with a problem. Dr. K. Mathan is an Epidemiologist tracking Covid-19. He wanted to automated extraction of details from government bulletins for data collection. It was a tedious manual process of reading the bulletins and entering the data by hand. Since the bulletins has paragraphs of text with text in them, I was looking to see if I can leverage any NLP (Natural Language Processing) tools to automate parts of it.

Named Entity Recognition

The search led to the discovery of Named Entity Recognition (NER) using spaCy and the simplicity of code required to tag the information and automate the extraction. It kind of blew away my worries of doing Parts of Speech (POS) tagging and then custom writing an extraction algorithm. So, copied some text from Tamil Nadu Government Covid Bulletins and set out test out the effectiveness of the solution. It worked pretty well for the small amount of training data (3 lines) vs extracted data (26 lines).

Trying out NER based extraction in Google Colab Notebook using spaCy

But it had one serious issue. Generating Training Data. Generating training data for NER Annotation is a pain. It is infact the most difficult task in the entire process. The library is so simple and friendly to use, it is generating the training data that is difficult.

Creating NER Annotator

NER Annotation is fairly a common use case and there are multiple tagging software available for that purpose. But the problem is they are either paid, too complex to setup, requires you to create an account or signup, and sometimes doesn’t generate the output in spaCy’s format. The one that seemed dead simple was Manivannan Murugavel’s spacy-ner-annotator. That’s what I used for generating test data for the above example. But it is kind of buggy, the indices were out of place and I had to manually change a number of them before I could successfully use it.

After a painfully long weekend, I decided, it is time to just build one of my own. Manivannan’s tagger just uses JavaScript to create the training data JSON and then requires a conversion using a Python Script before it can be used. I decided to make it a little more bug proof.

This version of NER Annotator would:

  1. Use a Python backend to tokenize and detokenize text for tagging and generating training data.
  2. The UI will let me select tokens (idea copied from Prodigy from the spaCy team), so I don’t have to be pixel perfect in my selections.
  3. Generate JSON which can be directly loaded instead of having to post-process it with Python script.

The Project

I created the NER with the above goals as a Free and Open Source project and released it under MIT License.

Github Link:


Thanks to Philip Vollet noticing it and sharing it on LinkedIn and Twitter, the project has gotten about 107 stars on Github and 14 forks, which is much more reach than I hoped for.

Thanks to @1littlecoder for making a YouTube video of the tool and showing the full process of tagging some text, training data and performing extractions.

Featured on TheNextWeb & Lifehacker

Something really cool happened this week. I will let the tweets to take over.

… and that’s how I made it to the homepage of TheNextWeb.

… and Lifehacker

Source code of the extension:

For Chrome: Chrome Webstore

For Firefox:

What did I learn from this?

The most important thing I learnt while doing this is probably the fact that the extension architecture is standardised across Chrome and Firefox. Thanks to Shrinivasan for asking me to port it to Firefox.

But, I think the relationship is one sided. Firefox can work with extensions written for Chrome, but Chrome won’t work with extensions written for Firefox. This is due to the nature of Firefox’s API and the fallback it offers.

For example, the storage api on Firefox is storage.* whereas on Chrome it is*. Since Firefox has fallbacks for all the chrome.* API, the code primarily written for Chrome works without modifications on Firefox. But if a developer writes the plugin first for Firefox, it would lose the namespacing and therefore won’t work.

More technical details here at MDN web docs: Building a cross-browser extension

Special thanks @tshrinivasan for pushing me to build it for Firefox to @SuryaCEG for the UX advised and @IndianIdle for writing the article.

Building a quick and dirty data collection app with React, Google Sheets and AWS S3

Covid-19 has created a number of challenges for the society that people are trying to solve with the tools they have. One such challenge was to create an app for data collection from volunteers for food supply requirements for their communities.

This needed a form with the following inputs:

  1. Some text inputs like the volunteer’s name, his vehicle number, address of delivery..etc.,
  2. The location in geographic coordinates so that the delivery person can launch google maps and drive to the place
  3. A couple of photos of the closest landmark and the building of delivery.

Multiple ready made solutions like Google Forms, Zoho Forms were attempted, but we hit a block when it came to having a map widget which would let the location to be picked manually, and uploading photos. After an insightful experience with CovidCrowd, we were no longer in a mood to build a CRUD app with Database, servers..etc., So the hunt for low to zero maintenance solution resulted in a collection of pieces that work together like an app.

Piece 1: Google Script Triggers

Someone has successfully converted a Google Sheet into a writable database (sort of) with some Google Script magic. This allows any form to be submitted to the Google Sheet and the information would be stored in the columns like in a Database. This solved two issues, no need to have a database or a back-end interface to access the data.

Piece 2: AWS S3 Uploads from Browser

The AWS JavaScript SDK allows direct upload of files into buckets from the browser using the Congnito Credentials and Pool ID. Now we can upload the images to the S3 bucket and send the URLs of the images to the Google Sheet.

Piece 3: HTML 5 Geolocation API and Leaflet JS

Almost 100% of this data collection is going to happen via a mobile phone, to we have a high chance of getting the location directly from the browser using the browser’s native Geolocation API. In a scenario where the device location is not available or user has denied location access, A LeafletJS widget is embedded in the form with a marker which the user can move to the right location manually. This is also sent to the Google Sheets as a Google Maps URL with the Lat-Long.

Piece 4: Tying it all together – React

All of this was tied together into a React app using React hook form with data validation and custom logic which orchestras the location, file upload ..etc., When the app it built it results in a index.html file and a bunch of static CSS and JS files which can be hosted freely as Github Pages or in an existing server as a subdirectory. Maybe even server over a CDN gzipped files, because there is nothing to be done on the server side.

We even added things like image preview in the form so the user can see the photos he is uploading on the form.


Architecture Diagram



  1. Google Script Trigger Limits – There is a limit to how many times the Google Script can be triggered
  2. AWS Pool ID exposed – The Pool ID of with write capabilities is exposed to the world. If there is someone smart enough and your S3 bucket could become their free storage or if you have enabled DELETE access, then lose your data as well.
  3. DDOS and Spam – There are also other considerations like Spamming by watching the Google Script trigger or DDOS by triggering with random requests to the Google Script URL that you exhaust the limits.

All of these are overlooked for now as the volunteers involved are trusted and the URL is not publicly shared. Moreover the entire lifetime of this app might be just a couple of weeks. For now this zero maintenance architecture allows us to collect custom data the we want.


Building this solution showed me how problems can be solved without having to write a CRUD app with a admin dashboard every time. Sometimes a Google Sheet might be all that we need.

Source Code:

PS Do you know is just a single Google Sheet and a collection of static files on Github Pages? It servers 150,000 to 300,000 visitors at any given time.


The Mastodon ecosystem is really nice. The concept of Fediverse bringing in decentralized social network is a breath of fresh air. I created NiftyBot account in – a dedicated server for Mastodon Bots.

What is NiftyBot?

  • It is a Mastodon Bot Account

What does it do?

  • It posts updates about Indian Markets
  • Currently it posts NSE closing report at 4.01 PM everyday. Sample post below


How does it work?

It is a Python script running in AWS Lambda.


A scheduler tiggers the Lambda Function at 4.01 every Monday – Friday. The lambda function is a Python Script that fetches the necessary details from NSE’s output data and posts to Mastodon.

Source Code:

Some asked about if this bot is open source. Obviously, you see the source right here. 🙂 Still I will add the license here.

The above source code is released into the Public Domain. You can do what ever you want with it.

How much does it cost to run this Bot?


Numbers Please:

The AWS Lambda Free tier comes with 1 Million requests and 400,000 GBSec, which is a combination of how much memory we use and the time taken by our process. Since I have used the CloudWatch Scheduler Event as the trigger, I am using 20-22 requests, the Python function takes about 60 MB to run so running at the lowest memory of 128MB block, and usually completes in around 2600-2700 msec. The metrics says my worst billed event so far is about 0.3375 GBSec. With about 20-22 trading days in a month, I might use a total of 8-10 GBSeconds, leaving enough room to run many more bots like this one 🙂

Districts of Tamil Nadu – Hexagonal Maps

Hexagonal maps are useful for creating data visualisations of data points that are a representation of quantities that require equal sized polygons. For example Election Maps. While geographic map of the assembly and parliamentary constituencies might be used for visualising election results, it is a false one. The constituencies are created based on the number of people in a region under the principal of a representative for every X number of people. In such a scenario, using a geographic representation for a place like Chennai, which is made of 3 parliamentary constituencies, doesn’t give us the same perception as the 3 constituencies Kanniyakumari, Thoothukudi, and Thirunelveli put together. But in reality that’s what it actually means.


  • Geographic Representation – Skewed representation of the base data. Unequal real estate for equal (approx) number of people
  • Hexagonal Representation – Correct representation of the base data. Equal real estate for equal number of people.

Now that we have the Parliamentary constituencies in Hex Map form, why not have the Districts as well.


If you need the base data to create your own maps. The GeoJSON files for the same are available here

Map of PM Modi’s Domestic Visits

PM Modi visited Tamil Nadu on 27th January 2019 for the AIIMS Hospital ground breaking ceremony. Twitter was trending with #GoBackModi and #TNWelcomesModi and I was curious about the number of times has PM Modi visited Tamil Nadu before.

The PM India site has a neat list of all the visits

So, I created a map out of it.



This map was replaced after some errors were discovered in the base data.

Literacy Gap of SC community in TN districts

I was going through the Census 2011 data once again and Erode district’s low Schedule Caste (SC) literacy rate caught my eyes. It is not a very lagging state when in overall literacy. But its SC literacy was less than the least literate district of Dharmapuri. So I added the data to the TN Districts shapefile and visualised it to see how lagging are the SC community across the districts.

Here are the maps



In an earlier map, the gap of Thoothukudi was mentioned as -14%, while the actual gap is around 6% due to a typo during the data processing. The map has been updated to reflect the change.

My observations

  1. Kongu Belt (Coimbatore, Tiruppur, Erode) is the worst. ~~The Gounder (land owning) community has ensured their position and the social ladder and ensured the peasantry remained uneducated and illiterate.~~

Update: While there might be an element of truth to it, the maps alone are not indicative of the inference. I have made the above observation based on the number of issue that have appeared on the media like the Mid-day meal staff harassment, two tumbler system etc.,

  1. Dharmapuri is a peculiar case, it has the lowest overall literacy in TN, but it is also the only district where SC community is more literate than the general population.
  2. Kanniyakumari which tops the overall literacy rates also tops the SC literacy. In fact the SC community of Kanniyakumari is more literate than the general population of almost all other districts. I think it would be an interesting place of humanities research in the area of literacy, education and caste.

Data Source:

India Literacy Map with a How-To

I published the Tamilnadu district wise literacy map some days ago and @tshrinivasan asked if I can write a blog on how to do it, and here it is now.

What are we going to do?

We are going to create India’s State Wise Literacy Map. It will be a Choropleth map ℹ️ just like the Tamilnadu one.

Things we need

  1. QGIS – An Open Source software that will be used to process the geographic data and create the map. Download and install it from for your operating system.
  2. Base map – The digital map of India with its state boundaries as a shapefile. ℹ️ You search the internet for “India states shapefile”, there are a number of sources where you can find this. I am going to use the one from the Hindustan Times public repository. [shapefiles/india/state_ut/] ⬇️Download, Unzip the file and keep it ready. I am choosing the pre-Telangana map because the literacy data is from 2011 which is pre-Telangana.
  3. Data on literacy levels of the Indian states. An internet search for “India states literacy csv” would give a number of results. I am going to use the one from the Census 2011 website. ⬇️Download

Get the data ready

We have 2 sources of data:

  1. Geographic data which we downloaded from the Hindustan Times
  2. The Literacy data from the Census 2011 website

Both the datasets need to be joined to create the map. Let us do that:

  1. Open QGIS and create a new project. From menu select Project -> New Project
  2. Add the map using Layer -> Add Layer -> Add Vector Layer. Browse to the location of the downloaded shapefile, select the india_2000-2014_state.shp file and click Add. Add_layer
  3. You will be asked to select the coordinate system. Select WGS84 and click OK. Once the layer is added close the Add layer button. Select CRS
  4. Now you should have the map loaded to the main area, and should see the legend entry for the data layer like this. Base Map added
  5. Now right click on the layer and select Open Attribute TableOpen Attribute Table.png
  6. You will notice it has only two columns – the id and the state name. We are going to create a new column and add the literacy rates from the census data. In the Attribute Table, click the yellow pencil icon (first one in the icon bar) to start editing.
  7. Click the Add Column button and add the literacy column with type decimal. Add_columnliteracy column
  8. Now enter the literacy rates from the excel sheet into the newly added column. Sidenote: There is an automated way to combine the data without having to manually enter the data if you have the data in a delimited text file like CSV. It involves adding a something called a Data Layer. We will take the manual route to keep it simple.
  9. Once you have added the literacy values. Click Save Edits icon (Ctrl+s).  Now click the “Yellow Pencil” button again to stop editing. This is very important. Otherwise, you might unknowingly click at some place and change the geometry of the state boundaries.
  10. Now you should have the data in the attribute table like this. Attribute Table with Literacy.png
  11. Close the Attribute Table.

Styling the map

  1. In the Layers sidebar right click on the map layer and select Properties.
  2. In the Properties window, select Symbology from the side menu. Layer Properties.png
  3. In the Styling window make the following changes. Styling.png
    1. A – Change the style from “Single Symbol” to Graduated
    2. B – Select “literacy” as the column
    3. C – Set Precision to your liking (it denotes the decimal points of the values to be shown in the map legend). I prefer 0 or 1 usually.
    4. D – Choose a Color Ramp to your liking. I am choosing the one suitable for Wikipedia based on the Wikipedia Conventions.
    5. E – Set the mode to “Pretty Breaks”. Now as soon as you select this, the “Classes” tab right above it should be populated automatically. If not, use F.
    6. F – If your classes didn’t appear automatically, click the “Classify” button.
  4. Once you are satisfied with the Legend precision and the color ramp, click OK to see your styled Choropleth map. styled map.png

Note: The properties dialog provides a huge number of options to do a number of things including labels. Refer to a QGIS manual or tutorials on the web for related information.

Exporting the map

Now we have the styled map according to our liking ready. We need to export it to an image so that we can share it across.

  1. Click the “New Print Layout” button. Enter a name, I named mine “export” and click print layout
  2. You will get the Layout window with an empty page. empty print layout.png
  3. From the menu, select Add Item -> Add map. Click and drag the cursor to the required size. map inserted.png
  4. (Optional) There is a lot of white space around the map inside the box. We can make the map a little bigger by reducing the scale. On the right side switch to the Item Properties tab and reduce the value for Scale. (Mine was 17485874 and I changed it to 12500000). rescaled map.png
  5. Click Add Item -> Add Legend. Click and drag the cursor to create the Legend. India’s maps usually use the Bay of Bengal for that, I am going to do the same. Legend Added.png
  6. You will notice that the legend title says the layer name. But what we really want it to say is “Literacy Rate”. There are two ways to fix that. Choose the one that appeals to you.
    1. On the right in the Item Properties tab, under Main Properties, you can enter a title as “Literacy Rate”
    2. On the right in the Item Properties tab, under  Legend Items, double-click on the layer name and enter “Literacy Rate”
  7. Now there is some extra white space on the right. Let us clean that up. On the right side select Layout tab, scroll down to Resize Layout to content and click Resize Layout. Now the page should have been resized to only the map. cropped to content.png
  8. From menu click Layout -> Export as Image. Enter the filename in your desired location and save it. You could also export as PDF if you want to print.

Note: Apart from just the map and legend you can do a lot more complex things with the layout manager. Again, refer to a QGIS manual and other tutorials on the internet to fully learn about them.

Final Product