So this really happened (ing).
Last year(2017), I started finding coins on my cycle rides very frequently; almost every ride on an average. It became quite amusing and perplexing. Treasure hunt romantic that I am, I started tracking them like this.
As the number of coins grew it became tedious to do this and moved to more pragmatic excel sheet below.
I shared these findings on Facebook along with many “tongue in cheek” theories about them. Soon mine and crowdsourced theories started floating around.
- Coded messages coming to me from my future self. (Interstellaresque)
- Universe rewarding me for each ride and hence asking me to ride more.
- Neuro abnormality that makes me think I am finding them.
- Enhanced optical capability due to sprouting of new Neurons
- Faking the finding(s) (Obviously many didn't believe the frequency of this occurrence)
There was a joking reference to using AI to solve this mystery. After 60+ findings I decided to talk to my data scientist friend Chetan and we started discussing what intelligence can be gleaned from using various algorithms on this absurd happenings.
Data is Everything.
First a disclaimer, The dataset we have is very tiny and using AI on this is like using “sledgehammer to crack a nut”. However the story was too interesting to pass up.
Step 0 was to sanitise data which is non trivial and a very painstaking activity. In this case I had captured lot of notes in natural language which needed to be converted into machine understandable format.
Then we applied basic stats on these and they yielded pretty simple and obvious graphs.
Obviously Devanahalli seems like a coin rich area. But this knowledge was intuitively known to me or anyone else by looking at the excel sheets. Obviously the value of AI or Stats is just an automation.
As we dug deeper it became apparent that, in order to really gain some intelligence we need to define the problems first. So what are the questions that we can throw at this tiny data universe of coins.
So what are the problems ?
- Which area produces more results, are there specific days ?
- Is there a correlation with public happenings (such as festivals, processions)?
- Why does Devanahalli route (10 KM between railway bridge and Nandi Hill) has so many coins lying around ? (I have many theories on this.)
- Given 3 (Devanahalli route like characters) which other routes have high probability for the find ?
Problems in Machine Learning can be of two types. In one, the answer takes on a large number of possible values (e.g. the value of Dow Jones after 1 week). In the other, the output takes on one of a few values (e.g. whether an incoming email is SPAM). Regression methods are suitable for the first type of problem, while the second type lends itself naturally to classification methods.
Our problem statement — “Will I find a coin on a trip to XXX” — has a Yes/No answer. So let us try to use classification methods on this. A basic but important classification trick is called a Decision Tree. For Yes/No problems, this algorithm divides the entire range of inputs into Yes and No regions. However, the risk is that the algorithm fits the available training data too closely, and will fail for new inputs that are even slightly different from the training data. This is akin to the problem of bias if we ask a single person for a book recommendation. This issue (called overfitting in ML parlance) can be mitigated by randomly mixing up a large number of decision trees, somewhat like asking a 100 experts for a book recommendation. This collection of trees is evocatively called Random Forest, and is a powerful classification algorithm.
When we throw Random Forest at our finding-a-coin problem, we get some very nice results. How do we know? Intuitively, if my classifier gives me a Yes (No) when the real answer is Yes (No) most of the time, then it must be good. The Confusion Matrix measures exactly this. As you can see, our Random Forest gives us the right answer (33 + 8 =) 41 times and the wrong answer (2 + 2 =) 4 times. Almost 90%! Another related method is to look at the ROC curve (more specifically, the area under the ROC curve). The ideal ROC curve would be a box with the area of 1. As can be seen from the graph, our ROC has an area of 0.87.
Random Forest, Decision Tree
We decided to throw random forest at it.
Random forest is one of a very important machine language classifiers. The main objective of this is to eliminate the biases of basing the decision on single sliver of data (called overfitting in ML lingo) and leading to not so robust decisions. Simply put this is akin to talking to few people with various perspectives before making a major Decision in life.
As you can see predictions are not 100% accurate but tending towards it. The key again is the data.
As anyone who has actually done something in A.I can tell you, it isn’t all it is made out to be, at least not yet. There is a lot that happens in the human brain that gets translated into data and then the automation comes after that.
And finally leave with this image
(Thanks to Chetan Vinchhi for the code, graphs and editing part of the content)
Code Repository : For folks who want to play around more, here is the git link .