Feb 17, 2018

A shelter during information apocalypse

“What happens when anyone can make it appear as if anything has happened, regardless of whether or not it did?" -- Aviv Ovadya

I think the problem is very real, unlike many other apocalypse warnings. This post is about a coursework I did as a proof of concept of an idea which could be built as a shelter for information.

I am not covering the implementation part here since it was a really rough PoC and the real thing is beyond the scope of a side project.


Overview

Time and location provide a lot of context in a search. If I am searching for a restaurant say in Google maps, it provides me the list based on my location. There is another dimension which is the time that is generally ignored. For example, if I am searching for say some event in history, then not only are documents related to that event important but other events during same time and location are also important in understanding that event.

Another example is if I am researching a person say 'Elon Musk'. When doing this search on Google, it only highlights the news worthy and current articles on Elon Musk. But I might be interested in reading about Elon Musk in the context of say Bay Area or Los Angeles and say 10 years back when he started SpaceX or 15 years back when he started X. Given that context, the relevant information seemed completely not accessible. There is no good interface to do the same as well.

Yet another use could be in tourism where people can go back in time and see the chatter happening around some interesting events at locations of historical significance. Twitter feeds during the time of certain event from its location would be on interest many years later.

The project is to focus on categorizing text on location and time and feeding that into search.

Google page rank problem

Google organizes content using its page-rank algorithm. The main problem with this approach is that content that is SEO optimized filters out at the top and pretty much hides all other content.

Graph:


Page Rank Problem


What we end up consuming because of this approach is very specific content built for sharing and generally not very knowledgeable.

In general any algorithm which tries to categorize and optimize ranking will lead to such results eventually.

The internet as a result has become non discoverable, non explorable.

One good way to organize information is by adding physical constraints to its spread. This can typically be done by adding physical bounds to where it can be accessed. For this proof-of-concept, those physical bounds are time and space. All content is created somewhere. It might be a document on mathematics, but the author still has to write it at some physical location. Also, all content is written at some time. Thus a document on the same topic can be quite different depending if it was written today vs say 100 years back.

Exploring content within these two dimensions create really interesting results as seen in the proof-of-concept of this project. It allows exploring and analyzing content in a way which matters in my view. It highly discourages any possibility of SEO and pushing content to the users and encourages sharing of context along with content.

Problem definition

Geo temporal search involves tagging the documents with their appropriate time and location and providing this information in the context of a geo temporal browser.

The traditional search approach uses the construct of a query to retrieve relevant documents. Here, the query term can be totally optional and user might just want to view the relevent documents in the context of time and location.

Thus the problem involves placing each document in time-space dimension and determining relevant documents based on the appropriate use context.

Figure 1: Geo temporal vector space of all documents

Classifying geo-temporal context

A document can have a geo-temporal context in one of the following two ways -

  • The document was created in a certain geo-temporal vector space 
  • The document refers to a certain geo-temporal vector space 
All documents would have the #1 context above but not all would have #2. For example, though this document refers to Abraham Lincoln and 1920 somewhere, it does not refer to any geo-temporal context.

Defining time dimension

Time dimension for documents is generally not absolute. A document could be related to an era, century, decade, year, month, day or absolute time. Thus the time dimension is entirely contextual. The same would apply to the query term as well. The query needs to provide the time in the context of its scale.

Thus I might be interested in one of few things:
  • What happened on 20th January 1920, say 
  • What happened in 1920 AD 
  • What happened in 19th century 
The scale of time thus has to be capture and the context determined based on the content of the document.

Defining location dimension

Location dimension in a similar manner has a large scale, though much better defined then time. For example, I as a user might be interested in one of following:

  • What happened right here, at the intersection of this cross street 
  • What happened nearby, say in 5 mile radius 
  • What happenend in this city 
  • What happened in on this side of the world 
  • What happened in this country, or whatever entity existed at the corresponding time 

Extracting time and location dimension from documents

The hard part is extracting time and location dimension with their relevant scale from the documents in a reliable manner. Some background knowledge of history would be required to make sense of the context. For example, if a document is talking about say, "Abraham Lincoln gave the following speech today as part of his Swear-in ceremony", some knowledge of "Abraham Lincoln" would be required to determine the dimension data.

Validating Geo-Temporal Vector

Also, it would be of much interest to determine the dimensions not solely based on the content of the document. This would help in preventing historical spamming, where someone would write articles with past create dates and keywords to engineer their location on the geo-temporal vector space.

Few ideas of implementing such a validation:

  • If a document in the past refer to a document in the future, it is clearly not consistent 
  • If crawler history can determine based on crawl history if a document was updated for geo-temporal shifting 
  • Use available knowledge base to validate the vector space 
  • Use feedback mechanism from users to validate or invalidate the meta data 
  • Use source reliability score based on above validation, to establish appropriate trust 

Jan 28, 2018

Predicting using Idiot's Bayes

There is a lot of magic that goes into Machine Learning and AI. So it feels great the first time you work out all the math physically and do a prediction given some data.

I set out to implement the Naive Bayes on the pima-indians-diabetes dataset (Google it and you can fine it). There is a lot of greek that goes into understanding the formulae's but for those like me whose dictionary has "greek" meaning "i don't get it", here are all the gory formulae's worked out in a spreadsheet.





Jul 5, 2017

Scaling organizations


To scale yourself, hire people.
To scale with people, build teams.
To scale with teams, build processes.
Above all, build right culture. It scales everything best.