First Day,First Show.Last Word.Full Stop.

Movie Reviews.Mostly Tamil.Occasional Telugu.Rare Hindi.English episodic. Foreign Seldom.Cricket Trivia In The Gaps.



Monday, January 04, 2010

A reference to my work in Cricinfo

One of my hobbies these days is to build up my own statistical articles.I wrote a simple parser to extract information from the wonderful tool Cricinfo Statsguru and build up my own little database.I subsequently use this set of text files to generate my own information.

One of the work that was borne out of such an effort was this article.Thanks to Ananth ,the Cricinfo It figures blog editor who acknowledged my effort for the article.

The blog addresses the simple question how often have the same team played a particular match.This gives more answers with regards to the stability of the team.Now we extend that notion to n=10 ,meaning how often have the same 10 people played the game,it gives us more idea on which teams had a solid foundation of 10 similar people on more number of games.We can slowly extend the idea to on till n=2 to give us more insights.Ofcourse the answer to n=1 is trivial and simply boils down to which player has made the most number of matches.

The question is interesting,the answer to this is tough to find given the large amount of data we need to parse through.
There have been 1944 test matches at the time of writing the article.Each test match contributes to 2 teams so 1944*2=3888 teams. For finding n=11 the most inefficient method is to compare team#1 with team#3888 and see if there is a match if so increment the count for that team.Basically a brute force algorithm where you compare one team with every other team.This is time consuming,and is not easily extendable for all n.

So I adapted the following generic algorithm.

Consider the solution for a particular n.Say n=6.
So in a particular team How many sets are there of size n=6 .The answer is 11 C 6 (11 Combination 6) i,e 462.So like this we will construct 462 such sets for each team.
462*3888 i.e about 1.8 million sets.Now basically see how many of these 1.8 million sets are matching and you have your count.
So the solution boils down ,given a set of 11 elements,construct all possible subsets of size 2 ,size3 and so on till size 11 and get count of each possible subset across all teams.Wrote a simple Java algorithm for the same and we are done.The whole thing took 2 mins to run.Once the results are out a simple unix script to sort it on country basis will give me the formatted results.

That is the long and short of the algorithm I used.