Posted by: Sameer Agarwal | June 26, 2009

Warning: This blog post is a direct consequence of extreme joblessness (If you haven’t noticed, I am a graduate now!) and reading a bit too many articles on arbit topics from maths to game theory to love and  statistics on my super awesome electronic book reader \m/. First of all, the mathematical reasoning applied here may all be flawed,  at best, I am pretty much only a math enthusiast and if MA-101 is/was one of your favorite courses, please don’t bleed your eyes out by reading this further!

Well, now let us get back to the topic at hand. A couple of days back, I was cribbing as to “Why I will Never Have a Girlfriend” after reading this thought provoking article. The author used pure maths, statistics and proper census data to conclude that there are exactly 18,726 non-committed girls who are beautiful, intelligent and might also like me (yeah, the last point is debatable, but a safe 50% chance of liking me is taken into consideration… this is my blog… sue me). Though this number doesn’t appear to be very small,  statistically the search for 1 in 18,726 would take 67 years to complete. Hence the title of the article was well justified.

However, just when I was trying to overcome my disappointment I got to know that the beautiful girls are not in a much good position either!  However, the analysis isn’t as simple as the guys. It involves Game Theory, Nash Equilibrium and the famous NP-complete Interactive Decision Problem! (Yes, women are _really_ hard to understand!). To give a brief idea, let me remind you of the movie The Beautiful Mind. It dealt with a situation as follows:  Suppose you are hanging out in a bar with a couple of your friends and there is a group of beautiful girls all of whom are brunettes except one who is a blonde. Now, all the boys would like to approach the blonde first, however this doesnt quite seem to be a good strategy. Here is how Nash argued:

If we all go for the blonde, we block each other and not a single one of us is going to get her. So then we go for her friends, but they will all give us the cold shoulder because nobody likes to be second choice. But what if no one goes to the blonde? We don’t get in each other’s way and we don’ t insult the other girls. That’s the only way we win“.

Things are turning interesting now! This means that the prettiest girl in a group will always be available for potential guys to ask. But there is a catch! Consider a guy who is attracted to our beautiful blonde girl Carol and would like to talk to her. Realizing that Carol is shy, he is also a bit skeptical about his decision. But there are 3 possible outcomes as highlighted by Dr. José-Manuel Rey:

1. He talks to Carol and she responds in a friendly manner and he gets her phone number. (‘a’ reward points)
2. He does not approach Carol. Instead enjoys doing something else little less rewarding. (‘b’ reward points)
3. He talks to Carol and she proves uninterested and he feels miserable for a week. (‘0’ reward points)

Let us associate reward points with each of these actions. First outcome has reward points ‘a’, second has b reward points and third has zero reward points. Clearly, a > b > 0. Now, if there are N guys taking independent decisions, we will have possible set of actions (aah, NP-complete!). However, for a particular guy, success means that guy approaches Carol while all others donot. By symmetry, let us assume that the proabability of each guy approaching Carol is p, so for our man to approach Carol, the award associated with it is (He approaches Carol, while other N-1 guys don’t!). Compared to this, the reward he gets doing something else is b. So, for so called pareto optimal solution/ Nash Equilibrium,

i.e.

This result is pretty interesting. Some of the few points that it highlights is:

1. As N increases, p tends to zero! Which implies no one will go for the most beautiful girl around as everyone will be too afraid to ask her out. Waah!

2. If you are a beautiful girl (with a large value of N) reading this , don’t despair, make a move with any one out of those N guys

3. If you are a guy reading this, don’t despair if the girl has a large fan following! In fact, the more the better are your chances!

4. If you are neither, then stop pointing out Mathematical flaws in this article and get a life you loser! (yeah, look who is talking)

PS: On a serious note, Game Theory is a really really interesting topic! If anyone of you is interested, I recommend an excellent book by Roger B. Myerson titled Game Theory: Analysis of Conflict.

PS 2: The analysis is still incomplete due to an error pointed out by an ingenious personality (read point 4). I am still not working on it because I don’t give a damn!

Posted by: Sameer Agarwal | July 2, 2008

## Fun with GTalk/Pidgin: Changing your status messages programatically

We have always seen people trying to come up with catchy one-liners and cool slogans as their gtalk status messages. On being unable to come up with cool, funky and _original_ status messages to impress friends and colleagues, I decided to put on the programmer’s hat and came up with ways to set these status messages through a simple program. Advantages? Well, enormous. Once you can control the status message string through a program, you can basically rotate it (Rotating status messages do look cool ), update it automatically every 1 minute, write a scripts that ‘copies’ your friends status messages periodically… Seriously, the opportunities are limitless and only await your imagination! GTalk status messages can be thought on the lines of a Publish-Subscribe Messaging. To understand this, think of a simple newspaper. It publishes some information everyday and its reader in a way subscribe to this information. In the same way your messenger status message is published by youand it is “subscibed” by your friends. This can be thought of as a kind of Multicasting group. So, as part of this blog, I will basically discuss the Google Talk XMPP protocol and will focus on 2 popular IM that use this: GTalk for Windows and Pidgin for Linux (actually pidgin is multi platform)

Configuring Pidgin (in Linux):

For this you will need libpurple-bin to be installed on your system. You can find this in all common linux repositories. For eg. in Ubuntu, this can be done by a simple command:

$sudo apt-get install libpurple-bin Once this is done, we need a simple shell script to change our status message. For eg. the script below, makes the current date/time being set up in the status message in case you are “available” in the format : Wed Jul 2 13:46:53 EDT 2008 and updates it every 30 seconds. sleep_time=30 while test 1 = 1; do msg=date if test "purple-remote getstatus" = "available"; then purple-remote "setstatus?status=available&message=${msg}"
oldmsg="$msg" fi sleep "${sleep_time}s"
done

Let us call it status.sh

Now what remains is just to execute this status.sh with pidgin running and you are done! The status messages will keep on changing every 30 seconds and I guarantee it will be fun! Please don’t tamper much with the sleep time because while you are chatting, your status message is displayed on your friend’s side whenever it is changed. So you can imagine that it becomes quite irksome for your friend to continuously notice your status message change.

Configuring GTalk (in Windows):

Configrung GTalk messages programatically in Windows is a little tricky (another reason to switch to open source linux ) . For this, you need to have a Java Development environment in your system as we need to write XMPP application code using a well known API called Smack API which is in Java.

Now, we get down on writing the code!

The java file will look like:

// Import appropriate headers and packages
XMPPConnection connection = new XMPPConnection("gmail.com");
try
{
connection.connect();
Presence presence = new Presence(Presence.Type.available);
presence.setStatus("Hello Friends!");
presence.setPriority(24);
presence.setMode(Presence.Mode.available);
connection.sendPacket(presence);
Thread.sleep(30000);  // Sleeps for 30 seconds
}
catch (XMPPException e)
{
System.out.println(e.toString());
}

The thing to notice here is that the argument of setStatus is a String. So, to rotate your status message, just use the following function:

    private static String rotate(String input)
{
return input.substring(1) + input.charAt(0);
}

which takes in a simple string as input and rotates it one character at a time:

eg. CORNELL

ORNELLC

RNELLCO etc.

So, get down to it and Enjoy!

Posted by: Sameer Agarwal | March 2, 2008

## Bloom Filters: Designing a Spellchecker

Bloom filters are highly space/time efficient probabilistic data structures that are used to solve the membership problem of a set, that is, given an element, it is used to find out whether the particular element belongs to the set or not. False positives can occur in the results (though we can really minimize its probability) but false negatives are not allowed.

Ever wondered how the browser/text editor spell checkers work? They tend to take some bare minimum space and are very time efficient. You guessed it right : Bloom Filters is the key! I have designed a Spellchecker in C language using bloom filters and would like to explain how it works, thereby explaining bloom filters in general.

Now, the data structure of the spell checker is basically a m-bit array. Moreover we have k hash functions which act on strings and return a number between 1 to m-1. For inserting each word in the dictionary, we calculate its hash and set the corresponding bit it generated to be 1. So for each word, it is acted upon by k hash functions and correspondingly k bits are set to 1. Similarly, other words are also added in a similar fashion. Now, for testing membership, we calculate the k hashes of the word to be tested and if either of the bit corresponding to the generated position is found to be zero, we reject it. Clearly, false positives are possible (the bit being set to 1 by some other string) but false negatives can never be possible.

And now, the coolest part. Let us mathematically play around with the probability of false positives :

Assume that a hash function selects each array position with equal probability. The probability that a certain bit is not set to one by a certain hash function during the insertion of an element is then

$1-frac{1}{m}$.

The probability that it is not set by any of the hash functions is

$left(1-frac{1}{m}right)^k$.

If we have inserted n elements, the probability that a certain bit is still 0 is

$left(1-frac{1}{m}right)^{kn}$;

the probability that it is 1 is therefore

$1-left(1-frac{1}{m}right)^{kn}$.

Now test membership of an element that is not in the set. Each of the k array positions computed by the hash functions is 1 with a probability as above. The probability of all of them being 1, which would cause the algorithm to erroneously claim that the element is in the set, is then

$left(1-left(1-frac{1}{m}right)^{kn}right)^k approx left(1-e^{-kn/m}right)^k$.

Obviously, the probability of false positives decreases as m (the number of bits in the array) increases, and increases as n (the number of inserted elements) increases. For a given m and n, the value of k (the number of hash functions) that minimizes the probability is

$frac{m}{n}ln 2 approx frac{9m}{13n} approx 0.7frac{m}{n}$,

which gives the false positive probability of

$left(frac{1}{2}right)^{k} approx {0.6185}^{m/n}$.

OK, there was quite a bit of catch in assuming that the “hash function selects each array position with equal probability”. Theoretically this is not quite possible. Randomness is not to random in the world of computer science, but it is nevertheless a good approximation (or at least I would LIKE to believe that)

Implementation

Source code of the spell checker can be obtained here

The spell checker presently has 3 options :

1. Add a new word : This basically requires making the word go through k hash functions (k = 1 1 in my implementation) and the setting of corresponding bits in the bit array. The standard hash functions I used are:

• RSHash
• JSHash
• PJWHash
• ELFHash
• BKDRHash
• SDBMHash
• DJBHash
• DEKHash
• BPHash
• FNVHash
• APHash

These functions are a beauty in themselves, I will will also blog about them soon
2. Query a word : Requires the word to be operated on by k hash functions are returned true if all the corresponding bits are set.

3. Recreate optimized bit array : Now, as shown above, the value of k that minimizes the probability of false positives

$frac{m}{n}ln 2 approx frac{9m}{13n} approx 0.7frac{m}{n}$

So, here we have k fixed and n depends on the number of words in the dictionary. So, we can only edit m, i.e. the number of bits in the bit array. Therefore, whenever we needs to recreate the optimized bit array, we simply re-calculate m based on this equation and re-hash the words.

Operation 1 and 2 are constant time operations while operation 3 is quite expensive and must be used once in a while. Moreover, just an interesting fact is that a Bloom filter with 1% error and an optimal value of k, on the other hand, requires only about 9.6 bits per element — regardless of the size of the elements! This advantage comes partly from its compactness, inherited from arrays, and partly from its probabilistic nature. If a 1% false positive rate seems too high, each time we add about 4.8 bits per element we decrease it by ten times.

References :