Google, Synonyms, and Coca-Cola

A few days ago I was googling for “security CCE-263″ (I was looking for MITRE stuff) and I got back results that showed “Coca-Cola” in bold, as if I had searched for that term. Weird, I thought. I soon realized that “security” had nothing to do with it, and so I searched for sugar CCE-foo, getting results like this:

Free coca cola sugar packet Download

Free coca cola sugar packet Download at WareSeeker.com – Colasoft Packet Player foo packet decoder ac3 is a lightweight and useful add-on for foobar2000

Why have i stopped drinking coca cola? « Foo's blog

11 Oct 2009 Firstly it was never good for the health, it has a ton of sugar and caffeine. Companies like Coca cola do not operate democratically,

Jones Soda – Wikipedia, the free encyclopedia

By April 2007, all of the company's products switched to cane sugar, ….. The Seahawks previously sold soft drinks from The Coca-Cola Company;

After a few seconds I realized that the actual name of the Coca-Cola company is “Coca-Cola Entreprises”, or “CCE” for short. So it appears that Google is seeing “CCE” in my query and searching for “CCE” and “Coca-Cola” at the same time. Now, my question is: is Google doing this for *everything*, or is it only doing it for named entities (i.e. persons, organizations, geographic locations, etc.)? Moreover, is it doing this with acronyms only or also with generic highly-correlated words?

To answer the first question, I searched for other organization acronyms that I thought would be pretty common, checking the results to see if I could see the full name of the entity returned as a keyword, i.e. in bold. I tried with “SEC”, “FBI”, “CIA”, and “EPA”, and in no case I got back the full name of the entity as a keyword. Check it out – compare “EPA offices” with “CCE offices” and see how “Coca-Cola” is the only full company name that is returned as a keyword.

To answer the last question, I tried to search for “event viewer microsoft” in the hope that being “microsoft” and “windows” probably highly correlated, Google would return entries containing “windows” in lieu of “microsoft”; this is not the case though (as one would expect!), as the search does not return “windows” keywords. Moreover, searching for “msft redmond” does not return “microsoft” keywords, suggesting that the link between “Coca-Cola” and “CCE” is not simply based on high correlation of the occurrences of the two words, nor on acronyms.

So, I’m now left with one possibility only: something’s going on between Google and Coca-Cola :-)

Gabe Is Picky

I’ve been sent to London for “two months” (58 working days) to stay at these self-proclaimed “corporate studios” at XXX (*), together with a few other colleagues of mine. Mind you, it’s not that I was expecting Versailles, but last time I was sent to “corporate studios” I stayed here, which looks like this:
This is NOT the Chelsea CloistersThis is NOT the Chelsea Cloisters

Of course I was not expecting the same – this is Europe at the end of the day, and we love to be crammed in tiny placed, don’t we? – but at least I was hoping I could cook a little and feel “at home” during my “two-month” stay.

Guess what? When I arrived at XXX (*) I found this:
DoorRoom
What you see above is the whole “studio”, from door to window.

Note the details of the flooring and the nicotine-stained curtains:
Cracks on floorNicotine-stained Curtains

After some bitching, and after being told by a number of people that I’m picky, I got relocated to this other room in the same complex:
New Room
Not bad, right? Still not a “studio”, looks to me more like a “small hotel room”, but at least I got a sofa. I guess in Europe a sofa makes a studio. Good to know, when I’m back in Amsterdam I’ll cram a sofa in the elevator and rent it out as a corporate studio.

It took me a few seconds to start enjoying the details of my new abode:
KitchenBathroom
Notice the tape on the kitchen floor? About that kitchen, I have to remove the garbage bin in order to use the washer. Seriously, the washer door can’t swing open otherwise. And do you see the two taps on the bathroom sink? One is hot water and the other one is cold water; my left hand is scolded while the other one is getting frostbites. And people say I’m picky. Gabe, what’s wrong with these corporate studios? You’re the only one complaining.

I’ve also found out that no one here eats salads: the “apartment” has no bowl to mix a salad in, and when I asked my colleagues – who didn’t have it either – they looked at me suspiciously. I had to buy a bowl myself. One of my colleagues found dirty sheets, and another one had his kitchen floor flooded by a faulty washing machine. But everyone is wondering why I’m bitching while the other colleagues at the same place are not bitching. Gabe, you are picky. 

Some more details from my “corporate studio”:
Details from the Bathroom SinkStains on the Walls - Hey I'm PickyMore Stains on the Wall

Oh, these are supposed to be “serviced apartments” (check their Web site), do you think they’d refill the toilet paper? Think again. They’re supposed to – they did once, but then one evening I came back and was welcomed by this:
Refilled Toilet Paper Welcoming Me in the Evening
Apparently the service maid thought this roll would be enough for the evening and for the morning after. Either she’s anorexic or my ass is too big. I had to run to the petrol station and buy toilet paper, but hey, Gabe is picky – or his ass is too big.

Another useful piece of information in case you plan to spend a romantic weekend in this cozy alcove. When you use the in-room phone to call toll-free numbers (and I mean UK toll-free, not US toll-free or Jupiter toll-free) they charge you 1.18 pounds plus a 10% administrative fee, regardless of the duration of the call (I have a receipt showing charges for calls that lasted 3 seconds when I found the other number busy). When I inquired with the reception, they said it’s for “connection costs”. Wow, switchboards in November 2009, I’m paying fees for the cost of…doing what again? There must be a midget-in-a-box somewhere that’s getting rich at quickly swapping phone sockets with tiny little fingers. 

And after a few days of bitching left and right (hey, how picky is Gabe!) I realized that Trip Advisor features reviews on this place, with titles like “Filthy Dive” and “Astoundingly Bad Experience” and gems like this:  

If ever an apartment block is due for refurbishment, this is it! Getting out of the lift, the smell in the corridors took me back 20 years when i was living in digs as a student! You’ll have to be a midget to get around the studio apartments! kitchen and bathroom looked really old, the bed mattress was way beyond its use by date…….we booked for a week, I ran out after one night……….shame as the location is great. 

Well, apparently there are other bastards as picky as me. Real assholes they must be!

And at last, the crown jewel feature. This place is crowded with prostitutes and transexuals; I was originally told about it by a cab driver, but I didn’t think much of it until I saw a scene at 8am in the lobby. And in fact, it’s not a secret. Just search for “XXX (*)” and interviews to prostitutes come up, together with this nice excerpt from an article on “spoiled russians abroad” (sic):  

Another sign that Sergey’s dad was a good guy is the fact that he didn’t have a Belgravia mansion and the best Sergey could afford was a moldy bedsit in XXX (*), a grotty prison-like complex populated mostly by students, prostitutes, silverfish and enslaved Arab wives who are only allowed out for a walk at around 5.30 am.
“The prostitutes… they are actually not so bad,” he told me.
“Both of my neighbors are prostitutes, they all speak Russian, so I can hear what’s going on. They have a Kazakh pimp who once came around and broke into their room with an axe. They must have owed him some money or something. He made a giant hole in the door, I never saw them since.”
“Are you serious? Isn’t there supposed to be security?” I asked.
“Yeah, but they only care about what’s going on in the nicer apartments above. For those in the hellhole below… it’s just you and the rats.”
“It doesn’t sound like you’re very happy there, with the rodents and hookers and all.”

But hey, what the heck – I’m picky. 

November 12 Update
This is the “skylight” in my bathroom – you know, I belong to the “privileged” class of people, so I’m staying on the top floor and can enjoy some light through this luxury fixture:
Skylight in the Bathroom
 
These days it’s been raining quite a lot in London, and I get water and wind gusts from that hole in the bottom right corner, detailed here:
Details of the Skylight
But hey, I’m picky. 

November 15 Update
After these rainy days, water began dripping from the ceiling:
Leak on the Ceiling
Hey, what would you expect? It’s been raining quite a lot lately, it’s normal that some of that pouring rain drips into your room (sorry, “corporate studio”), isn’t it? Well, it’s not normal to me, but you know – I’m picky.

So I went to the management, and here’s their fix:
The Fix for the Leak
Hey, shit, Gabe is really picky. 

(*) Yes, I’m a coward, and I will not disclose the name of this place until I have left for good. I’ll update the post with the name of the place in February.

After the Jump – aka October Results

I’ve finally made the jump. The next logical big step in the HMM development was in fact to “put it all together”: the Person model, the GeoLocation model, and the Organization model. Each model separately was “conflicting” with the others (what’s “Charles de Gaulle” in “Charles de Gaulle airport“? The Person model thinks it’s a person while the GeoLocation model thinks it’s a geographical location) and the only way to make it work was to put everything in the same big-honking model.

The number of states in the new “unified” model jumped from about 30 to about 300, mostly due to the sheer number of different “morphological” structures of Organization names. At the same time, I underwent a lengthy process of re-labeling the input data (quite boring: more than 5,000 sentences) for reasons that at this moment I’ll categorize as “industrial secrets” :-)

The performance of this new “unified” model is quite what I was expecting: individually, each of the three entity types performs better, mostly because precision has improved due to the “collaboration” among the entities (the new model does not say anymore that “Charles de Gaulle” in “Charles de Gaulle airport” is a person since the GeoLocation “part” of the model wins), but unfortunately, the endemic poor performance of the Organization model brings down the total performance. Here are the learning curves:

Multi-Entity HMM Performance

The Person “part” of the model has finally hit 0.9 F (at between 80% and 90% of training data), the GeoLocation “part” is almost there, but the Organization part sucks big time :-( However, the global (“All”) performance at 60% is 0.826, a bit less than the performance of the Person model alone in August. I’ve got 0.074 to go before I reach my goal.

At this very moment, I am planning future development in the following two directions:

  1. Improve precision: as of now, the model thinks that “U.S.” in “U.S. troops” is a GeoLocation (while it’s not, it’s an adjectival form, a synonym for “American“). The training data is good, but the model gets confused since the “U.S.” tokens that do not belong to a GeoLocation entity get thrown in the Background uber-state of background tokens and the model doesn’t have any context to learn that when “U.S.” is followed by “troops” then it’s not a GeoLocation. The way I’ll (hope to) achieve this has to do with the “industrial secret” of above :-)
  2. Provide more training data: the slope of the learning curves above leaves hope that more training data will result in better performance. I’ve already started labeling new data, just need to find time.

As a final note, thanks again to JetBrains for their DotTrace and ReSharper: the running time of my tests is now a fraction of what used to be before, allowing me to experiment more and get results faster. The largest improvement came when I substituted dictionary lookups for state names in my Viterbi implementation with integer indexes (duh).

I Hate People Who Make Assumptions

That’s it. I’ve been Googling for a while to see whether I’ve got some personality disorder that could explain why I get enraged when I even imagine people making assumptions about me.

As an example, I get mad each and every time I recall that my company assumed that I would have had no troubles in working from Irving, Texas for a few weeks. Yeah right, what difference would it make to me to work and live in Amsterdam or in Irving? Two examples: I don’t drive and you can’t survive in Irving without an effing car (I was the only MoFo walking along the street – and you should know that most streets in Irving have no sidewalks, as two-legged beings are not supposed to roam that land), and my cell phone is the cheapest piece of junk in the world and does not work in the U.S., so don’t be surprised if I come back with hundreds of dollars of phone bills to call home. My face is getting red just for typing this.

To my relief, I’ve found that there are other people with the same problem. And this article showed me there’s light at the end of the tunnel. All I need to do now is make sure that every assuming person I know reads it and stops telling me how much time I need to get ready for a project, or when it’s the best time for me to catch a 10-hour flight to go back home, or how much sleep is enough for me when I’m jet-legged, or how much information is enough for me to do my job well.

Taking Shortcuts

I’ve been working on my HMM’s quite a lot lately to create what I call “composite HMM’s” (or “composite extractors”, more on this later) with the hope that “Organization” HMM’s will yield improved performance (hint: what’s common between “Bank of Japan” and “Bank of Scotland” ?).

The problem I’m having, however, is that calculating the performance of the HMM’s is now becoming a tedious task – my home laptop takes a couple of days to crunch the numbers. So I’ve decided to start attacking performance – something I was postponing for later but which has now become essential. I’ve downloaded JetBrains’ “dotTrace” and wow!, it’s an awesome product. It contains all the features I was accustomed to use with the internal MS profilers (say, LOP), and in a few secs I was able to pinpoint the bottlenecks and turn things around so that what used to be the bottleneck has now negligible impact on the overall runtime of the tests (case in point: Dictionary<string, int> lookups were killing me, and I’ve now substituted these with indexed array’s and proxied the string keys with int indexes).

Thanks JetBrains! Ron also talked me about your ReSharper, I’m gonna look at it soon.

A Very Secure Operating System

From an undisclosed Web site:

What makes CentOS a popular choice for web hosting providers is that it is frequently updated meaning that it is very secure and unlikely to be compromised any time soon…

Yeah, “very secure and unlikely to be compromised any time soon”, tell that to the Apache.Org guys, they’re still trying to figure out how much damage this hacker has caused to their servers :-)

August Results: Gabe 1, Good-Turing-Witten-Bell 0

Haven’t been done much in a while, as I’ve been focusing on a number of problems.

First problem: I have added a few more labeled sentences to the training data, and poof! The performance went down. That sucked being time. I had to figure out what was wrong, so I built tools and written a lot of code to show the difference in predicted tags when more training data is added to baseline training data. The strategy worked: I figured that the addition of the new training data caused the HMM to label “-“ as an entity, due to the fact that the ‘-‘ token in the “Pinkett-Smith” entity was now being emitted by an entity state (to be precise, by the second entity state in the chain), and the “entity” group emission (the hierarchically-higher distribution in my implementation of shrinkage) would then cause arbitrary entity states to emit ‘-‘ with a high probability. The fix was obviously to make punctuation symbols emitted by new, special “entity-punctuation” states that do not interpolate with the normal “entity” group, leaving the “entity” group clear of punctuation symbols.

Second problem: the unknown symbol probability, again. I decided one day to check the effect of an arbitrary “unknown symbol factor” – a fractional constant that is multiplied with the unknown symbol probability – only to find that arbitrary values of this constant sway the performance numbers “quite much”. This was a clear signal that my current calculation of the unknown symbol probability was not optimal. So I began a quest researching Good-Turing and Witten-Bell smoothing, implemented them, and got (slightly) worse results than I got with *my* old unknown symbol probability and with an arbitrary value of the “unknown symbol factor”. So I added a new “vocabulary size” parameter to the HMM, used that to calculate the unknown symbol probability, and performed better than Good-Turing and Witten-Bell together :-)

Finally, I’ve done a complete re-arch of the HMM framework in order to support a “finalization” step between the phase in which the model is built and the phase in which the model is being used for predictions. This re-arch allows me now to perform long calculations on the model, and the first thing I did was to add an expectation-maximization step to calculate the optimal interpolation lambda’s in my implementation of shrinkage. I even tested different start values for the lambda’s, and to my surprise they all end up converging to the same values, some of which are way off than the values I determined with experiments. This extra EM step improved performance quite a bit!

So, here is the current performance, calculated with 4758 sentences, 1659 of which having person names:

Person HMM Performance on August 11 2009

Person HMM Performance on August 11 2009

The average F at 60% is now 0.840 – 0.060 left to go, one third less than last time!

My TOP 10 Sad Songs

I keep shuffling the preference order of the songs I like the most, so I decided to commit in writing my TOP 10 lists and see how much they’ll change over time.
This is my first try – the TOP 10 sad songs that I listen to when I’m in a melancholic mood.

  • The Lightning Strike part I – Snow Patrol
    A masterpiece. Thanks to Sarah for making me discover it.
  • How to Save a Life – The Fray
    Miserable. Apparently, appeared on (some episode of) “Grey’s Anatomy”.
  • Auto Rock – Mogwai
    Awesome. From “Miami Vice” (the movie).
  • Sleeping Satellite – Tasmin Archer
    Don’t know why she entered oblivion after publishing such a wonderful song.
  • Stop Crying Your Heart Out – Oasis
    Heartbreaking. From the final scenes of some version of “the Butterfly Effect”.
  • Mad World – Gary Jules
    Suicidal. From “Donnie Darko”. If you watched the movie, you’re probably addicted to this song by now.
  • Glory Box – Portishead
  • Beautiful – Christina Aguilera
    Crap, Christina Aguilera?!? I’m really becoming an old dude.
  • Calling You – Jevetta Steele
    From Bagdad Cafe. Wonderful use of her voice.
  • Teardrop – Massive Attack
  • Special Needs – Placebo
    Again, thanks to Sarah for making me discover it.

Ok, not really “ten”, but getting there…

The Joys of Kicking Marketing A**es

My love for marketing people gets stronger every day.

Brought to my attention by Radi: basically, the CEO of this company was convinced by his smart marketing people (hope for him it was not his idea) to pay out a $10K prize to anyone who would defeat their (quasi-) two-factor authentication. Well, he had to pay.

How about having a “LinkedOut” site where we post the names of people that should never be hired?

June Results, Lexicon Training, and Discrimination

Newest results, this time improved by just adding some more training data:

Person HMM Performance on June 6, 2009

Person HMM Performance on June 3 2009

Compare with the previous results. The F score at 60% is now 0.803, so I only have [Run->Calc…] 0.097 to go before starting my own company.

A few days ago I mentioned “hacked vocabularies”, which I’d better rename as “training with lexicons”. I basically have these huge lists of first names, last names, and just names, and I was wondering how I could best incorporate these in the HMM training so to improve results. Of course I was mostly shooting at improving recall, since the lists give no indication on word context, thus not impacting precision too much.

My idea was to “enhance” the vocabulary at each of the entity states with these lexicons, so to assign a reasonable probability (larger than the probability of the unknown symbol, at least) to tokens that appear in these lists (the lexicon). Results: disaster! Precision fell down the drain. I broke my head for days on this, and I think I now know why precision went down.

No matter how much I clean the lists, there’s always gonna be some noise in them, for example with names like May or last names like Rain. Quoting Stan, I’ve learnt something today, and it has nothing to do with Cartman: training data must really be “real world”, that is, balanced. If you add lexicon words to the entity states, you’re improving the chances that “rain” is recognized as a person name; at the same time, since you are not also adding words to the background states, the probability of “rain” being background stays constant. There you go: precision goes down.

So I’ve nuked the idea of training with the lists, but I’m still keeping them around – you never know, one day I might come up with better ideas.

On a side note, I’ve finally graphed discrimination, which is a measure that I came up with to indicate how much I can rely on a tag’s confidence to improve precision without impacting recall. Basically, discrimination is the percentage of predicted tags whose confidence lies outside of what I call the “DMZ”. The DMZ is the range of confidence values between the highest confidence of all wrong predictions and the lowest confidence of all correct predictions. Basically, everything below the DMZ – that is, every predicted tag with a confidence below the DMZ – can be safely dismissed as “garbage”. At the same time, everything above the DMZ – that is, every predicted tag with a confidence above the DMZ – can be deemed correct. Everything in the DMZ is uncertain – the confidence value alone won’t help in discriminating between false or true predictions. Of course, the ideal case is discrimination=1.0, that is, no tags in the DMZ; in practical terms, this would mean that there’s a precise confidence threshold that can be used to separate garbage from gold.

Here’s the chart of discrimination plotted against training data percentages:

HMM Discrimination

Person HMM Discrimination on June 3 2009

The guy looks suspiciously exponential. Good for me, as it means the more labeled data, the more I can use confidence to filter results after prediction.

The discrimination measure is of course only meaningful in the context of the specific measure chosen for the confidence score; in my case, the confidence of a tag is the probability of the HMM path that covers the whole entity (that is, transition and emission states for the entity states only). This measure of course penalizes longer entities and favors shorter ones, simply because there’s more transitions and emissions in longer entities, and these transitions and emissions will always be < 1.0. But then, the confidence of an extracted “John Smith” could be greater than the confidence of an extracted “Titsiana”. Well, may be one day I should revisit my confidence.

Next Page »