Some More Marketing BS in Security

…this time from Cisco. I’ve stumbled upon a 2008 slide deck from them that introduces the new “Cisco Self-defending Network V3.0″. In particular, there’s a section about the “Cisco ASA 5500 Series” that shows how much effort Cisco marketing people (and, in this case, engineers as well) put in coming up with their bulls**t. Two notable examples:

Any Cisco voice/video communications encrypted with SRTP/TLS can now be inspected by Cisco ASA 5500 Adaptive Security Appliances: TLS signaling is terminated and inspected, then re-encrypted for connection to destination (leveraging integrated hardware encryption services for scalable performance).

So now what I thought was a secure tunnel between me and the other person is, in fact, cut in half at a potential interception point liable to be controlled by hackers, all of this thanks to the “Adaptive Security Appliance”. And as if this were not enough, it is obvious that for this scenario to work the Cisco people had to make their appliances open to man-in-the-middle attacks. Wow.

Advanced Web Traffic Security: Protects Networks from Web-based Threats.

  • Provides powerful regular expression (regex) matching capabilities to detect administrator customizable strings and optionally block, rate limit, and/or log traffic.
  • Deep inspection services provide businesses control over what actions users can perform when accessing websites.
  • Performs RFC compliance checking for protocol anomaly detection.
  • Provides MIME type filtering and content validation capabilities.

This sounds so much like an IDS from the nineties. Using (powerful) reg-ex’s, RFC compliance enforcement, and MIME filtering does not sound to me “advanced Web traffic security” . It’s what we were doing 20 years ago.

Then, there’s a section about the “Cisco Security Agent”, a software agent that you install on each box:

Intercepting Actions on the Endpoint: Application calls to the operating system are intercepted in real-time, and dynamic decisions are made to allow/deny actions.

Guys, it’s called “behavioral blocking”, it’s been tried for at least 10 years by a number of companies (including Symantec and Microsoft), and I have never seen it work. You might pull it off, though.

And finally, Cisco wants you to know that the agent has been…

Validated by PCI auditor to address PCI 1.1 DSS Requirements

…which apparently for Cisco is supposed to be a cool thing to say.

After the Jump – aka October Results

I’ve finally made the jump. The next logical big step in the HMM development was in fact to “put it all together”: the Person model, the GeoLocation model, and the Organization model. Each model separately was “conflicting” with the others (what’s “Charles de Gaulle” in “Charles de Gaulle airport“? The Person model thinks it’s a person while the GeoLocation model thinks it’s a geographical location) and the only way to make it work was to put everything in the same big-honking model.

The number of states in the new “unified” model jumped from about 30 to about 300, mostly due to the sheer number of different “morphological” structures of Organization names. At the same time, I underwent a lengthy process of re-labeling the input data (quite boring: more than 5,000 sentences) for reasons that at this moment I’ll categorize as “industrial secrets” :-)

The performance of this new “unified” model is quite what I was expecting: individually, each of the three entity types performs better, mostly because precision has improved due to the “collaboration” among the entities (the new model does not say anymore that “Charles de Gaulle” in “Charles de Gaulle airport” is a person since the GeoLocation “part” of the model wins), but unfortunately, the endemic poor performance of the Organization model brings down the total performance. Here are the learning curves:

Multi-Entity HMM Performance

The Person “part” of the model has finally hit 0.9 F (at between 80% and 90% of training data), the GeoLocation “part” is almost there, but the Organization part sucks big time :-( However, the global (“All”) performance at 60% is 0.826, a bit less than the performance of the Person model alone in August. I’ve got 0.074 to go before I reach my goal.

At this very moment, I am planning future development in the following two directions:

  1. Improve precision: as of now, the model thinks that “U.S.” in “U.S. troops” is a GeoLocation (while it’s not, it’s an adjectival form, a synonym for “American“). The training data is good, but the model gets confused since the “U.S.” tokens that do not belong to a GeoLocation entity get thrown in the Background uber-state of background tokens and the model doesn’t have any context to learn that when “U.S.” is followed by “troops” then it’s not a GeoLocation. The way I’ll (hope to) achieve this has to do with the “industrial secret” of above :-)
  2. Provide more training data: the slope of the learning curves above leaves hope that more training data will result in better performance. I’ve already started labeling new data, just need to find time.

As a final note, thanks again to JetBrains for their DotTrace and ReSharper: the running time of my tests is now a fraction of what used to be before, allowing me to experiment more and get results faster. The largest improvement came when I substituted dictionary lookups for state names in my Viterbi implementation with integer indexes (duh).

I Hate People Who Make Assumptions

That’s it. I’ve been Googling for a while to see whether I’ve got some personality disorder that could explain why I get enraged when I even imagine people making assumptions about me.

As an example, I get mad each and every time I recall that my company assumed that I would have had no troubles in working from Irving, Texas for a few weeks. Yeah right, what difference would it make to me to work and live in Amsterdam or in Irving? Two examples: I don’t drive and you can’t survive in Irving without an effing car (I was the only MoFo walking along the street – and you should know that most streets in Irving have no sidewalks, as two-legged beings are not supposed to roam that land), and my cell phone is the cheapest piece of junk in the world and does not work in the U.S., so don’t be surprised if I come back with hundreds of dollars of phone bills to call home. My face is getting red just for typing this.

To my relief, I’ve found that there are other people with the same problem. And this article showed me there’s light at the end of the tunnel. All I need to do now is make sure that every assuming person I know reads it and stops telling me how much time I need to get ready for a project, or when it’s the best time for me to catch a 10-hour flight to go back home, or how much sleep is enough for me when I’m jet-legged, or how much information is enough for me to do my job well.

Taking Shortcuts

I’ve been working on my HMM’s quite a lot lately to create what I call “composite HMM’s” (or “composite extractors”, more on this later) with the hope that “Organization” HMM’s will yield improved performance (hint: what’s common between “Bank of Japan” and “Bank of Scotland” ?).

The problem I’m having, however, is that calculating the performance of the HMM’s is now becoming a tedious task – my home laptop takes a couple of days to crunch the numbers. So I’ve decided to start attacking performance – something I was postponing for later but which has now become essential. I’ve downloaded JetBrains’ “dotTrace” and wow!, it’s an awesome product. It contains all the features I was accustomed to use with the internal MS profilers (say, LOP), and in a few secs I was able to pinpoint the bottlenecks and turn things around so that what used to be the bottleneck has now negligible impact on the overall runtime of the tests (case in point: Dictionary<string, int> lookups were killing me, and I’ve now substituted these with indexed array’s and proxied the string keys with int indexes).

Thanks JetBrains! Ron also talked me about your ReSharper, I’m gonna look at it soon.

A Very Secure Operating System

From an undisclosed Web site:

What makes CentOS a popular choice for web hosting providers is that it is frequently updated meaning that it is very secure and unlikely to be compromised any time soon…

Yeah, “very secure and unlikely to be compromised any time soon”, tell that to the Apache.Org guys, they’re still trying to figure out how much damage this hacker has caused to their servers :-)

August Results: Gabe 1, Good-Turing-Witten-Bell 0

Haven’t been done much in a while, as I’ve been focusing on a number of problems.

First problem: I have added a few more labeled sentences to the training data, and poof! The performance went down. That sucked being time. I had to figure out what was wrong, so I built tools and written a lot of code to show the difference in predicted tags when more training data is added to baseline training data. The strategy worked: I figured that the addition of the new training data caused the HMM to label “-“ as an entity, due to the fact that the ‘-‘ token in the “Pinkett-Smith” entity was now being emitted by an entity state (to be precise, by the second entity state in the chain), and the “entity” group emission (the hierarchically-higher distribution in my implementation of shrinkage) would then cause arbitrary entity states to emit ‘-‘ with a high probability. The fix was obviously to make punctuation symbols emitted by new, special “entity-punctuation” states that do not interpolate with the normal “entity” group, leaving the “entity” group clear of punctuation symbols.

Second problem: the unknown symbol probability, again. I decided one day to check the effect of an arbitrary “unknown symbol factor” – a fractional constant that is multiplied with the unknown symbol probability – only to find that arbitrary values of this constant sway the performance numbers “quite much”. This was a clear signal that my current calculation of the unknown symbol probability was not optimal. So I began a quest researching Good-Turing and Witten-Bell smoothing, implemented them, and got (slightly) worse results than I got with *my* old unknown symbol probability and with an arbitrary value of the “unknown symbol factor”. So I added a new “vocabulary size” parameter to the HMM, used that to calculate the unknown symbol probability, and performed better than Good-Turing and Witten-Bell together :-)

Finally, I’ve done a complete re-arch of the HMM framework in order to support a “finalization” step between the phase in which the model is built and the phase in which the model is being used for predictions. This re-arch allows me now to perform long calculations on the model, and the first thing I did was to add an expectation-maximization step to calculate the optimal interpolation lambda’s in my implementation of shrinkage. I even tested different start values for the lambda’s, and to my surprise they all end up converging to the same values, some of which are way off than the values I determined with experiments. This extra EM step improved performance quite a bit!

So, here is the current performance, calculated with 4758 sentences, 1659 of which having person names:

Person HMM Performance on August 11 2009

Person HMM Performance on August 11 2009

The average F at 60% is now 0.840 – 0.060 left to go, one third less than last time!

My TOP 10 Sad Songs

I keep shuffling the preference order of the songs I like the most, so I decided to commit in writing my TOP 10 lists and see how much they’ll change over time.
This is my first try – the TOP 10 sad songs that I listen to when I’m in a melancholic mood.

  • The Lightning Strike part I – Snow Patrol
    A masterpiece. Thanks to Sarah for making me discover it.
  • How to Save a Life – The Fray
    Miserable. Apparently, appeared on (some episode of) “Grey’s Anatomy”.
  • Auto Rock – Mogwai
    Awesome. From “Miami Vice” (the movie).
  • Sleeping Satellite – Tasmin Archer
    Don’t know why she entered oblivion after publishing such a wonderful song.
  • Stop Crying Your Heart Out – Oasis
    Heartbreaking. From the final scenes of some version of “the Butterfly Effect”.
  • Mad World – Gary Jules
    Suicidal. From “Donnie Darko”. If you watched the movie, you’re probably addicted to this song by now.
  • Glory Box – Portishead
  • Beautiful – Christina Aguilera
    Crap, Christina Aguilera?!? I’m really becoming an old dude.
  • Calling You – Jevetta Steele
    From Bagdad Cafe. Wonderful use of her voice.
  • Teardrop – Massive Attack
  • Special Needs – Placebo
    Again, thanks to Sarah for making me discover it.

Ok, not really “ten”, but getting there…

The Joys of Kicking Marketing A**es

My love for marketing people gets stronger every day.

Brought to my attention by Radi: basically, the CEO of this company was convinced by his smart marketing people (hope for him it was not his idea) to pay out a $10K prize to anyone who would defeat their (quasi-) two-factor authentication. Well, he had to pay.

How about having a “LinkedOut” site where we post the names of people that should never be hired?

June Results, Lexicon Training, and Discrimination

Newest results, this time improved by just adding some more training data:

Person HMM Performance on June 6, 2009

Person HMM Performance on June 3 2009

Compare with the previous results. The F score at 60% is now 0.803, so I only have [Run->Calc…] 0.097 to go before starting my own company.

A few days ago I mentioned “hacked vocabularies”, which I’d better rename as “training with lexicons”. I basically have these huge lists of first names, last names, and just names, and I was wondering how I could best incorporate these in the HMM training so to improve results. Of course I was mostly shooting at improving recall, since the lists give no indication on word context, thus not impacting precision too much.

My idea was to “enhance” the vocabulary at each of the entity states with these lexicons, so to assign a reasonable probability (larger than the probability of the unknown symbol, at least) to tokens that appear in these lists (the lexicon). Results: disaster! Precision fell down the drain. I broke my head for days on this, and I think I now know why precision went down.

No matter how much I clean the lists, there’s always gonna be some noise in them, for example with names like May or last names like Rain. Quoting Stan, I’ve learnt something today, and it has nothing to do with Cartman: training data must really be “real world”, that is, balanced. If you add lexicon words to the entity states, you’re improving the chances that “rain” is recognized as a person name; at the same time, since you are not also adding words to the background states, the probability of “rain” being background stays constant. There you go: precision goes down.

So I’ve nuked the idea of training with the lists, but I’m still keeping them around – you never know, one day I might come up with better ideas.

On a side note, I’ve finally graphed discrimination, which is a measure that I came up with to indicate how much I can rely on a tag’s confidence to improve precision without impacting recall. Basically, discrimination is the percentage of predicted tags whose confidence lies outside of what I call the “DMZ”. The DMZ is the range of confidence values between the highest confidence of all wrong predictions and the lowest confidence of all correct predictions. Basically, everything below the DMZ – that is, every predicted tag with a confidence below the DMZ – can be safely dismissed as “garbage”. At the same time, everything above the DMZ – that is, every predicted tag with a confidence above the DMZ – can be deemed correct. Everything in the DMZ is uncertain – the confidence value alone won’t help in discriminating between false or true predictions. Of course, the ideal case is discrimination=1.0, that is, no tags in the DMZ; in practical terms, this would mean that there’s a precise confidence threshold that can be used to separate garbage from gold.

Here’s the chart of discrimination plotted against training data percentages:

HMM Discrimination

Person HMM Discrimination on June 3 2009

The guy looks suspiciously exponential. Good for me, as it means the more labeled data, the more I can use confidence to filter results after prediction.

The discrimination measure is of course only meaningful in the context of the specific measure chosen for the confidence score; in my case, the confidence of a tag is the probability of the HMM path that covers the whole entity (that is, transition and emission states for the entity states only). This measure of course penalizes longer entities and favors shorter ones, simply because there’s more transitions and emissions in longer entities, and these transitions and emissions will always be < 1.0. But then, the confidence of an extracted “John Smith” could be greater than the confidence of an extracted “Titsiana”. Well, may be one day I should revisit my confidence.

google.navigation.opendns.com

Just noticed that tracert for google.com shows Google being served by google.navigation.opendns.com. After some search I’ve found this heart-breaking news. Crap, can’t trust Google anymore…….

Next Page »