Archive for December, 2008

Experimenting with Chain Lengths

I’ve spent the last few days doing a complete makeover of my HMM structure. I’ve fixed many stupid mistakes I had done in the past, and I’ve cleaned up the infrastructure around HMMs.

Most importantly, I’ve tested the impact of the length of prefix and suffix chains – that is, of the chains of states leading to and leaving from the set of states “implementing” the “person name” entity. Here’s a chart depicting the average F for chains of length 2, 3, and 4:

Person HMM Performance with Varying Prefix and Suffix Chain Lengths

Person HMM Performance with Varying Prefix and Suffix Chain Lengths

It definitely looks like shorter chains yield better performance…that sounds odd, but probably makes sense, given the fact that, for example, a prefix state at distance 2 from the entity has more “entropy” than a prefix state at distance 1 from the entity. As an example, compare “A gift by George” and “As told by Mike”: “by” is a strong indicator that an entity might follow, while “gift” and “told” are unrelated (noun and verb) and less indicative of an imminent entity.

In any case, as you can tell, the overall performance has improved, mostly thanks to the elimination of bad choices from the past. Here’s the current baseline:

Person HMM Performance on 24-12-2008

Person HMM Performance on 24-12-2008

Although recall is awesome, precision sucks…on average, 2 out of 3 predictions are totally garbage :-(

Here’s how I am planning to improve performance:

  • Investigate the effect of training with positive and negative examples, as opposed to training with positive examples only (as of now). I’m running tests as of now and will follow up “shortly” (s**t, it takes forever to run the tests…I need a faster box!)
  • Investigate the effect of using absolute discounting to smooth symbol emission probabilities (rather than my homegrown unknown lower/upper case symbol thing).
  • Investigate the effect of using separate fixed-length chains for entity states. I’m currently using one single chain of varying length; for example, “Gordon Sumner” is modelled with the path “Entity-1″ -> “Entity-2″, while “Gordon Matthew Sumner” is modelled with the path “Entity-1″ -> “Entity-2″ -> “Entity-3″. If I went for separate chains, I’d have a chain of states for each entity length; for example, “Gordon Sumner” – an entity made up of 2 tokens - would be modeled with the path “Entity2-1″ -> “Entity2-2″ (the number after “Entity” indicating the number of tokens in the entity), while “Gordon Matthew Sumner” would be modeled as “Entity3-1″ -> ”Entity3-2″ -> “Entity3-3″.
  • Investigate the use of shrinkage (see Freitag & McCallum) to “uniformize” the emission probabilities of related states; this is particularly important if I went for separate fixed-length entity chains, as I’d have the problem of fragmenting my training data across multiple related states. As an example, with the single varying-length entity chain of today, the state called “Entity-1″ would emit all the first names that I currently have in my labeled data, which is highly desirable. On the other hand, if I went with separate fixed-length entity chains, all my first names would be divided between states “Entity1-1″, “Entity2-1″, “Entity3-1″, and so on. Shrinkage would kinda “merge” the emission distributions of all the “EntityN-1″ states, so that fragmentation of training data would be less of an issue.

In conclusion, the current baseline is an improvement over the past one, but I’m still FAR away from my commercial goal…..sigh :-(

How I Calculate the HMM Performance

All the charts showing the performance of my HMMs are created using the holdout method. The entire corpus of tagged text is divided into two subsets: I train my HMMs on one subset (the “training” set), and then I used the HMMs to auto-tag the text in the second subset (the “test” set), comparing the HMMs’ predictions with the tags in the test set. The percentage of corpus text divided between the training set and the test set varies, and for each percentage value I run a number of times. In pseudo-code:

for(trainPercentage from 90 to 10):
{
	for(i from 1 to 100):
	{
		DivideCorpusRandomly() -> TrainingSet, TestSet;
		HMM hmm = new HMM();
		hmm.Train(TrainingSet);
		hmm.Predict(TestSet) -> predictions;
		Compare(predictions, TestSet.actualValues) -> PerformanceMetrics;
	}
	Show(PerformanceMetrics);
}

I can also vary the values of a confidenceThreshold parameter, which cuts off predictions having a low confidence, but I’m currently not using this parameter. I’ll leave its investigation for later, when my F score gets better :-)

Regarding counting “right” and “wrong” tags, I am being a Nazi with myself. Some papers I’ve read count the number of tokens correctly tagged by the HMM (so, tagging “Jada Pinkett” in “Jada Pinkett Smith” would score 2 on 3), while I count the whole tag (in the example above I’d have scored 0). Should I count tokens as well – and be nicer with myself along the way? :-)

My HMM Goal…

…here I state it: when I get the average F score at 90% for 60% of the test data, I’ll commercialize the thing. Promised.

Patterns and Flu

This is incredibly cool. A few days ago I was cursing the Web for not having a “flu trend” service that allows one to see whether or not there is a flu epidemic around. This would be extremely useful, for example, to figure out whether or not those cramps are due to yesterday night’s crappy restaurant or to a stomach flu :-)

Turns out that my heroes thought about it already: Google trends can be used to chart flu epidemics! This is freakin’ awesome. I’m looking forward to having my own search engine to play with this stuff :-)

Are Emission Probabilities so Important?

A ton of papers I’ve been reading on HMMs are centered around the need to choose one or another emission probability smoothing technique to deal with the nefarious “unknown symbol” (that is, a symbol that has never been seen emitted by a state during training but which we’d like to account for in production). Without smoothing, the probability of such a symbol would be zero and consequently any observation containing an unknown symbol would have zero probability of being generated by the model.

One smoothing technique – “shrinking” – is explained here (again, from Freitag and McCallum). The technique looks cool since it takes into account the structure of the HMM, which makes a lot of sense. Other classic techniques include Laplace and discounting.

My current HMM uses a couple of dirty tricks for the unknown symbol:

  • First off, I have two unknown symbols – LowerCase unknown and InitialCap unknown;
  • At state Si, the two unknown symbols share the probability of one single extra symbol (i.e. \frac{1}{\mbox{count}(\mbox{all emissions at }S_i) + 1}), distributed among them according to the relative percentage of lower-case and initial-case symbols.

My trick seemed smart enough to work fine. However, since so many papers stress the importance of smoothing emission probabilities, I’m wondering whether or not the adoption of another smoothing technique could make my HMM perform drastically better.

December Plan for HMM Improvement

It looks like the main routes that I have to work on are:

  • More training data: I need to keep labeling text. 216 tags in 167 sentences is too little.
  • Work on the HMM structure. Structure is everything. My current structure – a bit simplistic, I agree (more on it later) – could be improved.  I have two routes to explore:
    • First, my idea of “recursive” HMMs (basically, an HMM whose states can “call” arbitrary sub-HMMs which will “return” when done to the original calling state). More on this later.
    • Second, explore the work done by Freitag & McCallum (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.3665), who seem to have tackled the concept of machine-learning the best HMM structure for a given task. Have to look into this, seems awesome stuff :-)

Results of Auto-tagging

In order to check whether or not my HMM needs a more substantial corpus to train on, I’ve found a way to cheat and get performance results from a “larger” corpus. What I’ve done is use my existing HMM (the one trained on 216 tags in 167 sentences) in a “different” way to find “sure” people names in untagged text (had to use some heuristics in order to make sure I was tagging actual people names). This yielded 1392 tags in 1292 sentences. I’ve then used this “cheaty” corpus to base-train my HMM in all the performance test scenarios.

The chart below shows the results. The x-axis indicates the percentage of original corpus (the one with 216 tags in 167 sentences) on top of the dirty corpus. For example, “20%” indicates the HMM trained on the whole cheaty corpus plus 20% of the original corpus.

Person HMM Performance with Cheaty Corpus

Person HMM Performance with Cheaty Corpus

Weird, isn’t it? I wonder whether the negative slope is due to the fact that there’s a high correlation between the “base” corpus and the “training” corpus. But at least these results are better than the previous ones, strongly suggesting that I need to do more labeling (:-() and increase my training corpus.

On Precision and Recall

Precision and Recall are two measures that will be haunting me in the next weeks, while I’m working hard to model better HMMs.

Precision is a measure of bulls**t; less precision means more garbage (wrong entities) that my HMM produces. Recall, on the other hand, is a measure of how good is my HMM at spotting the right stuff; less recall means more entities go undetected.

Ideally, both measures should be 1.0 – no garbage at all, and all entities spotted.

Finally, the F-score is the harmonic mean of the two.

Introducing HiSam

Two years ago, when I left the awesome Delphi project in Redmond for Amsterdam, I began tinkering with my own text analytics framework, code-named “HiSam”. More on the framework in later posts.

The effort suffered a major hit when my laptop, containing the latest version of the code, had been stolen in October 2007. However, one year later, thanks to the economic crisis, I decided to revive the project, hoping to build some commercial services around the most promising components of the framework. One of these components is the extraction of named entities (such as people names, geographical locations, etc.), implemented through Hidden Markov Models.

Since I’ve currently shifted to 5th gear, I’ve decided to keep a blog with the developments of the framework and, in particular, with the developments of my HMM research.

As of now I have a primitive HMM built after processing random Wikipedia pages that I’ve labeled manually through an ad-hoc UI app. The infrastructure built around the HMM begins to look impressive – the labeling UI app itself is extremely handy, and a bunch of tools allow me to analyze and slice the HMM any way I want.

I’m publishing here the current results, which are pretty sad, I’d say :-(

The graph below shows min, max, and average values of the precision, recall, and F-score metrics out of my HMM for extracting people names, trained on different percentages of my current labeled text corpus (167 sentences containing 216 tags).

Person HMM Performance on 2008-12-13

Person HMM Performance on 2008-12-13

Needless to say, I’ve got a long way to go :-) But I’ve got a strategy that I’m working on now and I will get back here with results.



Follow

Get every new post delivered to your Inbox.