I’ve spent the last few days doing a complete makeover of my HMM structure. I’ve fixed many stupid mistakes I had done in the past, and I’ve cleaned up the infrastructure around HMMs.
Most importantly, I’ve tested the impact of the length of prefix and suffix chains – that is, of the chains of states leading to and leaving from the set of states “implementing” the “person name” entity. Here’s a chart depicting the average F for chains of length 2, 3, and 4:

Person HMM Performance with Varying Prefix and Suffix Chain Lengths
It definitely looks like shorter chains yield better performance…that sounds odd, but probably makes sense, given the fact that, for example, a prefix state at distance 2 from the entity has more “entropy” than a prefix state at distance 1 from the entity. As an example, compare “A gift by George” and “As told by Mike”: “by” is a strong indicator that an entity might follow, while “gift” and “told” are unrelated (noun and verb) and less indicative of an imminent entity.
In any case, as you can tell, the overall performance has improved, mostly thanks to the elimination of bad choices from the past. Here’s the current baseline:

Person HMM Performance on 24-12-2008
Although recall is awesome, precision sucks…on average, 2 out of 3 predictions are totally garbage
Here’s how I am planning to improve performance:
- Investigate the effect of training with positive and negative examples, as opposed to training with positive examples only (as of now). I’m running tests as of now and will follow up “shortly” (s**t, it takes forever to run the tests…I need a faster box!)
- Investigate the effect of using absolute discounting to smooth symbol emission probabilities (rather than my homegrown unknown lower/upper case symbol thing).
- Investigate the effect of using separate fixed-length chains for entity states. I’m currently using one single chain of varying length; for example, “Gordon Sumner” is modelled with the path “Entity-1″ -> “Entity-2″, while “Gordon Matthew Sumner” is modelled with the path “Entity-1″ -> “Entity-2″ -> “Entity-3″. If I went for separate chains, I’d have a chain of states for each entity length; for example, “Gordon Sumner” – an entity made up of 2 tokens - would be modeled with the path “Entity2-1″ -> “Entity2-2″ (the number after “Entity” indicating the number of tokens in the entity), while “Gordon Matthew Sumner” would be modeled as “Entity3-1″ -> ”Entity3-2″ -> “Entity3-3″.
- Investigate the use of shrinkage (see Freitag & McCallum) to “uniformize” the emission probabilities of related states; this is particularly important if I went for separate fixed-length entity chains, as I’d have the problem of fragmenting my training data across multiple related states. As an example, with the single varying-length entity chain of today, the state called “Entity-1″ would emit all the first names that I currently have in my labeled data, which is highly desirable. On the other hand, if I went with separate fixed-length entity chains, all my first names would be divided between states “Entity1-1″, “Entity2-1″, “Entity3-1″, and so on. Shrinkage would kinda “merge” the emission distributions of all the “EntityN-1″ states, so that fragmentation of training data would be less of an issue.
In conclusion, the current baseline is an improvement over the past one, but I’m still FAR away from my commercial goal…..sigh

