A few days ago I finished the exhausting re-labeling effort that I had talked about previously, and I started running tests to see where I finally stand against my stated goal.
At first the results were a bit discouraging. With the final set of labeled documents – 2,000 sentences added to the previous 2,500 sentences – the overall F score went down a bit. I was kind of expecting this though, as the more training data from heterogeneous domains you add to the mix, the more variance you have in the thing being learned, resulting in both more potential for errors and more difficulty in learning. This fact is nicely explained by the following excerpt from this paper on static code analysis tools:
The result of summing many independent random variables? A Gaussian distribution, most of it not on the points you saw and adapted to in the lab. Furthermore, Gaussian distributions have tails. As the number of samples grows, so, too, does the absolute number of points several standard deviations from the mean. The unusual starts to occur with increasing frequency.
The final overall F score – 0.864 at 90% – is still far away from my goal of 0.900 at 60%, and the entity-specific F scores (e.g. 0.900 for GeoLocation entities and 0.915 for Person entities) are far from the F scores boasted by research projects in entity extraction – which are all around 0.93.
So, this very morning I decided to do a real-world test: I took a few articles from CNN, fed these to my HMM, and observed the results. I was astonished!!!!!! The little guy did extremely well with these pieces of text it had never seen before. Here are a few examples – colors correspond to entity types and numbers indicate the probabilities of the extracted entities:
Example 1:
Greene: ” This is about the limitless capacity of the human heart. ” Bob Greene says a small town in Ohio is one of the most inspiring places in the United States.
- Greene (Person: 5.99677679935634E-05)
- Bob Greene (Person: 1.25848620925595E-07)
- Ohio (GeoLocation: 0.001397929451232)
- United States (GeoLocation: 0.00286421623850843)
Example 2:
Until, on July 20, 1969, Neil Armstrong, of Wapakoneta, walked on the moon.
- July 20 , 1969 (Time: 0.000150556184495453)
- Neil Armstrong (Person: 1.33506658714912E-07)
- Wapakoneta (GeoLocation: 6.91960283284816E-07)
- moon (AstronomicalPlace: 0.351350422734393)
Example 3:
A soldier mans a weapon at the rear of a U.S. Army helicopter over Afghanistan in May.
- U.S. Army (Organization: 3.40883387762237E-06)
- Afghanistan (GeoLocation: 0.000349482362808)
- May (Time: 0.00299625468107284)
Example 4:
Senate Judiciary Committee considers Sotomayor nomination on Tuesday.
- Senate Judiciary Committee (Organization: 3.66993148614553E-10)
- Sotomayor (Person: 5.60905081933395E-07)
- Tuesday (Time: 0.0389513108539469)
So, why the poor F score and the good results? Well, I think I’ve found the explanation. As I said here, when I calculate the performance of my HMMs I’m being Nazi with myself: all the papers I’ve read, in fact, count the number of tokens correctly tagged by their systems, while I count the number of correct tags. This means that when my HMM extracts “Ohio” from “I’m going to Northern Ohio”, I count that as zero recall – the expected tag is “Northern Ohio” and my guy hasn’t found it. On the other hand, research papers would count that as one token out of two, which yields a 0.5 recall.
With this in mind, the results are so good that I’ve decided to set in motion the “release” machine. It took me a couple of years but the first piece of HiSam will finally be live soon!!!!
These are the last TODO items before I start working on the commercial offering:
- Add an option to calculate the F score using the research papers’ method, and compare this score to their score;
- Label a few more documents in order to reach better stability and see whether the learning curve shifts up;
- Compress the XML serialization of the model – the current XML takes up 800Mb of disk space and takes forever to load…