For the past couple of months I’ve paused further development of the model itself (the “MultiEntity” model) in order to focus on a last round of re-labeling of the training data with three goals:
- Label new types of entities (monetary values and time expressions) together with the three “classic” ones (person names, geographical places, and organizations);
- Be more strict with my labeling and make sure to adhere to a set of guidelines that I’ve implemented with the goal of ensuring consistency of the training data;
- Enable a new mechanism that takes advantage of the inner structure of certain entity names – which is proving to be the key difference in reaching very good performance and on which I’m not yet ready to publicly elaborate.
At the same time I have also re-tokenized the training text being labeled in order to take advantage of some improvements I had done in the past years in the tokenization module.
On December 22 I reached a milestone – 60% of the “old” training data re-tokenized and re-labeled, about 2,900 sentences – and I decided it was time to pause the re-labeling effort and see whether I was going towards the right direction.
The first performance numbers were encouraging, but not as better than those before the re-labeling effort as I hoped they’d be. In order to investigate the reason for the modest improvement, I’ve employed a very useful technique that I had used earlier with a similar goal, a technique that I call “self-testing” and which I think I heard about in the machine learning literature.
Ideally, when you test your trained model on the same data that you have used to train it, you should see no errors in the model’s predictions. It’s kinda like asking a student to repeat the pages of the schoolbook she has just studied. In reality, however, the model does make some errors, exactly like the human student does
, and these errors can be attributed to one of two different causes:
- Noisy Training: the training data is not consistent because of some mistakes that took place during the manual labeling, and the learning algorithm is confused by these mistakes. Think of a student being told on Monday that 2 + 2 is 4, and then on Tuesday that 2 + 2 is 5. In my case, it could be that “China” has been tagged as a GeoLocation in one training sentence and erroneously as a Person – or not tagged at all – in another sentence.
- Limited Learning Capability: the model is unable to learn from the training data due to limitations inherent to its design. Think of a primary school student being told that she can integrate Schrödinger’s wave function to get the probability that a particle is at X, Y, Z and has moment m. In my case, “Capitol Hill” might have been labeled as a GeoLocation in “The teacher lives on Capitol Hill” and as an Organization in “Last week Capitol Hill passed the bill”, and the model might not be considering enough context (prefix and suffix) in order to be able to discern these two different meanings of “Capitol Hill”.
When I originally ran the self-test before Christmas – on the 2,900 sentences re-labeled so far – the model came back with hundreds of errors. I spent most of the holidays’ development time analyzing the errors and improving the model, with results that were encouraging by the day. This chart shows the daily changes in the average F score of the model when trained with 90% of the re-labeled data and tested on the remaining 10%:
The improvements shown in the graph are due to a combination of interventions.
First of all, the self-test pointed me to a number of labeling errors, which I promptly fixed. When the training data became error-free, I attacked the problem of dealing with tokens containing numbers (e.g. “340”), which I never had reason to worry about before, ending up with the huge improvement in the performance of the newly-introduced Currency entity.
Finally, my “secret” recipe kicked-in. Leveraging the flexible configurability of the model and fine-tuning it based on analyses of the errors allowed me to obtain the improvements shown with the Person, GeoLocation, and Organization entities. This novel technique I’m using exploits the internal structure of certain entities and takes advantage of the fact that these different entities all “live” in the same model. As an example, consider Organization entities like “Bank of Japan” and “Bank of England”. If the model is capable of understanding that the third token of these entities is always a GeoLocation, then it will be more inclined to flag “Bank of Italy” as an Organization when it sees it for the first time, provided that its vocabulary of GeoLocation literals is comprehensive enough to flag Italy as a GeoLocation. Similarly, the model has been trained to discern, for example, between one-word GeoLocation entities (like “China” and “Italy”) and two-word GeoLocation entities (like “South Korea” and “Northern Ireland”). By being able to make this distinction, the model will be less inclined to flag “Northern” alone as a GeoLocation – which is exactly what used to happen before my intervention. As one would expect, the number of states has exploded from about 300 to 1,280 after all the fine-tuning, but thanks to Freitag’s and McCallum’s interpolation of emission probabilities, I haven’t experienced any penalty from the fragmentation of states.
All of this contributed to an overall improvement of the F score from 0.795 to 0.869, with Organization entities alone improving to 0.824, well above the best score of 0.755 that I was capable of obtaining last October before the re-labeling effort and before the finalization of my “secret recipe”. Moreover, the current performance of Person entities is exactly as it was when the old model was trained with 2,000 more training sentences.
At this moment all I need to do is complete the re-labeling – about 2,000 sentences left – and hope that the new training data raises the F towards my goal.

0 Responses to “Improving the Model”