------------------------------------------------------------------------------- FROM MIT PRESS COMPUTATIONAL LINGUISTICS on 03/06/2014 ------------------------ ------------------------------------------------------------------------------- ------------------------------------------------------------------- Reviewer B: 2 What is this paper about? The paper discusses a variation on a typical hybrid system architecture for time expression recognition and normalization, where a machine-learning based time expression recognizer is followed by a rule-based time value normalizer. The variation involves postprocessing rules applied to the machine learning based recognizer, and this addition significantly improves recognition performance on the TempEval-3 task. 3 Strengths and Weaknesses Strengths: Temporal information extraction is an important problem area. Within that area, the task of time expression recognition and normalization has been studied quite thoroughly in the field. The use of postprocessing rules on top of machine learning is shown to have a real benefit for time expression recognition in this particular task. Weaknesses: The basic problem with this paper is its extremely narrow scope. The machine postprocessing rules are engineered for a particular task (TempEval-3), and the results are provided for that task alone. The rules iinclude switching the classifier's labels based on particular feature counts in the training data. No attempt is made to see if the methods work on other tasks (including prior TempEvals), or on other sorts of data (such as conversations, where language is more informal), or on other languages (which have been attempted at a number of competitions including TempEvals). Would the postprocessing rules need to be adjusted for different tasks? If so, what is the precise contribution? Section 1 (Introduction) is not the sort of introduction expected for a journal article. The paper should talk about why temporal information extraction is important, what the interesting challenges are, the different subproblems in the area (event anchoring, event ordering, time expression tagging and normalization), as well as broader applications like question-answering, chronologies, data mining, etc., and why this particular subproblem is of interest. Section 2 (Related Work) misses a number of key references on time expression tagging. It also fails to mention work on time expression tagging and normalization on different languages. In Section 3, Table 3 seems to be tested on data that it both gold as well as not gold (i.e., as well as on machine-annotated 'silver'). It is OK to train on silver, but testing on it without separately testing on gold isn't that insightful. Also, it might be possible to try out an ensemble method based on the two separate resources (gold and silver) to see if using the silver improves performance. As for normalization, it is not clear what is really new here. How does it compare with the approach of Ahn et al. (2005)? Does machine learning in part of that task yield any benefits? This seems the harder part of the task, and historically accuracies for normalization have never approached that of recognition. Is there any hope for improving performance on this subtask further? More generally, these normalization rules would be language-specific, right? Is the architecture flexible enough to plug in language-specific dictionaries to mitigate the extent of language-dependency? Missing references: Alexandersson, Jan, Norbert Riethinger, and Elisabeth Maier (1997). Insights into the Dialogue Processing of VERBMOBIL. Proceedings of the Fifth Conference on Applied Natural Language Processing (ANLP’1997) (Washington), 33-40. Busemann, Stephan, Thierry Declerck, Abdel Diagne, Luca Dini, Judith Klein, and Sven Schmeier (1997). Natural Language Dialogue Service for Appointment Scheduling Agents. Proceedings of the Fifth Conference on Applied Natural Language Processing (ANLP’1997) (Washington), 25-32. Mani, Inderjeet, Ben Wellner, Marc Verhagen, Chong Min Lee, and James Pustejovsky (2006). Machine Learning of Temporal Relations. Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL’2006) (Sydney), 753-760. Wiebe, Janyce, Tom O’Hara, Thorsten Ohrstrom-Sandgren, and Kenneth McKeever (1998). An Empirical Approach to Temporal Reference Resolution. Journal of Artificial Intelligence Research, 9, 247-293. ------------------------------------------------------------------- Reviewer C: 2 What is this paper about? This paper presents a description of the ManTIME system as used in TempEval 2013. It includes an exhaustive list of ManTIME's features, a careful statistical analysis used for model selection, a description of several post-processing heuristics that were applied, and a brief description of the normalizer. The evaluation repeats the TempEval 2013 results, and performs a brief error analysis. 3 Strengths and Weaknesses Strengths: This paper is a clear description of the ManTIME system, and at least the time expression extraction component should be quite replicable from the description. The paper is also quite careful in its evaluation, and uses statistical significance testing appropriately in several places. Weaknesses: The biggest weakness of this paper is that it doesn't really shed any new light on the field of temporal expression extraction and normalization. It simply presents the results of a single system. What does this system do new that other systems do not? What do the experiments here reveal that was not revealed in the TempEval 2013 paper which already compared across systems? The paper as-is would make a great technical report on the ManTIME system, but I just don't see how the paper as it stands adds any new knowledge to the field, which is something I would expect for a journal article. There's also a significant issue with one major claim of the paper: "It [ManTIME] is also the best performing machine learning-based system in the temporal expression extraction task.” ManTIME had the highest lenient F1, but ClearTK-TimeML had the highest strict F1. ClearTK-TimeML was also a feature-based BIO-classification model, and therefore almost directly comparable in architecture to ManTIME, but it was not mentioned in the article. I think there needs to be some significant discussion of the similarities and differences between the systems, and if/where one system is better than the other. There are also a variety of more minor issues: * For normalization, it may be worth mentioning the paper "A Synchronous Context Free Grammar for Time Normalization” (2013) in your review of time normalizers, as that system outperforms Llorens et al. (2012), which you also mention. * The error analysis was a bit too minimal. It should at least categorize the errors and provide counts of the number of times each category of error appeared. Doing so might be one point toward identifying some issues beyond what the TempEval 2013 paper has already discussed. * “CRFs... leads to possibly inconsistent sequences of labels” Please clarify that this is a limitation of the CRF++ implementation you have chosen (if indeed it is such a limitation). CRFs in general can disallow inappropriate transitions. For example, Mallet’s CRF allows you to disable transitions like O-I using the "--forbidden” flag. * It’s unclear what ManTIME does with an O-I sequence without the BIO fixer. Does it treat it as O-B or O-O, or what? That is, does it create a time expression including that I or throw it away entirely? * "true positive, false positive and false negative” Please define what these terms mean in this particular task (i.e. in terms of temporal expressions predicted by the system) * "for the expression “20th Century”, a value “19” was provided instead of “19XX”. “19” is the correct annotation here per the TimeML/TIDES guidelines. The many “19XX”s for this kind of thing in the AQUAINT corpus are incorrect. * the expression “a decade” was normalised with “P10Y” instead of the more correct “P1E”. The correct annotation here is “P1DE” per the TimeML/TIDES guidelines. Again, the “E” notation is an error in the AQUAINT corpus. (Agreed though that “P10Y” is clearly wrong here.) 4 Substantive Revisions Required Revisions to be Required: (1) Overhaul the description of the system. At each point (classifier, features, model selection, post processing, etc.), discuss the many other systems that take similar approaches, what ManTIME has in common with those systems and what it does differently. (2) Overhaul the evaluation section. Rather than simply repeating the TempEval 2013 results, compare ManTIME results to other systems, and discuss the differences in approaches and why ManTIME performs better or worse than that system. Where possible, obtain the submitted results of the other systems and compare errors directly to validate any claims you make. Revisions to be Encouraged: (1) Discuss the similarities and differences between ManTIME and ClearTK-TimeML. (2) Mention the paper "A Synchronous Context Free Grammar for Time Normalization” (2013) in your review of time normalizers (3) Clarify that CRFs in general can allow inappropriate transitions (4) Clarify what ManTIME does for O-I sequences without the BIO fixer. (5) Describe what true/false positive/negative mean in the temporal expression task (6) Fix the discussion of "annotation errors" to be consistent with the TimeML/TIDES guidelines. 6 Typographic Errors * “requires to” is not grammatical. You either need to say “requires one to” or “requires XXXing”. For example, “requires one to take” or “requires taking” are both okay, but “requires to take” is ungrammatical. * Put the “silver” in “silver data” in quotes the first time. No one outside of TempEval will know what “silver” data is, so better to indicate that you’re about to define it. * "extract directly” => “directly extract” * "associated to” => “associated with” * "already been proved” => either “already proven” or “which has already been proved” * "for training purpose” => "for training purposes” * "if it strictly equals to the” => "if it strictly equals the” * Figure 6 is very difficult to read. Please increase the font size. * "These results suggests” => "These results suggest” * "consists in” => “consists of" ------------------------------------------------------------------------------- FROM ELSEVIER DATA & KNOWLEDGE ENGINEERING on 24/06/2015 ---------------------- ------------------------------------------------------------------------------- ------------------------------------------------------------------ Reviewer #1: The paper presents a system for temporal expression identification and normalization (ManTIME). The paper is well written and well structured. The authors present a system to solve temporal expression identification and normalization. The paper includes a complete evaluation using the appropriate datasets and metrics and the results are comparable to those obtained in the state-of-the-art. The authors conclude that the use of some WordNet-based features are negative for the identification phase, and there is no statistically significant difference in the results based on gazetteers, shallow parsing and prepositional noun phrase labels used on top of a morpho-lexical features. Furthermore, they re-confirm the TempEval-3 conclusion that the use of silver annotated data to train the identification models does not improve the performance. The strong points of the paper include: - a good description of the related work - the review of a considerable amount of features in the identification phase - the comparison of different combinations of features in the ML and also in the post-processing phase - a detailed evaluation - the availability of the presented system The following points could be improved: - The reason why some of the features are used/tested for temporal expression identification is not clear: - Why using gazetteers about US cities and female names? - Why using the number of senses in WN? and antonyms? - In general, a motivation for the use of that features would contribute to a better understanding of the system features - The post-processing modules could be better understood if the authors include some examples. For example, include, in probabilistic correction, the probability of a token in a sentence before and after the application of this module. The threshold-based label switcher seems to cause some false positives, what is the contribution of this module alone? - In the normalization step, some post-manipulation rules (i.e., the frozen expressions or named timexes) are not clearly differentiated from extension rules. - In the evaluation, why the BIO fixer provided a negative contribution? Why things like "of flu" and "and" are wrongly identified as timexes? - In the evaluation, in the normalization errors section 4.4.2, in addition to indicate the number of cases e.g., 18 cases it would be great to indicate the percentage it represents The novelty and the contribution of the paper are not clear or not stressed enough and that makes the contribution weaker. - The authors use common techniques for both identification (ML: CRF) and normalization (rule-based). Both methods have already been used in the field with similar feature sets and system architecture. Furthermore, the features used are not new in the field either. - The performance rate obtained is very similar to state-of-the-art systems' performance. ------------------------------------------------------------------ Reviewer #2: This paper describes ManTIME, a system for recognizing temporal expressions and assigning normalized values to those expressions. ManTIME is an existing system that has been described before elsewhere, for example in the context of the tempeval challenge. The system performs well and is freely available. The paper focuses on some experiments exploring features, data sets and processing modules in the system. In general the paper fits the DKE journal and it is well written and mostly reasonably easy to follow fro me. But it does not significantly add insights that were not previously expressed in prior work, most notably the paper from tempeval-3 (which was included by the authors). Some notable additions were some more details on the features and some error analysis. Major revisions would be needed to add more content. Other comments. It is not technically correct to say that the approach ranked 3rd out of 21 participants in tempeval-3. There were about ten participants and together they submitted 21 runs of their systems. ManTIME submitted 6 runs and the best of them ranked 5th. It is true though that only two other systems had better results. I don not know much about CRFs, so the prose on the factor graph I could not follow. The motivation of why the post-processing is used is not clear. Early in 3.1 it says "Because of this trade-off, we also developed a post-processing pipeline" and later it is intimated that the silver data motivated the post-processing. This I do not understand. While on the topic, it is claimed that the optimal sequence is "Probabilistic correction module, BIO fixer, Threshold-based label switcher, BIO fixer". But other sequences are not presented. My hunch is that just the BIO fixer would get similar results. Just under the list of the four models (page 7) it says "Rather than making an educated and ad-hoc informed guess on the training data, we performed an extensive statistical evaluation." I am not sure what is being said by the first part. Also, the evaluation is over those four models only, not any of the other possible permutations. The last paragraph of page 7 is a bit muddy. Why is opting for Model 1 done to mitigate overfitting? Is a smaller feature set less likely to lead to overfitting? The use of cross-validation set and development set is confusing, especially since the development set is really used as a test set. Table 3 has the first mention of the ANOVA analysis. I am pleased to see an indication of whether results are statistically significant. It might be good to mention that small p-values indicate significance. The explanation of normalization could be better. It is clear what the pre-processing rules do. But I can only guess that the extension rules generate ISO8601 values for some expressions that TRIOS does not handle. I do not think that table 6 shows that "The normalisation phase benefits of more precisely identified temporal expressions". Why does the relatively low type accuracy suggest "the normaliser's inability to recognize new lexical patterns"? (page 15) Why is it surprising that the silver data did not improve system performance? I wish I remembered the reference, but recently I read a paper that described experiments that showed that systems trained on small amounts of gold data often outperformed systems trained on large amount of silver data. In the conclusion it says "This conclusion, although statistically significant, is necessarily limited by two factors: (I) the temporal information domain, and (II) the way WordNet has been used to generate features." This I do not understand. Minor remarks. Page 4: "performance is sensibly lower" ==> "performance is noticeably lower" First sentence of 3.1, "their right boundary" ==> right as in "correct" or as in "not-left"? Footnote 1 on page 6 does not seem to be on the right spot. Page 7, "In virtue of this analysis" ==> "In light of this analysis" or "By virtue of this analysis" Page 12, "Two attributes are particularly important with this respect" ==> "Two attributes are particularly important in this respect" Page 12, "our previous open-source rule-based normaliser already been proved to provide state-of-the-art performance" ==> rephrase, misses some words Table 5, page 14. It does not seem right that TimeBank has only 700 sentences. That would mean that the average timebank sentence has almost 90 words. Page 16, figure 6: this is hard to read in print