-------------------------------------------------------------------------------
FROM MIT PRESS COMPUTATIONAL LINGUISTICS on 03/06/2014 ------------------------
-------------------------------------------------------------------------------

------------------------------------------------------------------- Reviewer B:


2 What is this paper about?

The paper discusses a variation on a typical hybrid system architecture for
time expression recognition and normalization, where a machine-learning
based time expression recognizer is followed by a rule-based time value
normalizer. The variation involves postprocessing rules applied to the
machine learning based recognizer, and this addition significantly improves
recognition performance on the TempEval-3 task.


3 Strengths and Weaknesses

Strengths:

Temporal information extraction is an important problem area. Within that
area, the task of time expression recognition and normalization has been
studied quite thoroughly in the field. The use of postprocessing rules on
top of machine learning is shown to have a real benefit for time expression
recognition in this particular task.


Weaknesses:

The basic problem with this paper is its extremely narrow scope. The machine
postprocessing rules are engineered for a particular task (TempEval-3), and
the results are provided for that task alone. The rules iinclude switching
the classifier's labels based on particular feature counts in the training
data. No attempt is made to see if the methods work on other tasks
(including prior TempEvals), or on other sorts of data (such as
conversations, where language is more informal), or on other languages
(which have been attempted at a number of competitions including TempEvals).
Would the postprocessing rules need to be adjusted for different tasks? If
so, what is the precise contribution?

Section 1 (Introduction) is not the sort of introduction expected for a
journal article. The paper should talk about why temporal information
extraction is important, what the interesting challenges are, the different
subproblems in the area (event anchoring, event ordering, time expression
tagging and normalization), as well as broader applications like
question-answering, chronologies, data mining, etc., and why this particular
subproblem is of interest.

Section 2 (Related Work) misses a number of key references on time
expression tagging. It also fails to mention work on time expression tagging
and normalization on different languages.

In Section 3, Table 3 seems to be tested on data that it both gold as well
as not gold (i.e., as well as on machine-annotated 'silver'). It is OK to
train on silver, but testing on it without separately testing on gold isn't
that insightful. Also, it might be possible to try out an ensemble method
based on the two separate resources (gold and silver) to see if using the
silver improves performance.

As for normalization, it is not clear what is really new here. How does it
compare with the approach of Ahn et al. (2005)? Does machine learning in
part of that task yield any benefits? This seems the harder part of the
task, and historically accuracies for normalization have never approached
that of recognition. Is there any hope for improving performance on this
subtask further? More generally, these normalization rules would be
language-specific, right? Is the architecture flexible enough to plug in
language-specific dictionaries to mitigate the extent of
language-dependency?

Missing references:

Alexandersson, Jan, Norbert Riethinger, and Elisabeth Maier (1997). Insights
into  the Dialogue Processing of VERBMOBIL. Proceedings of the Fifth
Conference on Applied Natural Language Processing (ANLP’1997)
(Washington), 33-40.

Busemann, Stephan, Thierry Declerck, Abdel Diagne, Luca Dini, Judith Klein,
and Sven Schmeier (1997). Natural Language Dialogue Service for Appointment
Scheduling Agents. Proceedings of the Fifth Conference on Applied Natural
Language Processing (ANLP’1997) (Washington), 25-32.

Mani, Inderjeet, Ben Wellner, Marc Verhagen, Chong Min Lee, and James
Pustejovsky (2006). Machine Learning of Temporal Relations. Proceedings of
the 44th Annual Meeting of the Association for Computational Linguistics
(COLING-ACL’2006) (Sydney), 753-760.

Wiebe, Janyce, Tom O’Hara, Thorsten Ohrstrom-Sandgren, and Kenneth
McKeever (1998). An Empirical Approach to Temporal Reference Resolution.
Journal of Artiﬁcial Intelligence Research, 9, 247-293.

------------------------------------------------------------------- Reviewer C:


2 What is this paper about?

This paper presents a description of the ManTIME system as used in TempEval
2013. It includes an exhaustive list of ManTIME's features, a careful
statistical analysis used for model selection, a description of several
post-processing heuristics that were applied, and a brief description of the
normalizer. The evaluation repeats the TempEval 2013 results, and performs a
brief error analysis.


3 Strengths and Weaknesses

Strengths:

This paper is a clear description of the ManTIME system, and at least the
time expression extraction component should be quite replicable from the
description. The paper is also quite careful in its evaluation, and uses
statistical significance testing appropriately in several places.


Weaknesses:

The biggest weakness of this paper is that it doesn't really shed any new
light on the field of temporal expression extraction and normalization. It
simply presents the results of a single system. What does this system do new
that other systems do not? What do the experiments here reveal that was not
revealed in the TempEval 2013 paper which already compared across systems?
The paper as-is would make a great technical report on the ManTIME system,
but I just don't see how the paper as it stands adds any new knowledge to
the field, which is something I would expect for a journal article.

There's also a significant issue with one major claim of the paper: "It
[ManTIME] is also the best performing machine learning-based system in the
temporal expression extraction task.” ManTIME had the highest lenient F1,
but ClearTK-TimeML had the highest strict F1. ClearTK-TimeML was also a
feature-based BIO-classification model, and therefore almost directly
comparable in architecture to ManTIME, but it was not mentioned in the
article. I think there needs to be some significant discussion of the
similarities and differences between the systems, and if/where one system is
better than the other.

There are also a variety of more minor issues:

* For normalization, it may be worth mentioning the paper "A Synchronous
Context Free Grammar for Time Normalization” (2013) in your review of time
normalizers, as that system outperforms Llorens et al. (2012), which you
also mention.

* The error analysis was a bit too minimal. It should at least categorize
the errors and provide counts of the number of times each category of error
appeared. Doing so might be one point toward identifying some issues beyond
what the TempEval 2013 paper has already discussed.

* “CRFs... leads to possibly inconsistent sequences of labels” Please
clarify that this is a limitation of the CRF++ implementation you have
chosen (if indeed it is such a limitation). CRFs in general can disallow
inappropriate transitions. For example, Mallet’s CRF allows you to disable
transitions like O-I using the "--forbidden” flag.

* It’s unclear what ManTIME does with an O-I sequence without the BIO
fixer. Does it treat it as O-B or O-O, or what? That is, does it create a
time expression including that I or throw it away entirely?

* "true positive, false positive and false negative” Please define what
these terms mean in this particular task (i.e. in terms of temporal
expressions predicted by the system)

* "for the expression “20th Century”, a value “19” was provided
instead of “19XX”. “19” is the correct annotation here per the
TimeML/TIDES guidelines. The many “19XX”s for this kind of thing in the
AQUAINT corpus are incorrect.

* the expression “a decade” was normalised with “P10Y” instead of
the more correct “P1E”. The correct annotation here is “P1DE” per
the TimeML/TIDES guidelines. Again, the “E” notation is an error in the
AQUAINT corpus. (Agreed though that “P10Y” is clearly wrong here.)


4 Substantive Revisions Required

Revisions to be Required:

(1) Overhaul the description of the system. At each point (classifier,
features, model selection, post processing, etc.), discuss the many other
systems that take similar approaches, what ManTIME has in common with those
systems and what it does differently.

(2) Overhaul the evaluation section. Rather than simply repeating the
TempEval 2013 results, compare ManTIME results to other systems, and discuss
the differences in approaches and why ManTIME performs better or worse than
that system. Where possible, obtain the submitted results of the other
systems and compare errors directly to validate any claims you make.


Revisions to be Encouraged:

(1) Discuss the similarities and differences between ManTIME and
ClearTK-TimeML.

(2) Mention the paper "A Synchronous Context Free Grammar for Time
Normalization” (2013) in your review of time normalizers

(3) Clarify that CRFs in general can allow inappropriate transitions

(4) Clarify what ManTIME does for O-I sequences without the BIO fixer.

(5) Describe what true/false positive/negative mean in the temporal
expression task

(6) Fix the discussion of "annotation errors" to be consistent with the
TimeML/TIDES guidelines.


6 Typographic Errors

* “requires to” is not grammatical. You either need to say “requires
one to” or “requires XXXing”. For example, “requires one to take”
or “requires taking” are both okay, but “requires to take” is
ungrammatical.

* Put the “silver” in “silver data” in quotes the first time. No one
outside of TempEval will know what “silver” data is, so better to
indicate that you’re about to define it.

* "extract directly” => “directly extract”

* "associated to” => “associated with”

* "already been proved” => either “already proven” or “which has
already been proved”

* "for training purpose” => "for training purposes”

* "if it strictly equals to the” => "if it strictly equals the”

* Figure 6 is very difficult to read. Please increase the font size.

* "These results suggests” => "These results suggest”

* "consists in” => “consists of"

-------------------------------------------------------------------------------
FROM ELSEVIER DATA & KNOWLEDGE ENGINEERING on 24/06/2015 ----------------------
-------------------------------------------------------------------------------

------------------------------------------------------------------ Reviewer #1:

The paper presents a system for temporal expression identification and normalization (ManTIME). The paper is well written and well structured. The authors present a system to solve temporal expression identification and normalization.
The paper includes a complete evaluation using the appropriate datasets and metrics and the results are comparable to those obtained in the state-of-the-art.
The authors conclude that the use of some WordNet-based features are negative for the identification phase, and there is no statistically significant difference in the results based on gazetteers, shallow parsing and prepositional noun phrase labels used on top of a morpho-lexical features. Furthermore, they re-confirm the TempEval-3 conclusion that the use of silver annotated data to train the identification models does not improve the performance.

The strong points of the paper include:
- a good description of the related work
- the review of a considerable amount of features in the identification phase
- the comparison of different combinations of features in the ML and also in the post-processing phase
- a detailed evaluation
- the availability of the presented system

The following points could be improved:
- The reason why some of the features are used/tested for temporal expression identification is not clear:
   - Why using gazetteers about US cities and female names?
   - Why using the number of senses in WN? and antonyms?
   - In general, a motivation for the use of that features would contribute to a better understanding of the system features
- The post-processing modules could be better understood if the authors include some examples. For example, include, in probabilistic correction, the probability of a token in a sentence before and after the application of this module. The threshold-based label switcher seems to cause some false positives, what is the contribution of this module alone?
- In the normalization step, some post-manipulation rules (i.e., the frozen expressions or named timexes) are not clearly differentiated from extension rules.
- In the evaluation, why the BIO fixer provided a negative contribution? Why things like "of flu" and "and" are wrongly identified as timexes?
- In the evaluation, in the normalization errors section 4.4.2, in addition to indicate the number of cases e.g., 18 cases it would be great to indicate the percentage it represents

The novelty and the contribution of the paper are not clear or not stressed enough and that makes the contribution weaker.
- The authors use common techniques for both identification (ML: CRF) and normalization (rule-based). Both methods have already been used in the field with similar feature sets and system architecture. Furthermore, the features used are not new in the field either.
- The performance rate obtained is very similar to state-of-the-art systems' performance.


------------------------------------------------------------------ Reviewer #2:

This paper describes ManTIME, a system for recognizing temporal expressions and assigning normalized values to those expressions. ManTIME is an existing system that has been described before elsewhere, for example in the context of the tempeval challenge. The system performs well and is freely available. The paper focuses on some experiments exploring features, data sets and processing modules in the system.

In general the paper fits the DKE journal and it is well written and mostly reasonably easy to follow fro me. But it does not significantly add insights that were not previously expressed in prior work, most notably the paper from tempeval-3 (which was included by the authors). Some notable additions were some more details on the features and some error analysis. Major revisions would be needed to add more content.


Other comments.

It is not technically correct to say that the approach ranked 3rd out of 21 participants in tempeval-3. There were about ten participants and together they submitted 21 runs of their systems. ManTIME submitted 6 runs and the best of them ranked 5th. It is true though that only two other systems had better results.

I don not know much about CRFs, so the prose on the factor graph I could not follow.

The motivation of why the post-processing is used is not clear. Early in 3.1 it says "Because of this trade-off, we also developed a post-processing pipeline" and later it is intimated that the silver data motivated the post-processing. This I do not understand. While on the topic, it is claimed that the optimal sequence is "Probabilistic correction module, BIO fixer, Threshold-based label switcher, BIO fixer". But other sequences are not presented. My hunch is that just the BIO fixer would get similar results.

Just under the list of the four models (page 7) it says "Rather than making an educated and ad-hoc informed guess on the training data, we performed an extensive statistical evaluation." I am not sure what is being said by the first part. Also, the evaluation is over those four models only, not any of the other possible permutations.

The last paragraph of page 7 is a bit muddy. Why is opting for Model 1 done to mitigate overfitting? Is a smaller feature set less likely to lead to overfitting? The use of cross-validation set and development set is confusing, especially since the development set is really used as a test set.

Table 3 has the first mention of the ANOVA analysis. I am pleased to see an indication of whether results are statistically significant. It might be good to mention that small p-values indicate significance.

The explanation of normalization could be better. It is clear what the pre-processing rules do. But I can only guess that the extension rules generate ISO8601 values for some expressions that TRIOS does not handle.

I do not think that table 6 shows that "The normalisation phase benefits of more precisely identified temporal expressions".

Why does the relatively low type accuracy suggest "the normaliser's inability to recognize new lexical patterns"? (page 15)

Why is it surprising that the silver data did not improve system performance? I wish I remembered the reference, but recently I read a paper that described experiments that showed that systems trained on small amounts of gold data often outperformed systems trained on large amount of silver data.

In the conclusion it says "This conclusion, although statistically significant, is necessarily limited by two factors: (I) the temporal information domain, and (II) the way WordNet has been used to generate features." This I do not understand.


Minor remarks.

Page 4: "performance is sensibly lower" ==> "performance is noticeably lower"

First sentence of 3.1, "their right boundary" ==> right as in "correct" or as in "not-left"?

Footnote 1 on page 6 does not seem to be on the right spot.

Page 7, "In virtue of this analysis" ==> "In light of this analysis" or "By virtue of this analysis"

Page 12, "Two attributes are particularly important with this respect" ==> "Two attributes are particularly important in this respect"

Page 12, "our previous open-source rule-based normaliser already been proved to provide state-of-the-art performance" ==> rephrase, misses some words

Table 5, page 14. It does not seem right that TimeBank has only 700 sentences. That would mean that the average timebank sentence has almost 90 words.

Page 16, figure 6: this is hard to read in print