-------------------------------------------------------------------------------
FROM JBI on 23/03/2017  (1st round)                    ------------------------
-------------------------------------------------------------------------------

Reviewer #1: The paper on overview of 2016 CEGS N-GRID shared tasks Track 2 is well written and easy to understand. Here are some comments:

1.      In the introduction part, authors should discuss more about the innovation and challenges of the dataset and task. Many similar clinical corpus are available as authors mentioned and many multi-label classification tasks were discussed. So why we need additional one. Similar issue with the "2 Related work" part, authors claimed "since it sets new directions of research." Please address what they are.

2.      "Positive valence domain" is not well defined or summarized in the introduction part. Although Figure 1 listed the details of "positive valence domain", it is unclear on the relation between the listed elements with the task itself. Were these elements used in this task/track, or only used to guide the annotation or used to select the notes or others?

3.      Figure 2 used "Track 1" and "Track 2". However, they were not addressed/discussed on the document. Please consider using other terms.

4.      The authors discussed annotators' consensus score using official ranking score. Usually people report Kappa score.

5.      The authors proposed customized MAE. Please address the reason why you do the customization. The formula has some errors, j should start from 0 instead of 1, to |C|-1 instead of |c|, Tj  -- the set of test documents labeled with class j --- gold label or prediction label? Is it yi the value of j?

6.      In Table 1, the last row, 02, 01, 04, and 03 are confusing. It should be 2,1,4, and 3, right?

7.      Result part to summarize the methods used in different teams should add the references and citations to their jobs.

8.      Figure 7 and Error analysis part: how to define "easy", "medium", and "difficult cases"?


Reviewer #2: In this paper, authors described results of the 2nd track of the 2016 CEGS N-GRID shared tasks. This task focuses on predicting symptom severity from neuropsychiatric clinical records. It is an important task in the precision medicine field. As a preliminary study on symptom severity prediction in neuropsychiatric clinical records, the results obtained by every team participating in the competition can inspire other researchers to conduct many further researchers. It is important to introduce the data annotation process, the evaluation methods, the summary of the competition results, and the detailed analysis of the results, and authors basically described the content in the paper. However, the reviewer also thinks that this paper needs a major revision because of the reasons described as follows.

(1) In related work, authors noted that this task sets new directions of NLP research. However, authors only said that "those records present structural and linguistic peculiarities not yet explored in the literature" with a little general illustration. Could authors show some detailed examples and more specific explanation? It is very important to other researchers to clearly understand the importance of this work.

(2) There is a serious mistake in the second paragraph of Section 3. "216 out of 451 records" should be changed to "216 out of 541 (325 + 216) records". The reviewer suggests that authors should carefully check all results reported in this paper.

(3) The data annotation process is very important, because it would decide the confidence of the results of this shared task. Authors realize it and give us a description. But the description makes the reviewer wonder how organizers ensure the stability of the results, because authors did the annotation work in a "time constrained" condition, and the process described in paragraph 3 of Section 4 is lack of persuasiveness. Moreover, authors concluded annotators' years of experience correlates with the measured result of annotators' performance with respect to the gold standard. The reviewer thinks that annotation result of the annotator 3 is most similar to the gold standard mainly because the most annotation results in gold standard are determined by the annotator 3 as described in paragraph 3 of Section 4.

(4) The figures 2, 3 and 4 make little sense. The reviewer suggests deleting these figures, because the descriptions in corresponding parts have been able to express clearly the content.

(5) The equation (1), which is used to compute of the NMAE^M, has some problems. Firstly, what is the value of the variable y_i? According to the description, the value of y_i is 0 or 1. Assuming that the understanding of the reviewer is correct, max(y_i, y_i - |C| - 1) makes no sense, when |C| is always 4. Moreover, based on the results shown in figure 4, the reviewer wonders about the correctness of the scores of the measured "agreement of each annotator against the gold standard by using the official ranking score".

(6) It is enough to list the concluded results in the last line of table 1, because the detailed introduction (e.g. affiliations, countries, number of submissions, etc.) in the rest of the table 1 about each team is not very important. And it would, consequently, save a lot of space for other more important content.

(7) The results shown in table 2 and figure 5 repeated. The reviewer suggests deleting one of them.

(8) Although there are many detailed analysis results shown in figure 6 and figure 7, but the reviewer thinks that the explanations in Section 8 are not careful enough. The reviewer suggests the authors give a more careful expression in order to "set new directions of NLP research" as they said in Section 2.

(9) A more detailed description of figure 7 is needed. How to define the group in figure 7 and what is the meaning of the values lay on the horizontal axis in figure 7?