1 files changed, 23 insertions, 36 deletions
diff --git a/aied2017/evaluation.tex b/aied2017/evaluation.tex
index bd2f168..247c08f 100644
--- a/aied2017/evaluation.tex
+++ b/aied2017/evaluation.tex
@@ -12,14 +12,16 @@ and to retrospectively evaluate quality of given hints.
  \hline
   & Rules & Maj & RF & All & Imp & All & Imp & Alt &  \\
  \hline
- sister & 0.988 & 0.663 & 0.983 &  128 & 128 & 127 & 84 & 26 & 34 \\
- del & 0.948 & 0.669 & 0.974 & 136 & 136 & 39 & 25 & 10 & 15 \\
- sum & 0.945 & 0.510 & 0.956 & 59 & 53 & 24 & 22 & 1 & 6 \\
+ sister & 0.988 & 0719 & 0.983 &  128 & 128 & 127 & 84 & 26 & 34 \\
+ del & 0.948 & 0.645 & 0.974 & 136 & 136 & 39 & 25 & 10 & 15 \\
+ sum & 0.945 & 0.511 & 0.956 & 59 & 53 & 24 & 22 & 1 & 6 \\
  is\_sorted  & 0.765 & 0.765 & 0.831 & 119 & 119 & 0 & 0 & 0 & 93 \\
- union & 0.785 & 0.766 & 0.813 & 106 & 106 & 182 & 66 & 7 & 6 \\
+ union & 0.785 & 0.783 & 0.813 & 106 & 106 & 182 & 66 & 7 & 6 \\
  ...& & & & & & & & & \\
  \hline
- Average & 0.857 & 0.666 &  0.908 & 79.73 & 77.34 & 46.75 & 26.36 & 5.55 & 23.75 \\
+ Total & & &  & 3613 & 3508 & 2057 & 1160 & 244 & 1045 \\
+ \hline
+ Average & 0.857 & 0.663 &  0.908 & 79.73 & 77.34 & 46.75 & 26.36 & 5.55 & 23.75 \\
  \hline
  \end{tabular} 
 \caption{Results on 5 selected domains and averaged results over 44 domains. 
@@ -29,51 +31,36 @@ and the number of hints that were actually implemented by students. The followin
 contain the number of all generated intent hints (All), 
 the number of implemented hints (Imp) and the number of implemented alternative hints (Alt). 
 The numbers in the last column are student submission where hints could not be generated. 
-The last row averages results over all 44 domains.}
+The bottom two rows give aggregated results (total and average) over all 44 domains.}
 \label{table:eval} 
 \end{table} 
 
 Table~\ref{table:eval} contains results on five selected problems (each problem represents a 
 different group of problems) and averaged results over all problems.\footnote{We report only a subset of results due to space restrictions. Full results and source code can be found at \url{https://ailab.si/ast-patterns/}. } The second, third, and fourth columns provide classification accuracies (CA) of rule-based classifier,
 majority classifier, and random forest on testing data. The majority classifier and the random forests method,
-which was the best performing machine learning algorithm over all problems, serve as lower and upper bound 
-for CA. For example, in the case of ``sister'' problem, 
-our rules correctly classified 99\% testing submissions, the majority was 66\%, and the random
-forests achieved 98\% - slightly worse than the rules on this particular problem. The CA 
-is also good for problems ``del'' and ``sum'', however it is lower in cases of ``is\_sorted'' and ``union'', 
-suggesting that the proposed set of AST patterns is insufficient in certain problems. Indeed, after analysing the problem ``is\_sorted'' 
-we observed that our pattern set does not enclose patterns containing empty list ``[]'', an important pattern
-for a predicate that tests whether a list is sorted. For this reason, the rule learning 
-algorithm failed to learn any C-rules and therefore all programs were classified as incorrect. Furthermore, many 
-solutions of the problem ``union'' include the use of the cut operator (exclamation mark), which
-is, again, ignored by out pattern generation algorithm. 
+which was the best performing machine learning algorithm over all problems, serve as references for bad and good CA on particular data sets. For example, in the case of the \texttt{sister} problem, 
+our rules correctly classified 99\% of testing instances, the accuracy of majority classifier was 66\%, and the random forests achieved 98\%. The CA of rules is also high for problems \texttt{del} and \texttt{sum}, however it is lower in cases of \texttt{is\_sorted} and \texttt{union}, suggesting that the proposed set of AST patterns is insufficient in certain problems. Indeed, after analysing the problem \texttt{is\_sorted},
+we observed that our pattern set does not enclose patterns containing only empty list ``[]'' in a predicate, an important pattern occuring as a base case for this predicate. For this reason, the rule learning 
+algorithm failed to learn any C-rules and therefore all programs were classified as incorrect. In the case of \texttt{union}, many solutions use the cut operator (exclamation mark), which
+is also ignored by our pattern generation algorithm. 
 
-To evaluate the quality of hints we selected all incorrect submission from student traces 
-that resulted in a correct program (solution). For example, from 403 submissions in the ``sister'' testing set,
-there were 289 relevant incorrect solutions. 
+We evaluated the quality of hints on incorrect submissions from student traces 
+that resulted in a correct program. In the case of the \texttt{sister} data set, there were 289 such incorrect submission out of all 403 submissions. 
 
-The columns under ``Buggy hints'' contain evaluation of hints generated from rules 
-for incorrect class (I-rules). For each generated buggy hint we check whether
-it was implemented by the student in the final submission (the solution). The column ``All'' is 
+The columns capped by ``Buggy hints'' in Table~\ref{table:eval} contain evaluation of hints generated from rules 
+for incorrect class (I-rules). For each generated buggy hint we checked whether
+it was implemented by the student in the final submission. The column ``All'' is 
 the number of all generated buggy hints and the column ``Imp'' is the number of 
-implemented hints. The results suggest that almost all buggy hints are implemented in the final solution.
+implemented hints. The results show high relevance of generated buggy hints, as 97\% (3508 out of 3613) of them were  implemented in the final solution.
 
-The intent hints are generated when algorithm fails to find any buggy hints. The column ``All'' contains the number of 
-generated intent hints, ``Imp'' the number of implemented main intent hints, while ``Alt'' is the number
+The intent hints are generated when the algorithm fails to find any buggy hints. The column ``All'' contains the number of generated intent hints, ``Imp'' the number of implemented main intent hints, while ``Alt'' is the number
 of implemented alternative hints. Notice that the percentage of implemented intent hints is significantly lower
-when compared to buggy hints: in the case
-of problem ``sister'' 84 out of 127 (66\%) hints were implemented, while in case of problem ``union'' only 66 out of 182 (36\%) hints were implemented.
+when compared to buggy hints: in the case of problem \texttt{sister} 84 out of 127 (66\%) hints were implemented, whereas in the case of problem \texttt{union} only 66 out of 182 (36\%) hints were implemented. On average, 56\% of main intent hints were implemented.
 
 The last column shows the number of submissions where no hints could be generated. This value is relatively high
-for the ``is\_sorted'' problem, because the algorithm could not learn any C-rules and consequently no intent hints where generated.
+for the ``is\_sorted'' problem, because the algorithm could not learn any C-rules and consequently no intent hints were generated.
 
-To sum up, buggy hints seem to be good and reliable, since they are always implemented when presented, even when we tested 
-them in retrospective (the students did not see these hints). The percentage of implemented intent hints is, 
-on average, lower (56\%; 26.36 of 46.75), which is still not a bad result, providing that it is difficult to determine intent
-of a programer. Last but not least, high classification accuracies in many problems imply that it is possible 
-to correctly determine the correctness of a program by simply verifying the presence of a small number of patterns. 
-Our hypothesis is that these patterns represent the parts of the problem solution that the students have 
-difficulties with. 
+To sum up, buggy hints seem to be good and reliable, since they are always implemented when presented, even when we tested them on past data - the decisions of students were not influenced by these hints. The percentage of implemented intent hints is, on average, lower (56\%), which is still not a bad result, providing that it is difficult to determine the intent of a programer. In 12\% (244 out 2057) where main intent hints were not implemented, students implemented an alternative hint that was identified by our algorithm. Generally, we were able to generate a hint in 84.5\% of cases. In 73\% of all cases, the hints were also implemented. Last but not least, high classification accuracies in many problems imply that it is possible to correctly determine the correctness of a program by simply verifying the presence of a small number of patterns. Our hypothesis is that there exist some crucial patterns in programs that students need to resolve. When they figure out these patterns, the implementation of the remaining of the program is often straightforward. 
 
 
 %%% Local Variables: