First draft of evaluation section.

author: Martin Možina <martin.mozina@fri.uni-lj.si> 2017-02-03 18:38:27 +0100
committer: Martin Možina <martin.mozina@fri.uni-lj.si> 2017-02-03 18:38:27 +0100
commit: 90ea92ad4c42a5fa980d55576cc96aa4a8fbf3a8 (patch)
tree: ea4a85901cd02f793d5e71e6b2a23467f41e4e4b
parent: bf57e6aa17bbcb8deac650c807b5f494a6488ba6 (diff)
1 files changed, 81 insertions, 1 deletions
diff --git a/aied2017/evaluation.tex b/aied2017/evaluation.tex
index 22e4253..50e9eb4 100644
--- a/aied2017/evaluation.tex
+++ b/aied2017/evaluation.tex
@@ -1 +1,81 @@
-\section{Evaluation}
-\ No newline at end of file
+\section{Evaluation and discussion}
+We evaluated our approach on 44 programming assignments. We preselected 70\% of students
+whose submissions were used as learning data for rule learning. The submissions from
+the remaining 30\% of students were used as testing data to evaluate classification accuracy of learned rules 
+and to retrospectively evaluate quality of given hints. 
+
+\begin{table}[t]
+\centering
+ \begin{tabular}{|l|rrr|rr|rrr|r|}
+ \hline
+ Problem& \multicolumn{3}{c|}{CA} & \multicolumn{2}{c|}{Buggy hints} & \multicolumn{3}{c|}{Intent hints} & No hint \\
+ \hline
+  & Rules & Maj & RF & All & Imp & All & Imp & Alt &  \\
+ \hline
+ sister & 0.988 & 0.663 & 0.983 &  128 & 128 & 127 & 84 & 26 & 34 \\
+ del & 0.948 & 0.669 & 0.974 & 136 & 136 & 39 & 25 & 10 & 15 \\
+ sum & 0.945 & 0.510 & 0.956 & 59 & 53 & 24 & 22 & 1 & 6 \\
+ is\_sorted  & 0.765 & 0.765 & 0.831 & 119 & 119 & 0 & 0 & 0 & 93 \\
+ union & 0.785 & 0.766 & 0.813 & 106 & 106 & 182 & 66 & 7 & 6 \\
+ ...& & & & & & & & & \\
+ \hline
+ Average & 0.857 & 0.666 &  0.908 & 79.73 & 77.34 & 46.75 & 26.36 & 5.55 & 23.75 \\
+ \hline
+ \end{tabular} 
+\caption{Results on 5 selected domains and averaged results over 44 domains. 
+Columns 2, 3, and 4 contain classification accuracies of our rule learning method, majority classifier,  
+and random forest, respectively. Columns 5 and 6 report the number of all generated buggy hints
+and the number of hints that were actually implemented by students. The following three columns
+contain the number of all generated intent hints (All), 
+the number of implemented hints (Imp) and the number of implemented alternative hints (Alt). 
+The numbers in the last column are student submission where hints could not be generated. 
+The last row averages results over all 44 domains.}
+\label{table:eval} 
+\end{table} 
+
+Table~\ref{table:eval} contains results on five selected problems (each problem represents a 
+different group of problems) and averaged results over all problems.\footnote{We report only a subset of results due to space
+restrictions. The table with results for all 44 problems can be found at 
+\url{www.ailab.si/aied2017}. } The second, third, and fourth columns provide classification accuracies (CA) of rule-based classifier,
+majority classifier, and random forest on testing data. The majority classifier and the random forests method,
+which was the best performing machine learning algorithm over all problems, serve as lower and upper bound 
+for CA. For example, in the case of ``sister'' problem, 
+our rules correctly classified 99\% testing submissions, the majority was 66\%, and the random
+forests achieved 98\% - slightly worse than the rules on this particular problem. The CA 
+is also good for problems ``del'' and ``sum'', however it is lower in cases of ``is\_sorted'' and ``union'', 
+suggesting that the proposed set of AST patterns is insufficient in certain problems. Indeed, after analysing the problem ``is\_sorted'' 
+we observed that our pattern set does not enclose patterns containing empty list ``[]'', an important pattern
+for a predicate that tests whether a list is sorted. For this reason, the rule learning 
+algorithm failed to learn any C-rules and therefore all programs were classified as incorrect. Furthermore, many 
+solutions of the problem ``union'' include the use of the cut operator (exclamation mark), which
+is, again, ignored by out pattern generation algorithm. 
+
+To evaluate the quality of hints we selected all incorrect submission from student traces 
+that resulted in a correct program (solution). For example, from 403 submissions in the ``sister'' testing set,
+there were 289 relevant incorrect solutions. 
+
+The columns under ``Buggy hints'' contain evaluation of hints generated from rules 
+for incorrect class (I-rules). For each generated buggy hint we check whether
+it was implemented by the student in the final submission (the solution). The column ``All'' is 
+the number of all generated buggy hints and the column ``Imp'' is the number of 
+implemented hints. The results suggest that almost all buggy hints are implemented in the final solution.
+
+The intent hints are generated when algorithm fails to find any buggy hints. The column ``All'' contains the number of 
+generated intent hints, ``Imp'' the number of implemented main intent hints, while ``Alt'' is the number
+of implemented alternative hints. Notice that the percentage of implemented intent hints is significantly lower
+when compared to buggy hints: in the case
+of problem ``sister'' 84 out of 127 (66\%) hints were implemented, while in case of problem ``union'' only 66 out of 182 (36\%) hints were implemented.
+
+The last column shows the number of submissions where no hints could be generated. This value is relatively high
+for the ``is\_sorted'' problem, because the algorithm could not learn any C-rules and consequently no intent hints where generated.
+
+To sum up, buggy hints seem to be good and reliable, since they are always implemented when presented, even when we tested 
+them in retrospective (the students did not see these hints). The percentage of implemented intent hints is, 
+on average, lower (56\%; 26.36 of 46.75), which is still not a bad result, providing that it is difficult to determine intent
+of a programer. Last but not least, high classification accuracies in many problems imply that it is possible 
+to correctly determine the correctness of a program by simply verifying the presence of a small number of patterns. 
+Our hypothesis is that these patterns represent the parts of the problem solution that the students have 
+difficulties with. 
+
+
+
author	Martin Možina <martin.mozina@fri.uni-lj.si>	2017-02-03 18:38:27 +0100
committer	Martin Možina <martin.mozina@fri.uni-lj.si>	2017-02-03 18:38:27 +0100
commit	90ea92ad4c42a5fa980d55576cc96aa4a8fbf3a8 (patch)
tree	ea4a85901cd02f793d5e71e6b2a23467f41e4e4b
parent	bf57e6aa17bbcb8deac650c807b5f494a6488ba6 (diff)