diff options
-rw-r--r-- | aied2017/evaluation.tex | 82 |
1 files changed, 81 insertions, 1 deletions
diff --git a/aied2017/evaluation.tex b/aied2017/evaluation.tex index 22e4253..50e9eb4 100644 --- a/aied2017/evaluation.tex +++ b/aied2017/evaluation.tex @@ -1 +1,81 @@ -\section{Evaluation}
\ No newline at end of file +\section{Evaluation and discussion} +We evaluated our approach on 44 programming assignments. We preselected 70\% of students +whose submissions were used as learning data for rule learning. The submissions from +the remaining 30\% of students were used as testing data to evaluate classification accuracy of learned rules +and to retrospectively evaluate quality of given hints. + +\begin{table}[t] +\centering + \begin{tabular}{|l|rrr|rr|rrr|r|} + \hline + Problem& \multicolumn{3}{c|}{CA} & \multicolumn{2}{c|}{Buggy hints} & \multicolumn{3}{c|}{Intent hints} & No hint \\ + \hline + & Rules & Maj & RF & All & Imp & All & Imp & Alt & \\ + \hline + sister & 0.988 & 0.663 & 0.983 & 128 & 128 & 127 & 84 & 26 & 34 \\ + del & 0.948 & 0.669 & 0.974 & 136 & 136 & 39 & 25 & 10 & 15 \\ + sum & 0.945 & 0.510 & 0.956 & 59 & 53 & 24 & 22 & 1 & 6 \\ + is\_sorted & 0.765 & 0.765 & 0.831 & 119 & 119 & 0 & 0 & 0 & 93 \\ + union & 0.785 & 0.766 & 0.813 & 106 & 106 & 182 & 66 & 7 & 6 \\ + ...& & & & & & & & & \\ + \hline + Average & 0.857 & 0.666 & 0.908 & 79.73 & 77.34 & 46.75 & 26.36 & 5.55 & 23.75 \\ + \hline + \end{tabular} +\caption{Results on 5 selected domains and averaged results over 44 domains. +Columns 2, 3, and 4 contain classification accuracies of our rule learning method, majority classifier, +and random forest, respectively. Columns 5 and 6 report the number of all generated buggy hints +and the number of hints that were actually implemented by students. The following three columns +contain the number of all generated intent hints (All), +the number of implemented hints (Imp) and the number of implemented alternative hints (Alt). +The numbers in the last column are student submission where hints could not be generated. +The last row averages results over all 44 domains.} +\label{table:eval} +\end{table} + +Table~\ref{table:eval} contains results on five selected problems (each problem represents a +different group of problems) and averaged results over all problems.\footnote{We report only a subset of results due to space +restrictions. The table with results for all 44 problems can be found at +\url{www.ailab.si/aied2017}. } The second, third, and fourth columns provide classification accuracies (CA) of rule-based classifier, +majority classifier, and random forest on testing data. The majority classifier and the random forests method, +which was the best performing machine learning algorithm over all problems, serve as lower and upper bound +for CA. For example, in the case of ``sister'' problem, +our rules correctly classified 99\% testing submissions, the majority was 66\%, and the random +forests achieved 98\% - slightly worse than the rules on this particular problem. The CA +is also good for problems ``del'' and ``sum'', however it is lower in cases of ``is\_sorted'' and ``union'', +suggesting that the proposed set of AST patterns is insufficient in certain problems. Indeed, after analysing the problem ``is\_sorted'' +we observed that our pattern set does not enclose patterns containing empty list ``[]'', an important pattern +for a predicate that tests whether a list is sorted. For this reason, the rule learning +algorithm failed to learn any C-rules and therefore all programs were classified as incorrect. Furthermore, many +solutions of the problem ``union'' include the use of the cut operator (exclamation mark), which +is, again, ignored by out pattern generation algorithm. + +To evaluate the quality of hints we selected all incorrect submission from student traces +that resulted in a correct program (solution). For example, from 403 submissions in the ``sister'' testing set, +there were 289 relevant incorrect solutions. + +The columns under ``Buggy hints'' contain evaluation of hints generated from rules +for incorrect class (I-rules). For each generated buggy hint we check whether +it was implemented by the student in the final submission (the solution). The column ``All'' is +the number of all generated buggy hints and the column ``Imp'' is the number of +implemented hints. The results suggest that almost all buggy hints are implemented in the final solution. + +The intent hints are generated when algorithm fails to find any buggy hints. The column ``All'' contains the number of +generated intent hints, ``Imp'' the number of implemented main intent hints, while ``Alt'' is the number +of implemented alternative hints. Notice that the percentage of implemented intent hints is significantly lower +when compared to buggy hints: in the case +of problem ``sister'' 84 out of 127 (66\%) hints were implemented, while in case of problem ``union'' only 66 out of 182 (36\%) hints were implemented. + +The last column shows the number of submissions where no hints could be generated. This value is relatively high +for the ``is\_sorted'' problem, because the algorithm could not learn any C-rules and consequently no intent hints where generated. + +To sum up, buggy hints seem to be good and reliable, since they are always implemented when presented, even when we tested +them in retrospective (the students did not see these hints). The percentage of implemented intent hints is, +on average, lower (56\%; 26.36 of 46.75), which is still not a bad result, providing that it is difficult to determine intent +of a programer. Last but not least, high classification accuracies in many problems imply that it is possible +to correctly determine the correctness of a program by simply verifying the presence of a small number of patterns. +Our hypothesis is that these patterns represent the parts of the problem solution that the students have +difficulties with. + + + |