summaryrefslogtreecommitdiff
path: root/paper/evaluation.tex
diff options
context:
space:
mode:
Diffstat (limited to 'paper/evaluation.tex')
-rw-r--r--paper/evaluation.tex7
1 files changed, 3 insertions, 4 deletions
diff --git a/paper/evaluation.tex b/paper/evaluation.tex
index 6b6c7a8..3b7f9fb 100644
--- a/paper/evaluation.tex
+++ b/paper/evaluation.tex
@@ -2,7 +2,7 @@
We evaluated our approach on 44 programming assignments. We preselected 70\% of students
whose submissions were used as learning data for rule learning. The submissions from
the remaining 30\% of students were used as testing data to evaluate classification accuracy of learned rules,
-and to retrospectively evaluate quality of given hints.
+and to retrospectively evaluate quality of given hints. Problems analyzed in this paper constitute a complete introductory course in Prolog, covering the basics of the language.
\begin{table}[t]
\caption{Results on five selected domains and averaged results over 44 domains.
@@ -39,7 +39,7 @@ Table~\ref{table:eval} contains results on five selected problems (each represen
which had the best overall performance, serve as references for bad and good CA on particular data sets.
For example, our rules correctly classified 99\% of testing instances for the \code{sister} problem,
-the accuracy of the majority classifier was 66\%, and random forests achieved 98\%. CA of rules is also high for problems \code{del} and \code{sum}. It is lower, however, for \code{is\_sorted} and \code{union}, suggesting that the proposed set of AST patterns is insufficient for certain problems. Indeed, after analyzing the problem \code{is\_sorted},
+the accuracy of the majority classifier was 66\%, and random forests achieved 98\%. CA of rules is also high for problems \code{del} and \code{sum}. It is lower, however, for \code{is\_sorted} and \code{union}, suggesting that the proposed AST patterns are insufficient for certain problems. Indeed, after analyzing the problem \code{is\_sorted},
we observed that our patterns do not cover predicates with a single empty-list (\code{[]}) argument, which occurs as the base case in this problem. For this reason, the rule learning
algorithm failed to learn any positive rules and therefore all programs were classified as incorrect. In the case of \code{union}, many solutions use the cut (\code{!}) operator, which
is also ignored by our pattern generation algorithm.
@@ -61,8 +61,7 @@ for the \code{is\_sorted} problem, because the algorithm could not learn any pos
To sum up, buggy hints seem to be good and reliable, since they are always implemented when presented, even when we tested them on past data -- the decisions of students were not influenced by these hints. The percentage of implemented intent hints is, on average, lower (56\%), which is still not a bad result, providing that it is difficult to determine the programmer’s intent. In 12\% (244 out 2057) of generated intent hints, students implemented an alternative hint that was identified by our algorithm. Overall we were able to generate hints for 84.5\% of incorrect submissions. Of those hints, 86\% were implemented (73\% of all incorrect submissions).
-Last but not least, high classification accuracies in many problems imply that it is possible to correctly determine the correctness of a program by simply checking for the presence of a small number of patterns. Our hypothesis is that there exist some crucial patterns in programs that students have difficulties with. When they figure out these patterns, implementing the rest of the program is usually straightforward.
-
+High classification accuracies for many problems imply that it is possible to determine program correctness simply by checking for the presence of a small number of patterns. Our hypothesis is that for each program certain crucial patterns exist that students have difficulties with. When they figure out these patterns, implementing the rest of the program is usually straightforward.
%%% Local Variables:
%%% mode: latex