summaryrefslogtreecommitdiff
path: root/paper/evaluation.tex
diff options
context:
space:
mode:
Diffstat (limited to 'paper/evaluation.tex')
-rw-r--r--paper/evaluation.tex7
1 files changed, 3 insertions, 4 deletions
diff --git a/paper/evaluation.tex b/paper/evaluation.tex
index 32cb03c..6b6c7a8 100644
--- a/paper/evaluation.tex
+++ b/paper/evaluation.tex
@@ -41,14 +41,13 @@ which had the best overall performance, serve as references for bad and good CA
For example, our rules correctly classified 99\% of testing instances for the \code{sister} problem,
the accuracy of the majority classifier was 66\%, and random forests achieved 98\%. CA of rules is also high for problems \code{del} and \code{sum}. It is lower, however, for \code{is\_sorted} and \code{union}, suggesting that the proposed set of AST patterns is insufficient for certain problems. Indeed, after analyzing the problem \code{is\_sorted},
we observed that our patterns do not cover predicates with a single empty-list (\code{[]}) argument, which occurs as the base case in this problem. For this reason, the rule learning
-algorithm failed to learn any C-rules and therefore all programs were classified as incorrect. In the case of \code{union}, many solutions use the cut (\code{!}) operator, which
+algorithm failed to learn any positive rules and therefore all programs were classified as incorrect. In the case of \code{union}, many solutions use the cut (\code{!}) operator, which
is also ignored by our pattern generation algorithm.
We evaluated the quality of hints on incorrect submissions from those student traces
that resulted in a correct program. In the case of the \code{sister} data set, there were 289 such incorrect submission out of 403 submissions in total.
-The columns captioned “Buggy hints” in Table~\ref{table:eval} contain evaluation of hints generated from rules
-for incorrect programs (I-rules). For each generated buggy hint we checked whether
+The columns captioned “Buggy hints” in Table~\ref{table:eval} contain evaluation of buggy hints generated from negative rules. For each generated buggy hint we checked whether
it was implemented by the student in the final submission. The column “All” is
the number of all generated buggy hints, while the column “Imp” is the number of
implemented hints. The results show high relevance of generated buggy hints, as 97\% (3508 out of 3613) of them were implemented in the final solution; in other words, the buggy pattern was removed.
@@ -58,7 +57,7 @@ of implemented alternative hints. Notice that the percentage of implemented inte
when compared to buggy hints: in the case of problem \code{sister} 84 out of 127 (66\%) hints were implemented, whereas in the case of problem \code{union} only 66 out of 182 (36\%) hints were implemented. On average, 56\% of main intent hints were implemented.
The last column shows the number of submissions where no hints could be generated. This value is relatively high
-for the \code{is\_sorted} problem, because the algorithm could not learn any C-rules and thus no intent hints were generated.
+for the \code{is\_sorted} problem, because the algorithm could not learn any positive rules and thus no intent hints were generated.
To sum up, buggy hints seem to be good and reliable, since they are always implemented when presented, even when we tested them on past data -- the decisions of students were not influenced by these hints. The percentage of implemented intent hints is, on average, lower (56\%), which is still not a bad result, providing that it is difficult to determine the programmer’s intent. In 12\% (244 out 2057) of generated intent hints, students implemented an alternative hint that was identified by our algorithm. Overall we were able to generate hints for 84.5\% of incorrect submissions. Of those hints, 86\% were implemented (73\% of all incorrect submissions).