From ad732e88887a72485ca03fae288bb9dc39911d15 Mon Sep 17 00:00:00 2001 From: Timotej Lazar Date: Tue, 7 Feb 2017 18:49:34 +0100 Subject: Final version for review --- aied2017/aied2017.tex | 2 +- aied2017/evaluation.tex | 30 +++++++++++++++------------- aied2017/method.tex | 53 ++++++++++++++++++------------------------------- aied2017/patterns.tex | 36 ++++++++++++++++----------------- 4 files changed, 54 insertions(+), 67 deletions(-) diff --git a/aied2017/aied2017.tex b/aied2017/aied2017.tex index a7ed5e1..3faaaf8 100644 --- a/aied2017/aied2017.tex +++ b/aied2017/aied2017.tex @@ -41,7 +41,7 @@ % motivation When implementing a programming tutor, it is often difficult to manually consider all possible errors encountered by students. An alternative is to automatically learn a bug library of erroneous patterns from students’ programs. % learning -We propose using abstract-syntax-tree (AST) patterns as features for learning rules to distinguish between correct and incorrect programs. These rules can be used for debugging student programs: rules for incorrect programs (buggy rules) contain patterns indicating mistakes, whereas each rule for correct programs covers a subset of submissions sharing the same solution strategy. +We propose using abstract-syntax-tree (AST) patterns as features for learning rules to distinguish between correct and incorrect programs. These rules can be used for debugging student programs: rules for incorrect programs (buggy rules) contain patterns indicating mistakes, whereas rules for correct programs cover subsets of submissions sharing the same solution strategy. % generating hints To generate hints, we first check all buggy rules and point out incorrect patterns. If no buggy rule matches, rules for correct programs are used to recognize the student’s intent and suggest patterns that still need to be implemented. % evaluation diff --git a/aied2017/evaluation.tex b/aied2017/evaluation.tex index bb3f4c6..32cb03c 100644 --- a/aied2017/evaluation.tex +++ b/aied2017/evaluation.tex @@ -1,10 +1,18 @@ -\section{Evaluation and discussion} +\section{Evaluation} We evaluated our approach on 44 programming assignments. We preselected 70\% of students whose submissions were used as learning data for rule learning. The submissions from the remaining 30\% of students were used as testing data to evaluate classification accuracy of learned rules, and to retrospectively evaluate quality of given hints. \begin{table}[t] +\caption{Results on five selected domains and averaged results over 44 domains. +Columns 2, 3, and 4 contain classification accuracies of our rule learning method, majority classifier, +and random forest, respectively. Columns 5 and 6 report the number of all generated buggy hints +and the number of hints that were actually implemented by students. The following three columns +contain the number of all generated intent hints (All), +the number of implemented hints (Imp) and the number of implemented alternative hints (Alt). +The numbers in the last column are student submission where hints could not be generated. +The bottom two rows give aggregated results (total and average) over all 44 domains.} \centering \begin{tabular}{|l|rrr|rr|rrr|r|} \hline @@ -24,22 +32,14 @@ and to retrospectively evaluate quality of given hints. Average & 0.857 & 0.663 & 0.908 & 79.73 & 77.34 & 46.75 & 26.36 & 5.55 & 23.75 \\ \hline \end{tabular} -\caption{Results on five selected domains and averaged results over 44 domains. -Columns 2, 3, and 4 contain classification accuracies of our rule learning method, majority classifier, -and random forest, respectively. Columns 5 and 6 report the number of all generated buggy hints -and the number of hints that were actually implemented by students. The following three columns -contain the number of all generated intent hints (All), -the number of implemented hints (Imp) and the number of implemented alternative hints (Alt). -The numbers in the last column are student submission where hints could not be generated. -The bottom two rows give aggregated results (total and average) over all 44 domains.} \label{table:eval} \end{table} Table~\ref{table:eval} contains results on five selected problems (each representing a group of problems from one lab session), and averaged results over all 44 problems.\footnote{We report only a subset of results due to space restrictions. Full results and source code can be found at \url{https://ailab.si/ast-patterns/}. } The second, third, and fourth columns provide classification accuracies (CA) of the rule-based, majority, and random-forest classifiers on testing data. The majority classifier and the random forests method, which had the best overall performance, serve as references for bad and good CA on particular data sets. -For example, our rules correctly classified 99\% of testing instance for the \code{sister} problem, -the accuracy of the majority classifier was 66\%, and random forests achieved 98\%. Rule CA is also high for problems \code{del} and \code{sum}. It is lower, however, for \code{is\_sorted} and \code{union}, suggesting that the proposed set of AST patterns is insufficient for certain problems. Indeed, after analyzing the problem \code{is\_sorted}, +For example, our rules correctly classified 99\% of testing instances for the \code{sister} problem, +the accuracy of the majority classifier was 66\%, and random forests achieved 98\%. CA of rules is also high for problems \code{del} and \code{sum}. It is lower, however, for \code{is\_sorted} and \code{union}, suggesting that the proposed set of AST patterns is insufficient for certain problems. Indeed, after analyzing the problem \code{is\_sorted}, we observed that our patterns do not cover predicates with a single empty-list (\code{[]}) argument, which occurs as the base case in this problem. For this reason, the rule learning algorithm failed to learn any C-rules and therefore all programs were classified as incorrect. In the case of \code{union}, many solutions use the cut (\code{!}) operator, which is also ignored by our pattern generation algorithm. @@ -47,20 +47,22 @@ is also ignored by our pattern generation algorithm. We evaluated the quality of hints on incorrect submissions from those student traces that resulted in a correct program. In the case of the \code{sister} data set, there were 289 such incorrect submission out of 403 submissions in total. -The columns capped by “Buggy hints” in Table~\ref{table:eval} contain evaluation of hints generated from rules +The columns captioned “Buggy hints” in Table~\ref{table:eval} contain evaluation of hints generated from rules for incorrect programs (I-rules). For each generated buggy hint we checked whether it was implemented by the student in the final submission. The column “All” is the number of all generated buggy hints, while the column “Imp” is the number of implemented hints. The results show high relevance of generated buggy hints, as 97\% (3508 out of 3613) of them were implemented in the final solution; in other words, the buggy pattern was removed. -The intent hints are generated when the algorithm fails to find any buggy hints. The column “All” contains the number of generated intent hints, “Imp”' the number of implemented main intent hints, and “Alt” is the number +The intent hints are generated when the algorithm fails to find any buggy hints. The column “All” contains the number of generated intent hints, “Imp” the number of implemented main intent hints, and “Alt” is the number of implemented alternative hints. Notice that the percentage of implemented intent hints is significantly lower when compared to buggy hints: in the case of problem \code{sister} 84 out of 127 (66\%) hints were implemented, whereas in the case of problem \code{union} only 66 out of 182 (36\%) hints were implemented. On average, 56\% of main intent hints were implemented. The last column shows the number of submissions where no hints could be generated. This value is relatively high for the \code{is\_sorted} problem, because the algorithm could not learn any C-rules and thus no intent hints were generated. -To sum up, buggy hints seem to be good and reliable, since they are always implemented when presented, even when we tested them on past data -- the decisions of students were not influenced by these hints. The percentage of implemented intent hints is, on average, lower (56\%), which is still not a bad result, providing that it is difficult to determine the programmer’s intent. In 12\% (244 out 2057) where main intent hints were not implemented, students implemented an alternative hint that was identified by our algorithm. Overall we were able to generate a hint in 84.5\% of cases. In 73\% of all cases, the hints were also implemented. Last but not least, high classification accuracies in many problems imply that it is possible to correctly determine the correctness of a program by simply verifying the presence of a small number of patterns. Our hypothesis is that there exist some crucial patterns in programs that students need to resolve. When they figure out these patterns, the implementation of the remaining of the program is often straightforward. +To sum up, buggy hints seem to be good and reliable, since they are always implemented when presented, even when we tested them on past data -- the decisions of students were not influenced by these hints. The percentage of implemented intent hints is, on average, lower (56\%), which is still not a bad result, providing that it is difficult to determine the programmer’s intent. In 12\% (244 out 2057) of generated intent hints, students implemented an alternative hint that was identified by our algorithm. Overall we were able to generate hints for 84.5\% of incorrect submissions. Of those hints, 86\% were implemented (73\% of all incorrect submissions). + +Last but not least, high classification accuracies in many problems imply that it is possible to correctly determine the correctness of a program by simply checking for the presence of a small number of patterns. Our hypothesis is that there exist some crucial patterns in programs that students have difficulties with. When they figure out these patterns, implementing the rest of the program is usually straightforward. %%% Local Variables: diff --git a/aied2017/method.tex b/aied2017/method.tex index 55537ab..5e27229 100644 --- a/aied2017/method.tex +++ b/aied2017/method.tex @@ -5,60 +5,42 @@ This section explains the three main components of our approach: automatically e \subsection{Extracting patterns} \label{sec:extracting-patterns} -We construct patterns by connecting pairs of leaf nodes in a program’s AST. For this paper we always select a pair of nodes from the same clause: either two nodes referring to the same variable (like the examples in Fig.~\ref{fig:sister}), or a value (such as the empty list \code{[]} or \code{0}) and another variable or value in the same \textsf{compound} or \textsf{binop} (like the blue dotted pattern in Fig.~\ref{fig:sum}). For example, in the clause\footnote{The second occurrence of variables \code{A}, \code{B} and \code{C} is marked with ’ for disambiguation.} +We construct patterns by connecting pairs of leaf nodes in a program’s AST. For this paper we always select a pair of nodes from the same clause: either two nodes referring to the same variable (like the examples in Fig.~\ref{fig:sister}), or a value (such as the empty list \code{[]} or the number \code{0}) and another variable or value in the same \textsf{compound} or \textsf{binop} (like the blue dotted pattern in Fig.~\ref{fig:sum}). For example, in the clause (the second occurrence of each variable -- \code{A}, \code{B} and \code{C} -- is marked with~’ for disambiguation) \begin{Verbatim} a(A,\textsf{ }B):- b(A\textsf{'},\textsf{ }C), - B\textsf{'} is C\textsf{'}\textsf{ }+\textsf{ }18. + B\textsf{'} is C\textsf{'}\textsf{ }+\textsf{ }1. \end{Verbatim} \noindent -we select the following pairs of nodes: \{\code{A},\,\code{A\textsf{'}}\}, \{\code{B},\,\code{B\textsf{'}}\}, \{\code{C},\,\code{C\textsf{'}}\}, \{\code{B\textsf{'}},\,\code{18}\} and \{\code{C\textsf{'}},\,\code{18}\}. +we select the following pairs of nodes: \{\code{A},\,\code{A\textsf{'}}\}, \{\code{B},\,\code{B\textsf{'}}\}, \{\code{C},\,\code{C\textsf{'}}\}, \{\code{B\textsf{'}},\,\code{1}\} and \{\code{C\textsf{'}},\,\code{1}\}. For each selected pair of leaf nodes $(a,b)$ we build a pattern by walking the AST in depth-first order, and recording nodes that lie on the paths to $a$ and $b$. We omit \textsf{and} nodes, as explained in the previous section. We also include certain nodes that do not lie on a path to any selected leaf. Specifically, we include the functor or operator of all \textsf{compound}, \textsf{binop} or \textsf{unop} nodes containing $a$ or $b$. -Patterns constructed in this way form the set of features for rule learning. To keep this set at a reasonable size, we only use patterns that have appeared in programs submitted at least five times. +Patterns constructed in this way form the set of features for rule learning. To keep this set at a reasonable size, we only use patterns that have occurred at least five times in submitted programs. \subsection{Learning rules} -%\begin{figure}[t] -%\noindent \textsc{Procedure $learn\_rules(D, sig, acc)$} -%\\ -%\\ -%\noindent Given: data \emph{D}, threshold of significance \emph{sig}, threshold of accuracy \emph{acc}, and a -%procedure for learning rules $learn\_target\_rules(target, D, D_{acc}, sig, acc)$ that learns a set of rules for -%class \emph{target} from data \emph{D}, where: a) each attribute-value pair in the condition part -%needs to be significant using the likelihood-ratio test with significance level $p