diff options
Diffstat (limited to 'paper')
-rw-r--r-- | paper/evaluation.tex | 7 | ||||
-rw-r--r-- | paper/method.tex | 40 |
2 files changed, 23 insertions, 24 deletions
diff --git a/paper/evaluation.tex b/paper/evaluation.tex index 32cb03c..6b6c7a8 100644 --- a/paper/evaluation.tex +++ b/paper/evaluation.tex @@ -41,14 +41,13 @@ which had the best overall performance, serve as references for bad and good CA For example, our rules correctly classified 99\% of testing instances for the \code{sister} problem, the accuracy of the majority classifier was 66\%, and random forests achieved 98\%. CA of rules is also high for problems \code{del} and \code{sum}. It is lower, however, for \code{is\_sorted} and \code{union}, suggesting that the proposed set of AST patterns is insufficient for certain problems. Indeed, after analyzing the problem \code{is\_sorted}, we observed that our patterns do not cover predicates with a single empty-list (\code{[]}) argument, which occurs as the base case in this problem. For this reason, the rule learning -algorithm failed to learn any C-rules and therefore all programs were classified as incorrect. In the case of \code{union}, many solutions use the cut (\code{!}) operator, which +algorithm failed to learn any positive rules and therefore all programs were classified as incorrect. In the case of \code{union}, many solutions use the cut (\code{!}) operator, which is also ignored by our pattern generation algorithm. We evaluated the quality of hints on incorrect submissions from those student traces that resulted in a correct program. In the case of the \code{sister} data set, there were 289 such incorrect submission out of 403 submissions in total. -The columns captioned “Buggy hints” in Table~\ref{table:eval} contain evaluation of hints generated from rules -for incorrect programs (I-rules). For each generated buggy hint we checked whether +The columns captioned “Buggy hints” in Table~\ref{table:eval} contain evaluation of buggy hints generated from negative rules. For each generated buggy hint we checked whether it was implemented by the student in the final submission. The column “All” is the number of all generated buggy hints, while the column “Imp” is the number of implemented hints. The results show high relevance of generated buggy hints, as 97\% (3508 out of 3613) of them were implemented in the final solution; in other words, the buggy pattern was removed. @@ -58,7 +57,7 @@ of implemented alternative hints. Notice that the percentage of implemented inte when compared to buggy hints: in the case of problem \code{sister} 84 out of 127 (66\%) hints were implemented, whereas in the case of problem \code{union} only 66 out of 182 (36\%) hints were implemented. On average, 56\% of main intent hints were implemented. The last column shows the number of submissions where no hints could be generated. This value is relatively high -for the \code{is\_sorted} problem, because the algorithm could not learn any C-rules and thus no intent hints were generated. +for the \code{is\_sorted} problem, because the algorithm could not learn any positive rules and thus no intent hints were generated. To sum up, buggy hints seem to be good and reliable, since they are always implemented when presented, even when we tested them on past data -- the decisions of students were not influenced by these hints. The percentage of implemented intent hints is, on average, lower (56\%), which is still not a bad result, providing that it is difficult to determine the programmer’s intent. In 12\% (244 out 2057) of generated intent hints, students implemented an alternative hint that was identified by our algorithm. Overall we were able to generate hints for 84.5\% of incorrect submissions. Of those hints, 86\% were implemented (73\% of all incorrect submissions). diff --git a/paper/method.tex b/paper/method.tex index a4f7ae7..43362da 100644 --- a/paper/method.tex +++ b/paper/method.tex @@ -22,38 +22,38 @@ For each selected pair of leaf nodes $(a,b)$ we construct a pattern by walking t Patterns are extracted automatically given above constraints (each pattern connecting a pair of variables or values). We find that such patterns work well for Prolog. Other languages, however, will likely require different kinds of patterns to achieve good performance. -Finally, to avoid learning rules specific to a particular program (covering typos and other idiosyncratic mistakes), we ignore rare patterns. In this study we used patterns that occurred in at least five submissions. These patterns form the feature space for rule learning. +Finally, to avoid inducing rules specific to a particular program (covering typos and other idiosyncratic mistakes), we ignore rare patterns. In this study we used patterns that occurred in at least five submissions. These patterns form the feature space for rule learning. \subsection{Learning rules} -We represent students’ programs in the feature space of AST patterns described above. Each pattern corresponds to one binary feature with value \textsf{true} when the pattern is present and \textsf{false} when it is absent. We use unit testing to classify each program as correct if it passes all test cases, and incorrect otherwise. We use these labels for machine learning. +We represent students’ programs in the feature space of AST patterns described above. Each pattern corresponds to one binary feature with value \textsf{true} when the pattern is present and \textsf{false} when it is absent. We classify each program as correct if it passes a predefined set of test cases, and incorrect otherwise. We use these labels for machine learning. -Since we can establish program correctness using appropriate unit tests, our goal here is not classifying new submissions. Instead, we wish to discover patterns associated with program correctness. This approach to machine learning is called \emph{descriptive induction} -- the automatic discovery of patterns describing regularities in data. We use rule learning for this task, because conditions of rules can be easily translated to hints. +Since we can already establish program correctness using appropriate tests cases, our goal here is not classifying new submissions. Instead, we wish to discover patterns associated with correct and incorrect programs. This approach to machine learning is called \emph{descriptive induction} -- the automatic discovery of patterns describing regularities in data. We use rule learning for this task, because rule conditions can be easily translated to hints. -Before explaining the algorithm, let us discuss the reasons why a program can be incorrect. Our experience indicates that bugs in student programs can often be described by 1) some incorrect or \emph{buggy} pattern, which needs to be removed, or 2) some missing relation (pattern) that should be included before the program can be correct. We shall now explain how both types of errors can be identified with rules. +Before explaining the algorithm, let us discuss the reasons why a program can be incorrect. Our experience indicates that bugs in student programs can often be described by 1) some incorrect or \emph{buggy} pattern, which needs to be removed, or 2) some missing relation (pattern) between objects that should be included before the program can be correct. We shall now explain how both types of errors can be identified with rules. -To discover buggy patterns, the algorithm first learns rules that describe incorrect programs (I-rules). We use a variant of the CN2 algorithm~\cite{clark1991rule} implemented within the Orange data-mining toolbox~\cite{demsar2013orange}. Since we use rules to generate hints, and since hints should not be presented to students unless they are likely to be correct, we impose additional constraints on the rule learner: +To discover buggy patterns, the algorithm first learns \emph{negative rules} that describe incorrect programs. We use a variant of the CN2 algorithm~\cite{clark1991rule} implemented within the Orange data-mining toolbox~\cite{demsar2013orange}. Since we use rules to generate hints, and since hints should not be presented to students unless they are likely to be correct, we impose additional constraints on the rule learner: -\begin{enumerate} - \item The classification accuracy of each learned rule must exceed a threshold (we selected 90\%, as 10\% error seems acceptable for our application). - \item Each conjunct in a condition must be significant with respect to the likelihood-ratio test (in our experiments significance threshold was set to $p=0.05$). - \item A conjunct can only specify the presence of a pattern: we allow feature-value pairs with only \textsf{true} as value. -\end{enumerate} +\begin{itemize} + \item classification accuracy of each learned rule must exceed a threshold (we selected 90\%, as 10\% error seems acceptable for our application); + \item each conjunct in a condition must be significant with respect to the likelihood-ratio test (in our experiments we set significance threshold to $p=0.05$); + \item a conjunct can only specify the presence of a pattern (in other words, we only allow feature-value pairs with the value \textsf{true}). +\end{itemize} -\noindent The former two constraints are needed to induce good rules with significant patterns, while the latter constraint assures that rules mention only presence (and not absence) of patterns as reasons for a program to be incorrect. This is important, since conditions of I-rules ought to contain patterns symptomatic of incorrect programs. +The first two constraints ensure good rules with only significant patterns, while the last constraint ensures rules only mention the presence (and not absence) of patterns as reasons for a program to be incorrect. This is important, since conditions in negative rules should contain patterns symptomatic of incorrect programs. -With respect to the second type of error, we could try the same approach and learn rules using the above algorithm for the class of correct programs (C-rules). Having accurate rules for correct programs, the conditional part of these rules would define sufficient combinations of patterns that render a program correct. -It turns out that it is difficult to learn accurate rules for correct programs, because there are many programs that are incorrect despite having all important patterns, because they include also incorrect patterns. +With respect to the second type of error, we could try the same approach and use the above algorithm to learn \emph{positive rules} for the class of correct programs. The conditional part of positive rules should define sufficient combinations of patterns that render a program correct. +It turns out that it is difficult to learn accurate positive rules, because there are many programs that are incorrect despite having all important patterns, because they also include incorrect patterns. -A possible way to solve this problem is to remove programs that are covered by rules for incorrect class. This way all known buggy patterns are removed from the data and will not be included in C-rules. However, removing incorrect patterns also removes the need for relevant patterns. For example, if all incorrect programs were removed, the single C-rule “$\mathsf{true} ⇒ \mathsf{correct}$” would suffice, which cannot be used to generate hints. We achieved the best results by learning from the complete data set, whereas the accuracy of rules was estimated on data without programs covered by I-rules. +A possible way to solve this problem is to remove programs that are covered by some negative rule. This way all known buggy patterns are removed from the data, and will not be included in positive rules. However, removing incorrect patterns also removes the need for specifying relevant patterns in positive rules. For example, if all incorrect programs were removed, the single rule “$\mathsf{true} ⇒ \mathsf{correct}$” would suffice, which cannot be used to generate hints. We achieved the best results by learning positive rules from the complete data set, but estimating their accuracy only on programs not covered by some negative rule. -Even though our main interest is discovery of patterns, we can still use induced rules to classify new programs, for example to evaluate the quality of rules. The classification procedure has three steps: 1) if an I-rule covers the program, classify it as incorrect; 2) else if a C-rule covers the program, classify it as correct; 3) otherwise, if no rule covers the program, classify it as incorrect. +While our main interest is discovering important patterns, induced rules can still be used to classify new programs, for example to evaluate rule quality. Classification proceeds in three steps: 1) if a negative rule covers the program, classify it as incorrect; 2) else if a positive rule covers the program, classify it as correct; 3) otherwise, if no rule covers the program, classify it as incorrect. \subsection{Generating hints} -Once we have induced the rules for a given problem, we can use them to provide hints based on buggy or missing patterns. To generate a hint for an incorrect program, each rule is considered in turn. We consider two types of feedback: \emph{buggy} hints based on I-rules, and \emph{intent} hints based on C-rules. +Once we have induced the rules for a given problem, we can use them to provide hints based on buggy or missing patterns. To generate a hint for an incorrect program, each rule is considered in turn. We consider two types of feedback: \emph{buggy hints} based on negative rules, and \emph{intent hints} based on positive rules. -First, all I-rules are checked to find any known incorrect patterns in the program. To find the most likely incorrect patterns, the rules are considered in the order of decreasing quality. If all patterns in the rule “$p_1 ∧ ⋯ ∧ p_k ⇒ \mathsf{incorrect}$” match, we highlight the relevant leaf nodes. As an aside, we found that most I-rules are based on the presence of a single pattern. For the incorrect \code{sum} program from the previous section, our method produces the following highlight +First, all negative rules are checked to find any known incorrect patterns in the program. To find the most likely incorrect patterns, the rules are considered in the order of decreasing quality. If all patterns in the rule “$p_1 ∧ ⋯ ∧ p_k ⇒ \mathsf{incorrect}$” match, we highlight the corresponding leaf nodes. As a side note, we found that most negative rules are based on the presence of a single pattern. For the incorrect \code{sum} program from the previous section, our method produces the following highlight \begin{Verbatim} sum([],0). % \textit{base case:} the empty list sums to zero @@ -63,11 +63,11 @@ sum([H|T],\red{\underline{Sum}}):- % \textit{recursive case:} \end{Verbatim} \noindent -based on the rule “$p ⇒ \mathsf{incorrect}$”, where $p$ corresponds to the solid red pattern in Fig.~\ref{fig:sum}. This rule covers 36 incorrect programs, and one correct program using an unusual solution strategy. +based on the rule “$p ⇒ \mathsf{incorrect}$”, where $p$ is the solid red pattern in Fig.~\ref{fig:sum}. This rule covers 36 incorrect programs, and one correct program using an unusual solution strategy. -If no I-rule matches the program, we use C-rules to determine the student’s intent. C-rules group patterns that together indicate a high likelihood that the program is correct. Each C-rule thus defines a particular “solution strategy” in terms of AST patterns. We reason that alerting the student to a missing pattern could help them complete the program without revealing the whole solution. +If no negative rule matches the program, we use positive rules to determine the student’s intent. positive rules group patterns that together indicate a high likelihood that the program is correct. Each positive rule thus defines a particular “solution strategy” in terms of AST patterns. We reason that alerting the student to a missing pattern could help them complete the program without revealing the whole solution. -When generating a hint from C-rules, we consider all \emph{partially matching} rules “$p_1 ∧ ⋯ ∧ p_k ⇒ \mathsf{correct}$”, where the student’s program matches some (but not all) patterns $p_i$. For each such rule we store the number of matching patterns, and the set of missing patterns. We then return the most common missing pattern among the rules with most matching patterns. +When generating a hint from positive rules, we consider all \emph{partially matching} rules “$p_1 ∧ ⋯ ∧ p_k ⇒ \mathsf{correct}$”, where the student’s program matches some (but not all) patterns $p_i$. For each such rule we store the number of matching patterns, and the set of missing patterns. We then return the most common missing pattern among the rules with most matching patterns. For example, if we find the following missing pattern for an incorrect program implementing the \code{sister} predicate: |