diff options
Diffstat (limited to 'aied2017')
-rw-r--r-- | aied2017/aied2017.tex | 1 | ||||
-rw-r--r-- | aied2017/conclusion.tex | 8 | ||||
-rw-r--r-- | aied2017/evaluation.tex | 46 | ||||
-rw-r--r-- | aied2017/method.tex | 6 | ||||
-rw-r--r-- | aied2017/patterns.tex | 11 |
5 files changed, 42 insertions, 30 deletions
diff --git a/aied2017/aied2017.tex b/aied2017/aied2017.tex index a253632..a7ed5e1 100644 --- a/aied2017/aied2017.tex +++ b/aied2017/aied2017.tex @@ -27,6 +27,7 @@ \newcommand\code[1]{\texttt{#1}} \newcommand\red[1]{{\begingroup\color[rgb]{0.8,0.15,0.15}#1\endgroup}} +\newcommand\green[1]{{\begingroup\color[rgb]{0.2,0.7,0.2}#1\endgroup}} \newcommand\hl[1]{\textbf{#1}} \begin{document} diff --git a/aied2017/conclusion.tex b/aied2017/conclusion.tex index e2cd71b..331424f 100644 --- a/aied2017/conclusion.tex +++ b/aied2017/conclusion.tex @@ -1,14 +1,14 @@ \section{Conclusion} -AST patterns can be used to describe structural information in programs’ ASTs. By encoding only relations between selected nodes, each pattern can match many programs. AST patterns thus function as “regular expressions for trees”. We have used patterns as features for machine learning. +AST patterns can be used to describe the structure of a program’s AST. By encoding only relations between selected nodes, each pattern can match many programs. AST patterns thus function as “regular expressions for trees”. We presented a method for automatically extracting AST patterns from student programs. Our patterns encode relations between data objects in a program, with each pattern connecting either two instances of the same variable, a variable and a value, or two values. We consider such patterns as the atomic syntactic relations in a program, and use them as machine-learning features. -Evaluation shows that our patterns are useful for classifying Prolog programs. It is likely, however, that other programming languages will require different kinds of patterns. In Python, for example, variables can take on different values and appear in many places. This will likely require patterns relating more than two instances of a variable, or multiple variables. +We explained how patterns can be used as features to induce rules for classifying correct and incorrect programs. Since the goal of our research is to generate hints, we adapted the CN2 algorithm to produce rules useful for that purpose. We induce rules in two passes: we first learn the rules for incorrect programs, and then use the programs not covered by any such rule to learn the rules for correct programs. -We explained how patterns can be used as features to induced rules for classifying correct and incorrect programs. Since the goal of our research is to generate hints, we adapted the CN2 algorithm to produce rules useful for that purpose. We induce rules in two passes: we first learn the I-rules for incorrect programs, and then use the programs not covered by any I-rule to learn the C-rules for correct programs. +Evaluation shows that our patterns are useful for classifying Prolog programs. It is likely, however, that other programming languages will require different kinds of patterns. In Python, for example, variables can take on different values and appear in many places. This will likely require patterns relating more than two instances of a variable, or multiple variables. -We showed how to generate hints based on I- and C-rules by highlighting buggy patterns or pointing out what patterns are missing. Evaluation on past student data shows that useful hints can be provided for many incorrect submissions this way. The quality of feedback could be improved by annotating rules with explanations in natural language. Furthermore, since patterns and rules are easily interpretable, they can also help when manually authoring tutoring systems, with I-rules indicating the common errors and C-rules the typical solution strategies for each problem. +We showed how to generate hints based on rules by highlighting buggy patterns or pointing out what patterns are missing. Evaluation on past student data shows that useful hints can be provided for many incorrect submissions this way. The quality of feedback could be improved by annotating rules with explanations in natural language. Furthermore, since patterns and rules are easily interpretable, they can also help when manually authoring tutoring systems, by indicating the common errors and the typical solution strategies for each problem. In the future we will attempt to improve rule accuracy for certain problems, such as \code{union}. This will likely necessitate new kinds of patterns, for example to handle the cut operator. Adapting our methods to handle Python programs will give us some insight into what kinds of patterns could be useful in different situations. Finally, we will implement hint generation in an online programming tutor CodeQ, and evaluate the effect of automatic feedback on students’ problem-solving. diff --git a/aied2017/evaluation.tex b/aied2017/evaluation.tex index 247c08f..f81c2fd 100644 --- a/aied2017/evaluation.tex +++ b/aied2017/evaluation.tex @@ -1,8 +1,8 @@ \section{Evaluation and discussion} We evaluated our approach on 44 programming assignments. We preselected 70\% of students whose submissions were used as learning data for rule learning. The submissions from -the remaining 30\% of students were used as testing data to evaluate classification accuracy of learned rules -and to retrospectively evaluate quality of given hints. +the remaining 30\% of students were used as testing data to evaluate classification accuracy of learned rules, +and to retrospectively evaluate quality of given hints. \begin{table}[t] \centering @@ -12,7 +12,7 @@ and to retrospectively evaluate quality of given hints. \hline & Rules & Maj & RF & All & Imp & All & Imp & Alt & \\ \hline - sister & 0.988 & 0719 & 0.983 & 128 & 128 & 127 & 84 & 26 & 34 \\ + sister & 0.988 & 0.719 & 0.983 & 128 & 128 & 127 & 84 & 26 & 34 \\ del & 0.948 & 0.645 & 0.974 & 136 & 136 & 39 & 25 & 10 & 15 \\ sum & 0.945 & 0.511 & 0.956 & 59 & 53 & 24 & 22 & 1 & 6 \\ is\_sorted & 0.765 & 0.765 & 0.831 & 119 & 119 & 0 & 0 & 0 & 93 \\ @@ -24,7 +24,7 @@ and to retrospectively evaluate quality of given hints. Average & 0.857 & 0.663 & 0.908 & 79.73 & 77.34 & 46.75 & 26.36 & 5.55 & 23.75 \\ \hline \end{tabular} -\caption{Results on 5 selected domains and averaged results over 44 domains. +\caption{Results on five selected domains and averaged results over 44 domains. Columns 2, 3, and 4 contain classification accuracies of our rule learning method, majority classifier, and random forest, respectively. Columns 5 and 6 report the number of all generated buggy hints and the number of hints that were actually implemented by students. The following three columns @@ -35,32 +35,32 @@ The bottom two rows give aggregated results (total and average) over all 44 doma \label{table:eval} \end{table} -Table~\ref{table:eval} contains results on five selected problems (each problem represents a -different group of problems) and averaged results over all problems.\footnote{We report only a subset of results due to space restrictions. Full results and source code can be found at \url{https://ailab.si/ast-patterns/}. } The second, third, and fourth columns provide classification accuracies (CA) of rule-based classifier, -majority classifier, and random forest on testing data. The majority classifier and the random forests method, -which was the best performing machine learning algorithm over all problems, serve as references for bad and good CA on particular data sets. For example, in the case of the \texttt{sister} problem, -our rules correctly classified 99\% of testing instances, the accuracy of majority classifier was 66\%, and the random forests achieved 98\%. The CA of rules is also high for problems \texttt{del} and \texttt{sum}, however it is lower in cases of \texttt{is\_sorted} and \texttt{union}, suggesting that the proposed set of AST patterns is insufficient in certain problems. Indeed, after analysing the problem \texttt{is\_sorted}, -we observed that our pattern set does not enclose patterns containing only empty list ``[]'' in a predicate, an important pattern occuring as a base case for this predicate. For this reason, the rule learning -algorithm failed to learn any C-rules and therefore all programs were classified as incorrect. In the case of \texttt{union}, many solutions use the cut operator (exclamation mark), which -is also ignored by our pattern generation algorithm. +Table~\ref{table:eval} contains results on five selected problems (each representing a group of problems from one lab session), and averaged results over all 44 problems.\footnote{We report only a subset of results due to space restrictions. Full results and source code can be found at \url{https://ailab.si/ast-patterns/}. } The second, third, and fourth columns provide classification accuracies (CA) of the rule-based, majority, and random-forest classifiers on testing data. The majority classifier and the random forests method, +which had the best overall performance, serve as references for bad and good CA on particular data sets. -We evaluated the quality of hints on incorrect submissions from student traces -that resulted in a correct program. In the case of the \texttt{sister} data set, there were 289 such incorrect submission out of all 403 submissions. +For example, our rules correctly classified 99\% of testing instance for the \code{sister} problem, +the accuracy of the majority classifier was 66\%, and random forests achieved 98\%. Rule CA is also high for problems \code{del} and \code{sum}. It is lower, however, for \code{is\_sorted} and \code{union}, suggesting that the proposed set of AST patterns is insufficient for certain problems. Indeed, after analyzing the problem \code{is\_sorted}, +we observed that our patterns do not cover predicates with a single empty-list (\code{[]}) argument, which occurs as the base case in this problem. For this reason, the rule learning +algorithm failed to learn any C-rules and therefore all programs were classified as incorrect. In the case of \code{union}, many solutions use the cut (\code{!}) operator, which +is also ignored by our pattern generation algorithm. -The columns capped by ``Buggy hints'' in Table~\ref{table:eval} contain evaluation of hints generated from rules -for incorrect class (I-rules). For each generated buggy hint we checked whether -it was implemented by the student in the final submission. The column ``All'' is -the number of all generated buggy hints and the column ``Imp'' is the number of -implemented hints. The results show high relevance of generated buggy hints, as 97\% (3508 out of 3613) of them were implemented in the final solution. +We evaluated the quality of hints on incorrect submissions from those student traces +that resulted in a correct program. In the case of the \code{sister} data set, there were 289 such incorrect submission out of 403 submissions in total. -The intent hints are generated when the algorithm fails to find any buggy hints. The column ``All'' contains the number of generated intent hints, ``Imp'' the number of implemented main intent hints, while ``Alt'' is the number +The columns capped by “Buggy hints” in Table~\ref{table:eval} contain evaluation of hints generated from rules +for incorrect programs (I-rules). For each generated buggy hint we checked whether +it was implemented by the student in the final submission. The column “All” is +the number of all generated buggy hints, while the column “Imp” is the number of +implemented hints. The results show high relevance of generated buggy hints, as 97\% (3508 out of 3613) of them were implemented in the final solution; in other words, the buggy pattern was removed. + +The intent hints are generated when the algorithm fails to find any buggy hints. The column “All” contains the number of generated intent hints, “Imp”' the number of implemented main intent hints, and “Alt” is the number of implemented alternative hints. Notice that the percentage of implemented intent hints is significantly lower -when compared to buggy hints: in the case of problem \texttt{sister} 84 out of 127 (66\%) hints were implemented, whereas in the case of problem \texttt{union} only 66 out of 182 (36\%) hints were implemented. On average, 56\% of main intent hints were implemented. +when compared to buggy hints: in the case of problem \code{sister} 84 out of 127 (66\%) hints were implemented, whereas in the case of problem \code{union} only 66 out of 182 (36\%) hints were implemented. On average, 56\% of main intent hints were implemented. The last column shows the number of submissions where no hints could be generated. This value is relatively high -for the ``is\_sorted'' problem, because the algorithm could not learn any C-rules and consequently no intent hints were generated. +for the \code{is\_sorted} problem, because the algorithm could not learn any C-rules and consequently no intent hints were generated. -To sum up, buggy hints seem to be good and reliable, since they are always implemented when presented, even when we tested them on past data - the decisions of students were not influenced by these hints. The percentage of implemented intent hints is, on average, lower (56\%), which is still not a bad result, providing that it is difficult to determine the intent of a programer. In 12\% (244 out 2057) where main intent hints were not implemented, students implemented an alternative hint that was identified by our algorithm. Generally, we were able to generate a hint in 84.5\% of cases. In 73\% of all cases, the hints were also implemented. Last but not least, high classification accuracies in many problems imply that it is possible to correctly determine the correctness of a program by simply verifying the presence of a small number of patterns. Our hypothesis is that there exist some crucial patterns in programs that students need to resolve. When they figure out these patterns, the implementation of the remaining of the program is often straightforward. +To sum up, buggy hints seem to be good and reliable, since they are always implemented when presented, even when we tested them on past data -- the decisions of students were not influenced by these hints. The percentage of implemented intent hints is, on average, lower (56\%), which is still not a bad result, providing that it is difficult to determine the programmer’s intent. In 12\% (244 out 2057) where main intent hints were not implemented, students implemented an alternative hint that was identified by our algorithm. Overall we were able to generate a hint in 84.5\% of cases. In 73\% of all cases, the hints were also implemented. Last but not least, high classification accuracies in many problems imply that it is possible to correctly determine the correctness of a program by simply verifying the presence of a small number of patterns. Our hypothesis is that there exist some crucial patterns in programs that students need to resolve. When they figure out these patterns, the implementation of the remaining of the program is often straightforward. %%% Local Variables: diff --git a/aied2017/method.tex b/aied2017/method.tex index f3606c0..55537ab 100644 --- a/aied2017/method.tex +++ b/aied2017/method.tex @@ -42,9 +42,9 @@ Patterns constructed in this way form the set of features for rule learning. To % \label{figure:algorithm} %\end{figure} -We represent submitted programs in the feature space of AST patterns described above. One pattern corresponds to one boolean feature with value set to $true$ when pattern is present and $false$ when pattern is absent. We classify programs as correct or incorrect based on predefined unit tests for each problem, and use these labels for machine learning. +We represent students’ programs in the feature space of AST patterns described above. Each pattern corresponds to one binary feature with value $true$ when the pattern is present and $false$ when it is absent. We classify programs as correct or incorrect based on predefined unit tests for each problem, and use these labels for machine learning. -Since programs can be validated with appropriate unit tests, our goal is not classifying new submissions, but to discover patterns related to program correctness. This machine-learning approach is called \emph{descriptive induction} -- the automatic discovery of patterns describing regularities in data. We use rule learning for this task because conditions of rules can be easily translated to hints. +Since programs can be validated with appropriate unit tests, our goal is not classifying new submissions, but to discover patterns related to program correctness. This approach to machine-learning is called \emph{descriptive induction} -- the automatic discovery of patterns describing regularities in data. We use rule learning for this task because conditions of rules can be easily translated to hints. Before explaining the algorithm, let us discuss the reasons why a program can be incorrect. Our teaching experience indicates that bugs in student programs can often be described by 1) some incorrect or \emph{buggy} pattern, which needs to be removed, or 2) some missing relation (pattern) that should be included before the program can be correct. We shall now explain how both types of errors can be identified with rules. @@ -88,6 +88,8 @@ For example, if we find the following missing pattern for an incorrect program i \noindent we could display a message to the student saying “you are missing a comparison between \code{X} and some other value, with the form \code{X} \code{\textbackslash{}=} \code{?}”. +This method can find more than one missing pattern for a given partial program. In such cases we return the most commonly occurring pattern as the main hint, and other candidate patterns as alternative hints. We use main and alternative intent hints to establish the upper and lower bounds when evaluating automatic hints. + %%% Local Variables: %%% mode: latex diff --git a/aied2017/patterns.tex b/aied2017/patterns.tex index 47fc9e2..11909bf 100644 --- a/aied2017/patterns.tex +++ b/aied2017/patterns.tex @@ -81,7 +81,16 @@ The second pattern in Fig.~\ref{fig:sister}, drawn with solid red arrows, encode (compound (functor \code{parent}) (args var))) \end{Verbatim} -Patterns describe relations between nodes in a program’s AST. Specifically, the pattern ($a$ $b$ $c$) means that the nodes $b$ and $c$ are descended from $a$, and that $b$ precedes $c$ in a depth-first tree walk. In general, an AST matches the pattern (\textsf{name} $p_1$ … $p_k$) if it contains a node $n$ labeled \textsf{name}; the subtree rooted at $n$ must also contain, in depth-first order, distinct nodes $n_1$ to $n_k$ matching subpatterns $p_1$ to $p_k$. +Patterns describe relations between nodes in a program’s AST. Specifically, the pattern ($a$ $b$ $c$) means that the nodes $b$ and $c$ are descended from $a$, and that $b$ precedes $c$ in a depth-first tree walk. In general, an AST matches the pattern (\textsf{name} $p_1$ … $p_k$) if it contains a node $n$ labeled \textsf{name}; the subtree rooted at $n$ must also contain, in depth-first order, distinct nodes $n_1$ to $n_k$ matching subpatterns $p_1$ to $p_k$. The above pattern, for example, matches only the last of the following programs (the first program is missing one call to \code{parent}, and the second has different variables in positions encoded by the pattern): + +\begin{Verbatim} +\textit{\red{% nonmatching}} \textit{\red{% nonmatching}} \textit{\green{% matching}} +sister(X,Y):- sister(X,Y):- sister(X,Y):- + female(X), female(X), parent(A,X), + parent(P,X), parent(A,X), female(X), + X \textbackslash{}= Y. parent(B,Y), parent(A,Y), + X \textbackslash{}= Y. X \textbackslash{}= Y. +\end{Verbatim} A relation between any two objects in a program is insufficient to reason about the program’s behavior on the whole. In the tutoring context, however, there are patterns that strongly indicate the presence of certain bugs. Take for example the following incorrect program to sum a list: |