summaryrefslogtreecommitdiff
path: root/paper
diff options
context:
space:
mode:
Diffstat (limited to 'paper')
-rw-r--r--paper/background.tex2
-rw-r--r--paper/conclusion.tex4
-rw-r--r--paper/introduction.tex4
-rw-r--r--paper/method.tex23
-rw-r--r--paper/patterns.tex6
5 files changed, 22 insertions, 17 deletions
diff --git a/paper/background.tex b/paper/background.tex
index 9457b35..08b30ca 100644
--- a/paper/background.tex
+++ b/paper/background.tex
@@ -7,7 +7,7 @@ Several programming tutors generate hints from differences between the student
Rivers and Koedinger compare students’ programs directly to a database of previous correct submissions~\cite{rivers2015data-driven}. They reduce program variability using equivalence-preserving transformations, such as inlining functions and reordering binary expressions. Hints are generated by suggesting a minimal correct step leading from the current submission to the closest correct program.
% unit tests
-Another option is to compare program behavior. Nguyen et al. classify programs according to results on a preselected set of test inputs~\cite{nguyen2014codewebs}. Li et al. generate test cases to distinguish between programs by selecting inputs that exercise different code paths in the program~\cite{li2016measuring}. Such tutors can point out pertinent failing test cases, which can be very helpful. However, to do any sort of conceptual analysis of student programs, we need to define some language for describing invariant properties of programs.
+Another option is to compare program behavior. Nguyen et al. classify programming mistakes according to results on a preselected set of test inputs~\cite{nguyen2014codewebs}. Li et al. generate test cases to distinguish between programs by selecting inputs that exercise different code paths in the program~\cite{li2016measuring}. Such tutors can point out pertinent failing test cases, which can be very helpful.
% constraints
\emph{Constraints}~\cite{mitrovic2012fifteen} encode domain principles using if-then rules with relevance and satisfaction conditions, e.g. “if a function has a non-void return type, then it must have a return statement”~\cite{holland2009j-latte}. If a program violates a constraint, the tutor displays a predefined message. Le’s Prolog tutor improves constraint-based diagnosis by assigning weights to different types of constraints~\cite{le2009using}.
diff --git a/paper/conclusion.tex b/paper/conclusion.tex
index 331424f..2bccbd3 100644
--- a/paper/conclusion.tex
+++ b/paper/conclusion.tex
@@ -1,12 +1,12 @@
\section{Conclusion}
-AST patterns can be used to describe the structure of a program’s AST. By encoding only relations between selected nodes, each pattern can match many programs. AST patterns thus function as “regular expressions for trees”.
+We have used AST patterns as features to describe program structure. By encoding only relations between particular nodes, each pattern can match many programs. AST patterns thus function as a sort of “regular expressions” for trees.
We presented a method for automatically extracting AST patterns from student programs. Our patterns encode relations between data objects in a program, with each pattern connecting either two instances of the same variable, a variable and a value, or two values. We consider such patterns as the atomic syntactic relations in a program, and use them as machine-learning features.
We explained how patterns can be used as features to induce rules for classifying correct and incorrect programs. Since the goal of our research is to generate hints, we adapted the CN2 algorithm to produce rules useful for that purpose. We induce rules in two passes: we first learn the rules for incorrect programs, and then use the programs not covered by any such rule to learn the rules for correct programs.
-Evaluation shows that our patterns are useful for classifying Prolog programs. It is likely, however, that other programming languages will require different kinds of patterns. In Python, for example, variables can take on different values and appear in many places. This will likely require patterns relating more than two instances of a variable, or multiple variables.
+Evaluation shows that our patterns are useful for classifying Prolog programs. Other programming languages will likely require different patterns. For example, in commonly taught imperative languages such as Python or Java, each variable can take on different values and appear in many places. Further research is needed to determine the kinds of patterns useful in such situations.
We showed how to generate hints based on rules by highlighting buggy patterns or pointing out what patterns are missing. Evaluation on past student data shows that useful hints can be provided for many incorrect submissions this way. The quality of feedback could be improved by annotating rules with explanations in natural language. Furthermore, since patterns and rules are easily interpretable, they can also help when manually authoring tutoring systems, by indicating the common errors and the typical solution strategies for each problem.
diff --git a/paper/introduction.tex b/paper/introduction.tex
index 13dc9ec..baf8ba0 100644
--- a/paper/introduction.tex
+++ b/paper/introduction.tex
@@ -1,7 +1,7 @@
\section{Introduction}
% why automatic feedback
-Programming education is becoming increasingly accessible with massive online courses. Since thousands of students can attend such courses, it is impossible for teachers to individually evaluate each participant’s work. On the other hand, in-time feedback that directly addresses students’ mistakes can aid the learning process. Providing feedback automatically could thus greatly enhance these courses.
+Programming education is becoming increasingly accessible with massive online courses. Since thousands of students can attend such courses, it is impossible for teachers to individually evaluate each participant’s work. On the other hand, in-time feedback directly addressing students’ mistakes can aid the learning process. Providing feedback automatically could thus greatly enhance these courses.
% ITS background
Traditional programming tutors use manually constructed domain models to generate feedback. Model-tracing tutors simulate the problem-solving \emph{process}: how students program. This is challenging because there are no well-defined steps when writing a program. Many tutors instead only analyze individual programs submitted by students, and disregard how a program evolved. They often use models coded in terms of constraints or bug libraries~\cite{keuning2016towards}.
@@ -10,7 +10,7 @@ Traditional programming tutors use manually constructed domain models to generat
Developing a domain model requires significant knowledge-engineering effort~\cite{folsom-kovarik2010plan}. This is particularly true for programming tutors, where most problems have several alternative solutions with many possible implementations~\cite{le2013operationalizing}. Data-driven tutors reduce the necessary effort by mining educational data -- often from online courses -- to learn common errors and generate feedback~\cite{rivers2015data-driven,nguyen2014codewebs,jin2012program}.
% problem statement
-This paper addresses the problem of finding useful features to support data mining in programming tutors. Features should be robust against irrelevant variations in program code (such as renaming a variable) and relatable to the knowledge components of the target skill (programming), so as to support hint generation.
+In this paper we address the problem of finding useful features to support data mining in programming tutors. To support hint generation, these features must be robust against irrelevant code variations (such as renaming a variable) and relatable to knowledge components of the target skill (programming).
% our approach: patterns + rules
We describe features with \emph{abstract-syntax-tree patterns} that encode relations between nodes in a program’s abstract syntax tree. We use patterns that describe a path between pairs of leaf nodes referring to variables or values. By omitting some nodes on these paths, patterns can match different programs containing the same relation. We then induce rules to predict program correctness from AST patterns, allowing us to generate hints based on missing or incorrect patterns.
diff --git a/paper/method.tex b/paper/method.tex
index 1cd3497..a4f7ae7 100644
--- a/paper/method.tex
+++ b/paper/method.tex
@@ -1,11 +1,13 @@
\section{Method}
-This section explains the three main components of our approach: automatically extracting patterns from student submissions, learning classification rules for correct and incorrect programs, and using those rules to generate hints.
+This section explains the three steps in our approach: discovering AST patterns, learning classification rules for correct and incorrect programs, and using those rules to generate hints.
\subsection{Extracting patterns}
\label{sec:extracting-patterns}
-We construct patterns by connecting pairs of leaf nodes in a program’s AST. For this paper we always select a pair of nodes from the same clause: either two nodes referring to the same variable (like the examples in Fig.~\ref{fig:sister}), or a value (such as the empty list \code{[]} or the number \code{0}) and another variable or value in the same \textsf{compound} or \textsf{binop} (like the blue dotted pattern in Fig.~\ref{fig:sum}). For example, in the clause (the second occurrence of each variable -- \code{A}, \code{B} and \code{C} -- is marked with~’ for disambiguation)
+We extract patterns from student submissions. As described above, we are only interested in patterns connecting pairs of leaf nodes in an AST: either two nodes referring to the same variable (like the examples in Fig.~\ref{fig:sister}), or a value (such as the empty list \code{[]} or the number \code{0}) and another variable/value occurring within the same \textsf{compound} or \textsf{binop} (like the blue dotted pattern in Fig.~\ref{fig:sum}).
+
+We induce patterns from such node pairs. Given the clause (the second occurrence of each variable -- \code{A}, \code{B} and \code{C} -- is marked with ’ for disambiguation)
\begin{Verbatim}
a(A,\textsf{ }B):-
@@ -16,31 +18,34 @@ a(A,\textsf{ }B):-
\noindent
we select the following pairs of nodes: \{\code{A},\,\code{A\textsf{'}}\}, \{\code{B},\,\code{B\textsf{'}}\}, \{\code{C},\,\code{C\textsf{'}}\}, \{\code{B\textsf{'}},\,\code{1}\} and \{\code{C\textsf{'}},\,\code{1}\}.
-For each selected pair of leaf nodes $(a,b)$ we build a pattern by walking the AST in depth-first order, and recording nodes that lie on the paths to $a$ and $b$. We omit \textsf{and} nodes, as explained in the previous section. We also include certain nodes that do not lie on a path to any selected leaf. Specifically, we include the functor or operator of all \textsf{compound}, \textsf{binop} or \textsf{unop} nodes containing $a$ or $b$.
+For each selected pair of leaf nodes $(a,b)$ we construct a pattern by walking the AST in depth-first order and recording nodes that lie on the paths to $a$ and $b$. We omit \textsf{and} nodes, as explained in the previous section. We also include certain nodes that lie near the paths to selected leaves. Specifically, we include the functor/operator of all \textsf{compound}, \textsf{binop} and \textsf{unop} nodes containing $a$ or $b$.
+
+Patterns are extracted automatically given above constraints (each pattern connecting a pair of variables or values). We find that such patterns work well for Prolog. Other languages, however, will likely require different kinds of patterns to achieve good performance.
-Patterns constructed in this way form the set of features for rule learning. To keep this set at a reasonable size, we only use patterns that have occurred at least five times in submitted programs.
+Finally, to avoid learning rules specific to a particular program (covering typos and other idiosyncratic mistakes), we ignore rare patterns. In this study we used patterns that occurred in at least five submissions. These patterns form the feature space for rule learning.
\subsection{Learning rules}
-We represent students’ programs in the feature space of AST patterns described above. Each pattern corresponds to one binary feature with value $true$ when the pattern is present and $false$ when it is absent. We classify programs as correct or incorrect based on predefined unit tests for each problem, and use these labels for machine learning.
+We represent students’ programs in the feature space of AST patterns described above. Each pattern corresponds to one binary feature with value \textsf{true} when the pattern is present and \textsf{false} when it is absent. We use unit testing to classify each program as correct if it passes all test cases, and incorrect otherwise. We use these labels for machine learning.
-Since programs can be validated with appropriate unit tests, our goal is not classifying new submissions, but rather to discover patterns associated with program correctness. This approach to machine learning is called \emph{descriptive induction} -- the automatic discovery of patterns describing regularities in data. We use rule learning for this task, because conditions of rules can be easily translated to hints.
+Since we can establish program correctness using appropriate unit tests, our goal here is not classifying new submissions. Instead, we wish to discover patterns associated with program correctness. This approach to machine learning is called \emph{descriptive induction} -- the automatic discovery of patterns describing regularities in data. We use rule learning for this task, because conditions of rules can be easily translated to hints.
Before explaining the algorithm, let us discuss the reasons why a program can be incorrect. Our experience indicates that bugs in student programs can often be described by 1) some incorrect or \emph{buggy} pattern, which needs to be removed, or 2) some missing relation (pattern) that should be included before the program can be correct. We shall now explain how both types of errors can be identified with rules.
To discover buggy patterns, the algorithm first learns rules that describe incorrect programs (I-rules). We use a variant of the CN2 algorithm~\cite{clark1991rule} implemented within the Orange data-mining toolbox~\cite{demsar2013orange}. Since we use rules to generate hints, and since hints should not be presented to students unless they are likely to be correct, we impose additional constraints on the rule learner:
+
\begin{enumerate}
\item The classification accuracy of each learned rule must exceed a threshold (we selected 90\%, as 10\% error seems acceptable for our application).
\item Each conjunct in a condition must be significant with respect to the likelihood-ratio test (in our experiments significance threshold was set to $p=0.05$).
- \item A conjunct can only specify the presence of a pattern: we allow feature-value pairs with only $true$ as value.
+ \item A conjunct can only specify the presence of a pattern: we allow feature-value pairs with only \textsf{true} as value.
\end{enumerate}
\noindent The former two constraints are needed to induce good rules with significant patterns, while the latter constraint assures that rules mention only presence (and not absence) of patterns as reasons for a program to be incorrect. This is important, since conditions of I-rules ought to contain patterns symptomatic of incorrect programs.
With respect to the second type of error, we could try the same approach and learn rules using the above algorithm for the class of correct programs (C-rules). Having accurate rules for correct programs, the conditional part of these rules would define sufficient combinations of patterns that render a program correct.
-It turns out that it is difficult to learn accurate rules for correct programs, because there are many programs that are incorrect despite having all important patterns, because they include also incorrect patterns.
+It turns out that it is difficult to learn accurate rules for correct programs, because there are many programs that are incorrect despite having all important patterns, because they include also incorrect patterns.
-A possible way to solve this problem is to remove programs that are covered by rules for incorrect class. This way all known buggy patterns are removed from the data and will not be included in C-rules. However, removing incorrect patterns also removes the need for relevant patterns. For example, if all incorrect programs were removed, the single C-rule “$\mathsf{true} ⇒ \mathsf{correct}$” would suffice, which cannot be used to generate hints. We achieved the best results by learning from the complete data set, whereas the accuracy of rules was estimated on data without programs covered by I-rules.
+A possible way to solve this problem is to remove programs that are covered by rules for incorrect class. This way all known buggy patterns are removed from the data and will not be included in C-rules. However, removing incorrect patterns also removes the need for relevant patterns. For example, if all incorrect programs were removed, the single C-rule “$\mathsf{true} ⇒ \mathsf{correct}$” would suffice, which cannot be used to generate hints. We achieved the best results by learning from the complete data set, whereas the accuracy of rules was estimated on data without programs covered by I-rules.
Even though our main interest is discovery of patterns, we can still use induced rules to classify new programs, for example to evaluate the quality of rules. The classification procedure has three steps: 1) if an I-rule covers the program, classify it as incorrect; 2) else if a C-rule covers the program, classify it as correct; 3) otherwise, if no rule covers the program, classify it as incorrect.
diff --git a/paper/patterns.tex b/paper/patterns.tex
index e1aa767..2937fd0 100644
--- a/paper/patterns.tex
+++ b/paper/patterns.tex
@@ -10,7 +10,7 @@ sister(X,Y):- % X is Y’s sister when:
X \textbackslash{}= Y. % X and Y are not the same person.
\end{Verbatim}
-Figure~\ref{fig:sister} shows the program’s AST with two patterns. The pattern drawn with blue dotted arrows encodes the fact that the first argument to the \code{sister} predicate also appears in the call to \code{female}. In other words, this pattern states that \code{X} must be female to be a sister. We write this pattern as the s-expression
+Figure~\ref{fig:sister} shows the program’s AST with two patterns. The pattern drawn with blue dotted arrows encodes the fact that the first argument to the \code{sister} predicate also appears as the first argument in the call to \code{female}. In other words, this pattern states that \code{X} must be female to be a sister. We write this pattern as the s-expression
\begin{Verbatim}[fontfamily=sf]
(clause (head (compound (functor ‘\code{sister}’) (args var)))
@@ -68,9 +68,9 @@ Figure~\ref{fig:sister} shows the program’s AST with two patterns. The pattern
\label{fig:sister}
\end{figure}
-Every pattern used in this paper has the same basic structure, and describes paths from a \textsf{clause} node to one or two leaf nodes containing variables or values. All patterns in Figs.~\ref{fig:sister} and~\ref{fig:sum} are induced from such node pairs. For each leaf we also include some local context, for example the predicate name (e.g. \texttt{parent}).
+Every pattern used in this paper has the same basic structure, and describes paths from a \textsf{clause} node to one or two leaf nodes containing variables or values. All patterns in Figs.~\ref{fig:sister} and~\ref{fig:sum} are induced from such pairs of nodes. For each leaf we also include some local context, such as the name of the predicate (e.g. \texttt{parent}) and the operators used in \textsf{unop} and \textsf{binop} nodes.
-We regard these patterns as the smallest units of meaning in Prolog programs: each pattern encodes some interaction between two parts of the program. Including more than two leaf nodes in a pattern could make it difficult to pinpoint the exact error when generating hints. Each pattern contains at most two \textsf{var} nodes, so we require they both refer to the same variable -- relating two nodes with different variables would not tell us much about the program. This allows us to omit variable names from patterns.
+We regard these patterns as the smallest units of meaning in Prolog programs: each pattern encodes some interaction between two objects (variable or value) in the program. Including more than two leaf nodes in a pattern could make it difficult to pinpoint the exact error when generating hints. Each pattern contains at most two \textsf{var} nodes, so we require they both refer to the same variable -- relating two nodes with different variables would not tell us much about the program. We can thus omit actual variable names from patterns.
We handle syntactic variations in programs by omitting certain nodes from patterns. For example, by not including \textsf{and} nodes, the above pattern can match a clause regardless of the presence (or order) of other goals in its body (i.e., with any arrangement of \textsf{and} nodes in the AST). Order \emph{is} important for the nodes specified in the pattern; this is explained below.