1 files changed, 15 insertions, 23 deletions
diff --git a/aied2017/method.tex b/aied2017/method.tex
index 797f625..57643db 100644
--- a/aied2017/method.tex
+++ b/aied2017/method.tex
@@ -1,26 +1,26 @@
 \section{Method}
 
-The following subsections explain the three main components of our approach: extracting patterns from student submissions, learning classification rules for correct and incorrect programs, and using those rules to generate hints.
+This section explains the three main components of our approach: extracting patterns from student submissions, learning classification rules for correct and incorrect programs, and using those rules to generate hints.
 
 \subsection{Extracting patterns}
 \label{sec:extracting-patterns}
 
-We extract patterns from student programs by selecting certain subsets of leaves in a program’s AST, and building up patterns that match nodes in those subsets. For this paper we always select pairs of nodes from the same clause: either two nodes referring to the same variable (like the examples above), or a value (such as \code{0} or the empty list \code{[]}) and another variable or value that occurrs in the same \code{compound} or \code{binop}. For example, in the clause\footnote{Occurrences of the three variables \code{A}, \code{B} and \code{C} are subscripted for disambiguation.}
+We construct patterns by connecting pairs of leaf nodes in a program’s AST. For this paper we always select a pair of nodes from the same clause: either two nodes referring to the same variable (like the examples in Fig.~\ref{fig:sister}), or a value (such as the empty list \code{[]} or \code{0}) and another variable or value in the same \textsf{compound} or \textsf{binop} (like the blue dotted pattern in Fig.~\ref{fig:sum}). For example, in the clause\footnote{The second occurrence of variables \code{A}, \code{B} and \code{C} is marked with ’ for disambiguation.}
 
 \begin{Verbatim}
-a(A\textsubscript{1},B\textsubscript{1}):-
-  b(A\textsubscript{2},C\textsubscript{1}),
-  B\textsubscript{2} is C\textsubscript{2} + 18.
+a(A,\textsf{ }B):-
+  b(A\textsf{'},\textsf{ }C),
+  B\textsf{'} is C\textsf{'}\textsf{ }+\textsf{ }18.
 \end{Verbatim}
 
 \noindent
-we would select the following sets of leaf nodes: \{\code{A\textsubscript{1}},\code{A\textsubscript{2}}\}, \{\code{B\textsubscript{1}},\code{B\textsubscript{2}}\}, \{\code{C\textsubscript{1}},\code{C\textsubscript{2}}\}, \{\code{B\textsubscript{2}},\code{18}\}, and \{\code{C\textsubscript{2}},\code{18}\}.
+we select the following pairs of nodes: \{\code{A},\,\code{A\textsf{'}}\}, \{\code{B},\,\code{B\textsf{'}}\}, \{\code{C},\,\code{C\textsf{'}}\}, \{\code{B\textsf{'}},\,\code{18}\} and \{\code{C\textsf{'}},\,\code{18}\}.
 
-We build a pattern for each set $S$ of selected leaf nodes by walking the AST in depth-first order, and recording nodes that lie on paths to elements of $S$. As explained above, we omit \code{and} nodes, allowing the pattern to generalize to more programs. Patterns also include certain nodes that do not lie on a path to any selected leaf. Specifically, for each included \code{compound} node we also include the corresponding \code{functor} with the predicate name. We also include the operator names (like \code{+} and \code{is}) for all unary and binary (\code{binop}) nodes in the pattern.
+For each selected pair of leaf nodes $(a,b)$ we build a pattern by walking the AST in depth-first order, and recording nodes that lie on the paths to $a$ and $b$. We omit \textsf{and} nodes, as explained in the previous section. We also include certain nodes that do not lie on a path to any selected leaf. Specifically, we include the functor or operator of all \textsf{compound}, \textsf{binop} or \textsf{unop} nodes containing $a$ or $b$.
 
 Patterns constructed in this way form the set of features for rule learning. To keep this set at a reasonable size, we only use patterns that have appeared in programs submitted by at least five students.
 
-\subsection{Learning rules for correct and incorrect programs}
+\subsection{Learning rules}
 
 \begin{figure}[t]
 \centering
@@ -37,10 +37,7 @@ Patterns constructed in this way form the set of features for rule learning. To
   \item Let $C-rules = learn\_rules(correct, P, P_c, 0.05, 0.9)$
   \item Return $I-rules$ and $C-rules$
  \end{enumerate}
- \caption{An outline of the algorithm for learning rules. The method $learn\_rules$,
- which induces rules  for a specific class, is a variant of the 
- CN2 algorithm~\cite{YYY} implemented within the Orange data-mining suite~\cite{XXX}. 
- In all our experiments, $sig$ was set to 0.05 and $acc$ was set to 0.9. }
+ \caption{An outline of the algorithm for learning rules. The method $learn\_rules$, which induces rules  for a specific class, is a variant of the CN2 algorithm~\cite{x} implemented within the Orange data-mining suite~\cite{demsar2013orange}. In all our experiments, $sig$ was set to 0.05 and $acc$ was set to 0.9.}
  \label{figure:algorithm}
 \end{figure}
 
@@ -48,21 +45,16 @@ Submitted programs are represented in the feature space of AST patterns describe
 
 Since programs can already be validated with appropriate unit tests, our goal is not classifying new submissions, but rather to discover patterns correlated with program correctness. This machine-learning approach is called \emph{descriptive induction} -- the automatic discovery of patterns describing regularities in data. We use rule learning for this task because rule-based models are easily comprehensible.
 
-Before explaining the algorithm, let us discuss the reasons why a program can be incorrect. Our teaching experience indicates that bugs in student programs can often be described by 1) some incorrect pattern, which needs to be removed, or 2) some missing relation (pattern) that should be included before the program can be correct. We shall now explain how both types of errors can be identified with rules.
+Before explaining the algorithm, let us discuss the reasons why a program can be incorrect. Our teaching experience indicates that bugs in student programs can often be described by 1) some incorrect or \emph{buggy} pattern, which needs to be removed, or 2) some missing relation (pattern) that should be included before the program can be correct. We shall now explain how both types of errors can be identified with rules.
 
-To discover patterns related to the first point, the algorithm first learns rules that describe incorrect programs. The conditions of these rules contain frequent patterns symptomatic of incorrect programs. Since rules are used to generate hints, and since hints should not be presented to students unless they are probably correct, we require that each learned rule's classification accuracy exceeds a certain threshold (in our experiments we used 90\%), each conjunct in a condition is significant with respect to the likelihood-ratio test (with $p=0.05$ in our experiments), and a conjunct can only specify the presence of a pattern. The former two constraints are needed to induce good rules with significant patterns, while the latter constraint assures that rules mention only presence (and not absence) of patterns as reasons for a program to be incorrect.
+To discover buggy patterns, the algorithm first learns rules that describe incorrect programs. The conditions of these rules contain frequent patterns symptomatic of incorrect programs. Since rules are used to generate hints, and since hints should not be presented to students unless they are probably correct, we require that each learned rule's classification accuracy exceeds a certain threshold (in our experiments we used 90\%), each conjunct in a condition is significant with respect to the likelihood-ratio test (with $p=0.05$ in our experiments), and a conjunct can only specify the presence of a pattern. The former two constraints are needed to induce good rules with significant patterns, while the latter constraint assures that rules mention only presence (and not absence) of patterns as reasons for a program to be incorrect.
 
-With respect to the second type of error, we could try the same approach and learn rules for the class of correct programs. Having accurate rules for correct programs, the conditional part of these rules would define sufficient groups of patterns that render a program correct. However, it turns out that it is difficult to learn accurate rules for correct programs, since these rules should contain all relevant patterns and prevent incorrect patterns, yet a conjunct can only specify the presence of a pattern. If specifying absence of patterns was allowed in rules' condition, the learning problem would still be difficult, since usually there are many incorrect patterns. A possible way to solve this problem is to learn from data set not containing programs that are covered by rules for incorrect class. This way all known incorrect patterns are removed from the data and are not needed anymore in conditions of rules. However, removing incorrect patterns also removes the need for relevant patterns. For example, if all incorrect programs were removed, a rule “$\mathsf{true} ⇒ \mathsf{correct}$” would suffice. Such rule does not contain any relevant patterns and could not be used to generate hints. We achieved the best results by learning from both data sets. The original data set (with all programs) is used to learn rules, while the filtered data are used to test whether a rule achieves the required classification accuracy (90\%).
+With respect to the second type of error, we could try the same approach and learn rules for the class of correct programs. Having accurate rules for correct programs, the conditional part of these rules would define sufficient groups of patterns that render a program correct. However, it turns out that it is difficult to learn accurate rules for correct programs, since these rules should contain all relevant patterns and prevent incorrect patterns, yet a conjunct can only specify the presence of a pattern. If specifying absence of patterns was allowed in rules' condition, the learning problem would still be difficult, since usually there are many incorrect patterns. A possible way to solve this problem is to learn from data set not containing programs that are covered by rules for incorrect class. This way all known incorrect patterns are removed from the data and are not needed anymore in conditions of rules. However, removing incorrect patterns also removes the need for relevant patterns. For example, if all incorrect programs were removed, the rule “$\mathsf{true} ⇒ \mathsf{correct}$” would suffice. Such rule does not contain any relevant patterns and could not be used to generate hints. We achieved the best results by learning from both data sets: the original data set (with all programs) is used to learn rules, while only programs without buggy patterns are used to test whether a rule achieves the required classification accuracy (90\%).
 
 Figure~\ref{figure:algorithm} contains an outline of the algorithm. The rules describing
-incorrect programs are called $I-rules$ and the rules for correct programs are called $C-rules$. 
+incorrect programs are called I-rules and the rules for correct programs are called C-rules.
 
-Even though our main interest is discovery of patterns, we can still use induced rules to classify
-new programs, for example to evaluate the quality of rules. The classification procedure has three steps: 
-first check whether a $I-rule$ covers the program that needs to be classified, and if it 
-does, classify it as incorrect. Then, check whether a $C-rule$ covers the program and classify 
-it as correct if it does. Otherwise, if none of the induced rules cover this program, classify it as 
-incorrect. 
+Even though our main interest is discovery of patterns, we can still use induced rules to classify new programs, for example to evaluate the quality of rules. The classification procedure has three steps: if an I-rule covers the program, classify it as incorrect; if a C-rule covers the program, classify it as correct; otherwise, if no rule covers the program, classify it as incorrect.
 
 \subsection{Generating hints}
 
@@ -79,7 +71,7 @@ sum([H|T],\red{\underline{Sum}}):-    % \textit{recursive case:}
 
 If no I-rule matches the program, we use C-rules to determine the student’s intent. C-rules group patterns that together indicate a high likelihood that the program is correct. Each C-rule thus defines a particular “solution strategy” defined in terms of AST patterns. We reason that a hint alerting the student to a missing pattern could help them complete the program without revealing the whole solution.
 
-When generating a hint from C-rules, we consider all \emph{partially} matching rules “$p_1 ∧ ⋯ ∧ p_k ⇒ \mathsf{correct}$”, where the student’s program matches some (but not all) patterns $p_i$. For each such rule we store the number of matching patterns, and the set of missing patterns. We then return the most common missing pattern among the rules with most matching patterns.
+When generating a hint from C-rules, we consider all \emph{partially matching} rules “$p_1 ∧ ⋯ ∧ p_k ⇒ \mathsf{correct}$”, where the student’s program matches some (but not all) patterns $p_i$. For each such rule we store the number of matching patterns, and the set of missing patterns. We then return the most common missing pattern among the rules with most matching patterns.
 
 For example, if we find the following missing pattern for an incorrect program implementing the \code{sister} predicate: