6 files changed, 46 insertions, 115 deletions
diff --git a/aied2017/aied2017.tex b/aied2017/aied2017.tex
index b9a9803..c8a6e9c 100644
--- a/aied2017/aied2017.tex
+++ b/aied2017/aied2017.tex
@@ -2,15 +2,15 @@
 
 \usepackage[utf8]{inputenc}
 
+% reclaim some plain-text sanity
 \usepackage{newunicodechar}
 \newunicodechar{∧}{\ensuremath{\land}}
 \newunicodechar{⇒}{\ensuremath{\Rightarrow}}
 \newunicodechar{⋯}{\ensuremath{\cdots}}
-% latex ∧ unicode ⇒ …
 
 \usepackage{fancyvrb}
 \fvset{commandchars=\\\{\},baselinestretch=0.98,samepage=true,xleftmargin=2.5mm}
-
+% WAT — I don’t even…
 \makeatletter
 \begingroup
 \catcode`\`=\active
@@ -23,6 +23,8 @@
 \usepackage{forest}
 \usetikzlibrary{arrows.meta}
 
+\usepackage{hyperref}
+
 \newcommand\code[1]{\texttt{#1}}
 \newcommand\red[1]{{\begingroup\color[rgb]{0.8,0.15,0.15}#1\endgroup}}
 \newcommand\hl[1]{\textbf{#1}}
diff --git a/aied2017/background.tex b/aied2017/background.tex
index 715267b..bb6bed3 100644
--- a/aied2017/background.tex
+++ b/aied2017/background.tex
@@ -1,89 +1,24 @@
 \section{Background}
 
-ITSs perform two main functions: suggesting problems for the student to solve, and providing immediate feedback as the student works on each problem~\cite{vanlehn2006behavior}. In this paper we focus on the latter: how to provide guidance to a student solving a programming exercise? Furthermore, we wish to do so in a data-driven fashion -- the domain model underlying our analysis should be learned from student data, without manual intervention for each problem.
-
-In order to mine data from past student submissions, we need to define code features or metrics that a) can easily be extracted from the code, b) carry information allowing us to distinguish programs, and c) are insensitive to superficial differences (such as different identifiers) between otherwise equivalent programs. This section reviews some of the existing approaches to defining such code features.
-
-% program-level features
-%Considering only entire programs, and not their individual parts, avoids this issue as we always have a “full picture” of a program. The main challenge then is to account for the superficial differences between programs.
- 
-Rivers and Koedinger reduce syntactic and semantic variations by \emph{canonicalizing} programs into normal form~\cite{rivers2015data-driven}. Among other transformations they inline functions, compute constant expressions and reorder commutative expressions. To generate hints, they find the correct program closest to the student’s submission, and construct a token-by-token edit path between those programs.
-Jin et al. do not compare programs directly, but construct a \emph{linkage matrix} for each program, describing data dependencies between individual statements in a program~\cite{jin2012program}. This yields a simplified representation that is robust against variations and can be used to extract crucial differences between programs.
-
 % solution strategies
-Several authors manually define \emph{plans} or \emph{strategies} to describe possible solutions for each problem~\cite{gerdes2016ask-elle,johnson1990understanding,hong2004guided}. These specifications are usually written in a domain-specific language and allow a tutor to discover intent from partial programs.
-
-% test-case outputs
-Another possibility is to disregard the program’s structure and instead focus on dynamic analysis. Nguyen et al. consider two programs equivalent when they return the same outputs for a large battery of test cases~\cite{nguyen2014codewebs}. Li et al. generate informative test cases dynamically by selecting inputs that exercise different program paths~\cite{li2016measuring}.
+Several programming tutors generate hints from differences between the student’s program and a predefined set of possible solutions. The possible solution strategies for each problem can be given as a set of programs, or specified in a domain-specific language. Both Johnson’s Pascal tutor~\cite{johnson1990understanding} and Hong’s Prolog tutor~\cite{hong2004guided} perform hierarchical goal decomposition based on predefined programming \emph{plans} or \emph{techniques} to determine the student’s intent. Gerdes et al. use a small set of annotated model programs to derive solution strategies, which function in a similar way~\cite{gerdes2016ask-elle}.
 
-% proper features
-In constraint-based modeling pioneered by Ohlsson and Mitrovic, student submission are analyzed in terms of invariant domain principles~\cite{mitrovic2012fifteen}. Constraints typically take the form of if-then rules, such as “if a function has a non-void return type, then it must contain a return statement”~\cite{holland2009j-latte}. Tutors using this approach look for unsatisfied constraints in submitted programs to generate feedback.
-% TODO note pattern hierarchy used in Le’s tutor
-Le’s Prolog tutor improves constraint-based diagnosis by assigning weights to different types of constraints~\cite{le2009using}.
+% canonicalization
+Rivers and Koedinger compare students’ programs directly to a database of previous correct submissions~\cite{rivers2015data-driven}. They reduce program variability using equivalence-preserving transformations, such as inlining functions and reordering binary expressions. Hints are generated by suggesting a minimal correct step leading from the current submission to the closest correct program.
 
-% code- / syntax-based features
-The methods discussed so far only consider entire programs (after normalization), or focus on epiphenomenal features such as run-time behavior. Defining program features in terms of the program’s code or structure is difficult, as each part of a program can potentially affect -- or be affected by -- any other part. In other words, a statement that is correct in the context of one submission might be erroneous in another.
+% unit tests
+Above approaches directly compare ASTs, so they avoid the need for defining program features. Another option is to compare program behavior. Nguyen et al. use results from on a battery of unit tests to cluster incorrect programs~\cite{nguyen2014codewebs}. Li et al. generate test cases to distinguish between programs by selecting inputs that exercise different code paths in the program~\cite{li2016measuring}. Such tutors can point out pertinent failing test cases, which can be very helpful. However, to do any sort of conceptual analysis of student programs, we need to define some language for describing invariant properties of programs.
 
-% text / token runs
-Several attempts have been made at defining such features. The simplest option is to consider fragments of program text; the MOSS system for detecting plagiarism is based on $k$-grams~\cite{schleimer2003winnowing}. CCFinder looks for similar token sequences in different programs to discover code clones~\cite{kamiya2002ccfinder}. OverCode uses individual lines of code as features, allowing a teacher to visualize the similarities between many student submissions~\cite{glassman2015overcode}.
+% constraints
+\emph{Constraints}~\cite{mitrovic2012fifteen} encode domain principles using if-then rules with relevance and satisfaction conditions, e.g. “if a function has a non-void return type, then it must have a return statement”~\cite{holland2009j-latte}. If a program violates a constraint, the tutor displays a predefined message. Le’s Prolog tutor improves constraint-based diagnosis by assigning weights to different types of constraints~\cite{le2009using}.
 
-% features in programming ITS should be meaningful
-While features defined in terms of strings or tokens are easy to implement, they lack the structural information that would allow us to reason about the program’s behavior. Our goal is to define features that capture individual concepts or recurring patterns in student submissions. Nguyen et al. used the Codewebs tool to discover the different AST subtrees that have the same function in a given context~\cite{nguyen2014codewebs}.
+% data-driven tutors
+Jin et al. use \emph{linkage graphs} to describe data dependencies between the program’s statements~\cite{jin2012program}; we use AST patterns in a similar way.
+Nguyen et al. analyzed submissions in a large machine-learning course to learn a vocabulary of \emph{code phrases}: subtrees of submissions’ abstract syntax trees that perform the same function in a given context. By swapping parts between different programs, they built up a search library of functionally equivalent AST subtrees within a given context.
 
 % tree-regular expressions
-The features used in the present paper are inspired by \emph{tree regular expressions} used in the Tregex tool by Levy and Andrew~\cite{levy2006tregex}, and the \emph{trx} system by Bagrak and Shivers~\cite{bagrak2004trx}. Extending ordinary regular expressions to trees allows us to define structural patterns with varying levels of specificity.
-
-
-%% dynamic / temporal programming model
-%The first major ITS paradigm employs a \emph{dynamic} domain model to support hint generation. Here, the problem-solving process is typically modeled as a search through a state space of partial and complete solutions. Solving a problem is represented by a sequence of well-defined steps from the initial state to one of the possible goal states. Cognitive and other model-tracing tutors use this approach~\cite{anderson1995cognitive,vanlehn2005andes}.
-%
-%% dynamic model: data-driven
-%The Hint Factory uses student data to model the solution space as a Markov decision process~\cite{barnes2008toward}. Several data-driven tutors~\cite{jin2012program,rivers2013automatic,piech2015autonomously} adapt and extend this approach to the programming domain. This allows them to generate hints for errors that students have made (and fixed) in the past.
-%
-%% dynamic model: issues
-%One of the issues when developing a dynamic domain model for programming is the lack of directly observable, interpretable problem-solving steps. Some tutors like the Lisp-ITS deal with this by restricting student input: students must write programs in top-to-bottom order, and fix any erroneous symbols before continuing. Data-driven tutors typically only include submissions -- versions of the program the student submitted for testing -- in the solution space, and do not ascribe any meaning (in domain concepts) to transitions between submissions.
-%
-%%E.g. model-tracing tutors. Not well-suited for programming tutors because:
-%%  - no directly interpretable problem-solving steps
-%%  - student traces are basically insane, so learning from them is not a good idea anyway (Piech 2015)
-%%  - human tutors are usually able to comment without referring to previous program versions
-%%Since there are no meaningful observable actions in a free-form programming tutor, the “programming steps” in these models are simply arrows connecting programs in the solution space of incorrect and correct submissions. Commonly occurring steps are arrows spanning pairs of submissions that were often observed in sequence.
-%
-%% static programming model
-%The other major tutoring paradigm uses a \emph{static} domain model. Instead of representing the entire problem-solving process, this approach only models individual solution states. Since a) programming steps are difficult to define and observe, and b) human teachers are often able to resolve student errors based only on the current program, many programming tutors opt for a static domain model.
-%%Avoid the need to understand student actions by only considering each submission individually.
-%
-%% static model: constraint-based modeling
-%Probably the most commonly used approach is constraint-based modeling~\cite{mitrovic2012fifteen}, where programs are checked against a set of constraints. Constraints typically take the form of rules like “if a function has a non-void return type, then it must contain a return statement”~\cite{holland2009j-latte}. Tutors using this approach base hints on unsatisfied constraints in a submitted program. Le’s Prolog tutor improves constraint-based diagnosis by assigning different weights to constraints~\cite{le2009using}.
-%
-%% static model: ad hoc
-%While model-tracing and CBM are established paradigms with a somewhat standardized programming models, many tutors use ad hoc models. These can be defined manually or learned using machine-learning techniques.
-%% - plan-based
-%Johnson identified the key concepts when writing a Pascal program for a single exercise, encoding them into programming plans that allow discovering student intentions and diagnosing errors~\cite{johnson1990understanding}.
-%%  - grammar-based (techniques) (Brna/Huang/Le)
-%Hong’s Prolog tutor uses a similar approach to analyze programs, based on programming techniques~\cite{brna1991prolog}.
-%%  - strategies (Gerdes)
-%Similarly, Ask-Elle by Gerdes et al. uses manually defined programming strategies to automatically derive possible solutions for a given Haskell problem~\cite{gerdes2016ask-elle}.
-%
-%% data-driven models
-%Data-driven methods have also been used for constructing static programming models. The idea is to extract information from a large set of correct and incorrect submissions; the main challenge is how to distinguish superficial variations between programs from semantically important differences.
-%% canonicalization (Rivers 2015)
-%Rivers et al. reduce syntactic and semantic variations by \emph{canonicalizing} programs into a normal form: inlining functions, computing constant expressions, reordering commutative expressions, and so on~\cite{rivers2015data-driven}. To generate hints, they find the correct program closest to the student’s submission, and construct a token-by-token edit path between those programs.
-%%  AST subtrees (Nguyen 2014)
-%Nguyen et al. analyzed submissions in a machine-learning course to learn a vocabulary of \emph{code phrases}: subtrees of submissions’ abstract syntax trees that perform the same function in a given context.
-
+The idea for AST patterns comes from Tregex -- \emph{tree regular expressions}, mainly used in the field of natural-language processing~\cite{levy2006tregex}. Tregex patterns can encode complex relations between nodes, but can become unwieldy; in this paper we use a simpler s-expression syntax. Another language for describing tree patterns using s-expressions is \emph{trx}, which additionally supports choice, repetition and other operators~\cite{bagrak2004trx}.
 
-%Generate hints based on the differences between incorrect and correct programs. Must distinguish between syntactic (irrelevant) and semantic (relevant) variations.
-%  - n-grams or other purely textual attributes
-%    - mostly used for plagiarism (MOSS) / code clone detection 
-%      
-%** Authoring tools (10%)
-%An interpretable model can help with classifying errors and discovering concepts.
-%  - reference?
-%
-%** Regex for trees (20%)
-%Syntactic n-grams (Sidorov).
-%Tgrep / trx for structural patterns in NLP - use the same for describing code!
 
 %%% Local Variables:
 %%% mode: latex
diff --git a/aied2017/evaluation.tex b/aied2017/evaluation.tex
index 50e9eb4..bd2f168 100644
--- a/aied2017/evaluation.tex
+++ b/aied2017/evaluation.tex
@@ -34,9 +34,7 @@ The last row averages results over all 44 domains.}
 \end{table} 
 
 Table~\ref{table:eval} contains results on five selected problems (each problem represents a 
-different group of problems) and averaged results over all problems.\footnote{We report only a subset of results due to space
-restrictions. The table with results for all 44 problems can be found at 
-\url{www.ailab.si/aied2017}. } The second, third, and fourth columns provide classification accuracies (CA) of rule-based classifier,
+different group of problems) and averaged results over all problems.\footnote{We report only a subset of results due to space restrictions. Full results and source code can be found at \url{https://ailab.si/ast-patterns/}. } The second, third, and fourth columns provide classification accuracies (CA) of rule-based classifier,
 majority classifier, and random forest on testing data. The majority classifier and the random forests method,
 which was the best performing machine learning algorithm over all problems, serve as lower and upper bound 
 for CA. For example, in the case of ``sister'' problem, 
@@ -78,4 +76,7 @@ Our hypothesis is that these patterns represent the parts of the problem solutio
 difficulties with. 
 
 
-
+%%% Local Variables:
+%%% mode: latex
+%%% TeX-master: "aied2017"
+%%% End:
+\ No newline at end of file
diff --git a/aied2017/introduction.tex b/aied2017/introduction.tex
index 44fc8df..790d1e7 100644
--- a/aied2017/introduction.tex
+++ b/aied2017/introduction.tex
@@ -13,7 +13,7 @@ Developing the domain model requires significant knowledge-engineering effort~\c
 This paper addresses the problem of finding useful features to support data mining in programming tutors. Features should be robust against superficial or irrelevant variations in program code, and relatable to the knowledge components of the target skill (programming), so as to support hint generation.
 
 % our approach: patterns + rules
-We describe features with \emph{patterns} that encode relations between variables in a program’s abstract syntax tree (AST). Patterns describe paths between certain “interesting” leaf nodes. By omitting some nodes on these paths, patterns match different programs containing the same relation. We then induce rules to predict program correctness based on AST patterns, allowing us to generate hints based on missing or incorrect patterns.
+We describe features with \emph{AST patterns} that encode relations between nodes in a program’s abstract syntax tree. We use patterns that describe a path between pairs of leaf nodes referring to variables or values. By omitting some nodes on these paths, patterns can match different programs containing the same relation. We then induce rules to predict program correctness based on AST patterns, allowing us to generate hints based on missing or incorrect patterns.
 
 % evaluation
 We evaluated our approach on existing Prolog programs submitted by students during past lab sessions of a second-year university course. Rule classification accuracy ranged between 85\% and 99\% for most problems. For 75\% of incorrect submissions we are able to suggest potentially useful patterns -- those that the student had actually implemented in the final, correct program.
diff --git a/aied2017/method.tex b/aied2017/method.tex
index 797f625..57643db 100644
--- a/aied2017/method.tex
+++ b/aied2017/method.tex
@@ -1,26 +1,26 @@
 \section{Method}
 
-The following subsections explain the three main components of our approach: extracting patterns from student submissions, learning classification rules for correct and incorrect programs, and using those rules to generate hints.
+This section explains the three main components of our approach: extracting patterns from student submissions, learning classification rules for correct and incorrect programs, and using those rules to generate hints.
 
 \subsection{Extracting patterns}
 \label{sec:extracting-patterns}
 
-We extract patterns from student programs by selecting certain subsets of leaves in a program’s AST, and building up patterns that match nodes in those subsets. For this paper we always select pairs of nodes from the same clause: either two nodes referring to the same variable (like the examples above), or a value (such as \code{0} or the empty list \code{[]}) and another variable or value that occurrs in the same \code{compound} or \code{binop}. For example, in the clause\footnote{Occurrences of the three variables \code{A}, \code{B} and \code{C} are subscripted for disambiguation.}
+We construct patterns by connecting pairs of leaf nodes in a program’s AST. For this paper we always select a pair of nodes from the same clause: either two nodes referring to the same variable (like the examples in Fig.~\ref{fig:sister}), or a value (such as the empty list \code{[]} or \code{0}) and another variable or value in the same \textsf{compound} or \textsf{binop} (like the blue dotted pattern in Fig.~\ref{fig:sum}). For example, in the clause\footnote{The second occurrence of variables \code{A}, \code{B} and \code{C} is marked with ’ for disambiguation.}
 
 \begin{Verbatim}
-a(A\textsubscript{1},B\textsubscript{1}):-
-  b(A\textsubscript{2},C\textsubscript{1}),
-  B\textsubscript{2} is C\textsubscript{2} + 18.
+a(A,\textsf{ }B):-
+  b(A\textsf{'},\textsf{ }C),
+  B\textsf{'} is C\textsf{'}\textsf{ }+\textsf{ }18.
 \end{Verbatim}
 
 \noindent
-we would select the following sets of leaf nodes: \{\code{A\textsubscript{1}},\code{A\textsubscript{2}}\}, \{\code{B\textsubscript{1}},\code{B\textsubscript{2}}\}, \{\code{C\textsubscript{1}},\code{C\textsubscript{2}}\}, \{\code{B\textsubscript{2}},\code{18}\}, and \{\code{C\textsubscript{2}},\code{18}\}.
+we select the following pairs of nodes: \{\code{A},\,\code{A\textsf{'}}\}, \{\code{B},\,\code{B\textsf{'}}\}, \{\code{C},\,\code{C\textsf{'}}\}, \{\code{B\textsf{'}},\,\code{18}\} and \{\code{C\textsf{'}},\,\code{18}\}.
 
-We build a pattern for each set $S$ of selected leaf nodes by walking the AST in depth-first order, and recording nodes that lie on paths to elements of $S$. As explained above, we omit \code{and} nodes, allowing the pattern to generalize to more programs. Patterns also include certain nodes that do not lie on a path to any selected leaf. Specifically, for each included \code{compound} node we also include the corresponding \code{functor} with the predicate name. We also include the operator names (like \code{+} and \code{is}) for all unary and binary (\code{binop}) nodes in the pattern.
+For each selected pair of leaf nodes $(a,b)$ we build a pattern by walking the AST in depth-first order, and recording nodes that lie on the paths to $a$ and $b$. We omit \textsf{and} nodes, as explained in the previous section. We also include certain nodes that do not lie on a path to any selected leaf. Specifically, we include the functor or operator of all \textsf{compound}, \textsf{binop} or \textsf{unop} nodes containing $a$ or $b$.
 
 Patterns constructed in this way form the set of features for rule learning. To keep this set at a reasonable size, we only use patterns that have appeared in programs submitted by at least five students.
 
-\subsection{Learning rules for correct and incorrect programs}
+\subsection{Learning rules}
 
 \begin{figure}[t]
 \centering
@@ -37,10 +37,7 @@ Patterns constructed in this way form the set of features for rule learning. To
   \item Let $C-rules = learn\_rules(correct, P, P_c, 0.05, 0.9)$
   \item Return $I-rules$ and $C-rules$
  \end{enumerate}
- \caption{An outline of the algorithm for learning rules. The method $learn\_rules$,
- which induces rules  for a specific class, is a variant of the 
- CN2 algorithm~\cite{YYY} implemented within the Orange data-mining suite~\cite{XXX}. 
- In all our experiments, $sig$ was set to 0.05 and $acc$ was set to 0.9. }
+ \caption{An outline of the algorithm for learning rules. The method $learn\_rules$, which induces rules  for a specific class, is a variant of the CN2 algorithm~\cite{x} implemented within the Orange data-mining suite~\cite{demsar2013orange}. In all our experiments, $sig$ was set to 0.05 and $acc$ was set to 0.9.}
  \label{figure:algorithm}
 \end{figure}
 
@@ -48,21 +45,16 @@ Submitted programs are represented in the feature space of AST patterns describe
 
 Since programs can already be validated with appropriate unit tests, our goal is not classifying new submissions, but rather to discover patterns correlated with program correctness. This machine-learning approach is called \emph{descriptive induction} -- the automatic discovery of patterns describing regularities in data. We use rule learning for this task because rule-based models are easily comprehensible.
 
-Before explaining the algorithm, let us discuss the reasons why a program can be incorrect. Our teaching experience indicates that bugs in student programs can often be described by 1) some incorrect pattern, which needs to be removed, or 2) some missing relation (pattern) that should be included before the program can be correct. We shall now explain how both types of errors can be identified with rules.
+Before explaining the algorithm, let us discuss the reasons why a program can be incorrect. Our teaching experience indicates that bugs in student programs can often be described by 1) some incorrect or \emph{buggy} pattern, which needs to be removed, or 2) some missing relation (pattern) that should be included before the program can be correct. We shall now explain how both types of errors can be identified with rules.
 
-To discover patterns related to the first point, the algorithm first learns rules that describe incorrect programs. The conditions of these rules contain frequent patterns symptomatic of incorrect programs. Since rules are used to generate hints, and since hints should not be presented to students unless they are probably correct, we require that each learned rule's classification accuracy exceeds a certain threshold (in our experiments we used 90\%), each conjunct in a condition is significant with respect to the likelihood-ratio test (with $p=0.05$ in our experiments), and a conjunct can only specify the presence of a pattern. The former two constraints are needed to induce good rules with significant patterns, while the latter constraint assures that rules mention only presence (and not absence) of patterns as reasons for a program to be incorrect.
+To discover buggy patterns, the algorithm first learns rules that describe incorrect programs. The conditions of these rules contain frequent patterns symptomatic of incorrect programs. Since rules are used to generate hints, and since hints should not be presented to students unless they are probably correct, we require that each learned rule's classification accuracy exceeds a certain threshold (in our experiments we used 90\%), each conjunct in a condition is significant with respect to the likelihood-ratio test (with $p=0.05$ in our experiments), and a conjunct can only specify the presence of a pattern. The former two constraints are needed to induce good rules with significant patterns, while the latter constraint assures that rules mention only presence (and not absence) of patterns as reasons for a program to be incorrect.
 
-With respect to the second type of error, we could try the same approach and learn rules for the class of correct programs. Having accurate rules for correct programs, the conditional part of these rules would define sufficient groups of patterns that render a program correct. However, it turns out that it is difficult to learn accurate rules for correct programs, since these rules should contain all relevant patterns and prevent incorrect patterns, yet a conjunct can only specify the presence of a pattern. If specifying absence of patterns was allowed in rules' condition, the learning problem would still be difficult, since usually there are many incorrect patterns. A possible way to solve this problem is to learn from data set not containing programs that are covered by rules for incorrect class. This way all known incorrect patterns are removed from the data and are not needed anymore in conditions of rules. However, removing incorrect patterns also removes the need for relevant patterns. For example, if all incorrect programs were removed, a rule “$\mathsf{true} ⇒ \mathsf{correct}$” would suffice. Such rule does not contain any relevant patterns and could not be used to generate hints. We achieved the best results by learning from both data sets. The original data set (with all programs) is used to learn rules, while the filtered data are used to test whether a rule achieves the required classification accuracy (90\%).
+With respect to the second type of error, we could try the same approach and learn rules for the class of correct programs. Having accurate rules for correct programs, the conditional part of these rules would define sufficient groups of patterns that render a program correct. However, it turns out that it is difficult to learn accurate rules for correct programs, since these rules should contain all relevant patterns and prevent incorrect patterns, yet a conjunct can only specify the presence of a pattern. If specifying absence of patterns was allowed in rules' condition, the learning problem would still be difficult, since usually there are many incorrect patterns. A possible way to solve this problem is to learn from data set not containing programs that are covered by rules for incorrect class. This way all known incorrect patterns are removed from the data and are not needed anymore in conditions of rules. However, removing incorrect patterns also removes the need for relevant patterns. For example, if all incorrect programs were removed, the rule “$\mathsf{true} ⇒ \mathsf{correct}$” would suffice. Such rule does not contain any relevant patterns and could not be used to generate hints. We achieved the best results by learning from both data sets: the original data set (with all programs) is used to learn rules, while only programs without buggy patterns are used to test whether a rule achieves the required classification accuracy (90\%).
 
 Figure~\ref{figure:algorithm} contains an outline of the algorithm. The rules describing
-incorrect programs are called $I-rules$ and the rules for correct programs are called $C-rules$. 
+incorrect programs are called I-rules and the rules for correct programs are called C-rules.
 
-Even though our main interest is discovery of patterns, we can still use induced rules to classify
-new programs, for example to evaluate the quality of rules. The classification procedure has three steps: 
-first check whether a $I-rule$ covers the program that needs to be classified, and if it 
-does, classify it as incorrect. Then, check whether a $C-rule$ covers the program and classify 
-it as correct if it does. Otherwise, if none of the induced rules cover this program, classify it as 
-incorrect. 
+Even though our main interest is discovery of patterns, we can still use induced rules to classify new programs, for example to evaluate the quality of rules. The classification procedure has three steps: if an I-rule covers the program, classify it as incorrect; if a C-rule covers the program, classify it as correct; otherwise, if no rule covers the program, classify it as incorrect.
 
 \subsection{Generating hints}
 
@@ -79,7 +71,7 @@ sum([H|T],\red{\underline{Sum}}):-    % \textit{recursive case:}
 
 If no I-rule matches the program, we use C-rules to determine the student’s intent. C-rules group patterns that together indicate a high likelihood that the program is correct. Each C-rule thus defines a particular “solution strategy” defined in terms of AST patterns. We reason that a hint alerting the student to a missing pattern could help them complete the program without revealing the whole solution.
 
-When generating a hint from C-rules, we consider all \emph{partially} matching rules “$p_1 ∧ ⋯ ∧ p_k ⇒ \mathsf{correct}$”, where the student’s program matches some (but not all) patterns $p_i$. For each such rule we store the number of matching patterns, and the set of missing patterns. We then return the most common missing pattern among the rules with most matching patterns.
+When generating a hint from C-rules, we consider all \emph{partially matching} rules “$p_1 ∧ ⋯ ∧ p_k ⇒ \mathsf{correct}$”, where the student’s program matches some (but not all) patterns $p_i$. For each such rule we store the number of matching patterns, and the set of missing patterns. We then return the most common missing pattern among the rules with most matching patterns.
 
 For example, if we find the following missing pattern for an incorrect program implementing the \code{sister} predicate:
 
diff --git a/aied2017/patterns.tex b/aied2017/patterns.tex
index 32796ac..25a75e8 100644
--- a/aied2017/patterns.tex
+++ b/aied2017/patterns.tex
@@ -52,8 +52,8 @@ Figure~\ref{fig:sister} shows the program’s AST with two patterns. The pattern
 \path[-{Latex[length=1.5mm,width=1mm]},thick,relative,red] (g2) edge[transform canvas={xshift=1.5mm}] (g2a);
 \path[-{Latex[length=1.5mm,width=1mm]},thick,relative,red] (g2a) edge[transform canvas={xshift=-1.2mm}] (g2av);
 % second pattern
-\path[-{Latex[length=1.5mm,width=1mm]},thick,dotted,relative,blue] (top) edge[out=-15,in=-155] (h);
-\path[-{Latex[length=1.5mm,width=1mm]},thick,dotted,relative,blue] (h) edge[transform canvas={xshift=-1.5mm}] (hc);
+\path[-{Latex[length=1.5mm,width=1mm]},thick,dotted,relative,blue] (top) edge[transform canvas={xshift=-0.8mm,yshift=0.8mm}] (h);
+\path[-{Latex[length=1.5mm,width=1mm]},thick,dotted,relative,blue] (h) edge[transform canvas={xshift=-1mm}] (hc);
 \path[-{Latex[length=1.5mm,width=1mm]},thick,dotted,relative,blue] (hc) edge[transform canvas={xshift=-1.5mm}] (hf);
 \path[-{Latex[length=1.5mm,width=1mm]},thick,dotted,relative,blue] (hf) edge[transform canvas={xshift=-1.1mm}] (hfl);
 \path[-{Latex[length=1.5mm,width=1mm]},thick,dotted,relative,blue] (hc) edge[transform canvas={xshift=1.5mm}] (ha);
@@ -64,7 +64,7 @@ Figure~\ref{fig:sister} shows the program’s AST with two patterns. The pattern
 \path[-{Latex[length=1.5mm,width=1mm]},thick,dotted,relative,blue] (g3) edge[transform canvas={xshift=1.5mm}] (g3a);
 \path[-{Latex[length=1.5mm,width=1mm]},thick,dotted,relative,blue] (g3a) edge[transform canvas={xshift=-1.2mm}] (g3av);
   \end{forest}
-  \caption{Abstract syntax tree for the \code{sister(X,Y)} program, showing two patterns and the leaf nodes that induce them. Solid red pattern relates the first argument to the two calls to \code{parent(X,Y)}. Dotted blue pattern encodes the necessary condition that \code{X} must be female to be a sister.}
+  \caption{The AST for the \code{sister} program, showing two patterns and the leaf nodes inducing them. Solid red arrows equate the first arguments in the two calls to \code{parent}. Dotted blue arrows encode the necessary condition that \code{X} must be female to be a sister.}
   \label{fig:sister}
 \end{figure}
 
@@ -90,7 +90,7 @@ sum([H|T],Sum):-    % \textit{recursive case:}
   Sum is Sum + H.   %  add first element (\textit{bug:} reused variable)
 \end{Verbatim}
 
-This error is fairly common with Prolog novices: the variable \code{Sum} is used to represent both the sum of the whole list, and the sum of only the tail elements. The last line fails since Prolog cannot unify \code{Sum} with a (usually) different value of \code{Sum + H}. The program’s AST is displayed in Fig.~\ref{fig:sum}.
+This error is fairly common with Prolog novices: the variable \code{Sum} is used to represent both the sum of the whole list, and the sum of only the tail elements. The last line fails since Prolog cannot unify \code{Sum} with a (usually) different value of \code{Sum\,+\,H}. The program’s AST is displayed in Fig.~\ref{fig:sum}.
 
 \begin{figure}[htbp]
   \centering
@@ -114,7 +114,7 @@ for tree={
       [compound,name=c2c [functor,name=c2cf [sum,name=c2cfv]]
         [args,name=c2ca1 [var [T]] [args,name=c2ca2 [var,name=c2ca2v [Sum,draw,rectangle,thick,red]]]]]
       [binop,name=c2b [var,name=c2bv [Sum,draw,rectangle,thick,dashed,orange]] [is,name=c2bo]
-        [binop,name=c2bb [var,name=c2bbv [Sum,draw,rectangle,thick,dashed,orange]] [+,name=c2bbo] [var [H]]]]]]]]
+        [binop,name=c2bb [var,name=c2bbv [Sum,draw,rectangle,thick,dashed,orange]] [+,name=c2bbo] [var [H]]]]]]]
 % first pattern
 \path[-{Latex[length=1.5mm,width=1mm]},thick,relative,dotted,line width=0.4mm,blue] (c1) edge[transform canvas={xshift=-1mm}] (c1h);
 \path[-{Latex[length=1.5mm,width=1mm]},thick,relative,dotted,line width=0.4mm,blue] (c1h) edge[transform canvas={xshift=-1mm}] (c1hc);
@@ -128,11 +128,11 @@ for tree={
 \path[-{Latex[length=1.5mm,width=1mm]},thick,relative,red] (c2) edge[transform canvas={xshift=-0.8mm,yshift=0.8mm}] (c2h);
 \path[-{Latex[length=1.5mm,width=1mm]},thick,relative,red] (c2h) edge[transform canvas={xshift=-1mm}] (c2hc);
 \path[-{Latex[length=1.5mm,width=1mm]},thick,relative,red] (c2hc) edge[transform canvas={xshift=-1.2mm}] (c2hf);
-\path[-{Latex[length=1.5mm,width=1mm]},thick,relative,red] (c2hf) edge[transform canvas={xshift=-1mm}] (c2hfv);
+\path[-{Latex[length=1.5mm,width=1mm]},thick,relative,red] (c2hf) edge[transform canvas={xshift=-0.8mm,yshift=0.8mm}] (c2hfv);
 \path[-{Latex[length=1.5mm,width=1mm]},thick,relative,red] (c2hc) edge[transform canvas={xshift=1.2mm}] (c2ha1);
 \path[-{Latex[length=1.5mm,width=1mm]},thick,relative,red] (c2ha1) edge[transform canvas={xshift=1.2mm}] (c2ha2);
 \path[-{Latex[length=1.5mm,width=1mm]},thick,relative,red] (c2ha2) edge[transform canvas={xshift=1mm}] (c2ha2v);
-\path[-{Latex[length=1.5mm,width=1mm]},thick,relative,red] (c2) edge[transform canvas={xshift=-1mm}] (c2c);
+\path[-{Latex[length=1.5mm,width=1mm]},thick,relative,red] (c2) edge[out=-15,in=-170] (c2c);
 \path[-{Latex[length=1.5mm,width=1mm]},thick,relative,red] (c2c) edge[transform canvas={xshift=-1.2mm}] (c2cf);
 \path[-{Latex[length=1.5mm,width=1mm]},thick,relative,red] (c2cf) edge[transform canvas={xshift=-1mm}] (c2cfv);
 \path[-{Latex[length=1.5mm,width=1mm]},thick,relative,red] (c2c) edge[transform canvas={xshift=1.2mm}] (c2ca1);
@@ -146,7 +146,7 @@ for tree={
 \path[-{Latex[length=1.5mm,width=1mm]},thick,relative,dashed,orange] (c2bb) edge[transform canvas={xshift=-1mm}] (c2bbv);
 \path[-{Latex[length=1.5mm,width=1mm]},thick,relative,dashed,orange] (c2bb) edge[transform canvas={xshift=0.9mm}] (c2bbo);
   \end{forest}
-  \caption{AST for the \texttt{sum} program.}
+  \caption{AST for the buggy \texttt{sum} program. Dotted arrows relate the correct values in the base case. Solid and dashed arrows denote two patterns describing incorrect reuse of the \code{Sum} variable.}
   \label{fig:sum}
 \end{figure}
 
@@ -164,7 +164,7 @@ The second pattern, drawn with dashed orange arrows in the figure, indicates the
 (clause (binop (var \code{Sum}) \code{is} (binop (var \code{Sum}) \code{+})))
 \end{Verbatim}
 
-The leftmost pattern, drawn with dotted blue arrows in Fig.~\ref{fig:sum}, describes the correct relation between the two constants in the base-case rule:
+The leftmost pattern in Fig.~\ref{fig:sum}, drawn with dotted blue arrows, describes the correct relation between the two constants in the base-case rule:
 
 \begin{Verbatim}[fontfamily=sf]
 (clause (head (compound (functor \code{sum}) (args \code{[]} (args \code{0})))))
@@ -173,6 +173,7 @@ The leftmost pattern, drawn with dotted blue arrows in Fig.~\ref{fig:sum}, descr
 \noindent
 We include such patterns in our feature set to cover the base-case clauses in recursive programs, which often include no variables.
 
+While the patterns used in this paper appear to be useful for analyzing Prolog programs, it is likely that other kinds of patterns will be needed for other programming languages. In Python, for example, variables can take on different values and be accessed from many places. This will likely require patterns relating more than two instances of a variable, or multiple variables.
 
 %%% Local Variables:
 %%% mode: latex