Include remaining reviewer comments

author: Timotej Lazar <timotej.lazar@fri.uni-lj.si> 2017-04-16 15:47:07 +0200
committer: Timotej Lazar <timotej.lazar@fri.uni-lj.si> 2017-04-16 15:47:07 +0200
commit: 4d161f0c8307a803ea023fb566dc1c772aa99d6b (patch)
tree: 3c5fa48fa6b983a620521d025afec93411aced0d
parent: 5bfbc0853c016e325ad5f85400487ffc32c0dc1a (diff)
3 files changed, 16 insertions, 8 deletions
diff --git a/paper/conclusion.tex b/paper/conclusion.tex
index 2bccbd3..eefa25b 100644
--- a/paper/conclusion.tex
+++ b/paper/conclusion.tex
@@ -4,11 +4,11 @@ We have used AST patterns as features to describe program structure. By encoding
 
 We presented a method for automatically extracting AST patterns from student programs. Our patterns encode relations between data objects in a program, with each pattern connecting either two instances of the same variable, a variable and a value, or two values. We consider such patterns as the atomic syntactic relations in a program, and use them as machine-learning features.
 
-We explained how patterns can be used as features to induce rules for classifying correct and incorrect programs. Since the goal of our research is to generate hints, we adapted the CN2 algorithm to produce rules useful for that purpose. We induce rules in two passes: we first learn the rules for incorrect programs, and then use the programs not covered by any such rule to learn the rules for correct programs.
+We explained how to induce rules for classifying correct and incorrect programs based on AST patterns. Since the goal of our research is to generate hints, we adapted the CN2 algorithm to produce rules useful for that purpose. We induce rules in two passes: we first learn the rules for incorrect programs, and then use programs not covered by any such rule to learn the rules for correct programs.
 
 Evaluation shows that our patterns are useful for classifying Prolog programs. Other programming languages will likely require different patterns. For example, in commonly taught imperative languages such as Python or Java, each variable can take on different values and appear in many places. Further research is needed to determine the kinds of patterns useful in such situations.
 
-We showed how to generate hints based on rules by highlighting buggy patterns or pointing out what patterns are missing. Evaluation on past student data shows that useful hints can be provided for many incorrect submissions this way. The quality of feedback could be improved by annotating rules with explanations in natural language. Furthermore, since patterns and rules are easily interpretable, they can also help when manually authoring tutoring systems, by indicating the common errors and the typical solution strategies for each problem.
+We showed how to generate hints based on rules by highlighting buggy patterns or pointing out what patterns are missing. Evaluation on past student data shows that useful hints can be provided for many incorrect submissions this way. The quality of feedback could be improved by annotating rules with explanations in natural language. Since patterns and rules are easily interpretable, they can also help when manually authoring tutoring systems, by indicating the common errors and the typical solution strategies for each problem.
 
 In the future we will attempt to improve rule accuracy for certain problems, such as \code{union}. This will likely necessitate new kinds of patterns, for example to handle the cut operator. Adapting our methods to handle Python programs will give us some insight into what kinds of patterns could be useful in different situations. Finally, we will implement hint generation in an online programming tutor CodeQ, and evaluate the effect of automatic feedback on students’ problem-solving.
 
diff --git a/paper/evaluation.tex b/paper/evaluation.tex
index 6b6c7a8..3b7f9fb 100644
--- a/paper/evaluation.tex
+++ b/paper/evaluation.tex
@@ -2,7 +2,7 @@
 We evaluated our approach on 44 programming assignments. We preselected 70\% of students
 whose submissions were used as learning data for rule learning. The submissions from
 the remaining 30\% of students were used as testing data to evaluate classification accuracy of learned rules,
-and to retrospectively evaluate quality of given hints.
+and to retrospectively evaluate quality of given hints. Problems analyzed in this paper constitute a complete introductory course in Prolog, covering the basics of the language.
 
 \begin{table}[t]
 \caption{Results on five selected domains and averaged results over 44 domains.
@@ -39,7 +39,7 @@ Table~\ref{table:eval} contains results on five selected problems (each represen
 which had the best overall performance, serve as references for bad and good CA on particular data sets.
 
 For example, our rules correctly classified 99\% of testing instances for the \code{sister} problem,
-the accuracy of the majority classifier was 66\%, and random forests achieved 98\%. CA of rules is also high for problems \code{del} and \code{sum}. It is lower, however, for \code{is\_sorted} and \code{union}, suggesting that the proposed set of AST patterns is insufficient for certain problems. Indeed, after analyzing the problem \code{is\_sorted},
+the accuracy of the majority classifier was 66\%, and random forests achieved 98\%. CA of rules is also high for problems \code{del} and \code{sum}. It is lower, however, for \code{is\_sorted} and \code{union}, suggesting that the proposed AST patterns are insufficient for certain problems. Indeed, after analyzing the problem \code{is\_sorted},
 we observed that our patterns do not cover predicates with a single empty-list (\code{[]}) argument, which occurs as the base case in this problem. For this reason, the rule learning
 algorithm failed to learn any positive rules and therefore all programs were classified as incorrect. In the case of \code{union}, many solutions use the cut (\code{!}) operator, which
 is also ignored by our pattern generation algorithm.
@@ -61,8 +61,7 @@ for the \code{is\_sorted} problem, because the algorithm could not learn any pos
 
 To sum up, buggy hints seem to be good and reliable, since they are always implemented when presented, even when we tested them on past data -- the decisions of students were not influenced by these hints. The percentage of implemented intent hints is, on average, lower (56\%), which is still not a bad result, providing that it is difficult to determine the programmer’s intent. In 12\% (244 out 2057) of generated intent hints, students implemented an alternative hint that was identified by our algorithm. Overall we were able to generate hints for 84.5\% of incorrect submissions. Of those hints, 86\% were implemented (73\% of all incorrect submissions).
 
-Last but not least, high classification accuracies in many problems imply that it is possible to correctly determine the correctness of a program by simply checking for the presence of a small number of patterns. Our hypothesis is that there exist some crucial patterns in programs that students have difficulties with. When they figure out these patterns, implementing the rest of the program is usually straightforward.
-
+High classification accuracies for many problems imply that it is possible to determine program correctness simply by checking for the presence of a small number of patterns. Our hypothesis is that for each program certain crucial patterns exist that students have difficulties with. When they figure out these patterns, implementing the rest of the program is usually straightforward.
 
 %%% Local Variables:
 %%% mode: latex
diff --git a/paper/method.tex b/paper/method.tex
index 43362da..0c3fdff 100644
--- a/paper/method.tex
+++ b/paper/method.tex
@@ -22,7 +22,7 @@ For each selected pair of leaf nodes $(a,b)$ we construct a pattern by walking t
 
 Patterns are extracted automatically given above constraints (each pattern connecting a pair of variables or values). We find that such patterns work well for Prolog. Other languages, however, will likely require different kinds of patterns to achieve good performance.
 
-Finally, to avoid inducing rules specific to a particular program (covering typos and other idiosyncratic mistakes), we ignore rare patterns. In this study we used patterns that occurred in at least five submissions. These patterns form the feature space for rule learning.
+In order to avoid inducing rules specific to a particular program (covering typos and other idiosyncratic mistakes), we ignore rare patterns. In this study we used patterns that occurred in at least five submissions. These patterns form the feature space for rule learning.
 
 \subsection{Learning rules}
 
@@ -49,6 +49,15 @@ A possible way to solve this problem is to remove programs that are covered by s
 
 While our main interest is discovering important patterns, induced rules can still be used to classify new programs, for example to evaluate rule quality. Classification proceeds in three steps: 1) if a negative rule covers the program, classify it as incorrect; 2) else if a positive rule covers the program, classify it as correct; 3) otherwise, if no rule covers the program, classify it as incorrect.
 
+We note that Prolog clauses can often be written in various ways. For example, the clause “\code{sum([],0).}” can also be written as
+
+\begin{Verbatim}
+sum(List,Sum):- List = [], Sum = 0.
+\end{Verbatim}
+
+\noindent
+Our method covers such variations by including additional patterns and rules. Another option would be to use rules in conjunction with program canonicalization, by transforming each submission into a semantically equivalent normalized form before extracting patterns~\cite{rivers2015data-driven}.
+
 \subsection{Generating hints}
 
 Once we have induced the rules for a given problem, we can use them to provide hints based on buggy or missing patterns. To generate a hint for an incorrect program, each rule is considered in turn. We consider two types of feedback: \emph{buggy hints} based on negative rules, and \emph{intent hints} based on positive rules.
@@ -78,7 +87,7 @@ For example, if we find the following missing pattern for an incorrect program i
 \noindent
 we could display a message to the student saying “comparison between \code{X} and some other value is missing”, or “your program is missing the goal \code{X} \code{\textbackslash{}=} \code{?}”.
 
-This method can find more than one missing pattern for a given partial program. In such cases we return the most commonly occurring pattern as the main hint, and other candidate patterns as alternative hints. We use main and alternative intent hints to establish the upper and lower bounds when evaluating automatic hints.
+This method can find several missing patterns for a given partial program. In such cases we return the most commonly occurring pattern as the main hint, and other candidate patterns as alternative hints. We use main and alternative intent hints to establish the upper and lower bounds when evaluating hints.
 
 
 %%% Local Variables:
author	Timotej Lazar <timotej.lazar@fri.uni-lj.si>	2017-04-16 15:47:07 +0200
committer	Timotej Lazar <timotej.lazar@fri.uni-lj.si>	2017-04-16 15:47:07 +0200
commit	4d161f0c8307a803ea023fb566dc1c772aa99d6b (patch)
tree	3c5fa48fa6b983a620521d025afec93411aced0d
parent	5bfbc0853c016e325ad5f85400487ffc32c0dc1a (diff)