diff options
-rw-r--r-- | aied2017/introduction.tex | 11 | ||||
-rw-r--r-- | aied2017/method.tex | 54 |
2 files changed, 18 insertions, 47 deletions
diff --git a/aied2017/introduction.tex b/aied2017/introduction.tex index 2b2362d..44fc8df 100644 --- a/aied2017/introduction.tex +++ b/aied2017/introduction.tex @@ -1,13 +1,13 @@ \section{Introduction} % why automatic feedback -Programming education is becoming increasingly accessible with massive online courses. Since thousands of students can attend such courses, it is impossible for teachers to individually evaluate each participant’s work. On the other hand, in-time feedback directly addressing student’s errors can aid the learning process. Providing feedback automatically could thus greatly enhance these courses. +Programming education is becoming increasingly accessible with massive online courses. Since thousands of students can attend such courses, it is impossible for teachers to individually evaluate each participant’s work. On the other hand, in-time feedback that directly addresses students’ mistakes can aid the learning process. Providing feedback automatically could thus greatly enhance these courses. % ITS background -Traditional intelligent tutoring systems use manually constructed domain models to generate feedback. Cognitive tutors model the problem-solving \emph{process}: how students write programs. This is challenging because there are no well-defined steps when writing a program (as there are in chess, for example). Many tutors instead only analyze individual versions of the student’s program, and disregard its evolution. These models are often coded in terms of constraints or bug libraries~\cite{keuning2016towards}. +Traditional programming tutors use manually constructed domain models to generate feedback. Model-tracing tutors simulate the problem-solving \emph{process}: how students write programs. This is challenging because there are no well-defined steps when programming (as there are for example in chess). Many tutors instead only analyze individual programs submitted by students, and disregard how a program evolved. They often use models coded in terms of constraints or bug libraries~\cite{keuning2016towards}. % data-driven domain modeling -Developing the domain model requires significant knowledge-engineering effort~\cite{folsom-kovarik2010plan}. This is particularly true for programming tutors, where most problems have several alternative solutions and many possible implementations~\cite{le2013operationalizing}. Data-driven tutors reduce the necessary effort by mining educational data -- often from online courses -- to learn how to fix common errors and generate feedback. +Developing the domain model requires significant knowledge-engineering effort~\cite{folsom-kovarik2010plan}. This is particularly true for programming tutors, where most problems have several alternative solutions with many possible implementations~\cite{le2013operationalizing}. Data-driven tutors reduce the necessary effort by mining educational data -- often from online courses -- to learn common errors and generate feedback~\cite{rivers2015data-driven,nguyen2014codewebs,jin2012program}. % problem statement This paper addresses the problem of finding useful features to support data mining in programming tutors. Features should be robust against superficial or irrelevant variations in program code, and relatable to the knowledge components of the target skill (programming), so as to support hint generation. @@ -15,8 +15,11 @@ This paper addresses the problem of finding useful features to support data mini % our approach: patterns + rules We describe features with \emph{patterns} that encode relations between variables in a program’s abstract syntax tree (AST). Patterns describe paths between certain “interesting” leaf nodes. By omitting some nodes on these paths, patterns match different programs containing the same relation. We then induce rules to predict program correctness based on AST patterns, allowing us to generate hints based on missing or incorrect patterns. +% evaluation +We evaluated our approach on existing Prolog programs submitted by students during past lab sessions of a second-year university course. Rule classification accuracy ranged between 85\% and 99\% for most problems. For 75\% of incorrect submissions we are able to suggest potentially useful patterns -- those that the student had actually implemented in the final, correct program. + % contributions -The main contributions presented in this paper are: AST patterns as features for machine-learning, a rule-based model for predicting program correctness, and hints generated from incorrect or missing patterns in student programs. +The main contributions presented in this paper are: AST patterns as features for machine learning, a rule-based model for predicting program correctness, and hints generated from incorrect or missing patterns in student programs. %%% Local Variables: %%% mode: latex diff --git a/aied2017/method.tex b/aied2017/method.tex index e70d18a..797f625 100644 --- a/aied2017/method.tex +++ b/aied2017/method.tex @@ -21,6 +21,7 @@ We build a pattern for each set $S$ of selected leaf nodes by walking the AST in Patterns constructed in this way form the set of features for rule learning. To keep this set at a reasonable size, we only use patterns that have appeared in programs submitted by at least five students. \subsection{Learning rules for correct and incorrect programs} + \begin{figure}[t] \centering \begin{enumerate} @@ -43,53 +44,20 @@ Patterns constructed in this way form the set of features for rule learning. To \label{figure:algorithm} \end{figure} -As described in the previous section, the feature space contains all frequent AST patterns. -The submitted programs are then represented in this feature space and classified as either -correct, if they pass all prespecified tests, or incorrect otherwise. Such data -serves as the data set for machine learning. - -However, since we are always able to validate a program with unit tests, -the goal of machine learning is not to classify new programs, but to discover patterns -that are correlated with correctness of programs. This approach of machine learning is -referred to as descriptive induction, that is the automatic discovery of patterns describing -regularities in data. We use rule learning for this task since rule-based models are easy to comprehend. - -Before we can explain the algorithm, we need to discuss the reasons why a program -can be incorrect. From our previous pedagogical experience, a student program is -incorrect 1) if some incorrect pattern is present, which needs to be removed, or 2) if the -program lacks a certain relation (pattern) that should be included before the program can be correct. -We shall now explain how both types of errors can be identified with rules. - -To discover patterns related to the first point, the algorithm first learns rules that describe -incorrect programs. The conditions of these rules contain frequent patterns that symptom -incorrect programs. Since rules are used to generate hints and since hints should not be -presented to students unless they are probably correct, we require that each learned rule's -classification accuracy exceeds a certain threshould (in our experiments we used 90\%), -each conjunct in a condition is significant with respect to the likelihood-ratio test (with $p=0.05$ in our experiments), -and a conjunct can only specify the presence of a pattern. The former two constraints are needed -to induce good rules with significant patterns, while the latter constraint assures that rules mention -only presence (and not absence) of patterns as reasons for a program to be incorrect. - -With respect to the second type of error, we could try the same approach and learn rules for the class of -correct programs. Having accurate rules for correct programs, the conditional part of these rules would -define sufficient groups of patterns that render a program correct. However, it turns out that -it is difficult to learn accurate rules for correct programs, since these rules should contain all relevant patterns -and prevent incorrect patterns, yet a conjunct can only -specify the presence of a pattern. If specifying absence of patterns was allowed in rules' condition, -the learning problem would still be difficult, since usually there are many incorrect patterns. -A possible way to solve this problem is to learn from data set not containing programs that are covered by rules for incorrect class. -This way all known incorrect patterns are removed from the data and are not needed anymore in conditions of rules. -However, removing incorrect patterns also removes the need for relevant patterns. For example, if all incorrect programs were removed, -a rule ``IF True then class=correct'' would suffice. -Such rule does not contain any relevant patterns and could not be used to generate hints. -We achieved the best results by learning from both data sets. The original data set (with all programs) -is used to learn rules, while the filtered data are used to test whether a rule achieves the -required classification accuracy (90\%). +Submitted programs are represented in the feature space of AST patterns described above. We classify programs as correct or incorrect based on predefined unit tests for each problem, and use these labels for machine learning. + +Since programs can already be validated with appropriate unit tests, our goal is not classifying new submissions, but rather to discover patterns correlated with program correctness. This machine-learning approach is called \emph{descriptive induction} -- the automatic discovery of patterns describing regularities in data. We use rule learning for this task because rule-based models are easily comprehensible. + +Before explaining the algorithm, let us discuss the reasons why a program can be incorrect. Our teaching experience indicates that bugs in student programs can often be described by 1) some incorrect pattern, which needs to be removed, or 2) some missing relation (pattern) that should be included before the program can be correct. We shall now explain how both types of errors can be identified with rules. + +To discover patterns related to the first point, the algorithm first learns rules that describe incorrect programs. The conditions of these rules contain frequent patterns symptomatic of incorrect programs. Since rules are used to generate hints, and since hints should not be presented to students unless they are probably correct, we require that each learned rule's classification accuracy exceeds a certain threshold (in our experiments we used 90\%), each conjunct in a condition is significant with respect to the likelihood-ratio test (with $p=0.05$ in our experiments), and a conjunct can only specify the presence of a pattern. The former two constraints are needed to induce good rules with significant patterns, while the latter constraint assures that rules mention only presence (and not absence) of patterns as reasons for a program to be incorrect. + +With respect to the second type of error, we could try the same approach and learn rules for the class of correct programs. Having accurate rules for correct programs, the conditional part of these rules would define sufficient groups of patterns that render a program correct. However, it turns out that it is difficult to learn accurate rules for correct programs, since these rules should contain all relevant patterns and prevent incorrect patterns, yet a conjunct can only specify the presence of a pattern. If specifying absence of patterns was allowed in rules' condition, the learning problem would still be difficult, since usually there are many incorrect patterns. A possible way to solve this problem is to learn from data set not containing programs that are covered by rules for incorrect class. This way all known incorrect patterns are removed from the data and are not needed anymore in conditions of rules. However, removing incorrect patterns also removes the need for relevant patterns. For example, if all incorrect programs were removed, a rule “$\mathsf{true} ⇒ \mathsf{correct}$” would suffice. Such rule does not contain any relevant patterns and could not be used to generate hints. We achieved the best results by learning from both data sets. The original data set (with all programs) is used to learn rules, while the filtered data are used to test whether a rule achieves the required classification accuracy (90\%). Figure~\ref{figure:algorithm} contains an outline of the algorithm. The rules describing incorrect programs are called $I-rules$ and the rules for correct programs are called $C-rules$. -Even though our main interest is discovery of patterns, we could still use induced rules to classify +Even though our main interest is discovery of patterns, we can still use induced rules to classify new programs, for example to evaluate the quality of rules. The classification procedure has three steps: first check whether a $I-rule$ covers the program that needs to be classified, and if it does, classify it as incorrect. Then, check whether a $C-rule$ covers the program and classify |