diff options
Diffstat (limited to 'aied2018')
-rw-r--r-- | aied2018/evaluation.tex | 4 | ||||
-rw-r--r-- | aied2018/patterns.tex | 2 | ||||
-rw-r--r-- | aied2018/rules.tex | 69 |
3 files changed, 37 insertions, 38 deletions
diff --git a/aied2018/evaluation.tex b/aied2018/evaluation.tex index 72dca8e..7d5ee54 100644 --- a/aied2018/evaluation.tex +++ b/aied2018/evaluation.tex @@ -32,8 +32,8 @@ Evaluation was performed on a subset of exercises from an introductory Python course, implemented in the online programming environment CodeQ\footnote{Available at \url{https://codeq.si}. Source under AGPL3+ at \url{https://codeq.si/code}.}. Table~\ref{tab:stats} shows the number of users attempting each problem, the number of all / correct submissions, the performance of majority and random-forest classifiers for predicting program correctness based on patterns, and percentages of covered programs by rules: $n_p$ is the percentage of covered incorrect programs with n-rules that contain only presence of patterns, $n$ is the percentage of covered incorrect programs with all n-rules, and $p$ is the coverage of correct programs with p-rules. Classification accuracies were obtained using 10-fold cross validation. -Our primary interest in this paper is finding rules to help manual analysis of student submissions. The accuracy of automatic classification thus plays a secondary role to the interpretability of our model, however it is a good measure to estimate the expressiveness of patterns. We see that AST patterns increase classification accuracy for about 17\% overall. This result indicates that a significant amount of information can be gleaned from simple syntax-oriented analysis. Exceptions are the \textsf{ballistics} and the \textsf{minimax} problems, where there was no visible improvement. In both exercises, the set of used patterns was insufficient to distinguish between incorrect and correct programs. To further improve the quality of patterns, we will analyze misclassified programs in those two exercises and derive new formats of patterns, which will enable better learning. +Our primary interest in this paper is finding rules to help manual analysis of student submissions. The accuracy of automatic classification thus plays a secondary role to the interpretability of our model, however it is a good measure to estimate the expressiveness of patterns. We see that AST patterns increase classification accuracy for about 17\% overall. This result indicates that a significant amount of information can be gleaned from simple syntax-oriented analysis. Exceptions are the \textsf{ballistics} and the \textsf{minimax} problems, where there was no visible improvement. In both exercises, the set of used patterns was insufficient to distinguish between incorrect and correct programs. To further improve the quality of patterns, we intend to analyze misclassified programs in those two exercises and derive new formats of patterns, which should enable better learning. -The coverages of particular types of rules are also promising. On average, we can explain 43.2\% of incorrect programs with the presence of patterns. Considering all n-rules, we can explain over 70\% of incorrect submissions, however these rules are somewhat harder to understand. Similarly, we can explain 62\% of correct programs with p-rules. The remaining 38\% of solutions tackle the problems in ways that cannot be veiled by out current patterns or there are not enough similar solutions to form a rule. This requires further analysis. For example, in the \textsf{checking\_account} exercise the coverage of p-rules is only 11.5\%. +The coverages of particular types of rules are also promising. On average, we can explain 43.2\% of incorrect programs with the presence of patterns. Considering all n-rules, we can explain over 70\% of incorrect submissions, however these rules are somewhat harder to understand. Similarly, we can explain 62\% of correct programs with p-rules. The remaining 38\% of solutions tackle the problems in ways that cannot be veiled by our current patterns or there are not enough similar solutions to form a rule. This requires further analysis. For example, in the \textsf{checking\_account} exercise the coverage of p-rules is only 11.5\%. We have demonstrated how AST patterns can be described with TREs, and how such patterns can be combined to discover important concepts and errors in student programs. Currently, analyzing patterns and rules is quite cumbersome. We plan on developing a tool to allow teachers to easily construct and refine patterns and rules based on example programs. Ideally we would integrate our approach into an existing analysis tool such as OverCode~\cite{glassman2015overcode}. diff --git a/aied2018/patterns.tex b/aied2018/patterns.tex index 28a5ed8..e2da4fa 100644 --- a/aied2018/patterns.tex +++ b/aied2018/patterns.tex @@ -12,7 +12,7 @@ In this work we used TREs to encode (only) child and sibling relations in ASTs. \noindent Analogous to ordinary regular expressions, caret (\texttt{\textbf{\textasciicircum}}) and dollar sign (\texttt{\$}) anchor a node to be respectively the first or last child of its parent. A period (\texttt{\textbf{.}}) is a wildcard that matches any node. -\begin{figure}[htbp] +\begin{figure}[tbp] \centering \begin{forest} for tree={ diff --git a/aied2018/rules.tex b/aied2018/rules.tex index ff47939..acdd8a6 100644 --- a/aied2018/rules.tex +++ b/aied2018/rules.tex @@ -1,24 +1,26 @@ \section{Learning rules} \label{sec:rules} -The goal of learning rules in this paper is to discover and explain common approaches and mistakes in student programs. We use a rule learner called ABCN2e, implemented within the Orange data mining library~\cite{demsar2013orange}. ABCN2e modifies the original CN2 algorithm~\cite{clarkECML1991} to learn unordered rules; modifications are described in a technical report at \url{https://ailab.si/abml}. +The goal of learning rules in this paper is to discover and explain common approaches and mistakes in student programs. We use a rule learner called ABCN2e, implemented within the Orange data mining library~\cite{demsar2013orange}. ABCN2e modifies\footnote{Modifications are described in a technical report available at \url{https://ailab.si/abml}.} the original CN2 algorithm~\cite{clarkECML1991}. General rule-learning algorithms, such as CN2, tend to generate many specific rules. This produces more accurate results but makes rules harder to explain. This section describes the problem-specific configuration of the rule-learning algorithm for extracting relevant and explainable patterns from student programs. -Each program is represented in the feature space of AST patterns described in the previous section. Based on test results each program is classified either as \emph{correct} or \emph{incorrect}. A program can be incorrect for one of two reasons: either a) it contains some incorrect pattern (a buggy pattern) that should be removed or modified, or b) it is missing one or more programing constructs (patterns) that should be present for the program to be correct. +Each program is represented in the feature space of AST patterns described in the previous section. Based on test results each program is classified either as \emph{correct} or \emph{incorrect}. Our main motivation for learning rules bases on assumption that the patterns highly correlated with incorrect programs represent mistakes. -Classification rules can express both reasons. For buggy patterns we learn rules for incorrect programs, where each condition in the rule must express the presence of a pattern. The condition of such a rule therefore contains a set of patterns that imply a bug in the program. For missing patterns, we learn another set of rules covering programs that are not covered by above rules. These rules may contain missing patterns within their conditions, and describe the missing constructs in a program that have to be implemented. All rules explaining incorrect programs are called \emph{n-rules}. +A program can be incorrect for one of two reasons: either a) it contains some incorrect pattern (a buggy pattern) that should be removed or modified, or b) it is missing one or more programing constructs (patterns) that should be present for the program to be correct. + +Classification rules can express both reasons. To identify buggy patterns we learn a set of rules covering incorrect programs, where each condition in the rule must express the presence of a pattern. For missing patterns, we learn another set of rules, which may contain missing patterns within their conditions, and describe the missing constructs in a program that have to be implemented. All rules explaining incorrect programs are called \emph{n-rules}. To learn explainable, meaningful and non-redundant rules, we impose the following additional constraints on the rule learner: \begin{itemize} - \item classification accuracy of each rule must exceed 90\%, because we accept a 10\% false-positive error as acceptable; - \item each conjunct in the condition of a rule must be significant according to the likelihood test, meaning that each pattern in the condition part is indeed relevant (we set the significance threshold to p=0.05); + \item classification accuracy of each rule must exceed 90\%, because we deem a 10\% false-positive error as acceptable; + \item each term in the condition of a rule must be significant according to the likelihood test, meaning that each pattern in the condition part is indeed relevant (we set the significance threshold to p=0.05); \item a condition can have at most 3 patterns; and - \item each rule must cover at least 5 distinct programs -- this avoids redundant rules that represent the same error with a different combination of patterns. + \item each rule must cover at least 5 distinct programs -- to avoid learning redundant rules representing the same error. \end{itemize} -Different approaches can be represented with rules explaining correct programs. A program is correct when it implements all required patterns and no buggy patterns. There may be several possible sets of required patterns for each exercise, with each set corresponding to a different approach to solving it. +To identify different approaches to solving an exercise, we learn rules that explain correct programs. A program is correct when it implements all required patterns and no buggy patterns. There may be several possible sets of required patterns for each exercise, with each set corresponding to a different approach. We use the same constraints as in the case of n-rules and learn rules for correct programs called \emph{p-rules}. In this case, we always require that conditions mention the presence of patterns, since it is easier to explain different approaches of students with something they have written and not with something they have not. To account for possible buggy patterns, the requirement to achieve 90\% classification accuracy was not evaluated on full data, but only on data not covered by n-rules. Hence, a rule can cover an example with a specific approach even though it contains a buggy pattern. @@ -63,8 +65,8 @@ P5 ∧ P35 ⇒ incorrect [72, 0] Pattern \textsf{P5} matches programs where the result of the \texttt{input} call is not cast to \texttt{float} but stored as a string. Pattern \textsf{P35} matches programs where the value 32 is subtracted from a variable on the left-hand side of a multiplication. Sample programs matching the first rule (left) and the second rule (right) are: \begin{Verbatim} -g2 = input() g2 = \blue{input}('Temperature [F]? ') -g1 = \blue{int}(g2) g1 = (\red{(g2 - 32) *} (5 / 9)) +g2 = input() g2 = \blue{\underline{input}}('Temperature [F]? ') +g1 = \blue{\underline{int}}(g2) g1 = (\red{\dashuline{(g2 - 32) *}} (5 / 9)) print(((g1-32)*(5/9))) print(g2, 'F equals', g1, 'C') \end{Verbatim} @@ -93,11 +95,11 @@ P80 ⇒ correct [0, 38] \noindent Patterns in the condition of the first rule, \textsf{P2} and \textsf{P8}, correspond respectively to expressions of the form \texttt{float(input(?))} and \texttt{print((?-32)*?)}. Programs matching both patterns wrap the function \texttt{float} around \texttt{input}, and have an expression that subtracts 32 and then uses multiplication within the \texttt{print}. -This first rule demonstrates an important property of p-rules: although patterns \textsf{P2} and \textsf{P8} are in general not sufficient for a correct program (it is trivial to implement a matching but incorrect program), only one out of 201 student submissions matching these patterns was incorrect. This suggests that the conditions of p-rules represent the critical elements of the solution. Once a student has figured out these patterns, they are almost certain to have a correct solution. A sample program matching the first rule is: +This first rule demonstrates an important property of p-rules: although patterns \textsf{P2} and \textsf{P8} are in general not sufficient for a correct program (it is trivial to implement a matching but incorrect program), only one out of 201 student submissions matching these patterns was incorrect. This suggests that the conditions of p-rules represent the critical elements of the solution. Once students have figured out these patterns, they are almost certain to have a correct solution. A sample program matching the first rule is: \begin{Verbatim} -g1 = \blue{float(input(}'Temperature [F]: ')) -print(((g1 \red{- 32) *} (5 / 9))) +g1 = \blue{\underline{float(input(}}'Temperature [F]: ')) +print(((g1 \red{\dashuline{- 32) *}} (5 / 9))) \end{Verbatim} \noindent @@ -105,7 +107,7 @@ The second and third p-rules are variations of the first. For instance, the seco \begin{Verbatim} g1 = input('Fahrenheit?') -g0 = (\blue{(float(g1) - 32)} * (5 / 9)) +g0 = (\blue{\underline{(float(g1) - 32)}} * (5 / 9)) print(g0) \end{Verbatim} @@ -122,10 +124,9 @@ def max_abs(l): return vmax \end{Verbatim} -We have received 155 submissions (57 correct, 98 incorrect) for this exercise. Due to its higher complexity and since the solutions are much more diverse, we obtained 8298 patterns to be used as attributes in learning. High number of patterns together with a low number of learning examples could present a problem for rule learning: since the space of possible rules is large, some of the learned rules might be a result of statistical anomalies. One needs to apply a certain amount of caution when interpreting these rules. - -The rule-learning algorithm learned 15 n-rules (7 mentioning only presence of patterns) and 6 p-rules. Below we can see the two best n-rules referring to the presence of patterns and two programs; the left one is covered by the first rule, and the right one by the second rule: +We have received 155 submissions (57 correct, 98 incorrect) for this exercise. Due to its higher complexity and since the solutions contain definitions of lists, we obtained 8298 patterns to be used as attributes in learning. High number of patterns together with a low number of learning examples could present a problem for rule learning, since the large number of candidate rules could lead to statistical anomalies. Still, most of the learned rules seem valuable. +The rule-learning algorithm learned 15 n-rules (7 mentioning only presence of patterns) and 6 p-rules. Below we can see the two best n-rules referring to the presence of patterns and two programs (left for first rule, right for second): \begin{Verbatim}[fontfamily=sf] P64 ⇒ incorrect [22, 0] P2 ∧ P70 ⇒ incorrect [17, 0] @@ -135,14 +136,14 @@ P2 ∧ P70 ⇒ incorrect [17, 0] def max_abs(l): def max_abs(l): vmax = 0 vmax = None for i in range(len(l)): for v in l: - if \blue{vmax} < abs(l[i]): if vmax==None or vmax<v: - vmax = l[i] \red{vmax} = \blue{abs}(v) - return \blue{vmax} return \red{vmax} + if \blue{\underline{vmax}} < abs(l[i]): if vmax==None or vmax<v: + vmax = l[i] \red{\dashuline{vmax}} = \blue{\underline{abs}}(v) + return \blue{\underline{vmax}} return \red{\dashuline{vmax}} \end{Verbatim} -The pattern from the first rule, \textsf{P64}, matches functions returning the variable that is used in the condition of an if clause without an application of another function (such as \texttt{abs}). The left program demonstrates this pattern, where the value \texttt{vmax} is compared in the if clause and then returned. According to the teachers of the Python class, this error is common, because students forget that they need to compare the absolute value of \texttt{vmax}. +The pattern from the first rule, \textsf{P64}, matches functions returning the variable that is used in the condition of an if clause without an application of another function (such as \texttt{abs}). The left program contains this pattern, where the value \texttt{vmax} is compared in the if clause and then returned. This error is quite common among students, because they forget to compare the absolute value of \texttt{vmax}. -The second rule contains two patterns. \textsf{P70} (blue) matches functions containing the call to \texttt{abs} in an assignment statement nested within a for loop and an if clause. \textsf{P2} (red) matches functions that return the variable used in an assignment statement within a for-if block. Such programs are incorrect because they do not store the original list element. For example, if -7 has the largest absolute value in the list, then the function should return -7 and not 7. +The second rule contains two patterns. \textsf{P70} (blue underlined) matches functions containing the call to \texttt{abs} in an assignment statement nested within a for loop and an if clause. \textsf{P2} (red dash) matches functions that return the variable used in an assignment statement within a for-if block. Such programs are incorrect since they do not store the original list element. For example, if -7 has the largest absolute value in the list, then the function should return -7 and not 7. The best two n-rules with absence of patterns in condition are: @@ -154,9 +155,9 @@ P36 ∧ ¬P162 ⇒ incorrect [26, 0] \noindent The first rule covers programs matching \textsf{P1} (checks for a function definition in the program) but missing \textsf{P11} (if the iteration variable in a for loop is directly assigned to another variable within an if clause) and \textsf{P131} (whether the return statement uses indexing, i.e. \texttt{return l[?]}). One such example is the above right program: it contains a function definition, it does not directly assign the value of \texttt{v} but uses its absolute value, and does not use indexing in the return statement. -This rule specifies two missing patterns, which makes is quite difficult to understand. It does not directly state the issue with a given program: if one of the two missing patterns were implemented, the rule would not cover this program any more. Therefore, the questions is, which of these two reasons is really missing? Different missing patterns could be understood as different options to finalize the program. +This rule specifies two missing patterns, which makes is harder to understand. If one of the two missing patterns were implemented, the rule would not cover this program any more. Therefore, the questions is, which of these two reasons is really missing? Different missing patterns could be understood as different options to finalize the program. -The second rule identifies only one missing pattern. \textsf{P36} matches a call to \texttt{max} in the return statement, whereas \textsf{P162} matches a call to \texttt{max} with the given list as the argument. This rule covers, for example, the following program: +The second rule identifies one missing pattern. \textsf{P36} matches a call to function \texttt{max} in the return statement, whereas \textsf{P162} matches a call to \texttt{max} with the given list as the argument. This rule covers, for example, the following program: \begin{Verbatim} def max_abs(l): @@ -164,7 +165,7 @@ def max_abs(l): \end{Verbatim} \noindent -It uses \texttt{max} in the return statement, but does not apply it directly to the input list \texttt{l}. Note that this program would fail because the function \texttt{abs} does not accept a list argument. +It uses \texttt{max} in the return statement, but does not apply it directly to the list \texttt{l}. The four most accurate p-rules induced by our rule learner were: @@ -179,22 +180,20 @@ P27 ⇒ correct [ 6 38] Sample programs covered by the first and second rules are: \begin{Verbatim}[fontfamily=tt] -def max_abs(l): def max_abs(\green{l}): - \green{vmax = 0} vmax = l[0] - for \blue{v} in l: for \blue{v} in \green{l}: - if abs(\red{vmax}) < abs(v): if abs(\red{vmax}) < abs(v): - \red{vmax} = \blue{v} vmax = \blue{v} - return vmax return \red{vmax} +def max_abs(l): def max_abs(\green{\dotuline{l}}): + \green{\dotuline{vmax = 0}} vmax = l[0] + for \blue{\underline{v}} in l: for \blue{\underline{v}} in \green{\dotuline{l}}: + if abs(\red{\dashuline{vmax}}) < abs(v): if abs(\red{\dashuline{vmax}}) < abs(v): + \red{\dashuline{vmax}} = \blue{\underline{v}} vmax = \blue{\underline{v}} + return vmax return \red{\dashuline{vmax}} \end{Verbatim} \noindent -The first two rules and the above programs are similar. Both rules share a common reason, \textsf{P11} (blue in both programs), describing a pattern, where the variable from the for loop is used in the right side of an assignment within the if clause. \textsf{P17} and \textsf{P27} are also similar (red in both programs). The former links the occurrence of a variable within the \texttt{abs} function in an if condition with the variable from an assignment, whereas the latter links the same variable from an if condition with the variable from the return statement. \textsf{P35} matches variable assignments to 0, hence the first rule covers solutions initializing \texttt{vmax} to zero. \textsf{P3} matches for-looping over the input list. +The first two rules and the above programs are similar. Both rules share a common reason, \textsf{P11} (blue underlined in both programs), describing a pattern, where the variable from the for loop is used in the right side of an assignment within the if clause. \textsf{P17} and \textsf{P27} are also similar (red dash underline). The former links the occurrence of a variable within the \texttt{abs} function in an if condition with the variable from an assignment, whereas the latter links the same variable from an if condition with the variable from the return statement. \textsf{P35} matches variable assignments to 0, hence the first rule covers solutions initializing \texttt{vmax} to zero. \textsf{P3} matches for-looping over the input list. -After inspecting all covered examples of the first and the second rule, we found out that the first rule is only a more strict version of the second rule, since all examples covered by the first rule are also covered by the second rule. These two rules therefore do not describe two different approaches, but two different representations of the same approach. Similarly, the fourth rule is a generalization of the first two rules, containing only \textsf{P27} within conditions. This pattern seem to be particularly important. Of 44 programs, where students used the absolute value of \texttt{vmax} in comparison and returned \texttt{vmax} at the end, 38 were evaluated as correct. +After inspecting all covered examples of the first and the second rule, we discovered that the first rule is only a more strict version of the second rule, since all examples covered by the first rule are also covered by the second rule. These two rules therefore do not describe two different approaches, but a more general and a more specific representation of the same approach. Similarly, the fourth rule is a generalization of the first two rules, containing only \textsf{P27} within conditions. It appears that the pattern \textsf{P27} is the most important; of 44 programs, where students used the absolute value of \texttt{vmax} in comparison and returned \texttt{vmax} at the end, 38 were evaluated as correct. The third rule describes a different pattern. It covers programs that define a list containing values 2, 1, and -6. Defining such a list is evidently not necessary for the solution of this exercise. Why would it then correlate with the correctness of the solution? -To explain this rule we first have to describe how students test their programs. One option is to simply use the \textit{Test} button, which submits the program to a server, where it is tested against a predefined set of test cases. The other option is to click the \textit{Run} button, which runs the program and outputs the results. Those students who defined a list with values 2, 1, and -6 in their programs are most likely using the second option. They create their own test cases and then submit a program only when they are certain that it is correct. Since the description of the exercise includes a single test case with values 2, 1, and -6, most students use this list as the testing case. - -On the other hand, given that the rule covers only nine programs, the probability that the rule is a statistical artifact is not negligible. +To explain this rule we first have to describe how students test their programs. One option is to simply use the \textit{Test} button, which submits the program to a server, where it is tested against a predefined set of test cases. The other option is to click the \textit{Run} button, which runs the program and outputs the results. Those students who defined a list with values 2, 1, and -6 in their programs are most likely using the second option. They create their own test cases and then submit a program only when they are certain that it is correct. Since the description of the exercise includes a single test case with values 2, 1, and -6, most students use this list as the testing case. On the other hand, given that the rule covers only nine programs, the probability that the rule is a statistical artifact is not negligible. |