\section{Learning rules} \label{sec:rules} The goal of learning rules in this paper is to discover and explain common approaches and mistakes in student programs. We use a rule learner called ABCN2e, implemented within the Orange data mining library~\cite{demsar2013orange}. ABCN2e modifies the original CN2 algorithm~\cite{clarkECML1991} to learn unordered rules; modifications are described in a technical report at \url{https://ailab.si/abml}. General rule-learning algorithms, such as CN2, tend to generate many specific rules. This produces more accurate results but makes rules harder to explain. This section describes the problem-specific configuration of the rule-learning algorithm for extracting relevant and explainable patterns from student programs. Each program is represented in the feature space of AST patterns described in the previous section. Based on test results each program is classified either as \emph{correct} or \emph{incorrect}. A program can be incorrect for one of two reasons: either a) it contains some incorrect pattern (a buggy pattern) that should be removed or modified, or b) it is missing one or more programing constructs (patterns) that should be present for the program to be correct. Classification rules can express both reasons. For buggy patterns we learn rules for incorrect programs, where each condition in the rule must express the presence of a pattern. The condition of such a rule therefore contains a set of patterns that imply a bug in the program. For missing patterns, we learn another set of rules covering programs that are not covered by above rules. These rules may contain missing patterns within their conditions, and describe the missing constructs in a program that have to be implemented. All rules explaining incorrect programs are called \emph{n-rules}. To learn explainable, meaningful and non-redundant rules, we impose the following additional constraints on the rule learner: \begin{itemize} \item classification accuracy of each rule must exceed 90\%, because we accept a 10\% false-positive error as acceptable; \item each conjunct in the condition of a rule must be significant according to the likelihood test, meaning that each pattern in the condition part is indeed relevant (we set the significance threshold to p=0.05); \item a condition can have at most 3 patterns; and \item each rule must cover at least 5 distinct programs -- this avoids redundant rules that represent the same error with a different combination of patterns. \end{itemize} Different approaches can be represented with rules explaining correct programs. A program is correct when it implements all required patterns and no buggy patterns. There may be several possible sets of required patterns for each exercise, with each set corresponding to a different approach to solving it. We use the same constraints as in the case of n-rules and learn rules for correct programs called \emph{p-rules}. In this case, we always require that conditions mention the presence of patterns, since it is easier to explain different approaches of students with something they have written and not with something they have not. To account for possible buggy patterns, the requirement to achieve 90\% classification accuracy was not evaluated on full data, but only on data not covered by n-rules. Hence, a rule can cover an example with a specific approach even though it contains a buggy pattern. \section{Interpreting rules} \label{sec:interpreting-rules} Learned rules can be used to analyze student programing. This section describes several rules induced for two Python exercises: \textsf{Fahrenheit to Celsius}, which reads a value from standard input and calculates the result, and \textsf{Greatest Absolutist}, one of the introductory exercises for functions. \subsection{Fahrenheit to Celsius} The first problem in CodeQ Python course is to write a program converting from degrees Fahrenheit to degrees Celsius. The program should ask the user to input a temperature, and print the result. A sample correct program is: \begin{Verbatim} F = float(input("Fahrenheit: ")) C = 5 / 9 * (F - 32) print("Celsius: ", C) \end{Verbatim} Students have submitted 1177 programs for this problem, with 495 correct and 682 incorrect programs. Our systems extracted 891 relevant AST patterns, which were used as attributes in rule learning. The rule learner induced 24 n-rules, 14 of which mention only presence of patterns, and 16 p-rules. We first take a look at n-rules that mention only presence of patterns in their conditions. The most accurate rule according to the rule learner was: \begin{Verbatim}[fontfamily=sf] P20 ⇒ incorrect [208, 1] \end{Verbatim} \noindent This rule covers programs where the pattern \textsf{P20} is present. It implies an incorrect program, and covers 208 incorrect and one correct program. \textsf{P20} is the AST pattern describing a call to the \texttt{int} function: \begin{Verbatim}[fontfamily=sf] (Module (body (Assign (value (Call (func (Name (id int) (ctx Load)))))))) \end{Verbatim} The second best n-rule covers 72 incorrect and no correct programs: \begin{Verbatim}[fontfamily=sf] P5 ∧ P35 ⇒ incorrect [72, 0] \end{Verbatim} \noindent Pattern \textsf{P5} matches programs where the result of the \texttt{input} call is not cast to \texttt{float} but stored as a string. Pattern \textsf{P35} matches programs where the value 32 is subtracted from a variable on the left-hand side of a multiplication. Sample programs matching the first rule (left) and the second rule (right) are: \begin{Verbatim} g2 = input() g2 = \blue{input}('Temperature [F]? ') g1 = \blue{int}(g2) g1 = (\red{(g2 - 32) *} (5 / 9)) print(((g1-32)*(5/9))) print(g2, 'F equals', g1, 'C') \end{Verbatim} These rules describe two common student errors. The left program is incorrect, since it fails when the user inputs a decimal. The right program is incorrect because the input string must be cast to a number. Not casting it (pattern \textsf{P5}) and then using it in an expression (pattern \textsf{P35}) will raise an exception. The two most accurate n-rules with missing patterns in their conditions are: \begin{Verbatim}[fontfamily=sf] ¬P0 ⇒ incorrect [106, 0] ¬P1 ∧ P16 ⇒ incorrect [100, 0] \end{Verbatim} \noindent Pattern \textsf{P0} matches programs with a call to function \texttt{print}. A program without a \texttt{print} is always incorrect, since it will not output anything. The second rule covers programs with \textsf{P1} missing and \textsf{P16} present. \textsf{P16} matches programs with a call to the \texttt{print} function, where the argument contains a formula which subtracts 32 from a variable and then further multiplies the result. \textsf{P1} describes a call to the function \texttt{float} as the first item in an expression, i.e. \texttt{= float(...)}. This rule therefore represents programs that have the formula in the \texttt{print} function (\textsf{P16} is present), however fail to cast input from string to float (\textsf{P1} is missing). Let us now examine the other type of rules. The best four p-rules are: \begin{Verbatim}[fontfamily=sf] P2 ∧ P8 ⇒ correct [1, 200] P1 ∧ P42 ⇒ correct [0, 68] P1 ∧ P8 ⇒ correct [3, 217] P80 ⇒ correct [0, 38] \end{Verbatim} \noindent Patterns in the condition of the first rule, \textsf{P2} and \textsf{P8}, correspond respectively to expressions of the form \texttt{float(input(?))} and \texttt{print((?-32)*?)}. Programs matching both patterns wrap the function \texttt{float} around \texttt{input}, and have an expression that subtracts 32 and then uses multiplication within the \texttt{print}. This first rule demonstrates an important property of p-rules: although patterns \textsf{P2} and \textsf{P8} are in general not sufficient for a correct program (it is trivial to implement a matching but incorrect program), only one out of 201 student submissions matching these patterns was incorrect. This suggests that the conditions of p-rules represent the critical elements of the solution. Once a student has figured out these patterns, they are almost certain to have a correct solution. A sample program matching the first rule is: \begin{Verbatim} g1 = \blue{float(input(}'Temperature [F]: ')) print(((g1 \red{- 32) *} (5 / 9))) \end{Verbatim} \noindent The second and third p-rules are variations of the first. For instance, the second rule describes programs that have the formula in the argument to the \texttt{print} function. The fourth rule, however, is different. \textsf{P80} describes programs that subtract 32 from a variable cast to float. The following program matches \textsf{P80}: \begin{Verbatim} g1 = input('Fahrenheit?') g0 = (\blue{(float(g1) - 32)} * (5 / 9)) print(g0) \end{Verbatim} \subsection{Greatest Absolutist} In this exercise students must implement a function that accepts a list of numbers and returns the element with the largest absolute value. One solution is \begin{Verbatim}[fontfamily=tt] def max_abs(l): vmax = l[0] for v in l: if (abs(v) > abs(vmax)): vmax = v return vmax \end{Verbatim} We have received 155 submissions (57 correct, 98 incorrect) for this exercise. Due to its higher complexity and since the solutions are much more diverse, we obtained 8298 patterns to be used as attributes in learning. High number of patterns together with a low number of learning examples could present a problem for rule learning: since the space of possible rules is large, some of the learned rules might be a result of statistical anomalies. One needs to apply a certain amount of caution when interpreting these rules. The rule-learning algorithm learned 15 n-rules (7 mentioning only presence of patterns) and 6 p-rules. Below we can see the two best n-rules referring to the presence of patterns and two programs; the left one is covered by the first rule, and the right one by the second rule: \begin{Verbatim}[fontfamily=sf] P64 ⇒ incorrect [22, 0] P2 ∧ P70 ⇒ incorrect [17, 0] \end{Verbatim} \begin{Verbatim} def max_abs(l): def max_abs(l): vmax = 0 vmax = None for i in range(len(l)): for v in l: if \blue{vmax} < abs(l[i]): if vmax==None or vmax