\documentclass{llncs} \usepackage[utf8]{inputenc} \usepackage{newunicodechar} \newunicodechar{∧}{\ensuremath{\land}} \newunicodechar{¬}{\ensuremath{\lnot}} \newunicodechar{⇒}{\ensuremath{\Rightarrow}} \newunicodechar{→}{\ensuremath{\rightarrow}} \newunicodechar{⋯}{\ensuremath{\cdots}} \usepackage{bold-extra} \usepackage{bm} \usepackage{hyperref} \usepackage[normalem]{ulem} \usepackage{color} \newcommand\red[1]{{\begingroup\color[rgb]{0.8,0.15,0.15}#1\endgroup}} \newcommand\blue[1]{{\begingroup\color[rgb]{0.15,0.15,0.8}#1\endgroup}} \newcommand\green[1]{{\begingroup\color[rgb]{0.15,0.8,0.15}#1\endgroup}} \usepackage{fancyvrb} \fvset{commandchars=\\\{\},baselinestretch=0.98,samepage=true,xleftmargin=2.5mm} % WAT — I don’t even… \makeatletter \begingroup \catcode`\`=\active \gdef\FV@fontfamily@sf{% \def\FV@FontScanPrep{\FV@MakeActive\`}% \def\FV@FontFamily{\sffamily\edef`{{\string`}}}} \endgroup \makeatother \usepackage{tikz} \usepackage{forest} \usetikzlibrary{arrows.meta,calc} \newcommand\code[1]{\texttt{#1}} \newcommand\pattern[1]{\textsf{#1}} \begin{document} \title{Syntax-based analysis of programming concepts in Python} \author{Martin Možina, Timotej Lazar} \institute{University of Ljubljana, Faculty of Computer and Information Science, Slovenia} \maketitle \begin{abstract} % background / problem Writing programs is essential to learning programming. Most programming courses encourage students to practice with lab and homework assignments. By analyzing solutions to these exercises teachers can discover mistakes and concepts students are struggling with, and use that knowledge to improve the course. Students however tend to submit many different programs even for simple exercises, making such analysis difficult. % solution We propose using tree regular expressions to encode common patterns in programs. Based on these patterns we induce rules describing common approaches and mistakes for a given assignment. In this paper we present a case study of rule-based analysis for an introductory Python exercise. We show that our rules are easy to interpret, and can be learned from a relatively small set of programs. \\\\ \textbf{Keywords:} Learning programming · Educational data analysis · Error diagnosis · Abstract syntax tree · Tree regular expressions \end{abstract} \section{Introduction} Providing feedback to students is among the most time-consuming tasks when teaching programming. In large courses with hundreds of students, feedback is therefore often limited to automated program testing. While test cases can reliably determine whether a program is correct or not, they cannot easily be associated with specific errors in the code. Several attempts have been made to automatically discover common errors in student programs~\cite{jin2012program,rivers2015data-driven,nguyen2014codewebs,hovemeyer2016control}. This would allow a teacher to annotate a representative subset of submissions with feedback messages, which could then be automatically propagated to similar programs. These techniques are used for instance by the OverCode tool to visualize variations in student programs~\cite{glassman2015overcode}. We use \emph{tree regular expressions} (TREs) to specify important patterns in a program’s abstract syntax tree (AST) while disregarding irrelevant parts. We have previously demonstrated this approach with Prolog programs~\cite{lazar2017automatic}. Here we refine the AST patterns and show that they can be applied to Python -- a different programming paradigm -- with only a few modifications. \section{AST patterns} We encode structural patterns in ASTs using tree regular expressions (TREs). In this work we consider (only) child and sibling relations in ASTs. We write them as S-expressions, such as \pattern{(a (b \textbf{\textasciicircum}~d~\textbf{.}~e~\$) c)}. This expression matches any tree satisfying the following constraints: \begin{itemize} \item the root \pattern{a} has at least two children, \pattern{b} and \pattern{c}, adjacent and in that order; and \item the node \pattern{b} has three children: \pattern{d}, followed by any node, followed by \pattern{e}. \end{itemize} \noindent As in ordinary regular expressions, caret (\texttt{\textbf{\textasciicircum}}) and dollar sign (\texttt{\$}) anchor a node to be respectively the first or last child, and a period (\texttt{\textbf{.}}) matches any node. With TREs we encode interesting patterns in a program while disregarding irrelevant parts. Take for example the following, nearly correct Python function that prints the divisors of its argument $n$: \begin{Verbatim} \red{\dashuline{\textbf{def}}} divisors(\blue{\underline{\textbf{n}}}): \red{\dashuline{\textbf{for}}} d in range(1, \blue{\underline{\textbf{n}}}): \red{\dashuline{\textbf{if}}} n % d == 0: print(d) \end{Verbatim} The highlighted fragments correspond to the following two patterns: \begin{enumerate} \item \pattern{(Function (body (For (body If))))} and \item \pattern{(Function (name divisors) (args \textbf{\textasciicircum}~Var \$)} \item[] ~~~~\pattern{(body (For (iter (Call (func range) (args \textbf{\textasciicircum}~\textbf{.} Var \$))))))}. \end{enumerate} The first TRE encodes a single path in the AST and describes the program’s structure: \textsf{Function}$\boldsymbol{-}$\textsf{For}$\boldsymbol{-}$\textsf{If}. The second TRE relates the argument in the definition of \code{divisors} to the last argument to \code{range}. The second pattern shows a common mistake for this problem: \code{range(1,n)} will only generate values up to \code{n-1}, so \code{n} will not be printed as its own divisor. A correct pattern would include the binary operator \code{+} on the path to \code{n}, indicating a call to \code{range(1,n+1)}. Patterns are extracted automatically from student programs. We first canonicalize each program~\cite{rivers2015data-driven} using code from ITAP\footnote{Available at \url{https://github.com/krivers/ITAP-django}.}. For each pattern we select a subset of nodes in the AST, then walk the tree from each selected node to the root and construct the TRE by including all nodes along those paths. While pattern extraction is completely automated, we have manually defined the kinds of node subsets that are selected. After analyzing solutions to several programming problems, we decided to use the following kinds of patterns: \begin{enumerate} \item We select each pair of leaf nodes referring to the same variable. \item For each control-flow node $n$ we construct a pattern from the set $\{n\}$; we do the same for each \textsf{Call} node representing a function call. \item For each expression -- such as \code{(F-32)*5/9} -- we select the different combinations of literal and variable nodes in the expression. In these patterns we include at most one node referring to a variable. \end{enumerate} Note that in every constructed pattern, all \textsf{Var} nodes refer to the same variable. We found that patterns constructed from such nodesets are useful for discriminating between programs. As we show in the next section, they are also easily interpreted in terms of bugs and strategies for a given problem. \section{Interpreting learned rules: a case study} \label{sec:rules} The goal of learning rules in this paper is to discover and explain common approaches and mistakes in student programs. We use a similar rule learner to the one described in~\cite{lazar2017automatic}, implemented within the Orange data mining library~\cite{demsar2013orange}. Each program is represented in the feature space of AST patterns described in the previous section. Based on test results each program is classified either as \emph{correct} or \emph{incorrect}. Rules are then learned to explain why a program is correct or incorrect. Rules for incorrect programs are called \emph{n-rules} and rules for correct programs are called \emph{p-rules}. The patterns mentioned within the condition part of a rule can be used to analyze student programing. For example, the patterns from a rule for incorrect programs are more often present in incorrect than in correct programs, therefore they likely correspond to an erroneous pattern. This section describes several rules induced for the \textsf{Fahrenheit to Celsius} Python exercise, which reads a value from standard input and calculates the result. The program that solves the \textsf{Fahrenheit to Celsius} exercise should ask the user to input a temperature, and print the result. A sample correct program is: \begin{Verbatim} F = float(input("Fahrenheit: ")) C = 5 / 9 * (F - 32) print("Celsius: ", C) \end{Verbatim} In the CodeQ environment, students have submitted 1177 programs for this problem, with 495 correct and 682 incorrect programs. Our systems extracted 891 relevant AST patterns, which were used as attributes in rule learning. The rule learner induced 24 n-rules and 16 p-rules. Two examples of highly accurate n-rules were: \begin{Verbatim}[fontfamily=sf] P20 ⇒ incorrect [208, 1] P5 ∧ P35 ⇒ incorrect [72, 0] \end{Verbatim} \noindent The first rule covers programs where the pattern \textsf{P20} is present. The rule implies an incorrect program, and covers 208 incorrect and one correct program. \textsf{P20} is the AST pattern describing a call to the \texttt{int} function: \begin{Verbatim}[fontfamily=sf] (Module (body (Assign (value (Call (func (Name (id int) (ctx Load)))))))) \end{Verbatim} The pattern \textsf{P5} in the second rule matches programs where the result of the \texttt{input} call is not cast to \texttt{float} but stored as a string. Pattern \textsf{P35} matches programs where the value 32 is subtracted from a variable on the left-hand side of a multiplication. Sample programs matching the first rule (left) and the second rule (right) are: \begin{Verbatim} g2 = input() g2 = \blue{\underline{input}}('Temperature [F]? ') g1 = \blue{\underline{int}}(g2) g1 = (\red{\dashuline{(g2 - 32) *}} (5 / 9)) print(((g1-32)*(5/9))) print(g2, 'F equals', g1, 'C') \end{Verbatim} These rules describe two common student errors. The left program is incorrect, since it fails when the user inputs a decimal. The right program is incorrect because the input string must be cast to a number. Not casting it (pattern \textsf{P5}) and then using it in an expression (pattern \textsf{P35}) will raise an exception. In some cases, n-rules imply a missing pattern. For example: \begin{Verbatim}[fontfamily=sf] ¬P0 ⇒ incorrect [106, 0] \end{Verbatim} \noindent Pattern \textsf{P0} matches programs with a call to function \texttt{print}. A program without a \texttt{print} is always incorrect, since it will not output anything. Let us now examine the other type of rules. A sample p-rule was: \begin{Verbatim}[fontfamily=sf] P2 ∧ P8 ⇒ correct [1, 200] \end{Verbatim} \noindent Patterns in the condition of the rule, \textsf{P2} and \textsf{P8}, correspond respectively to expressions of the form \texttt{float(input(?))} and \texttt{print((?-32)*?)}. Programs matching both patterns wrap the function \texttt{float} around \texttt{input}, and have an expression that subtracts 32 and then uses multiplication within the \texttt{print}. This rule demonstrates an important property of p-rules: although patterns \textsf{P2} and \textsf{P8} are in general not sufficient for a correct program (it is trivial to implement a matching but incorrect program), only one out of 201 student submissions matching these patterns was incorrect. This suggests that the conditions of p-rules represent the critical elements of the solution. Once students have figured out these patterns, they are almost certain to have a correct solution. A sample program matching the first rule is: \begin{Verbatim} g1 = \blue{\underline{float(input(}}'Temperature [F]: ')) print(((g1 \red{\dashuline{- 32) *}} (5 / 9))) \end{Verbatim} \section{Discussion and further work} Our primary interest in this paper is to help manual analysis of student submissions. We proposed to first represent submitted programs with patterns extracted from abstract syntax trees and then learn classification rules that distnguish between correct and incorrect programs. We showed that both rules and patterns are easy to interpret and can be used to explain typical mistakes and approaches. The accuracy of automatic classification plays a secondary role, however it is a good measure to estimate the expressiveness of patterns. Over 12 exercises a random forest model achieved for about 17\% overall higher accuracy than the majority classifier. This result indicates that a significant amount of information can be gleaned from simple syntax-oriented analysis. To further improve the quality of patterns, we intend to analyze misclassified programs in exercises and derive new formats of patterns, which should enable better learning. We have demonstrated how AST patterns can be described with TREs, and how such patterns can be combined to discover important concepts and errors in student programs. Currently, analyzing patterns and rules is quite cumbersome. We plan on developing a tool to allow teachers to easily construct and refine patterns and rules based on example programs. Ideally we would integrate our approach into an existing analysis tool such as OverCode~\cite{glassman2015overcode}. \bibliographystyle{splncs} \bibliography{aied2018} \end{document}