From 46502e354696f04d6d63182b1b0577d9e036a697 Mon Sep 17 00:00:00 2001 From: Timotej Lazar Date: Thu, 1 Feb 2018 20:47:39 +0100 Subject: Add section on patterns Evaluation with RFs is still missing. --- aied2018/aied2018.tex | 32 +++++++++- aied2018/patterns.tex | 160 ++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 189 insertions(+), 3 deletions(-) create mode 100644 aied2018/patterns.tex (limited to 'aied2018') diff --git a/aied2018/aied2018.tex b/aied2018/aied2018.tex index 32bd728..64f66a8 100644 --- a/aied2018/aied2018.tex +++ b/aied2018/aied2018.tex @@ -1,13 +1,39 @@ \documentclass{llncs} \usepackage[utf8]{inputenc} -\usepackage{hyperref} - \usepackage{newunicodechar} \newunicodechar{∧}{\ensuremath{\land}} \newunicodechar{⇒}{\ensuremath{\Rightarrow}} +\newunicodechar{→}{\ensuremath{\rightarrow}} \newunicodechar{⋯}{\ensuremath{\cdots}} +\usepackage{bold-extra} +\usepackage{hyperref} +\usepackage[normalem]{ulem} + +\usepackage{color} +\newcommand\red[1]{{\begingroup\color[rgb]{0.8,0.15,0.15}#1\endgroup}} +\newcommand\blue[1]{{\begingroup\color[rgb]{0.15,0.15,0.8}#1\endgroup}} + +\usepackage{fancyvrb} +\fvset{commandchars=\\\{\},baselinestretch=0.98,samepage=true,xleftmargin=2.5mm} +% WAT — I don’t even… +\makeatletter +\begingroup +\catcode`\`=\active +\gdef\FV@fontfamily@sf{% + \def\FV@FontScanPrep{\FV@MakeActive\`}% + \def\FV@FontFamily{\sffamily\edef`{{\string`}}}} +\endgroup +\makeatother + +\usepackage{tikz} +\usepackage{forest} +\usetikzlibrary{arrows.meta,calc} + +\newcommand\code[1]{\texttt{#1}} +\newcommand\pattern[1]{\textsf{#1}} + \begin{document} \title{Syntax-based analysis of programming concepts in Python} \author{Martin Možina, Timotej Lazar} @@ -25,7 +51,7 @@ We propose using tree regular expressions to encode common patterns in programs. %\input{introduction} %\input{background} -%\input{patterns} +\input{patterns} %\input{rules} %\input{evaluation} %\input{conclusion} diff --git a/aied2018/patterns.tex b/aied2018/patterns.tex new file mode 100644 index 0000000..4edd981 --- /dev/null +++ b/aied2018/patterns.tex @@ -0,0 +1,160 @@ +\section{AST patterns} + +We encode structural patterns in ASTs using tree regular expressions (TREs). An ordinary regular expression describes the set of strings matching certain constraints; similarly, a TRE describes the set of trees containing certain nodes and relations. Since TREs describe structure, they are themselves represented as trees. More specifically, both ASTs and TREs are ordered rooted trees. + +In this work we used TREs to encode (only) child and sibling relations in ASTs. We write them as S-expressions, such as \pattern{(a (b \textbf{\textasciicircum}~d~\textbf{.}~e~\$) c)}. This expression matches any tree satisfying the following constraints: + +\begin{itemize} +\item the root node \pattern{a} has at least two children, \pattern{b} and \pattern{c}, in that order; and +\item the node \pattern{b} has three children: \pattern{d}, followed by any node, followed by \pattern{e}. +\end{itemize} + +Figure \ref{fig:tre-example} shows one instance of a matching tree, with arrows indicating the pattern. Analogous to ordinary regular expressions, caret (\texttt{\textbf{\textasciicircum}}) and dollar sign (\texttt{\$}) anchor a node to be respectively the first or last child of its parent. A period (\texttt{\textbf{.}}) represents a wildcard that matches any one node. + +\begin{figure}[htbp] + \centering + \begin{forest} + for tree={ + font=\sf, + edge=darkgray, + l sep=0, + l=0.9cm, + } +[a,name=a + [b,name=b + [d,name=d] [g,name=g] [e,name=e]] + [f] + [c,name=c [j] [k]] +] +\path[-{Latex[length=1.5mm,width=1mm]},thick,relative,blue] (a) edge[transform canvas={xshift=-1.1mm,yshift=0.2mm}] (b); +\path[-{Latex[length=1.5mm,width=1mm]},thick,relative,blue] (a) edge[transform canvas={xshift=1.1mm,yshift=0.2mm}] (c); +\path[-{Latex[length=1.5mm,width=1mm]},thick,relative,blue] (b) edge[transform canvas={xshift=-1.1mm,yshift=0.2mm}] (d); +\draw[opacity=0] (b) -- node[anchor=east,thick,blue,opacity=1,font=\large] {\texttt{\textasciicircum}} (d); +\path[draw,thick,relative,blue,transform canvas={xshift=-0.7mm}] (b) -- ($ (b)!0.6!(g) $); +\path[-{Latex[length=1.5mm,width=1mm]},thick,relative,blue] (b) edge[transform canvas={xshift=1.1mm,yshift=0.2mm}] (e); +\draw[opacity=0] (b) -- node[anchor=west,thick,blue,opacity=1,font=\scriptsize,transform canvas={xshift=0mm,yshift=1mm}] {\texttt{\$}} (e); + \end{forest} + \caption{A tree with a pattern (in blue besides the edges). An arrow $x→y$ indicates that node $x$ has a child $y$. The shorter line \pattern{b}—\pattern{g} means that the child can be any node. Anchors \texttt{\textbf{\textasciicircum}} and \texttt{\$} mean that the pattern will match only the first or last child.} + \label{fig:tre-example} +\end{figure} + +With TREs we encode interesting patterns in a program while disregarding irrelevant parts. Take for example the following Python function that prints the divisors of its argument $n$: + +\begin{Verbatim} +def divisors(n): + for d in range(1, n): + if n % d == 0: + print(d) +\end{Verbatim} + +Figure \ref{fig:patterns-example} shows the simplified AST for this program, with two patterns overlaid. As described below, we construct our TREs automatically from certain subsets of nodes in the AST. The dashed pattern in Fig. \ref{fig:patterns-example} was induced from the control-flow nodes in the program \cite{hovemeyer2016control}, while the solid pattern was induced from a pair of leaf nodes that represent the same variable. + + +\begin{figure}[htbp] + \centering + \begin{forest} + for tree={ + font=\sf, + edge=darkgray, + %s sep=0.05cm, + l sep=0, + l=0.9cm, + %s sep=0, + %s=1.5cm, + } +[Function, name=def, draw,rectangle,red,dashed + [name, name=name1 [divisors, name=divisors, font=\bfseries\sffamily]] + [args, name=args1 + [Var, name=args1-1, draw,rectangle,blue [n, font=\bfseries\sffamily, l=0.7cm]]] + [body, name=body1, before computing xy={s=2cm} + [For, name=for, l=1.2cm, draw,rectangle,red,dashed + [target + [Var, l=0.9cm [d, font=\bfseries\sffamily, l=0.7cm]]] + [iter, name=iter + [Call, name=call, l=0.9cm + [func, name=func, l=0.8cm [range, name=range, font=\bfseries\sffamily, l=0.7cm]] + [args, name=args2, l=0.8cm + [Num, name=args2-1, l=1cm [1, font=\bfseries\sffamily, l=0.7cm]] + [Var, name=args2-2, l=1cm,draw,rectangle,blue [n, font=\bfseries\sffamily, l=0.7cm]]]]] + [body, name=body2 + [If, name=if, draw,rectangle,red,dashed + [test, l=0.7cm [Compare, l=0.8cm + [BinOp + [Var [n, font=\bfseries\sffamily, l=0.6cm]] + [\%, font=\bfseries\sffamily] + [Var [d, font=\bfseries\sffamily, l=0.6cm]]] + [{==}, font=\bfseries\sffamily, l=0.7cm] + [Num [0, font=\bfseries\sffamily, l=0.7cm]]]] + [body, l=0.7cm [Call, l=0.8cm + [func [print, font=\bfseries\sffamily, l=0.7cm]] + [args [Var, l=0.7cm [d, font=\bfseries\sffamily, l=0.6cm]]]]]]]]]] +% first pattern +\path[-{Latex[length=1.5mm,width=1mm]},thick,relative,red,dashed] (def) edge[transform canvas={xshift=1.1mm,yshift=0.2mm}] (body1); +\path[-{Latex[length=1.5mm,width=1mm]},thick,relative,red,dashed] (body1) edge[transform canvas={xshift=0.8mm}] (for); +\path[-{Latex[length=1.5mm,width=1mm]},thick,relative,red,dashed] (for) edge[transform canvas={xshift=1.1mm,yshift=0.5mm}] (body2); +\path[-{Latex[length=1.5mm,width=1mm]},thick,relative,red,dashed] (body2) edge[transform canvas={xshift=0.8mm}] (if); +% second pattern +\path[-{Latex[length=1.5mm,width=1mm]},thick,relative,blue] (def) edge[transform canvas={xshift=-1.1mm,yshift=0.2mm}] (name1); +\path[-{Latex[length=1.5mm,width=1mm]},thick,relative,blue] (def) edge[transform canvas={xshift=-0.8mm}] (args1); +\path[-{Latex[length=1.5mm,width=1mm]},thick,relative,blue] (args1) edge[transform canvas={xshift=-0.8mm}] (args1-1); +\draw[opacity=0] (args1) -- node[anchor=east,thick,blue,opacity=1,font=\large] {\texttt{\textasciicircum}} (args1-1); +\draw[opacity=0] (args1) -- node[anchor=west,thick,blue,opacity=1,font=\scriptsize,transform canvas={xshift=-0.5mm,yshift=0.6mm}] {\texttt{\$}} (args1-1); +\path[-{Latex[length=1.5mm,width=1mm]},thick,relative,blue] (def) edge[transform canvas={xshift=-1mm,yshift=-0.3mm}] (body1); +\path[-{Latex[length=1.5mm,width=1mm]},thick,relative,blue] (body1) edge[transform canvas={xshift=-0.8mm}] (for); +\path[-{Latex[length=1.5mm,width=1mm]},thick,relative,blue] (for) edge[transform canvas={xshift=1mm,yshift=-0.5mm}] (iter); +\path[-{Latex[length=1.5mm,width=1mm]},thick,relative,blue] (iter) edge[transform canvas={xshift=0.8mm}] (call); +\path[-{Latex[length=1.5mm,width=1mm]},thick,relative,blue] (call) edge[transform canvas={xshift=-1.2mm}] (func); +\path[-{Latex[length=1.5mm,width=1mm]},thick,relative,blue] (call) edge[transform canvas={xshift=1.2mm}] (args2); +\path[draw,thick,relative,blue,transform canvas={xshift=-0.7mm}] (args2) -- ($ (args2)!0.55!(args2-1) $); +\draw[opacity=0] (args2) -- node[anchor=east,thick,blue,opacity=1,font=\large,transform canvas={xshift=0.6mm,yshift=0.4mm}] {\texttt{\textasciicircum}} (args2-1); +\path[-{Latex[length=1.5mm,width=1mm]},thick,relative,blue] (args2) edge[transform canvas={xshift=1mm}] (args2-2); +\draw[opacity=0] (args2) -- node[anchor=west,thick,blue,opacity=1,font=\scriptsize,transform canvas={xshift=0mm,yshift=1mm}] {\texttt{\$}} (args2-2); + \end{forest} + \caption{The AST for the \code{divisors} program with two patterns. Leaf nodes (in bold) correspond to terminals in the program, i.e. names and values. Dashed red arrows represent the pattern describing the control structure of the program. Solid blue arrows encode the incorrect second argument to the \code{range} function.} + \label{fig:patterns-example} +\end{figure} + +These two patterns are represented by the respective S-expressions + +\begin{enumerate} +\item \pattern{(Function (body (For (body If))))} and +\item \pattern{(Function (name divisors) (args \textbf{\textasciicircum}~Var \$)} +\item[] ~~~~\pattern{(body (For (iter (Call (func range) (args \textbf{\textasciicircum}~\textbf{.} Var \$))))))}. +\end{enumerate} + +\noindent +Since this representation is not easy to read, we will instead represent TREs by highlighting relevant text in examples of matching programs. For instance + +\begin{Verbatim} +\red{\dashuline{\textbf{def}}} divisors(\blue{\underline{\textbf{n}}}): + \red{\dashuline{\textbf{for}}} d in range(1, \blue{\underline{\textbf{n}}}): + \red{\dashuline{\textbf{if}}} n % d == 0: + print(d) +\end{Verbatim} + +While the pattern encoding control flow is correct, the second pattern describes a common mistake for this problem: \code{range(1,n)} will only generate values up to $n-1$, so $n$ will not be printed as its own divisor. A correct pattern would include the binary operator \code{+} on the path to $n$, indicating a call to \code{range(1,n+1)}. + +\subsection{Constructing patterns} + +Patterns are extracted automatically from student programs. To construct each pattern, we select a subset of nodes in the AST. Then we walk the tree from each selected node to the root and include all nodes along those paths. + +Depending on node type we also include some nodes adjacent to such paths. For each comparison and unary/binary expression on the path we include the corresponding operator. For function definitions and calls we include the function name. Finally, in all argument lists we include the anchors (\textbf{\texttt{\textasciicircum}}~and~\code{\$}) and a wildcard (\textbf{\texttt{.}}) for each argument not on the path. This allows our TREs to discriminate between e.g. the first and second argument to a function. + +While pattern extraction is completely automated, we have manually defined the kinds of node subsets that are selected. After analyzing solutions to several programming problems, we decided to use the following kinds of patterns. Figure \ref{fig:patterns-example} shows patterns based on the first two items. + +\begin{enumerate} +\item +We select each pair of leaf nodes referring to the same variable. +\item +For each control-flow node $n$ we construct a pattern from the set $\{n\}$; we do the same for each \textsf{Call} node. +\item +For each expression (such as \code{(F-32)*5/9}) we select the different combinations of literal and variable nodes in the expression. In these patterns we include at most one node referring to a variable. +\end{enumerate} + +We found that patterns constructed from such nodesets are useful for discriminating between programs. As we show in Sect. \ref{sec:interpreting-rules}, they are also easily interpreted in terms of bugs and strategies for a given problem. + +\subsection{Evaluating patterns} + +Our primary interest in this paper is finding rules to help manual analysis of student submissions. The accuracy of automatic classification thus plays a secondary role to the interpretability of our model. Here we give only a rudimentary evaluation of patterns as attributes for distinguishing between correct and incorrect programs. + +% TODO -- cgit v1.2.1