From 9a44bdd0aa93e9957e0c917b6ac72c68c9da21c6 Mon Sep 17 00:00:00 2001 From: Martin Mozina Date: Tue, 6 Feb 2018 14:24:56 +0100 Subject: Added evaluation. --- aied2018/patterns.tex | 39 --------------------------------------- 1 file changed, 39 deletions(-) (limited to 'aied2018/patterns.tex') diff --git a/aied2018/patterns.tex b/aied2018/patterns.tex index 0b121f7..f6ef46e 100644 --- a/aied2018/patterns.tex +++ b/aied2018/patterns.tex @@ -152,42 +152,3 @@ For each expression (such as \code{(F-32)*5/9}) we select the different combinat We found that patterns constructed from such nodesets are useful for discriminating between programs. As we show in Sect.~\ref{sec:interpreting-rules}, they are also easily interpreted in terms of bugs and strategies for a given problem. Note that in any pattern constructed in this way, all \textsf{Var} nodes refer to the same variable, simplifying interpretation. -\subsection{Evaluating patterns} - -Our primary interest in this paper is finding rules to help manual analysis of student submissions. The accuracy of automatic classification thus plays a secondary role to the interpretability of our model. Here we give only a rudimentary evaluation of patterns as attributes for distinguishing between correct and incorrect programs. - -Evaluation was performed on a subset of exercises from an introductory Python course. The course was implemented in the online programming environment CodeQ\footnote{Available at \url{https://codeq.si}. Source under AGPL3+ at \url{https://codeq.si/code}.}, which was used to collect correct and incorrect submissions. Table~\ref{tab:stats} shows the number of users attempting each problem, the number of all and correct submissions, and the performance of majority and random-forest classifiers for predicting the correctness of a program based on the patterns it contains. We tested these classifiers using 10-fold cross validation. - -\setlength{\tabcolsep}{6pt} -\def\arraystretch{1.1} -\begin{table}[htb] -\caption{Solving statistics and classification accuracy for several introductory Python problems. The second column shows the number of users attempting the problem. Columns 3 and 4 show the number of all / correct submissions. The last two columns show the classification accuracy for the majority and random-forest classifiers.} -\centering - \begin{tabular}{r|c|cc|cc} - & & \multicolumn{2}{c|}{\textbf{Submissions}} & \multicolumn{2}{c}{\textbf{CA}} \\ - \textbf{Problem} & \textbf{Users} & Total & Correct & Majority & RF \\ - \hline -\textsf{fahrenheit\_to\_celsius}& 521 & 1177 & 495 & 0.579 & 0.933 \\ -%\textsf{pythagorean\_theorem}& 349 & 669 & 317 & 0.499 & 0.809 \\ -\textsf{ballistics}& 248 & 873 & 209 & 0.761 & 0.802 \\ -\textsf{average}& 209 & 482 & 186 & 0.614 & 0.830 \\ -\hline -\textsf{buy\_five}& 294 & 476 & 292 & 0.613 & 0.828 \\ -\textsf{competition}& 227 & 327 & 230 & 0.703 & 0.847 \\ -\textsf{top\_shop}& 142 & 476 & 133 & 0.721 & 0.758 \\ -\textsf{minimax}& 74 & 163 & 57 & 0.650 & 0.644 \\ -\textsf{checking\_account}& 132 & 234 & 112 & 0.521 & 0.744 \\ -\textsf{consumers\_anonymous}& 65 & 170 & 53 & 0.688 & 0.800 \\ -\hline -\textsf{greatest}& 70 & 142 & 83 & 0.585 & 0.859 \\ -\textsf{greatest\_absolutist}& 58 & 155 & 57 & 0.632 & 0.845 \\ -\textsf{greatest\_negative}& 62 & 195 & 71 & 0.636 & 0.815 \\ -\hline -Total / average & 2102 & 4811 & 1978 & 0.642 & 0.809 \\ - \end{tabular} -\label{tab:stats} -\end{table} - -We see that AST patterns increase classification accuracy for about 17\% overall. This result indicates that a significant amount of information can be gleaned from simple syntax-oriented analysis. One exception is the \textsf{ballistics} problem, where the improvement was quite small. This is an artifact of our testing framework: the program is expected to read two values from standard input in a certain order, but none of the used patterns could encode this information. - -% TODO why worse-than-bad results for minimax? -- cgit v1.2.1