From 9a44bdd0aa93e9957e0c917b6ac72c68c9da21c6 Mon Sep 17 00:00:00 2001
From: Martin Mozina <martin.mozina@fri.uni-lj.si>
Date: Tue, 6 Feb 2018 14:24:56 +0100
Subject: Added evaluation.

---
 aied2018/patterns.tex | 39 ---------------------------------------
 1 file changed, 39 deletions(-)

(limited to 'aied2018/patterns.tex')

diff --git a/aied2018/patterns.tex b/aied2018/patterns.tex
index 0b121f7..f6ef46e 100644
--- a/aied2018/patterns.tex
+++ b/aied2018/patterns.tex
@@ -152,42 +152,3 @@ For each expression (such as \code{(F-32)*5/9}) we select the different combinat
 
 We found that patterns constructed from such nodesets are useful for discriminating between programs. As we show in Sect.~\ref{sec:interpreting-rules}, they are also easily interpreted in terms of bugs and strategies for a given problem. Note that in any pattern constructed in this way, all \textsf{Var} nodes refer to the same variable, simplifying interpretation.
 
-\subsection{Evaluating patterns}
-
-Our primary interest in this paper is finding rules to help manual analysis of student submissions. The accuracy of automatic classification thus plays a secondary role to the interpretability of our model. Here we give only a rudimentary evaluation of patterns as attributes for distinguishing between correct and incorrect programs.
-
-Evaluation was performed on a subset of exercises from an introductory Python course. The course was implemented in the online programming environment CodeQ\footnote{Available at \url{https://codeq.si}. Source under AGPL3+ at \url{https://codeq.si/code}.}, which was used to collect correct and incorrect submissions. Table~\ref{tab:stats} shows the number of users attempting each problem, the number of all and correct submissions, and the performance of majority and random-forest classifiers for predicting the correctness of a program based on the patterns it contains. We tested these classifiers using 10-fold cross validation.
-
-\setlength{\tabcolsep}{6pt}
-\def\arraystretch{1.1}
-\begin{table}[htb]
-\caption{Solving statistics and classification accuracy for several introductory Python problems. The second column shows the number of users attempting the problem. Columns 3 and 4 show the number of all / correct submissions. The last two columns show the classification accuracy for the majority and random-forest classifiers.}
-\centering
- \begin{tabular}{r|c|cc|cc}
- & & \multicolumn{2}{c|}{\textbf{Submissions}} & \multicolumn{2}{c}{\textbf{CA}} \\
- \textbf{Problem} & \textbf{Users} & Total & Correct & Majority & RF \\
- \hline
-\textsf{fahrenheit\_to\_celsius}& 521 & 1177 & 495 & 0.579 & 0.933 \\
-%\textsf{pythagorean\_theorem}& 349 & 669 & 317 & 0.499 & 0.809 \\
-\textsf{ballistics}& 248 & 873 & 209 & 0.761 & 0.802 \\
-\textsf{average}& 209 & 482 & 186 & 0.614 & 0.830 \\
-\hline
-\textsf{buy\_five}& 294 & 476 & 292 & 0.613 & 0.828 \\
-\textsf{competition}& 227 & 327 & 230 & 0.703 & 0.847 \\
-\textsf{top\_shop}& 142 & 476 & 133 & 0.721 & 0.758 \\
-\textsf{minimax}& 74 & 163 & 57 & 0.650 & 0.644 \\
-\textsf{checking\_account}& 132 & 234 & 112 & 0.521 & 0.744 \\
-\textsf{consumers\_anonymous}& 65 & 170 & 53 & 0.688 & 0.800 \\
-\hline
-\textsf{greatest}& 70 & 142 & 83 & 0.585 & 0.859 \\
-\textsf{greatest\_absolutist}& 58 & 155 & 57 & 0.632 & 0.845 \\
-\textsf{greatest\_negative}& 62 & 195 & 71 & 0.636 & 0.815 \\
-\hline
-Total / average & 2102 & 4811 & 1978 & 0.642 & 0.809 \\
- \end{tabular}
-\label{tab:stats}
-\end{table}
-
-We see that AST patterns increase classification accuracy for about 17\% overall. This result indicates that a significant amount of information can be gleaned from simple syntax-oriented analysis. One exception is the \textsf{ballistics} problem, where the improvement was quite small. This is an artifact of our testing framework: the program is expected to read two values from standard input in a certain order, but none of the used patterns could encode this information.
-
-% TODO why worse-than-bad results for minimax?
-- 
cgit v1.2.1