From dcda0e6145351553132addf6ba59f2eeff2ce1e1 Mon Sep 17 00:00:00 2001
From: Timotej Lazar <timotej.lazar@fri.uni-lj.si>
Date: Sun, 4 Feb 2018 16:59:34 +0100
Subject: Add evaluation for patterns

---
 aied2018/patterns.tex | 36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/aied2018/patterns.tex b/aied2018/patterns.tex
index b273f8d..6517ec6 100644
--- a/aied2018/patterns.tex
+++ b/aied2018/patterns.tex
@@ -156,4 +156,38 @@ We found that patterns constructed from such nodesets are useful for discriminat
 
 Our primary interest in this paper is finding rules to help manual analysis of student submissions. The accuracy of automatic classification thus plays a secondary role to the interpretability of our model. Here we give only a rudimentary evaluation of patterns as attributes for distinguishing between correct and incorrect programs.
 
-% TODO
+Evaluation was performed on a subset of exercises from an introductory Python course. The course was implemented in the online programming environment CodeQ\footnote{Available at \url{https://codeq.si}. Source under AGPL3+ at \url{https://codeq.si/code}.}, which was used to collect correct and incorrect submissions. Table~\ref{tab:stats} shows the number of users attempting each problem, the number of all and correct submissions, and the performance of majority and random-forest classifiers for predicting the correctness of a program based on the patterns it contains. We tested these classifiers using 10-fold cross validation.
+
+\setlength{\tabcolsep}{6pt}
+\def\arraystretch{1.1}
+\begin{table}[htb]
+\caption{Solving statistics and classification accuracy for several introductory Python problems. The second column shows the number of users attempting the problem. Columns 3 and 4 show the number of all / correct submissions. The last two columns show the classification accuracy for the majority and random-forest classifiers.}
+\centering
+ \begin{tabular}{r|c|cc|cc}
+ & & \multicolumn{2}{c|}{\textbf{Submissions}} & \multicolumn{2}{c}{\textbf{CA}} \\
+ \textbf{Problem} & \textbf{Users} & Total & Correct & Majority & RF \\
+ \hline
+\textsf{fahrenheit\_to\_celsius}& 521 & 1044 & 384 & 0.579 & 0.933 \\
+\textsf{pythagorean\_theorem}& 349 & 669 & 317 & 0.499 & 0.809 \\
+\textsf{ballistics}& 248 & 833 & 197 & 0.761 & 0.802 \\
+\textsf{average}& 209 & 473 & 178 & 0.614 & 0.830 \\
+\hline
+\textsf{buy\_five}& 294 & 419 & 208 & 0.613 & 0.828 \\
+\textsf{competition}& 227 & 316 & 207 & 0.703 & 0.847 \\
+\textsf{top\_shop}& 142 & 471 & 129 & 0.721 & 0.758 \\
+\textsf{minimax}& 74 & 164 & 57 & 0.650 & 0.644 \\
+\textsf{checking\_account}& 132 & 221 & 101 & 0.521 & 0.744 \\
+\textsf{consumers\_anonymous}& 65 & 170 & 51 & 0.688 & 0.800 \\
+\hline
+\textsf{greatest}& 70 & 86 & 25 & 0.585 & 0.859 \\
+\textsf{greatest\_absolutist}& 58 & 132 & 32 & 0.632 & 0.845 \\
+\textsf{greatest\_negative}& 62 & 154 & 37 & 0.636 & 0.815 \\
+\hline
+Total / average & 2451 & 5152 & 1923 & 0.631 & 0.809 \\
+ \end{tabular}
+\label{tab:stats}
+\end{table}
+
+We see that AST patterns increase classification accuracy for about 17\% overall. This result indicates that a significant amount of information can be gleaned from simple syntax-oriented analysis. One exception is the \textsf{ballistics} problem, where the improvement was quite small. This is an artifact of our testing framework: the program is expected to read two values from standard input in a certain order, but none of the used patterns could encode this information.
+
+% TODO why worse-than-bad results for minimax?
-- 
cgit v1.2.1