aied2018/evaluation.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

\section{Evaluation, discussion, and further work}

\setlength{\tabcolsep}{4.5pt}
\def\arraystretch{1.1}
\begin{table}[t]
	\caption{Solving statistics, classification accuracy, and coverage of rules for several introductory Python problems. The second column shows the number of users attempting the problem. Columns 3 and 4 show the number of all / correct submissions. The next two columns show the classification accuracy for the majority and random-forest classifiers. The last three rows show percentages of covered examples: columns $n_p$ and $n$ contain covered incorrect programs (n-rules with presence of patterns and all n-rules), column $p$ contains the percentage of covered correct programs by p-rules.}
	\centering
	\begin{tabular}{l|c|cc|cc|ccc}
		& & \multicolumn{2}{c|}{\textbf{Submissions}} & \multicolumn{2}{c|}{\textbf{CA}} & \multicolumn{3}{c}{\textbf{Coverage}} \\
		\textbf{Problem} & \textbf{Users} & Total & Correct & Maj & RF & $n_p$ & $n$ & $p$ \\
		\hline
		\textsf{fahrenheit\_to\_celsius}& 521 & 1177 & 495 & 0.579 & 0.933 & 0.708 & 0.935 & 0.867 \\
		%\textsf{pythagorean\_theorem}& 349 & 669 & 317 & 0.499 & 0.809 \\
		\textsf{ballistics}& 248 & 873 & 209 & 0.761 & 0.802 & 0.551 & 0.666 & 0.478 \\
		\textsf{average}& 209 & 482 & 186 & 0.614 & 0.830 & 0.230 & 0.338 & 0.618 \\
		\hline
		\textsf{buy\_five}& 294 & 476 & 292 & 0.613 & 0.828 & 0.234 & 0.489 & 0.719 \\
		\textsf{competition}& 227 & 327 & 230 & 0.703 & 0.847 & 0.361 & 0.515 & 0.896 \\
		\textsf{top\_shop}& 142 & 476 & 133 & 0.721 & 0.758 & 0.399 & 0.802 & 0.444 \\
		\textsf{minimax}& 74 & 163 & 57 & 0.650 & 0.644 & 0.462 & 0.745 & 0.298 \\
		\textsf{checking\_account}& 132 & 234 & 112 & 0.521 & 0.744 & 0.143 & 0.491 & 0.115\\
		\textsf{consumers\_anon}& 65 & 170 & 53 & 0.688 & 0.800 & 0.376 & 0.880 & 0.623 \\
		\hline
		\textsf{greatest}& 70 & 142 & 83 & 0.585 & 0.859 & 0.492 & 0.746 & 0.880\\
		\textsf{greatest\_abs}& 58 & 155 & 57 & 0.632 & 0.845 & 0.612 & 0.878 & 0.789 \\
		\textsf{greatest\_neg}& 62 & 195 & 71 & 0.636 & 0.815 & 0.621 & 0.960 & 0.718 \\
		\hline
		Total / average & 2102 & 4811 & 1978 & 0.642 & 0.809 & 0.432 & 0.704 & 0.620 \\
	\end{tabular}
	\label{tab:stats}
\end{table}


Evaluation was performed on a subset of exercises from an introductory Python course. The course was implemented in the online programming environment CodeQ\footnote{Available at \url{https://codeq.si}. Source under AGPL3+ at \url{https://codeq.si/code}.}, which was used to collect correct and incorrect submissions. Table~\ref{tab:stats} shows the number of users attempting each problem, the number of all and correct submissions, the performance of majority and random-forest classifiers for predicting the correctness of a program based on the patterns it contains, and percentages of covered programs by rules: $n_p$ is the percentage of covered incorrect programs with n-rules that contain only presence of patterns, $n$ is the percentage of covered incorrect programs with all n-rules, and $p$ is the coverage of correct programs with p-rules. Classification accuracies were obtained using 10-fold cross validation.

Our primary interest in this paper is finding rules to help manual analysis of student submissions. The accuracy of automatic classification thus plays a secondary role to the interpretability of our model, however it is a good measure to estimate the expressiveness of patterns. We see that AST patterns increase classification accuracy for about 17\% overall. This result indicates that a significant amount of information can be gleaned from simple syntax-oriented analysis. Exceptions are the \textsf{ballistics} and the \textsf{minimax} problems, where there was no visible improvement. In both exercises, the set of used patterns was insufficient to distinguish between incorrect and correct programs. To further improve the quality of patterns, we will analyze misclassified programs in those two exercises and derive new formats of patterns, which will enable better learning. 

The coverages of particular types of rules are also promising. On average, we can explain 43.2\% of incorrect programs with the presence of patterns. Considering all n-rules, we can explain over 70\% of incorrect submissions, however these rules are somewhat harder to understand. Similarly, we can explain 62\% of correct programs with p-rules. The remaining 38\% of solutions tackle the problems in ways that cannot be veiled by out current patterns or there are not enough similar solutions to form a rule. This requires further analysis. For example, in the \textsf{checking\_account} exercise the coverage of p-rules is only 11.5\%. 

% TODO why worse-than-bad results for minimax?