aied2018/evaluation.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

\section{Evaluation and discussion}

\setlength{\tabcolsep}{4.5pt}
\def\arraystretch{1.07}
\begin{table}[t]
	\caption{Solving statistics, classification accuracy, and coverage of rules for several introductory Python problems. The second column shows the number of users attempting the problem. Columns 3 and 4 show the number of all / correct submissions. The next two columns show the classification accuracy for the majority and random-forest classifiers. The last three columns show percentages of covered examples: columns $n_p$ and $n$ give covered incorrect programs (n-rules with presence of patterns and all n-rules), and column $p$ gives the percentage of correct programs covered by p-rules.}
	\centering
	\begin{tabular}{l|c|cc|cc|ccc}
		& & \multicolumn{2}{c|}{\textbf{Submissions}} & \multicolumn{2}{c|}{\textbf{CA}} & \multicolumn{3}{c}{\textbf{Coverage}} \\
		\textbf{Problem} & \textbf{Users} & Total & Correct & Maj & RF & $n_p$ & $n$ & $p$ \\
		\hline
		\textsf{fahrenheit\_to\_celsius}& 521 & 1177 & 495 & 0.579 & 0.933 & 0.708 & 0.935 & 0.867 \\
		%\textsf{pythagorean\_theorem}& 349 & 669 & 317 & 0.499 & 0.809 \\
		\textsf{ballistics}& 248 & 873 & 209 & 0.761 & 0.802 & 0.551 & 0.666 & 0.478 \\
		\textsf{average}& 209 & 482 & 186 & 0.614 & 0.830 & 0.230 & 0.338 & 0.618 \\
		\hline
		\textsf{buy\_five}& 294 & 476 & 292 & 0.613 & 0.828 & 0.234 & 0.489 & 0.719 \\
		\textsf{competition}& 227 & 327 & 230 & 0.703 & 0.847 & 0.361 & 0.515 & 0.896 \\
		\textsf{top\_shop}& 142 & 476 & 133 & 0.721 & 0.758 & 0.399 & 0.802 & 0.444 \\
		\textsf{minimax}& 74 & 163 & 57 & 0.650 & 0.644 & 0.462 & 0.745 & 0.298 \\
		\textsf{checking\_account}& 132 & 234 & 112 & 0.521 & 0.744 & 0.143 & 0.491 & 0.115\\
		\textsf{consumers\_anon}& 65 & 170 & 53 & 0.688 & 0.800 & 0.376 & 0.880 & 0.623 \\
		\hline
		\textsf{greatest}& 70 & 142 & 83 & 0.585 & 0.859 & 0.492 & 0.746 & 0.880\\
		\textsf{greatest\_abs}& 58 & 155 & 57 & 0.632 & 0.845 & 0.612 & 0.878 & 0.789 \\
		\textsf{greatest\_neg}& 62 & 195 & 71 & 0.636 & 0.815 & 0.621 & 0.960 & 0.718 \\
		\hline
		Total / average & 2102 & 4811 & 1978 & 0.642 & 0.809 & 0.432 & 0.704 & 0.620 \\
	\end{tabular}
	\label{tab:stats}
\end{table}

Evaluation was performed on a subset of exercises from an introductory Python course, implemented in the online programming environment CodeQ\footnote{Available at \url{https://codeq.si}. Source under AGPL3+ at \url{https://codeq.si/code}.}. Table~\ref{tab:stats} shows the number of users attempting each problem, the number of all / correct submissions, the performance of majority and random-forest classifiers for predicting program correctness based on patterns, and percentages of covered programs by rules: $n_p$ is the percentage of covered incorrect programs with n-rules that contain only presence of patterns, $n$ is the percentage of covered incorrect programs with all n-rules, and $p$ is the coverage of correct programs with p-rules. Classification accuracies were obtained using 10-fold cross validation.

Our primary interest in this paper is finding rules to help manual analysis of student submissions. The accuracy of automatic classification thus plays a secondary role to the interpretability of our model, however it is a good measure to estimate the expressiveness of patterns. We see that AST patterns increase classification accuracy for about 17\% overall. This result indicates that a significant amount of information can be gleaned from simple syntax-oriented analysis. Exceptions are the \textsf{ballistics} and the \textsf{minimax} problems, where there was no visible improvement. In both exercises, the set of used patterns was insufficient to distinguish between incorrect and correct programs. To further improve the quality of patterns, we will analyze misclassified programs in those two exercises and derive new formats of patterns, which will enable better learning. 

The coverages of particular types of rules are also promising. On average, we can explain 43.2\% of incorrect programs with the presence of patterns. Considering all n-rules, we can explain over 70\% of incorrect submissions, however these rules are somewhat harder to understand. Similarly, we can explain 62\% of correct programs with p-rules. The remaining 38\% of solutions tackle the problems in ways that cannot be veiled by out current patterns or there are not enough similar solutions to form a rule. This requires further analysis. For example, in the \textsf{checking\_account} exercise the coverage of p-rules is only 11.5\%. 

We have demonstrated how AST patterns can be described with TREs, and how such patterns can be combined to discover important concepts and errors in student programs. Currently, analyzing patterns and rules is quite cumbersome. We plan on developing a tool to allow teachers to easily construct and refine patterns and rules based on example programs. Ideally we would integrate our approach into an existing analysis tool such as OverCode~\cite{glassman2015overcode}.