From 9a44bdd0aa93e9957e0c917b6ac72c68c9da21c6 Mon Sep 17 00:00:00 2001 From: Martin Mozina Date: Tue, 6 Feb 2018 14:24:56 +0100 Subject: Added evaluation. --- aied2018/aied2018.tex | 4 ++-- aied2018/evaluation.tex | 42 ++++++++++++++++++++++++++++++++++++++++++ aied2018/patterns.tex | 39 --------------------------------------- aied2018/rules.tex | 3 --- 4 files changed, 44 insertions(+), 44 deletions(-) create mode 100644 aied2018/evaluation.tex diff --git a/aied2018/aied2018.tex b/aied2018/aied2018.tex index 17f0e7a..26cfe6e 100644 --- a/aied2018/aied2018.tex +++ b/aied2018/aied2018.tex @@ -46,7 +46,7 @@ % background / problem Writing programs is essential to learning programming. Most programming courses encourage students to practice with lab and homework assignments. By analyzing solutions to these exercises teachers can discover mistakes and concepts students are struggling with, and use that knowledge to improve the content and presentation of the course. Students however tend to submit many different programs even for simple exercises, making such analysis difficult. % solution -We propose using tree regular expressions to encode common patterns in programs. Based on these patterns we induce rules describing common approaches and mistakes for a given assignment. In this paper we describe the rule-learning algorithm and present case studies of rule-based analysis for several introductory Python problems. We show that our rules are easy to interpret, and can be learned from a relatively small set of programs. +We propose using tree regular expressions to encode common patterns in programs. Based on these patterns we induce rules describing common approaches and mistakes for a given assignment. In this paper we describe the rule-learning algorithm and present two case studies of rule-based analysis for introductory Python problems. We show that our rules are easy to interpret, and can be learned from a relatively small set of programs. \\\\ \textbf{Keywords:} Learning programming · Educational data analysis · Error diagnosis · Abstract syntax tree · Tree regular expressions \end{abstract} @@ -55,7 +55,7 @@ We propose using tree regular expressions to encode common patterns in programs. %\input{background} \input{patterns} \input{rules} -%\input{evaluation} +\input{evaluation} %\input{conclusion} \bibliographystyle{splncs} diff --git a/aied2018/evaluation.tex b/aied2018/evaluation.tex new file mode 100644 index 0000000..d74a4f5 --- /dev/null +++ b/aied2018/evaluation.tex @@ -0,0 +1,42 @@ + +\section{Evaluation, discussion, and further work} + +\setlength{\tabcolsep}{6pt} +\def\arraystretch{1.1} +\begin{table}[t] + \caption{Solving statistics, classification accuracy, and coverage of rules for several introductory Python problems. The second column shows the number of users attempting the problem. Columns 3 and 4 show the number of all / correct submissions. The next two columns show the classification accuracy for the majority and random-forest classifiers. The last three rows show percentages of covered examples: columns $n_p$ and $n$ contain covered incorrect programs (n-rules with presence of patterns and all n-rules), column $p$ contains the percentage of covered correct programs by p-rules.} + \centering + \begin{tabular}{l|c|cc|cc|ccc} + & & \multicolumn{2}{c|}{\textbf{Submissions}} & \multicolumn{2}{c|}{\textbf{CA}} & \multicolumn{3}{c}{\textbf{Coverage}} \\ + \textbf{Problem} & \textbf{Users} & Total & Correct & Maj & RF & $n_p$ & $n$ & $p$ \\ + \hline + \textsf{fahrenheit\_to\_cel}& 521 & 1177 & 495 & 0.579 & 0.933 & 0.708 & 0.935 & 0.867 \\ + %\textsf{pythagorean\_theorem}& 349 & 669 & 317 & 0.499 & 0.809 \\ + \textsf{ballistics}& 248 & 873 & 209 & 0.761 & 0.802 & 0.551 & 0.666 & 0.478 \\ + \textsf{average}& 209 & 482 & 186 & 0.614 & 0.830 & 0.230 & 0.338 & 0.618 \\ + \hline + \textsf{buy\_five}& 294 & 476 & 292 & 0.613 & 0.828 & 0.234 & 0.489 & 0.719 \\ + \textsf{competition}& 227 & 327 & 230 & 0.703 & 0.847 & 0.361 & 0.515 & 0.896 \\ + \textsf{top\_shop}& 142 & 476 & 133 & 0.721 & 0.758 & 0.399 & 0.802 & 0.444 \\ + \textsf{minimax}& 74 & 163 & 57 & 0.650 & 0.644 & 0.462 & 0.745 & 0.298 \\ + \textsf{checking\_account}& 132 & 234 & 112 & 0.521 & 0.744 & 0.143 & 0.491 & 0.115\\ + \textsf{consumers\_anon}& 65 & 170 & 53 & 0.688 & 0.800 & 0.376 & 0.880 & 0.623 \\ + \hline + \textsf{greatest}& 70 & 142 & 83 & 0.585 & 0.859 & 0.492 & 0.746 & 0.880\\ + \textsf{greatest\_abs}& 58 & 155 & 57 & 0.632 & 0.845 & 0.612 & 0.878 & 0.789 \\ + \textsf{greatest\_neg}& 62 & 195 & 71 & 0.636 & 0.815 & 0.621 & 0.960 & 0.718 \\ + \hline + Total / average & 2102 & 4811 & 1978 & 0.642 & 0.809 & 0.432 & 0.704 & 0.620 \\ + \end{tabular} + \label{tab:stats} +\end{table} + + +Evaluation was performed on a subset of exercises from an introductory Python course. The course was implemented in the online programming environment CodeQ\footnote{Available at \url{https://codeq.si}. Source under AGPL3+ at \url{https://codeq.si/code}.}, which was used to collect correct and incorrect submissions. Table~\ref{tab:stats} shows the number of users attempting each problem, the number of all and correct submissions, the performance of majority and random-forest classifiers for predicting the correctness of a program based on the patterns it contains, and percentages of covered programs by rules: $n_p$ is the percentage of covered incorrect programs with n-rules that contain only presence of patterns, $n$ is the percentage of covered incorrect programs with all n-rules, and $p$ is the coverage of correct programs with p-rules. Classification accuracies were obtained using 10-fold cross validation. + +Our primary interest in this paper is finding rules to help manual analysis of student submissions. The accuracy of automatic classification thus plays a secondary role to the interpretability of our model, however it is a good measure to estimate the expressiveness of patterns. We see that AST patterns increase classification accuracy for about 17\% overall. This result indicates that a significant amount of information can be gleaned from simple syntax-oriented analysis. Exceptions are the \textsf{ballistics} and the \textsf{minimax} problems, where there was no visible improvement. In both exercises, the set of used patterns was insufficient to distinguish between incorrect and correct programs. To further improve the quality of patterns, we will analyze misclassified programs in those two exercises and derive new formats of patterns, which will enable better learning. + +The coverages of particular types of rules are also promising. On average, we can explain 43.2\% of incorrect programs with the presence of patterns. Considering all n-rules, we can explain over 70\% of incorrect submissions, however these rules are somewhat harder to understand. Similarly, we can explain 62\% of correct programs with p-rules. The remaining 38\% of solutions tackle the problems in ways that cannot be veiled by out current patterns or there are not enough similar solutions to form a rule. This requires further analysis. For example, in the \textsf{checking\_account} exercise the coverage of p-rules is only 11.5\%. + +% TODO why worse-than-bad results for minimax? + diff --git a/aied2018/patterns.tex b/aied2018/patterns.tex index 0b121f7..f6ef46e 100644 --- a/aied2018/patterns.tex +++ b/aied2018/patterns.tex @@ -152,42 +152,3 @@ For each expression (such as \code{(F-32)*5/9}) we select the different combinat We found that patterns constructed from such nodesets are useful for discriminating between programs. As we show in Sect.~\ref{sec:interpreting-rules}, they are also easily interpreted in terms of bugs and strategies for a given problem. Note that in any pattern constructed in this way, all \textsf{Var} nodes refer to the same variable, simplifying interpretation. -\subsection{Evaluating patterns} - -Our primary interest in this paper is finding rules to help manual analysis of student submissions. The accuracy of automatic classification thus plays a secondary role to the interpretability of our model. Here we give only a rudimentary evaluation of patterns as attributes for distinguishing between correct and incorrect programs. - -Evaluation was performed on a subset of exercises from an introductory Python course. The course was implemented in the online programming environment CodeQ\footnote{Available at \url{https://codeq.si}. Source under AGPL3+ at \url{https://codeq.si/code}.}, which was used to collect correct and incorrect submissions. Table~\ref{tab:stats} shows the number of users attempting each problem, the number of all and correct submissions, and the performance of majority and random-forest classifiers for predicting the correctness of a program based on the patterns it contains. We tested these classifiers using 10-fold cross validation. - -\setlength{\tabcolsep}{6pt} -\def\arraystretch{1.1} -\begin{table}[htb] -\caption{Solving statistics and classification accuracy for several introductory Python problems. The second column shows the number of users attempting the problem. Columns 3 and 4 show the number of all / correct submissions. The last two columns show the classification accuracy for the majority and random-forest classifiers.} -\centering - \begin{tabular}{r|c|cc|cc} - & & \multicolumn{2}{c|}{\textbf{Submissions}} & \multicolumn{2}{c}{\textbf{CA}} \\ - \textbf{Problem} & \textbf{Users} & Total & Correct & Majority & RF \\ - \hline -\textsf{fahrenheit\_to\_celsius}& 521 & 1177 & 495 & 0.579 & 0.933 \\ -%\textsf{pythagorean\_theorem}& 349 & 669 & 317 & 0.499 & 0.809 \\ -\textsf{ballistics}& 248 & 873 & 209 & 0.761 & 0.802 \\ -\textsf{average}& 209 & 482 & 186 & 0.614 & 0.830 \\ -\hline -\textsf{buy\_five}& 294 & 476 & 292 & 0.613 & 0.828 \\ -\textsf{competition}& 227 & 327 & 230 & 0.703 & 0.847 \\ -\textsf{top\_shop}& 142 & 476 & 133 & 0.721 & 0.758 \\ -\textsf{minimax}& 74 & 163 & 57 & 0.650 & 0.644 \\ -\textsf{checking\_account}& 132 & 234 & 112 & 0.521 & 0.744 \\ -\textsf{consumers\_anonymous}& 65 & 170 & 53 & 0.688 & 0.800 \\ -\hline -\textsf{greatest}& 70 & 142 & 83 & 0.585 & 0.859 \\ -\textsf{greatest\_absolutist}& 58 & 155 & 57 & 0.632 & 0.845 \\ -\textsf{greatest\_negative}& 62 & 195 & 71 & 0.636 & 0.815 \\ -\hline -Total / average & 2102 & 4811 & 1978 & 0.642 & 0.809 \\ - \end{tabular} -\label{tab:stats} -\end{table} - -We see that AST patterns increase classification accuracy for about 17\% overall. This result indicates that a significant amount of information can be gleaned from simple syntax-oriented analysis. One exception is the \textsf{ballistics} problem, where the improvement was quite small. This is an artifact of our testing framework: the program is expected to read two values from standard input in a certain order, but none of the used patterns could encode this information. - -% TODO why worse-than-bad results for minimax? diff --git a/aied2018/rules.tex b/aied2018/rules.tex index 855b875..3da87ea 100644 --- a/aied2018/rules.tex +++ b/aied2018/rules.tex @@ -197,6 +197,3 @@ To explain this rule we first have to describe how students test their programs. On the other hand, given that the rule covers only nine programs, the probability that the rule is a statistical artifact is not negligible. - -\section{Evaluation and discussion} - -- cgit v1.2.1