diff options
-rw-r--r-- | aied2018/evaluation.tex | 12 | ||||
-rw-r--r-- | aied2018/introduction.tex | 2 | ||||
-rw-r--r-- | aied2018/patterns.tex | 2 |
3 files changed, 7 insertions, 9 deletions
diff --git a/aied2018/evaluation.tex b/aied2018/evaluation.tex index 205a521..72dca8e 100644 --- a/aied2018/evaluation.tex +++ b/aied2018/evaluation.tex @@ -1,9 +1,9 @@ -\section{Evaluation, discussion, and further work} +\section{Evaluation and discussion} \setlength{\tabcolsep}{4.5pt} -\def\arraystretch{1.1} +\def\arraystretch{1.07} \begin{table}[t] - \caption{Solving statistics, classification accuracy, and coverage of rules for several introductory Python problems. The second column shows the number of users attempting the problem. Columns 3 and 4 show the number of all / correct submissions. The next two columns show the classification accuracy for the majority and random-forest classifiers. The last three rows show percentages of covered examples: columns $n_p$ and $n$ contain covered incorrect programs (n-rules with presence of patterns and all n-rules), column $p$ contains the percentage of covered correct programs by p-rules.} + \caption{Solving statistics, classification accuracy, and coverage of rules for several introductory Python problems. The second column shows the number of users attempting the problem. Columns 3 and 4 show the number of all / correct submissions. The next two columns show the classification accuracy for the majority and random-forest classifiers. The last three columns show percentages of covered examples: columns $n_p$ and $n$ give covered incorrect programs (n-rules with presence of patterns and all n-rules), and column $p$ gives the percentage of correct programs covered by p-rules.} \centering \begin{tabular}{l|c|cc|cc|ccc} & & \multicolumn{2}{c|}{\textbf{Submissions}} & \multicolumn{2}{c|}{\textbf{CA}} & \multicolumn{3}{c}{\textbf{Coverage}} \\ @@ -30,12 +30,10 @@ \label{tab:stats} \end{table} - -Evaluation was performed on a subset of exercises from an introductory Python course. The course was implemented in the online programming environment CodeQ\footnote{Available at \url{https://codeq.si}. Source under AGPL3+ at \url{https://codeq.si/code}.}, which was used to collect correct and incorrect submissions. Table~\ref{tab:stats} shows the number of users attempting each problem, the number of all and correct submissions, the performance of majority and random-forest classifiers for predicting the correctness of a program based on the patterns it contains, and percentages of covered programs by rules: $n_p$ is the percentage of covered incorrect programs with n-rules that contain only presence of patterns, $n$ is the percentage of covered incorrect programs with all n-rules, and $p$ is the coverage of correct programs with p-rules. Classification accuracies were obtained using 10-fold cross validation. +Evaluation was performed on a subset of exercises from an introductory Python course, implemented in the online programming environment CodeQ\footnote{Available at \url{https://codeq.si}. Source under AGPL3+ at \url{https://codeq.si/code}.}. Table~\ref{tab:stats} shows the number of users attempting each problem, the number of all / correct submissions, the performance of majority and random-forest classifiers for predicting program correctness based on patterns, and percentages of covered programs by rules: $n_p$ is the percentage of covered incorrect programs with n-rules that contain only presence of patterns, $n$ is the percentage of covered incorrect programs with all n-rules, and $p$ is the coverage of correct programs with p-rules. Classification accuracies were obtained using 10-fold cross validation. Our primary interest in this paper is finding rules to help manual analysis of student submissions. The accuracy of automatic classification thus plays a secondary role to the interpretability of our model, however it is a good measure to estimate the expressiveness of patterns. We see that AST patterns increase classification accuracy for about 17\% overall. This result indicates that a significant amount of information can be gleaned from simple syntax-oriented analysis. Exceptions are the \textsf{ballistics} and the \textsf{minimax} problems, where there was no visible improvement. In both exercises, the set of used patterns was insufficient to distinguish between incorrect and correct programs. To further improve the quality of patterns, we will analyze misclassified programs in those two exercises and derive new formats of patterns, which will enable better learning. The coverages of particular types of rules are also promising. On average, we can explain 43.2\% of incorrect programs with the presence of patterns. Considering all n-rules, we can explain over 70\% of incorrect submissions, however these rules are somewhat harder to understand. Similarly, we can explain 62\% of correct programs with p-rules. The remaining 38\% of solutions tackle the problems in ways that cannot be veiled by out current patterns or there are not enough similar solutions to form a rule. This requires further analysis. For example, in the \textsf{checking\_account} exercise the coverage of p-rules is only 11.5\%. -% TODO why worse-than-bad results for minimax? - +We have demonstrated how AST patterns can be described with TREs, and how such patterns can be combined to discover important concepts and errors in student programs. Currently, analyzing patterns and rules is quite cumbersome. We plan on developing a tool to allow teachers to easily construct and refine patterns and rules based on example programs. Ideally we would integrate our approach into an existing analysis tool such as OverCode~\cite{glassman2015overcode}. diff --git a/aied2018/introduction.tex b/aied2018/introduction.tex index ab69147..ebe86f5 100644 --- a/aied2018/introduction.tex +++ b/aied2018/introduction.tex @@ -16,4 +16,4 @@ Jin et al. use linkage graphs to represent dependencies between individual progr When comparing programs we must account for superficial differences such as different naming schemes. Rivers et al. canonicalize student programs using equivalency-preserving transformations (renaming variables, reordering binary expressions, inlining functions and so on)~\cite{rivers2015data-driven}. We use their canonicalization as a preprocessing step before extracting patterns. -The next section describes TREs and AST patterns, and gives a brief quantitative evaluation to show that patterns can be used to discriminate between correct and incorrect programs. Section~\ref{sec:rules} describes our modified version of the CN2 rule-learning algorithm, and analyzes student solutions to two programming exercises in terms of such rules. The last section outlines the directions for our future work. +The next section describes TREs and AST patterns, and gives a brief evaluation to show that patterns can discriminate between correct and incorrect programs. Section~\ref{sec:rules} describes our modified version of the CN2 rule-learning algorithm, and analyzes student solutions to two programming exercises in terms of such rules. The last section outlines the directions for our future work. diff --git a/aied2018/patterns.tex b/aied2018/patterns.tex index d50a62e..28a5ed8 100644 --- a/aied2018/patterns.tex +++ b/aied2018/patterns.tex @@ -127,7 +127,7 @@ The first TRE encodes a single path in the AST and describes the program’s blo print(d) \end{Verbatim} -The second pattern describes a common mistake for this problem: \code{range(1,n)} will only generate values up to \code{n-1}, so \code{n} will not be printed as its own divisor. A correct pattern would include the binary operator \code{+} on the path to \code{n}, indicating a call to \code{range(1,n+1)}. +The second pattern shows a common mistake for this problem: \code{range(1,n)} will only generate values up to \code{n-1}, so \code{n} will not be printed as its own divisor. A correct pattern would include the binary operator \code{+} on the path to \code{n}, indicating a call to \code{range(1,n+1)}. \subsection{Constructing patterns} |