summaryrefslogtreecommitdiff
path: root/aied2017
diff options
context:
space:
mode:
authorMartin Možina <martin.mozina@fri.uni-lj.si>2017-02-03 18:38:27 +0100
committerMartin Možina <martin.mozina@fri.uni-lj.si>2017-02-03 18:38:27 +0100
commit90ea92ad4c42a5fa980d55576cc96aa4a8fbf3a8 (patch)
treeea4a85901cd02f793d5e71e6b2a23467f41e4e4b /aied2017
parentbf57e6aa17bbcb8deac650c807b5f494a6488ba6 (diff)
First draft of evaluation section.
Diffstat (limited to 'aied2017')
-rw-r--r--aied2017/evaluation.tex82
1 files changed, 81 insertions, 1 deletions
diff --git a/aied2017/evaluation.tex b/aied2017/evaluation.tex
index 22e4253..50e9eb4 100644
--- a/aied2017/evaluation.tex
+++ b/aied2017/evaluation.tex
@@ -1 +1,81 @@
-\section{Evaluation} \ No newline at end of file
+\section{Evaluation and discussion}
+We evaluated our approach on 44 programming assignments. We preselected 70\% of students
+whose submissions were used as learning data for rule learning. The submissions from
+the remaining 30\% of students were used as testing data to evaluate classification accuracy of learned rules
+and to retrospectively evaluate quality of given hints.
+
+\begin{table}[t]
+\centering
+ \begin{tabular}{|l|rrr|rr|rrr|r|}
+ \hline
+ Problem& \multicolumn{3}{c|}{CA} & \multicolumn{2}{c|}{Buggy hints} & \multicolumn{3}{c|}{Intent hints} & No hint \\
+ \hline
+ & Rules & Maj & RF & All & Imp & All & Imp & Alt & \\
+ \hline
+ sister & 0.988 & 0.663 & 0.983 & 128 & 128 & 127 & 84 & 26 & 34 \\
+ del & 0.948 & 0.669 & 0.974 & 136 & 136 & 39 & 25 & 10 & 15 \\
+ sum & 0.945 & 0.510 & 0.956 & 59 & 53 & 24 & 22 & 1 & 6 \\
+ is\_sorted & 0.765 & 0.765 & 0.831 & 119 & 119 & 0 & 0 & 0 & 93 \\
+ union & 0.785 & 0.766 & 0.813 & 106 & 106 & 182 & 66 & 7 & 6 \\
+ ...& & & & & & & & & \\
+ \hline
+ Average & 0.857 & 0.666 & 0.908 & 79.73 & 77.34 & 46.75 & 26.36 & 5.55 & 23.75 \\
+ \hline
+ \end{tabular}
+\caption{Results on 5 selected domains and averaged results over 44 domains.
+Columns 2, 3, and 4 contain classification accuracies of our rule learning method, majority classifier,
+and random forest, respectively. Columns 5 and 6 report the number of all generated buggy hints
+and the number of hints that were actually implemented by students. The following three columns
+contain the number of all generated intent hints (All),
+the number of implemented hints (Imp) and the number of implemented alternative hints (Alt).
+The numbers in the last column are student submission where hints could not be generated.
+The last row averages results over all 44 domains.}
+\label{table:eval}
+\end{table}
+
+Table~\ref{table:eval} contains results on five selected problems (each problem represents a
+different group of problems) and averaged results over all problems.\footnote{We report only a subset of results due to space
+restrictions. The table with results for all 44 problems can be found at
+\url{www.ailab.si/aied2017}. } The second, third, and fourth columns provide classification accuracies (CA) of rule-based classifier,
+majority classifier, and random forest on testing data. The majority classifier and the random forests method,
+which was the best performing machine learning algorithm over all problems, serve as lower and upper bound
+for CA. For example, in the case of ``sister'' problem,
+our rules correctly classified 99\% testing submissions, the majority was 66\%, and the random
+forests achieved 98\% - slightly worse than the rules on this particular problem. The CA
+is also good for problems ``del'' and ``sum'', however it is lower in cases of ``is\_sorted'' and ``union'',
+suggesting that the proposed set of AST patterns is insufficient in certain problems. Indeed, after analysing the problem ``is\_sorted''
+we observed that our pattern set does not enclose patterns containing empty list ``[]'', an important pattern
+for a predicate that tests whether a list is sorted. For this reason, the rule learning
+algorithm failed to learn any C-rules and therefore all programs were classified as incorrect. Furthermore, many
+solutions of the problem ``union'' include the use of the cut operator (exclamation mark), which
+is, again, ignored by out pattern generation algorithm.
+
+To evaluate the quality of hints we selected all incorrect submission from student traces
+that resulted in a correct program (solution). For example, from 403 submissions in the ``sister'' testing set,
+there were 289 relevant incorrect solutions.
+
+The columns under ``Buggy hints'' contain evaluation of hints generated from rules
+for incorrect class (I-rules). For each generated buggy hint we check whether
+it was implemented by the student in the final submission (the solution). The column ``All'' is
+the number of all generated buggy hints and the column ``Imp'' is the number of
+implemented hints. The results suggest that almost all buggy hints are implemented in the final solution.
+
+The intent hints are generated when algorithm fails to find any buggy hints. The column ``All'' contains the number of
+generated intent hints, ``Imp'' the number of implemented main intent hints, while ``Alt'' is the number
+of implemented alternative hints. Notice that the percentage of implemented intent hints is significantly lower
+when compared to buggy hints: in the case
+of problem ``sister'' 84 out of 127 (66\%) hints were implemented, while in case of problem ``union'' only 66 out of 182 (36\%) hints were implemented.
+
+The last column shows the number of submissions where no hints could be generated. This value is relatively high
+for the ``is\_sorted'' problem, because the algorithm could not learn any C-rules and consequently no intent hints where generated.
+
+To sum up, buggy hints seem to be good and reliable, since they are always implemented when presented, even when we tested
+them in retrospective (the students did not see these hints). The percentage of implemented intent hints is,
+on average, lower (56\%; 26.36 of 46.75), which is still not a bad result, providing that it is difficult to determine intent
+of a programer. Last but not least, high classification accuracies in many problems imply that it is possible
+to correctly determine the correctness of a program by simply verifying the presence of a small number of patterns.
+Our hypothesis is that these patterns represent the parts of the problem solution that the students have
+difficulties with.
+
+
+