UTILIZATION OF LEARNING ANALYTICS TO OBTAIN PEDAGOGICALLY MEANINGFUL INFORMATION FROM DATA AND PRESENTING THE INFORMATION TO STUDENTS

(1)

FACULTY OFSCIENCE

U

TILIZATION OF LEARNING ANALYTICS TO

OBTAIN PEDAGOGICALLY MEANINGFUL

INFORMATION FROM DATA AND PRESENTING

THE INFORMATION TO STUDENTS

Author

Gerben van der Huizen 10460748 Supervisor Bert Bredeweg June 26, 2015 BSc Kunstmatige intelligentie Afstudeerproject BSc KI The Netherlands Science Park 904 1098 XH Amsterdam

(2)

1 Abstract

Finding ways to extract pedagogically meaningful information from student data is a common issue in the field of learning analytics and educational data mining. One of the ways we can use this information is by presenting the data in such a way that students can use it to improve their learning strategies. In this paper different meth-ods for extracting information from student data and methmeth-ods for presenting this information to students are discussed. A model for predicting student performance and feature analysis methods were used to extract information from the data and different ideas based on previous research were investigated for presenting data visualizations to students. The methods were tested on two data sets, one contain-ing student data from carrycontain-ing out assignments in Coach and the other consisted of Blackboard data of assignment scores and clicks on Blackboard. Firstly, from testing the prediction model on both data sets it was concluded that only the Black-board data showed potential for being able to make accurate predictions of student performance. Secondly, feature analysis resulted in finding correlations between features from which interesting information could be derived. Finally, the findings of this study showed that the investigated ideas for visualizations should first be evaluated by students before they can be implemented in a dashboard application for presenting visual representations of student data.

2 Introduction

It is a challenge for universities and other institutions to determine how to under-stand and make use of educational data generated by students (Siemens, 2013). Both learning analytics (LA) and educational data mining (EDM) investigate how to use the large amount of data students supply to their institution. LA can be described as a set of methods for collecting and analyzing data to find ways to improve the learning environment of students from which the data was collected, whereas EDM is generally regarded as applying data mining techniques to large sets of student data to detect patterns which could have been missed otherwise (Pa-pamitsiou and Economides, 2014). One of the related issues to LA and EDM is discovering ways of analysing the data of students to gain pedagogically meaning-ful information regarding student behaviour and learning strategies (Papamitsiou and Economides, 2014).

In this paper we will try to use data analysis methods to find out how we can ex-tract meaningful information from student data and report on these effectively to students. The research involves gaining insight into which kind of features or at-tributes can be extracted from the data and, with data analysis, to discover which

(5)

of these features are important for improving student learning strategies. Students will also be categorized by a clustering algorithm to investigate whether this tech-nique can help with the identification of hidden patterns and relationships in the student data. Furthermore, a prediction model based on student performance cate-gories assigned via clustering will be created using the extracted features, allowing for the predictive ability of both the utilized classification algorithms and the ex-tracted features to be investigated. Finally, the research will also involve a review on how we can present student data in unique and insightful ways with visualiza-tions. Since these ideas for visualizations can’t be tested on students and their data, the results of presenting useful visualizations to students will be based on expe-rience with the data and previous research performed on visualizations. Through the LA and EDM methods mentioned above, and with the available student data it will be possible to design a LA application to create the possibility for improving the learning strategies of students. This platform can then later be evaluated by stu-dents or simply serve as a guideline for designing learning environment systems with student data.

The contents of the project report is organised as follows: Firstly, the related work is reviewed. This section contains small summaries of papers and elaborates why these papers are relevant to this project. Then, the resources and research meth-ods are discussed. The resources section describes which data, programming lan-guages and analysis tools were used in this project, the method section describes how these resources were used to perform experiments. Next, the obtained results from the experiments which were described in research methods are presented. In the conclusion the general results are summarized as well as possible extensions of this research.

3 Related work

The review by Papamitsiou and Economides (2014) provides a systematic com-parison of recent LA and EDM research. The paper classifies the research by analysis method, learning settings and research objective. The goal is to capture the strengths and weaknesses of LA research and to help identify which problems should be addressed in future research. One of the issues discussed in the review paper which is directly related to this project, is finding ways to extract informa-tion from student data. The paper does not explain this issue in a lot of detail, but provides references to research papers which do provide more information on the issue.

M´arquez-Vera et al. (2013) is an example of a paper which provides insight on how to (pre-)process a LA data set. The paper describes how to perform data cleaning,

(6)

how to use WEKA to test feature selection algorithms for reducing dimensionality and how to use the SMOTE algorithm to balance imbalanced data (when the num-ber of instances of certain classes are small compared to others). The pre-processed data is then used to predict student failure by testing a number of supervised learn-ing algorithms for classification. The clusterlearn-ing of students on their performance with the k-means algorithm is discussed in Shovon et al. (2012). A small set of training data from students is used to divide the students into four performance based classes. This allows a teacher, based on the results of the clustering algo-rithm, to decide in which category each student belongs and which kind of learn-ing approach is suitable for the chosen category. In our research classification with the SMOTE algorithm, feature selection and k-means clustering were applied to student data by being part of a prediction model. The different techniques from the papers on classification (M´arquez-Vera et al., 2013) and clustering (Shovon et al., 2012) were implemented in the prediction model which enabled it to classify stu-dents on their performance category. The predictions made with the model provide information on the predictive ability of the data and its features.

Klerkx et al. (2014) shows how visualization techniques are becoming increas-ingly more important in the field of LA. The main goal of the paper is to describe which of the existing visualization techniques can be used to enhance the learn-ing environments of students. Santos et al. (2013) demonstrates experiments with a LA application called StepUp!, which students can use to view their own learn-ing activity through different visualizations of data. The paper describes different brainstorming sections where students helped to identify the issues they had with the application, which assisted the researchers with adding new functionality to StepUp!. An example of a LA dashboard for document recommendations and visu-alizing student data is given in Govaerts et al. (2010). The dashboard can be used by both teachers and students to improve self-monitoring for learners, awareness for teachers and students, time tracking and creating learning resource recommenda-tions (documents containing information on a certain subject). The dashboard was evaluated in two different testing sessions with students. At the end of each testing session, the user satisfaction was analyzed by asking students to give their opinion in a survey. It was concluded that students found the dashboard to be useful, but it was difficult to determine whether the dashboard actually improved the learning of students. The two papers about different visualization dashboards provided ideas for designing such a dashboard and emphasized the necessity of including student evaluations to test the effectiveness of the visualizations and dashboards. To de-velop ideas for presenting data visualizations to students, the methods referenced in Klerkx et al. (2014) were utilized to provide insight on which of these methods have already been implemented in recent years. The results from the research

(7)

pa-made about visualization methods in section 7.

4 Research methods

The contents of the research methods section is organised as follows: To begin with, the tools for programming and performing experiments are described in the resources. Next, the sources from which the data was acquired are described in sec-tions 4.2 and 4.4. The data sources section also describes the problems which were encountered during the process of acquiring the data. Furthermore, the methods for processing the available data sets by extracting useful features and cleaning up the data are explained in sections 4.3 and 4.5. The prediction model and its different components for determining the predictive ability of student data is discussed in section 4.6 and is followed by a description of the methods used for finding in-teresting correlations between features. The last section of the research methods discusses how the different ideas for visualizations were obtained.

4.1 Resources

The Python programming language (version 2.7) which contains support for read-ing data files was used for applyread-ing clusterread-ing to data with Scikit-learn, creatread-ing visualizations and application development. The WEKA machine learning envi-ronment was used to test the performance of machine learning algorithms that were unavailable in Scikit-learn (supervised learning and feature selection algorithms), but also to evaluate the results from applying clustering. WEKA was also utilized to perform data analysis and find correlations in the data by making use of its visualizations, creation of correlations matrices and feature selection algorithms. Matplotlib and Pandas were worked with to create visualizations in Python, be-cause they allow for a large amount of customization e.g. different kinds of graphs, custom colors and text positioning. Finally, the Pyside/PyQT libraries were used to create a dashboard as a demo for showing visualizations and other features which could be included in such an application.

4.2 Data set 1

Data set 1 consists of student data from an application called Coach. Students use Coach to carry out assignments online that are mandatory for passing a cer-tain course e.g. mathematics or biology courses. The received data belonged to a course from the psycho biology curriculum which required students to carry out basic mathematics assignments on Coach. Performing experiments on this Data set 1 first would present the opportunity to become familiar with working on student

(8)

data sets and to test how well certain algorithms perform on different sets of stu-dent data.

The courses followed on Coach had certain rules and a structure that was presented to the students while working with the application. First of all, students could log in at home or at the university to practice using questions relevant to the part of the course they were working on. Students could attempt to score on these questions or simply get the lowest score in exchange for all the answers. The questions pro-vided different examples of the same terms, which gave student the opportunity to get tips and direct feedback on their answers. The assessments required students to carry out similar exercises as presented in the questions, but these assessments were obligatory for passing the course. The assessments consisted of randomized exercises on the same topic. Once an exercise had been answered correctly the stu-dent would not have to make another attempt at this specific exercise. The stustu-dents could have as many attempts at an exercise as they wanted, which means that their assessment score could only go up. Students were not obligated to first practice and then carry out assessments. Students were however required to pass each of the assessments with a minimum score of 80 % before the deadlines.

Data set 1 contains two separate sets where one contains the grades that students achieved and the other set contained about 80000 entries of student activity with features such as type of activity, student id and timestamps (full list of features is described in data processing 4.3).

4.3 Data processing (Data set 1)

The Data set 1 was received as two Pickle files which both contain a python list of dictionaries. In the activity data set the dictionary items each represent a statement that was recorded by Coach and in the second data set the dictionaries map from a anonymous student identification number to the course grades. The features from the Pickle files are described below, followed by the section explaining the data processing that was applied.

4.3.1 Description of the activity and grade data set

The grade data set contains grades and IDs of 303 students that participated in the psycho biology course for which assignments were carried out on Coach. All the features from the grade data set are listed in table 1.

(9)

Table 1: A table containing all the features from the grade data set.

Table 2: A table containing all the features from the activity data set.

The activity data set contains recorded activity of each student for every exercise or assessment that was carried out on Coach by the student during the psycho biology course. Assessments only contain a launched and completed entry, which indicates when an assessment entry was launched and when it was completed. Questions

(10)

contain a launched entry and can contain multiple completed entries which in-dicates when certain exercises within a question were completed. The data also contains a small amount of media item entries (these could be videos or interactive images), but they do not contain a score. The student IDs were made anonymous by using a randomly generated identification number instead of their student email addresses. By cross-matching the student IDs in both data sets it was possible to create a data set containing new features for every student. Each of these features would have to be calculated from the 80000 entries, which can result in long wait-ing times dependwait-ing on the complexity of the calculations. Data processwait-ing of Data set 1 was implemented with Python, which included the reading of the data sets, calculations and the creation of a new data set. All the features from the activity data set are listed in table 2.

4.3.2 Extracted features from Data set 1

In total the data set consists of 278 features which can be used for data analysis and for creating a prediction model. These features can be generated by either using the entire data set with one month worth of data or by selecting a time period (week, two weeks or a month) which result in the same features with the amount from that time period. The generated data sets were saved in CSV format, which allows for efficient read and write operations. A list of the extracted features included in the created data set are listed in table 3.

While trying to extract features from Data set 1 some problems were encountered. One of the problems is that a student can enter assignments and get all the answers without making a legitimate attempt at a decent score. This makes it difficult to filter the fake attempts from the real attempts when trying to calculate the average score of students. Since Coach can register an attempt where students receive the answers directly from the application, it would be ideal if this would somehow be displayed in the data set. Another problem is that there seemed to be some data missing due to some of the students still having grades while having no available activity data. Furthermore, the activity entries from the same assignments can not be linked to each other with any sort of ID, because the ID supplied by the data is randomized for every entry. The randomized activity ID makes it difficult to determine how much time a student spent on an assignment.

(11)

Table 3: A table containing all the extracted features from Data set 1 which was used in experiments with the prediction model, feature analysis and visualization.

4.4 Data set 2

The Blackboard data (Data set 2) became available at a later stage within the project. Data set 2 was supposed to be extracted directly from the UvA database by using SQL commands. An effort was made to team up with Blackboard ICTS (ICT Services) of the UvA to get Data set 2, but due to several problems encountered during the process of extracting data it was not possible to get a complete data set (2013-2015) of the three requested courses on time. The Blackboard database uses the Oracle SQL language to manage and update all of its data, so to extract this data we had to create and test specific SQL queries. Blackboard ICTS section utilizes three separate databases: production (Blackboard uses this version), development (works with the newest version of the database and with test data) and acceptance (has less data functionality and is used for testing queries). Testing a query would first take place in acceptance to clean any bugs or errors from the query. To retrieve the desired data the tested query had to be used on the production set, but the query could still end up retrieving the wrong data because of some internal differences

(12)

between acceptance and production.

Progress on extracting data was achieved best when direct interaction with ICTS Blackboard was possible, but this couldn’t be achieved until later in the project. While extracting the data one of the main problems was the names of data columns and rows in the database (back-end) which did not correspond with the names visi-ble on Blackboard itself (front-end). The people at ICTS called this provisi-blem ‘boot-strapping’ which made it difficult to retrieve or find input commands or entries in the SQL database given on Blackboard. There were also some smaller problems which caused delay such as unexpected names for data features and the fact that re-trieving data from the production set took 20 minutes to complete. Near the end of the project a bug was found which indicated that Blackboard assigned the wrong timestamps to clicks which made the timestamps unreliable when a sequence of clicks has to be determined. These were some of the problems which caused the retrieval of Data set 2 to be delayed.

Even with the aid from ICTS Blackboard who work directly with the Blackboard database, retrieving the required data took nearly four weeks. In that period of time another method for extracting the data was found. This method used data export tools from Blackboard to extract data directly from the database to create a sta-tistical overview of student data. The new method was time consuming and did not provide data from previous years, but allowed for some experimentation with actual Blackboard data while the work continued on retrieving the entire data set. Data set 2 consists out of recorded clicks which a student performed on Blackboard during a course, and results (grades and scores) from assignments that he/she was required to turn in. Since the students could carry out multiple attempts to pass an assignment there was also data available that contained information on those attempts. The final grades that the students achieved for a course were also made available (full list of features is described in data processing section 4.5).

4.5 Data processing (Data set 2)

As mentioned previously, an alternative method (Blackboard export tools) was used to extract data from Blackboard and manually copy the results of the extrac-tion in an excel sheet. This data only contained pre-processed informaextrac-tion of ap-proximately 50 students from the three available courses, with no separated record-ings of individual clicks or activity which could normally be obtained directly from Data set 2. The three courses had a similar structure for grading, but each possessed different assignments, which caused the data to be split in three smaller sections for analysis. It should also be noted that the same students could follow multiple courses.

(13)

Table 4: A table containing all the extracted features from Data set 2 which was used in experiments with the prediction model, feature analysis

4.5.1 Description of the exported Data set 2

The exported data from Data set 2 contained information of 50 students from three different courses which took place during two months: Course 1 (26 students), Course 2 (17 students) and Course 3 (11 students). In total there were 21 features extracted, but the amount can vary based on the amount of assignments that a course could contain. A list of the features extracted from Data set 2 can be found in table 4.

During the process of manually extracting the data from the exported files some problem were encountered. One of these problems is that the loss of data by human error when manually extracting the data from the export files is likely. Another problem is that the pre-processed calculations which the Blackboard tool uses to create its data reports can’t be reviewed, so it is not possible to determine whether the generated data is reliable or correct. Generating the data and manually using Excel commands to retrieve the useful data is also a time consuming process.

4.6 Prediction model

The prediction model applies machine learning techniques on the student data for three different purposes: to discover in which way we can create student categories, to find the features which contribute the most to predicting students performance and to demonstrate the performance of supervised learning techniques on the UvA student data. It will be possible to create this prediction model for different time periods of student data to analyse if the model can still produce the same accuracy

(14)

with a few days of student data instead of with the entire month of student data which is available. If the model performs well on smaller time periods it could po-tentially be used in an LA application for predicting student performance at any point during course. The data from the first five days of the course (just before the entry exam was held) and the data from the entire month was used in the experi-ments.

The prediction model consists of clustering, feature selection and classification. Clustering with k-means will be discussed in section 4.6.1, a description of the feature selection process can be found in section 4.6.2 and the classification pro-cess is explained in section 4.6.3.

4.6.1 Clustering

The k-means algorithm (with Euclidean distance metric) was used to categorize students in different groups based on their data from one of the previously de-scribed data sets. Student categories or clusters can be based on different features e.g. amount of activity, amount of hints used, achieved grades and scores. In this project the students were divided in clusters based on their overall performance in both the entry exam and the final exam. The experiments were performed by split-ting the students in two categories (low and high performance) and three categories (low, average and high performance). More student categories can also be acquired by using a higher variety of features when clustering, for example, students could also be categorized based on their performance and their amount of activity. The WEKA learning environment was used to give a preview of how k-means would perform and to investigate if these groups give us any information about the data when a clustering was applied with more features. The scikit-learn module of Python was used to recreate the results from WEKA and add the labels for each category to the student data. Students with fake (for testing) or missing data were removed from the data set to prevent them from adding noise. Once the cluster-ing process is finished and a data set with labeled students is acquired, the feature selection process can be applied.

4.6.2 Feature selection

Both WEKA and scikit-learn provide support for a number of different feature selection algorithms. WEKA provides more options narrowing down the specific features which contribute the most to improving the classification process (rank-ing), so WEKA was utilized for performing feature selection on the data. In WEKA the results of the CfsSubsetEval feature selection algorithm were used to determine which features would be selected for classification. CfsSubsetEval algorithm

(15)

op-tion in WEKA evaluates the predictive ability and degree of redundancy in a subset of features from which the algorithm can then determine which subsets of features has the highest correlation with the class (categories). Using CfsSubsetEval it is possible to find the best available subset of features for predicting the performance category of a student. InfoGainAttributeEval was used to evaluate and rank the worth of individual features by measuring the information gain with respect to the class/category. Information gain means the reduction in uncertainty about the value of the class when we know the value of a feature. Feature selection was performed on both two and three performance categories of students, because a change in the amount of categories (two or three) can also change the correlation of the subset with the classes (information gain changes as well). Data set 1 contains two exam grades on which the categories are based, so feature selection was tested for both these grades as well. Once the ideal subset of features was found for both two and three categories, the performance of different classification algorithms with these subsets could be tested.

4.6.3 Classification

Classification was performed with different supervised learning algorithms on both two and three performance categories of students to analyse which of the algo-rithms performed best with the selected subset of features. WEKA includes a wide range of classification algorithms, but a choice had to be made on which algo-rithms would be used. Based on research from Ramaswami (2014), which includes an analysis of the predictive performance ability of the classification algorithms for educational data, neural networks and decision trees (in WEKA MultilayerPer-ceptron and J48) were chosen to be included in the experiments. Other algorithms that were included for classification are multiclass logistic regression (logistic in WEKA) and Naive Bayes (NaiveBayes in WEKA). 10-fold Cross validation was applied to make sure the results of each classifications report (precision and recall calculations) would also apply to independent data. As mentioned previously, the classification algorithms were tested on both two and three classes with the appro-priate subset of features. If the amount of instances of a certain class is low as com-pared to other classes, the SMOTE algorithm is applied to balance the amount of class instances before applying feature selection (SMOTE: minority class is over-sampled by creating synthetic examples). The accuracy of the different algorithms was not only tested for the official period of time allocated for the course, but also with only 5 days of student data to test the performance of the algorithms with less data (the data for different time periods was only available in Data set 1).

(16)

4.7 Feature analysis

Feature analysis was carried out to find correlations in the data which can help iden-tify student learning strategies or ideniden-tify features which have almost no relation to other features. WEKA allows for feature analysis with its visualization section in which all the features of a supplied data set can be plotted against each other in 2d graphs. Applying feature selection (PCA) on the labeled data set will also result in the unsupervised correlation/covariance matrix which can be used to detect cor-relation between features (instead of between features and a class). The goal of the data analysis section of this project is to try to find interesting correlations using the WEKA visualization and correlation matrix functionality.

4.8 Visualization

Since it was not possible to test visualizations on students with their own data, only some ideas for visualization were conceived based on the results of this project and previous research on visualizations. The visualizations should enable or facilitate the process for a student to evaluate if his/her performance has improved. Visual-ization can help improve the efficiency of certain tasks by showing what kind of progress was achieved and comparing it with the progress of others. Furthermore, a dashboard was designed and created based on previous research and the data available. This dashboard will provide a demonstration of how such an application could look and of the functionality it will need to provide.

(17)

5 Results (Data set 1)

5.1 Clustering results

The k-means clustering algorithm was used to split the students in different cate-gories based on their performance on either the entry exam and on the final exam. The Scikit-learn Python library implementation of k-means was used to divide the students into categories and create a labeled data set file in CSV format. Scikit-learn k-means uses a random state to initialize the centers of the clusters, this value was set to zero to make sure the clusters have the same starting point. The results from k-means can be found in table 5.

From the clustering example in table 6 it is possible to extract the following in-formation about average feature values of the created categories: Cluster or group 0 contains students with above average scores and final grade, but under average activity and time spend on Coach. Group 1 contains the largest group of students and has under average values for every feature. Cluster 2 consists of student who on average spend a lot of time on the assignments, but this can’t be reflected on the final grade or scores. In cluster 3 students have above average activity and high performance on scores and grades. Cluster 4 contains students who have extremely low performance on the final grade as well as under average time spend and amount of activity. Finally, student in group 5 have a lot of activity on Coach, but perform under average.

Based on the results from clustering on Data set 1 it can be concluded that the k-means is a solid option for categorizing students for data analysis. Table 6 demon-strates that k-means clustering clearly reveals the student categories. The features which distinguish categories from each other, e.g. performance and activity, can be discovered as well. Features that do not differentiate between categories stay relatively the same for each category e.g. average assessment score for example. The results from clustering students by performance for 2 and 3 categories, shown in table 5, were used for experimenting with feature selection (section 5.2) and performance classification (section 5.3). In addition to clustering students on their performance, using k-means with several other features can create different groups that can provide valuable information on the data. In table 6 an example of how other features can be used to create different student categories is shown.

(18)

Table 5: The results of clustering the students in different performance categories. The amount of students in every cluster and the average grade of each cluster are listed.

(19)

5.2 Feature selection results

The results of finding the best subsets of features for performance prediction can be found in table 7 and feature ranking on information gain in table 8.

The feature selection subsets for Data set 1 varied in which kind of features they contained. Overall, it is difficult to conclude anything from the subsets alone, this is why it was important to rank features on their information gain with respect to the categories. From ranking features with WEKA it was discovered that features selected for predicting the final exam performance have really low information gain and features selected for entry exam have relatively high information gain. From these results it can already be concluded that it will be difficult to predict the performance of students on there final exam with the available features.

Table 7: Contains five or less features from the selected subsets for two (top ta-ble) and three (bottom tata-ble) student performance categories. The total amount of features in each subset is listed as well. The WEKA function CfsSubsetEval (Best-First) was used to create these subsets.

(20)

Table 8: Top five ranked features for two (top table) and three (bottom table) student performance categories. The WEKA function InfoGainAttributeEval (Ranker) was used to create these rankings.

5.3 Classification results

The results of classification with Data set 1 is listed. Features selection and SMOTE were already applied to the data before performing these tests. Applying feature selection resulted in the best subset of features which performed best for predict-ing the performance class. With the help of feature selection the performance of the supervised learning algorithm increased. When splitting the students into three categories the low performance category has a relatively low amount of instances which could cause problems when performing classification. Oversampling with the SMOTE algorithm helped with creating more instances of low performance students. The effects of both feature selection and oversampling on the perfor-mance of one of the supervised learning algorithms can be found in table 10. The entire performance test for all the experiments can be found in table 9.

(21)

Table 9: Performance ranking to test the predictive ability of the supervised learn-ing algorithms for different time periods and different amounts of performance categories (Data set 1).

The results from classification (table 9) show that some of the features from Data set 1 can be used to predict students performance for the entry exam with rela-tively high accuracy (70% - 90%) on both two and three categories with the tested algorithms. The algorithms perform poorly (40% - 60%) for trying to predict the performance of students on the final exam with both two and three categories. This was expected due to the low information gain recorded from the feature selection experiments. Possessing the average score of all attempts and amount of activity from the entry exam in the data (features associated with Entreetoets in the table) helps with achieving high accuracy for predicting performance on the entry exam, so removing these features will result in a significant drop of accuracy with val-ues close to those of the final exam predictions. The difference in performance could also be associated to the final exam being in a different format or being more difficult to carry out than the assignments from Coach. Accuracy for predicting per-formance based on the final grade is lower when using data from 5 days instead of

(22)

the entire month, which is likely due to the decrease in data when utilizing shorter time periods. Neural network and decision tree algorithms showed the highest per-formance on most of the tests that were performed.

Table 10: A demonstration of how the performance of a supervised learning al-gorithm can improve by utilizing SMOTE (oversampling) before applying feature selection and classification. The experiment was performed with the decision tree algorithm, on 5 days of data, with three performance categories based on the entry exam.

(23)

5.4 Feature analysis results

A correlation matrix as seen in table 11 was created to discover correlations be-tween the features from Data set 1. Based on the correlations from the matrix some visualizations were found in WEKA which could emphasize these findings (exam-ple in figure 12).

Table 11: Correlation matrix of the features from Data set 1 (excluding the other 254 feature from the amount of activity for every assignment and the average score for every assignment).

(24)

Figure 12: Example of WEKA visualization: demonstrates the negative correlation between average assessment score (x-axis) and amount of time spent on assign-ments (y-axis) for Data set 1. Blue color indicates above average performance on the final exam and the red colour shows under average performance.

Feature analysis on Data set 1 did result in some unexpected findings, e.g. low correlation of total time spend with average score. The noticeable correlations from the correlation/covariance matrix (figure 11) created in WEKA from features of 303 students will now be listed:

• The days between first recorded activity and last recorded activity and total time spent on Coach assignments have low correlation with other features (under 0.30 and above -0.10).

• The number of questions not completed has relatively high positive correla-tion with the average quescorrela-tion score (if the amount of not completed ques-tions increases the average question score increases as well).

• Positive correlation between scores for assignments and time spent on as-signments is low. If students spent a lot of time on their asas-signments their average question and assessment score will be high, but students with a low amount of time spent on the assignments can still have a high average score (figure 12).

• When an increase in number of assessments carried out occurs, the average assessment score decreases (negative linear correlation).

(25)

• Other correlations are present as well e.g. positive correlations between amount of activity (amountOfLogs) and number of launched exercises, questions and media items.

6 Results (Data set 2)

6.1 Prediction model results

The same prediction model as for Data set 1 was applied to Data set 2. The model was tested on the data from Course 1 and Course 2 since these still had a moderate amount of instances compared to Course 3. Clustering resulted in two and three performance categories based on the final grade students had obtained. The perfor-mance categories were used for feature selection (table 13) and classification (table 14).

Table 13: The table at the top shows the selected subsets of features by WEKA CfsSubsetEval (BestFirst) for Course 1 and the bottom table shows the selected top five ranked features based on their correlation with the class calculated with InfoGainAttributeEval (Ranker) for Course 1.

(26)

Since the courses all contain similar content and grading mechanics on Blackboard, the prediction model tested on Course 1 and Course 2 can also be applied to Course 3. Only data from the entire course was available which meant that it was not pos-sible to test the prediction model on smaller a period of time e.g. a week instead of the entire two months of the course. The feature selection results from testing the prediction model on the Course 1 data demonstrate that features have a high infor-mation gain and correlation with the performance class, whereas the inforinfor-mation gain with the final grade performance class for the features from Data set 1 was significantly lower. Features such as the amount of clicks made on Blackboard, total assignment score and total attempts at assignments show high information gain with the performance class. The high information gain from the features from Course 1 is reflected in the performance of the classification algorithms which per-formed more accurate compared to the tests on Data set 1.

Table 14: The performance ranking for the predictive ability of the supervised learning algorithms on Course 1 and Course 2 (Data set 2). 10-fold cross valida-tion was applied to generate more reliable results as well as performance enhancing techniques such as feature selection and SMOTE.

(27)

6.2 Feature analysis results

A correlation matrix was created for Data set 2 for Course 1 and Course 2 to dis-cover correlations between features. Some of the interesting correlations resulted in insightful images from WEKA (figure 15 and figure 16). The correlations will be discussed in a later section.

No unexpected correlations were found during the analysis, but confirming some of the correlations can be important for presenting meaningful information data to students. The following correlations were found:

• High positive correlation of total assignment score with time spend on Black-board and total amount of clicks.

• All features have positive correlations between each other (if the value of one feature increases the value of the other feature will also increase). • Other correlations are present as well e.g. positive correlations amount of

clicks and time spend on Blackboard and number of attempts at assignments.

Figure 15: Example of WEKA visualization with Course 1 data: demonstrates the correlation between total time spent on the Blackboard part of the course (y-axis) and the total score achieved for the assignments (x-axis). Blue color indicates stu-dents with above average performance for the entire course (final grade) and the red colour shows students with under average performance.

(28)

Figure 16: Example of WEKA visualization with Course 1 data: demonstrates the correlation between the total assignment attempts (y-axis) and the total score achieved for the assignments (x-axis). Blue color indicates students with above average performance for the entire course (final grade) and the red colour shows students with under average performance.

7 Visualization

7.1 Assign visualizations based on predictions

Based on the different student categories created by the prediction model it should be possible to assign created visualizations based on the selected categories. Some ideas for visualization based on student categories were investigated, but could not be tested on students since the resources (e.g. access to Data set 2) and students for such experiments were not available. This means that the visualizations are based on previous research done in LA and experience from extracting information from student data with data analysis and machine learning. Further research on assigning visualizations based on performance would require students to evaluated each of the created images to investigate if the images help students with improving their learning strategies. Based on previous research, it is essential to have a group stu-dents to evaluate each of the created visualizations for investigating what kind of visuals and interactions students would prefer (see Govaerts et al., 2010, p. 7-9). In general it will be important to provide students with visualizations with which they can put themselves in context with other students. Improving student per-formance by raising perper-formance awareness and encouraging better perper-formance is a viable option for students who are under performing (Fritz, 2011). Showing

(29)

where a student resides in some kind of distribution of data , e.g. with performance and time spent, or showing what students are actively working on compared to other students, are examples of how students could compare themselves to others. Another approach could be to show student performance rankings among other students which might increase student motivation through competition. Showing a student a top 5 or top 10 of students performances or predictions of performances in a course next to his own data could also give the student an idea of where there is room for improvement. In addition, it could be also be beneficial if the students were provided with visual indicators of when their activity or other data fell below a certain threshold associated with the student earning a grade they desired/need for passing the course. Figure 17 and figure 18 are two examples of possible visu-alizations.

Figure 17: Visualization example of visual indicators of when a student’s activity falls below a certain threshold. The amount of activity (y-axis) is plotted against the dates on which the activity took place (x-axis). The green line shows the average activity of all students and the blue line shows the activity of one selected student. (Data set 1 was used to create this visualization)

(30)

Figure 18: Visualization example which shows the assignments a student has been working on (green) and the assignments that all the other students worked on dur-ing the same dates (red). The different assignments are plotted (y-axis) against the dates on which they were performed (x-axis). (Data set 1 was used to create this visualization)

(31)

7.2 Alternative approaches to data visualization

In the visualization results some ideas for visualizations for specific student cat-egories were described, but other approaches for providing students with useful visualizations are also possible. Similar to assigning visualizations with a predic-tion model, these approaches will also need students for evaluapredic-tion.

Allowing students to find visualizations on their own is an example of an alter-native approach. Instead of assigning visualization based on student performance, this approach will provide visual support to students in the form of multiple visual representations of the data. Providing students with many visualizations to choose from will enable them to find correlations in the data by themselves and, in the pro-cess, improve their learning strategy and learn about the strategies of their fellow students (Tory and Moller, 2004, p. 74-75).

Another idea for future research is to create visualization based on the type, amount or domain of the data. In the research paper Klerkx et al. (2014) an overview is given of research papers that have investigated the appropriate visualization tech-niques for a number of different data types. For future work it could be researched if this approach could be automated for LA dashboards to provide students with the appropriate visualizations based on their data.

7.3 GUI/Dashboard

A GUI/Dashboard was created with Python using the Pyside/PyQt library. This dashboard is meant to give a demonstration of how a LA application could look and what functionality this application could possess. The current implementation only works with Data set 1, but it would be possible to implement a version that works with Data set 2. When you start the dashboard, a pop up will become visible that asks for a student number. Once the student number is selected the dashboard will show a window with the data (visualizations) of the selected student. At the top of the window there are options for selecting another student or a different time period (1 week, 2 weeks, 1 month etc.). The application does not use the prediction model that was described earlier, but this can be implemented using Scikit-learn. At the bottom left the student can see a graph of his/her activity and the average activity of all students on any date, with a slider for selecting either a time segment or the entire set of dates available (minimum is a week). A few bar-graphs can be selected at the bottom right containing information about the grades, activity and exercises of a student. This enables the student to compare his/her values to the average of all the students. The top right contains the available information of a student. The top left segment currently holds some placeholder images, but in the future it could hold the recommended images for a student. The dashboard

(32)

gives a simple example of an application on which ideas for visualizations could be evaluated by students in future research (figure 19).

Figure 19: The GUI of the dashboard as described in section 7.3

8 Conclusion

In this paper we proposed methods for extracting meaningful information from student data by using LA and EDM techniques. Through testing these methods on two different data sets it was attempted to show what kind of information could be extracted and how this information could be presented to students. In the first stage of the research important data features were extracted from Data set 1 and Data set 2, to create data sets on which the data analysis and machine learning techniques

(33)

By testing a predictive model for classifying student performance on Data set 1 and Data set 2, it was found that the model only showed promising performance on Data set 2. The predictive ability of any of the available features from Data set 1 was too low to predict which kind of grade a student would achieve at the end of a course. This result is in contrast with the results obtained from testing on Data set 2 which contained features with high information gain with respect to the class. Clustering as well as feature selection from the prediction model showed promis-ing results for either categorizpromis-ing students or for investigatpromis-ing which features are important in the data. Overall, it can be concluded that the predictive ability of Data set 1 and its features is low and that the prediction model demonstrated a bet-ter performance when tested on Data set 2.

The results of feature analysis showed that with the use of correlation matrices and visualizations we can identify interesting correlations in the data. Most of the cor-relations found in the Data set 1 and Data set 2 were not unexpected (e.g. increase in activity/clicks also shows an increase in time spent on Blackboard/Coach), but it was important to confirm if these kind of correlations can be extracted from the data because they can each provide information about the learning strategies of students.

Different ideas were presented for providing students with visualizations from which they can extract information that helps improve their learning. It was ex-plained that students could potentially learn a lot from being able to see a visual representation of their data in context with other students, and also by receiving visual feedback on when a student was performing under average. Although a lot of ideas for visualizations were listed, most of these ideas will have to be evaluated by students in future experiments to confirm that they can help improve learning. The findings from this research show that it was successful in providing methods for extracting meaningful information from student data and demonstrating ideas for presenting this information to students. However, further research will have to be performed into applying these findings to a LA application for assisting students in improving their learning strategies.

9 Discussion and future work

The problems with extracting Data set 2 described in section 4.4 caused the data retrieval and analysis to be delayed. The data which was eventually extracted from the Blackboard database was incomplete, which is why the research methods were applied to the exported data set instead. Although not all the desired Blackboard data was available at the time of testing, the problems encountered during the data extraction process have been listed in this report and by the people working at

(34)

Blackboard ICTS as well, which will likely facilitate the process for future re-search. The exported data set allowed for performing experiments on Data set 2, but it is advised for future research to not use this data set because extracting it is time consuming and prone to human error.

Once all the desired Blackboard data is included in Data set 2 it will be possi-ble to make predictions with the data from previous years (not just 2015). It will also become possible to apply the prediction model on a smaller time period as was achieved with Data set 1. For feature analysis it will be important to use the complete Blackboard data to evaluate whether discovered correlations still apply to data from previous years. Furthermore, a more detailed analysis could be per-formed in the future, which could include a T-tests or other tests of significance on the data. In addition to the existing student data currently saved from Blackboard, it could be beneficial to add the attendance of students to blackboard since this supplies additional information which could be beneficial to predicting a students performance or for other analysis.

When the complete version of Data set 2 is acquired the correlations and other information extracted from this data could be used to create informative visualiza-tions for improving learning of students. However, it is also important to determine what kind of visualizations can actually provide the student with meaningful in-formation. Therefore, it will be necessary for future research to test the ideas for visualizations on students and use these ideas to create a dashboard application based on the feedback retrieved from students.

(35)

References

Fritz, J. (2011). Classroom walls that talk: Using online course activity data of suc-cessful students to raise self-awareness of underperforming peers. The Internet and Higher Education, 14(2):89–97.

Govaerts, S., Verbert, K., Klerkx, J., and Duval, E. (2010). Visualizing activities for self-reflection and awareness. In Advances in Web-Based Learning–ICWL 2010, pages 91–100. Springer.

Klerkx, J., Verbert, K., and Duval, E. (2014). Enhancing learning with visualiza-tion techniques. In Handbook of Research on Educavisualiza-tional Communicavisualiza-tions and Technology, pages 791–807. Springer.

M´arquez-Vera, C., Cano, A., Romero, C., and Ventura, S. (2013). Predicting stu-dent failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data. Applied intelligence, 38(3):315–330.

Papamitsiou, Z. and Economides, A. A. (2014). Learning analytics and educational data mining in practice: A systematic literature review of empirical evidence. Journal of Educational Technology & Society, 17(4):49–64.

Ramaswami, M. (2014). Validating predictive performance of classifier models for multiclass problem in educational data mining. International Journal of Computer Science Issues (IJCSI), 11(5):86–90.

Santos, J. L., Verbert, K., Govaerts, S., and Duval, E. (2013). Addressing learner issues with stepup!: an evaluation. In Proceedings of the Third International Conference on Learning Analytics and Knowledge, pages 14–22. ACM.

Shovon, M., Islam, H., and Haque, M. (2012). An approach of improving students academic performance by using k means clustering algorithm and decision tree. (IJACSA) International Journal of Advanced Computer Science and Applica-tions, 3(8):146–149.

Siemens, G. (2013). Learning analytics: The emergence of a discipline. American Behavioral Scientist, pages 1–21.

Tory, M. and Moller, T. (2004). Human factors in visualization research. Visual-ization and Computer Graphics, IEEE Transactions on, 10(1):72–84.

UTILIZATION OF LEARNING ANALYTICS TO OBTAIN PEDAGOGICALLY MEANINGFUL INFORMATION FROM DATA AND PRESENTING THE INFORMATION TO STUDENTS

U

TILIZATION OF LEARNING ANALYTICS TO

OBTAIN PEDAGOGICALLY MEANINGFUL

INFORMATION FROM DATA AND PRESENTING

THE INFORMATION TO STUDENTS

Contents

1

Abstract

2

Introduction

3

Related work

4

Research methods

5

Results (Data set 1)

6

Results (Data set 2)

7

Visualization

8

Conclusion

9

Discussion and future work

References