Price prediction from unstructured data

(1)

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided up into a number of sections and contains references. An outline can be something like (this is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page) (c) Introduction (d) Theoretical background (e) Model (f) Data (g) Empirical Analysis (h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you use should be logical) and the heading of the sections. You have a free choice how to list your references but be consistent. References in the text should contain the names of the authors and the year of publication. E.g. Heckman and McFadden (2013). In the case of three or more authors: list all names and year of publication in case of the rst reference and use the rst name and et al and year of publication for the other references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty as in the heading of this document. This combination is provided on Blackboard (in MSc Econometrics Theses & Presentations).

Price prediction from unstructured data

Samuel Koˇ

zuch

11690534

MSc in Econometrics

Track: Big Data Business Analytics

Date of final version: 15th August 2018

Supervisor: prof. dr. Marcel Worring Second reader: Sanna Stephan MPhil

Abstract

In the thesis, we compare the performance of the most known text mining algorithms (LDA and word2vec) in predicting price from item description and name. Moreover, three different prediction algorithms are trained to find

(2)

Statement of Originality

This document is written by Student Samuel Maroˇs Koˇzuch, who declares to take full responsibility for the contents of this document.

I declare that the text and the work presented in this document are original and that no sources other than those mentioned in the text and its references have been used in creating it.

The Faculty of Economics and Business is responsible solely for the super-vision of completion of the work, not for the contents.

(3)

Acknowledgement

I would like to thank all the people, who made the writing of the thesis possible. Firstly, I would like to thank my parents, sister and rest of the

(4)

1 Introduction

With the emergence of internet, large amount of data became available. From data on sales and bank transactions through various media files to e-mails, user manuals and many other text sources. While many algorithms were already in place to deal with structured data, the framework to apply similar techniques to unstructured data was missing. The shift of focus on text data was made in the mid 1990s, when several methods were proposed to extract relevant information from the data. Most notable ones include latent semantic analysis and latent dirichlet allocation. To this day, a plethora of new techniques are introduced regularly, each working on a different basis and trying to extract varying relevant information from textual data.

Immediately, text mining techniques started to be used across the machine learning field, e.g. in detecting spam mail, classifying sentiment, etc. Price modelling is one of the fields, which benefited from the evolution of text min-ing as well. The majority of the works in price modellmin-ing with text minmin-ing is, however, focused on predicting price movements in the stock market, where a vast number of data is available, in form of news articles. In contrary, there is not a lot work being done in predicting independent prices, such as those at real estate markets or online selling platforms, where data points are not time dependent. Due to deficiency of such work, there is a lack of publications comparing the text mining techniques in such predictions.

This thesis is dealing with the problem of price prediction on online selling platforms, from the user-filled description fields. In the thesis, I would like to answer two questions:

(8)

The thesis starts with the summary of known work in the field and con-tinues with an overview of the dataset used in the analysis. Afterwards, the methodology for pre-processing, applying various text mining techniques and prediction algorithms are described. Results of the analysis, an in-depth re-view of outcomes and a comparison is presented in the following section. The discussion includes problems encountered during the analysis and the conclusion summarizes the thesis. Finally, in appendix section, tables with relevant results are included.

(9)

2 Literature review

2.1 Text mining

Text mining is a process of knowledge discovery in the data. It focuses on discovering and understanding structural and useful non-trivial patterns in the textual data (Rajman and Besan¸con, 1998). Origins of the data can be very diverse; ranging from ordinary, such as social media posts and articles, to rather rare, like various government documents and corporate e-mails. Some documents do not have have a given structure; their layout is not very well defined. This type of data is referred to as unstructured. In pre-vious years, it has been estimated that more than 80 percent of all data is unstructured (Grimes, 2008). Even though, data is unstructured, gram-matical structures, patterns and relationships can still be found (Rajman and Besan¸con, 1998). Text mining is very closely linked to natural language processing. With the development of machine learning and artificial intel-ligence, the area of text mining is becoming increasingly popular. Using extracted patterns from the data allows to generalize new data examples and make predictions (Witten et al., 2016).

Text mining is considered to be an extension of data mining. Although, the goal of data mining is very similar to that of text mining, to discover and understand non-trivial patterns, the nature of the data is very different (Tan et al., 1999). A plethora of data mining algorithms can be used to learn patterns from the data. A few examples include decision trees, probabilistic dependency models, linear and nonlinear classification and many others. Each of the aforementioned algorithms has its pros and cons; each techniques is better suited for some problems than others.

(10)

unobserved information from the data. Most notable ones are topic modeling and word or document embedding, which have both become one of the most used techniques in text mining analysis.

2.2.1 Latent Dirichlet allocation

Latent Dirichlet allocation (LDA) is able to explain why some parts of the data are similar by grouping the observations into unobserved groups. It was first presented in 2003 as a graphical model for topic discovery (Blei et al., 2003). The model assumes, that each document is a mixture of topics. Each of the words in the document can be assigned to one of the topics present in a document.

LDA was successfully used in several prediction based researches. Wang et al. used sentiment analysis in combination with LDA to identify events from posts on Twitter on a day-to-day basis. The information was used in crime prediction and achieved far greater results than a simple linear predictor predicting incidents uniformly across days (Wang et al., 2012).

2.2.2 Word2vec

In 2013, Mikolov et al.introduced a skip-gram model. The goal of the model was to find word representations in vector space. The vectors are weights of a hidden layer of a neural network, which is trained using pairs of words; one of the words in a pair is used as a one-hot encoded input in the neural network, the other one is used as a one-hot encoded target in the neural network, i.e. the input word is trying to predict the target word. This allows the model to learn relationships between different words, where similar words have similar vector representation, since they are often used in a pair with analogous words. The model is able to process 100 billion words in a day and learn relationships between them (Mikolov et al., 2013b).

(11)

The model has been used in various research cases. Spotify was able to successfully implement an extension of word2vec to its recommender engine and achieved a better suited recommendations as a result of that (Boycheva, 2015). Other major companies such as Airbnb and Yahoo use a word2vec model as the backbone of recommending listings.

2.3 Applications of text mining

In the era of World Wide Web, there’s a vast amount of unstructured data available at all times. Social networks, e-mails, news articles and govern-ment docugovern-ments are just several examples of types of data used in recent text mining based researches. Text mining techniques started to be used in areas such as insurance cost modeling. stock price modeling and even gross prediction of movies

A team of researchers was able to improve a predictive model of insurance pay-off following an accident. The benchmarked model used only personal data of insured, e.g. age, sex and other demographics, while the model proposed by the researchers used a text description describing the incident and injuries caused. The team used text mining techniques (including LDA) to extract 896 important concepts from the data, which, in combination with the demographics, were used in a regression tree price prediction. The proposed model achieved about a 10 percent improvement over the original model (Kolyshkina and van Rooyen, 2006).

W¨uthrich et al. built a model predicting movement of a stock price. The forecast was based on news articles released prior to the opening of trading (W¨uthrich et al., 1998). 423 unique features in a form of dictionary were manually created and used as features in training of a neural network. The

(12)

In 2000, Ænalyst prototype was introduced with several improvements over the previous model. One of the main differences was the full automat-ization of the prototype, hence no manual choosing of features was needed. As the feature selection technique a TFxIDF (term frequency times inverse document frequency) was used (Lavrenko et al., 2000). Moreover, the ana-lysis windows was shortened from the previously used 24 hours to only one hour i.e. articles were analyzed every hour and every hour a decision was made, whether to buy, sell or do nothing. During the testing period of 40 days, the prototype made over 12,000 transactions and turned the initial investment of USD 10,000 into USD 280,000.

Li et al. used sentiment analysis of textual news to predict change in the trend of oil prices. Using Granger causality analysis, the researchers were able to explore the relationship between oil prices and the sentiment of news. The analysis was then added as an extension to already existing price trend models such as logit regression, support vector machine, decision tree and neural network. In all cases, models with sentiment analysis had significantly better prediction power (Li et al., 2016).

Doshi et al. used data from social media and movie dedicated sites to pre-dict the amount a movie will gross after four weeks. A concept of between-ness centrality was used to measure the influence of movie titles on conver-sation on social media, i.e. track the importance of a concept across social media. Moreover, sentiment analysis of forums focusing on filmography was used to determine the general mood about each movie. The extracted in-formation was used to predict gross over 4 weeks after movie release. The ratio of predicted earnings to projected total costs of the movie identified movie either as flop (ratio ≤ 1), decent (1 < ratio ≤ 2) or success (2 < ratio). The system was able to correctly predict in 80 percent of cases (Doshi et al., 2010).

(13)

Jain et al. analyzed the main drivers of Bitcoin’s price evolution. Twitter data were used to find different concepts in Bitcoin themed tweets, that were later used in price forecast. Moreover, lagged values of web searches were found to be strongly influencing the volatility of Bitcoin. Their model achieved a prediction performance with mean absolute percentage error of 1.07 percent (Jain et al., 2018).

Stevens used feature extraction from text to predict the selling price of real estates. He used the description of various properties main explanatory variables in his prediction of price. The descriptions were converted to uni-and bi-grams uni-and only the top n were used in the prediction, with n ranging from 10 to 250. The vectors of n-grams were converted into TFxIDF scores, which were then used in predicting the property’s price class. The best classifier, multinomial Naive Bayes, achieved a F1 _{score of 0.652 (Stevens,} 2014).

2.4 Conclusion

Text mining has been around for decades and has the same goal as data mining, although the techniques differ due to a different nature of the ana-lyzed data. In the thesis, I will use two well-known text mining techniques (word2vec and latent dirichlet allocation) to extract information from the textual data. The goal is to determine, which of the two techniques is bet-ter suited for price prediction of merchandise. This is somewhat different to other published papers, which use the aforementioned techniques in im-proving recommender systems or stock price prediction. To compare the performance of LDA and word2vec, 3 different prediction methods will be used to predict the price.

(14)

3 Dataset description

In this section, I will introduce the dataset used in the thesis as well as perform some exploratory data analysis to uncover interesting observations about the dataset.

3.1 Introduction

The dataset used during the analysis was part of Kaggle competition Mer-cari Price Suggestion Challenge, which took place in February 2018. The data was taken from Mercari, a Japanese online shopping application, with the biggest community in the country.

With increasing online sales and the variety of sold products, the product pricing can be very diverse. The users of the application can put single items or bundles, items with well known brands or even hand-made products, thus making price suggestion even harder. To combat the problem, the entrants were asked to use the textual user-filled data with the goal to predict product prices, in order to suggest the suitable price to a seller and help them trade offered goods. The challenge has drawn in 2384 teams with over 2782 competitors and over 36,746 entries.

The evaluation of the algorithms was done using root mean squared log-arithmic error (RMSLE), given by

RM SLE = v u u t 1 n n X i=1 (log(pi+ 1) − log(ai+ 1))2, (1)

where the lower the score, the better the fit, n is the number of observations in the dataset, pi is the predicted price of the i-th item and ai is the actual price of i-th item. The actual prices were not provided to entrants; each entrants submission was privately evaluated by the organizer on two test datasets, first and second stage dataset.

(15)

3.2 Data

At the beginning of the competition, two datasets were provided to each entrant: training and first stage test data. Datasets contained the same columns except for one, the target column price, which was left out of test data and left for entrants to predict. After the competition ended, the second stage dataset was made available, that was used for final evaluation and contained 5-times more rows than the initial data used for testing (∼700k compared to ∼3.5M). Since, the actual prices were not provided for testing datasets, the analysis in the thesis will focus around training data, which will be split into training set (for training purposes), validation set (for tuning purposes) and test data (for evaluation purposes).

The training data was structured and contained 1,482,535 rows and 9 columns. Each row represented an item posted for sale in the Mercari ap-plication and had a unique identifier, present in train id column. Six of the columns were focused on categorical attributes of items and the remaining two contained name and description of an item.

item condition id column captured the status of an item. Each item could be ranked by seller on a scale from 1 to 5, representing its condition, 1 being brand/almost new with 5 representing an item in poor condition. The class imbalance between condition of items is greatly present, with classes 1-3 being the most frequent, while 4 and 5 lacking greatly behind.

(16)

Figure 1: Frequency of item condition classes

For better orientation, each item being sold is classified into one of the main categories, subcategories etc. This feature is represented by column category name, which includes information on the related classification of an object. The dataset comprises of 11 main categories, 114 subcategories, 871 sub-subcategories and 8 cases of 4th _{level categories. The latter are very rare} and are therefore of little interest. The most represented main category is the ”women” category, with over 600k registered items. The least represented is ”Sports & Outdoors” category with around 25k items. Around 6,500 items were not listed in any of the categories. The most populated categories and their children can be seen in the plot below.

(17)

Figure 2: 1st and 2nd hierarchical category levels (Zientala, 2018)

shipping and brand name contain seller-entered information, whether the seller pays for shipping of the merchandise and the brand of sold items. The former is a simple binary representation, where 1 indicates full coverage of shipping and vice versa. The later is a textual representation of the item’s brand and contains about 4200 unique values.

(18)

Figure 3: Most popular brands by category

In Figure 3, the breakdown of most popular brands by category can be seen, as well as the brand’s total occurrence in the dataset. Quite significant patterns can be observed: 7 out of top ten brands have major representation in women related categories. PINK enjoys the overall top spot in number of appearances in the dataset and dominates the category for women. Runner-up, Nike, is a close second in total occurrences in the data. As the brand focuses on clothing and shoe lines for humans, it is greatly represented in women’s, men’s and kids’ categories. The top, non-clothing related brand is Apple, with the majority of advertised items being mobile phones and tablets.

price contains the information on the asking price of seller. The distinct feature of price is its skewness to the left and very long right tail, i.e. a plethora of prices below 250, few prices above 250. The lowest price is 0, i.e. people giving stuff away for free. The maximum price is 2009. To correct for skewness, log(p + 1) transformation is used. Histograms of both transformed and untransformed price can be seen in the figure 4.

(19)

Figure 4: Histogram of price and log(p + 1) transformed price

The high skewness of price was the main reason behind using root means squared logarithmic error for evaluation, so that lower and higher prices have equal weight in training. One can imagine the skewness of price as a highly unbalanced dataset with 2 classes. If a machine learning algorithm would be trained on such dataset without any correction for imbalance, the algorithm would most likely predict nothing, but the prevalent class as it would minimize the error; the exception being the state-of-art algorithm being able to distinguish very well between the two classes, that would be very hard to build.

Textual information about items is located in name and item description columns. While name is shorter and mostly just contains the name of the sold item, item description is longer and more elaborate, describing the sold item further. Since both fields are user-filled, both are subject to containing prices in them. This was taken care of by the competition organizer, where each price was replaced by [rm], to make inference harder. The two afore-mentioned columns are the most important from the whole dataset as most of the features will be extracted from these columns.

(20)

Figure 5: The frequency of most used n-grams in the description column.

In figure 5 the most common n-grams in description column can be seen. In all three cases, words or combination of words such as ”brand” and ”new” are present. This is on par with previous analysis, where it was found, that most commonly sold items are new or almost new. A great representation of words ”shipping”, ”price” and ”free” is observed as well, which is not strange given the nature of the data. In 1-grams, the word ”pink” is present, which may be related to brand PINK, which is the most frequent brand in the dataset. In 3-grams, one can see combinations like ”rm for rm” or ”for rm for”, which indicates that often, users try to include prices in their descriptions of sold merchandise.

(21)

4 Methodology

In this section, various methods and algorithms used in the thesis will be introduced. The section will be split into three subsections. The first one will be concerned with preprocessing of the dataset; the second one will introduce feature extraction techniques and the theory behind them; the third one will touch on predictors used in the thesis and their tuning.

4.1 Preprocessing

Preprocessing is the most crucial part of any analysis as it can greatly improve results. The main drive of the thesis is extraction of features from unstructured data, hence the major part of preprocessing is focused on text columns, but preparing data from other columns is just as important. The end result of preprocessing is a dataset, which is ready to be further analyzed.

4.1.1 Text preprocessing

The first step in preprocessing is removing the characters and words, which are of little use in further analysis or contain barely any information, e.g. stop words, punctuation, etc. Since the thesis focuses on extracting the ”meaning” from words, the punctuation contains little-to-no information about this and can be freely removed. All commas, dots, exclamation points, double spaces, etc. were removed. Furthermore, there are special charac-ters present in the dataset, highly popular in today’s internet age - emojis. Emojis might contain some information, since they can be replaced by their meaning in words, e.g. ”:-)” can be replaced by ”happy face” (Novak et al., 2015). The main issue with emojis is user’s subjective use of them. Certain emojis are used in situations, in which they are not needed or their original

(22)

move all of the emojis and focus the analysis only on the written words. The aforementioned subjectivity of user-filled information does not only apply to emojis, but to regular text as well.

To obtain the best result, some algorithms require stemming. Latent dirichlet allocation is one of such techniques, hence for this technique spe-cifically, word stemming will have to be performed in the thesis. Stemming removes all of the suffices such as -s, -es, -d, -ed, etc. Since English uses various suffices in majority of sentences, stemming the word and leaving just the root will improve the performance of topic-modelling techniques greatly (Raschka and Mirjalili, 2017).

4.1.2 Categorical variables

Most of the predictive techniques used in the thesis do not accept cat-egorical variables at all, or their computational time increases exponentially when present in the dataset. Hence all of the variables were converted to one-hot encoded variables. One-hot encoding takes all the unique values present in a column and converts them into a sparse matrix with ’1’ indicat-ing whether the label is present in the row and vice versa (Pedregosa et al., 2011). The one-hot encoding was applied to all categorical columns except category and brand name.

Figure 6: One-hot encoding process.

category columns contains 1005 unique categories. Performing a one-hot encoding on so many unique values would create a huge sparse matrix, which would drastically increase memory consumption as well as computational

(23)

kept and encoded.

The column containing brands has over 4200 unique values, which would create a matrix of even higher dimensionality as was the case with categor-ies. A possible solution would be to keep the brands, which meet certain benchmark in the dataset in terms of frequency. However, this approach would eliminate information about rare and expensive items as these would be clustered together with less-known brands. To avoid scarcity and loss of data, the brands were encoded numerically, i.e. each brand is given a unique numerical ID.

4.2 Predictors

The comparison of performance between word2vec and LDA will be eval-uated based on results from three different prediction techniques: linear regression, neural network and gradient boosting regression trees. All of the techniques are very well known, hence they will be covered only briefly.

4.2.1 Linear regression with stochastic gradient descent

Linear regression is the simplest of the predictive algorithms and is used in most of the analysis as the basic tool to determine predicted values. In the thesis, a linear regression with stochastic gradient descent is used due to size of the dataset. Weights in linear regression with SGD are adjusted based on the gradient of the loss function. The adjustment of weights can be characterized by

w(t+1)= wt+ λ∆g, (2)

(24)

4.2.2 Neural network

A neural network is a network of connected layers. Each layer contains neurons, which are activated based on an activation function and the value of activation. The information is first passed forward through the network until the output layer is reached. The outputs of an output layer are compared to the real value and similarly, as in the case of linear regression with SGD, the error is back propagated through the network, adjusting the weights.

4.2.3 Gradient boosting regression trees

The gradient boosting regression trees work by building n weak prediction models and constructing an ensemble of these predictors. The model is build in a step-wise fashion like other gradient boosting and allows for an arbitrary differentiable loss function.

4.3 Feature extraction

In this subsection, the feature extraction techniques will be introduced. In the first part, I will introduce word2vec, a recently introduced natural language processing technique. In the second part, latent dirichlet allocation will be discussed, a well-known technique with over 23k citations. In the last part, differences between the methods will be debated.

4.3.1 word2vec

Distributed representations of words improve greatly the accuracy of learn-ing algorithms in natural language processlearn-ing tasks, by grouplearn-ing similar words together in a vector space. The notion of using word representations in vector space is not new, the oldest use dates back to 1986, and many techniques have been proposed to efficiently estimate vectors. In 2013, a Skip-gram model (or commonly referred to as word2vec) was introduced, as an efficient method to learn vector representations from an enormous amount of text data. The significant advantage over similar techniques was

(25)

vectors to obtain other words. The model is trained using a neural network with the task to find word representations, that are useful for predicting the surrounding words in a sentence (Mikolov et al., 2013a).

4.3.1.1 Theory

The architecture of the word2vec model is rather simple, an input layer, one hidden layer and an output layer. Both input and output layers have as many neurons as there are words in the vocabulary; the number of neurons in the hidden layer reflects the number of dimensions in resulting representing word vectors and is set by the user. The hidden layer has no activation function while the output layer uses softmax, in order to determine the probability of neighbouring words. All aspects of the model will be discussed in coming pages.

Figure 7: An example of an architecture of the skip-gram model with vocabulary size of

10,000 and 300 hidden layer neurons (McCormick, 2016).

(26)

words before or behind the input word. A very intuitive explanation of the windows size and how the input pairs are created can be seen in figure 8.

Figure 8: Creation of input pairs with c = 2

From each pair, the model will learn statistics based on how many times the pair shows up, i.e. related words appear more often, hence their relationship is stronger. The pairs are fed into network as one-hot vectors: the one-hot vector of input word is fed into the input layer, while the one-hot vector of the other word in pair is used as a target, based on which the network calculates loss and adjust weights accordingly.

The objective of the network is to maximize the average log probability

1 T T X t=1 X −c≤j≤c,j6=0 log p (wt+j|wt) , (3) where p (wt+j|wt) is defined by p (wO|wI) = exp v_w0T_OvwI PW w=1exp (vw0TvwI) , (4)

where vw and vw0 are the vector representations of ”input” and ”output” and W is the number of words in the vocabulary (Mikolov et al., 2013b). The output of the network after training is a single vector containing, for every word in the vocabulary, the probability that a randomly selected word

(27)

neuron has a corresponding weight vector from the hidden layer, which is used to calculate the output probability using the softmax function. The probabilities over all output neurons add up to 1.

The corresponding word representations in vector space are taken from the hidden layer, more specifically the weights of the hidden layer are the learned word vectors. The number of weights is the product of the number of total words in vocabulary (the size of input/output layer) and the number of neurons in the hidden layer. For each word, its word vector can be obtained by matrix multiplication of its one-hot representation and the hidden layer weights.

Word2vec has many great perks, which are the results of its architec-ture. Firstly, since the relationships between words do not change across documents, a trained skip-gram model may be used on different corpuses and will achieve the same results. This allows the use of word embeddings, which were trained on large datasets with much more computational power without the need to train the model locally. Secondly, the word2vec can be used across languages as the same word, in two different languages, will have a very similar vector space representation in both of them (Press and Wolf, 2016). Thirdly, linear translations can be applied to word vectors, i.e. ex-isting word vectors can be used to create different word vectors. Therefore, the following holds:

vking− vman+ vwoman ≈ vqueen vSoviet+ vUnion ≈ vSoviet Union,

(28)

4.3.1.2 Word2vec application in text mining

In the thesis, a pre-trained skip-gram model will be used (denoted by w). The model, originally published by Google, was trained on Google News art-icles and has a vast vocabulary of 3M words with 300 neurons in the hidden layer. For each row, two separate vocabularies will be created; one for name (V1) column and one representing words contained in item description (V2). For each of the vocabularies, vector representations for every word in the re-spective vocabulary will be summed to finally obtain a vector of length 300. The overall result for each row will be 2 vectors of length 300, one contain-ing vector representation for vocabulary in the name column and the other one containing vector representation for vocabulary in the item description column. Keep in mind, that the vocabulary of both columns is changing with every row, hence each row will have its own two unique word vectors comprised from the textual data found in the record.

4.3.2 Latent dirichlet allocation

Latent dirichlet allocation was introduced in 2003 and quickly became a very popular method in natural language processing. The goal of LDA is to explain sets of observations by unobserved groups (”topics”), i.e. explain why some data is similar to each other. LDA assumes, that each document exhibits multiple topics and tries to construct the distribution of a topic over a fixed vocabulary (Blei, 2012). There are two outputs of LDA model:

1. a document to topic matrix, which describes each document’s involve-ment in the topic,

2. a word to topic matrix, which describes the probability of each word being in a certain topic.

4.3.2.1 Theory

(29)

gen-generative process. In LDA, observed variables are words in a document; hidden are K topics.

LDA can be described by the following relations. The topics are repres-ented by β1:K, where each βk denotes a distribution over a fixed vocabulary. Each document d has topic proportions θd, where θd,k denotes topic propor-tion of topic k in document d. Each word in the document d is assigned a topic; the topic assignment of n-th word in document d is denoted by zd,n; the vector of topic assignment vectors is denoted as zd. The observed words in document d are wd, where the n-th observed word in document d, an element from the fixed vocabulary, is wd,n.

The joint distribution of observed and hidden variables corresponds to the following generative process,

p (β1:K, θ1:K, z1:D, w1:D) = K Y i=1 p(βi) D Y d=1 p(θd) N Y n=1 p(zd,n|θd)p(wd,n|β1:K, zd,n) ! . (5)

Notice that many of the variables are dependent on other variables present in the model. These dependencies define LDA.

Conditional distribution of the topic structure given observed documents is defined by

p(β1:K, θ1:D, z1:D|w1:D) =

p(β1:K, θ1:D, z1:D, w1:D) p(w1:D)

, (6)

where numerator is the joint distribution of all random variables and de-nominator is the marginal probability of the observations.

(30)

the vocabulary of the name columns consists of fewer entries than that of description, hence fewer topics will result in a better fit. From the name column, 25 topics will be estimated. From the second text column, 50 topics will be estimated. The entries will be converted to document to topic mat-rix and used as new features. Textual columns in the testing data will be transformed to document to topic matrices as well, using the already trained LDA model.

4.3.3 Word2vec vs. LDA

While both of the methods are very popular, they both are very different models with both advantages and disadvantages over each other. Their differences will be summarized in this subsection.

As was already mentioned, the goal of word2vec is to learn word to word relationships. The output of the word2vec model are words mapped into an N -D vector space, with similar words having low distance in the vector space. The vector space is dense, flexible and hard to be interpreted by human. On the opposite hand, LDA is trying to learn relationships between words and the topics in a document. Each of the words is represented by a probability distribution of belonging to a certain topic. LDA creates a sparse matrix, which is not very flexible but human-interpretable, i.e. the topic learned in LDA can be easily interpreted.

The meaning of words and relationships between words hardly change, which allows the trained word2vec model to be used across different docu-ments and achieve similar results. This is mainly positive, as there are many pre-trained skip-gram models available online, trained on multi-billion word datasets. LDA, on the other hand, has to be trained for each corpus sep-arately, since the model uses relationships learned from documents to doc-uments, which are not required to stay the same across different datasets. This poses a problem with large corpuses as training of the LDA model can

(31)

learning LDA method, which shrinks the computational time needed to fit LDA model drastically. However, the on-line method is still suited for rel-atively small datasets, when compared to word2vec model, that is able to process 100 billion words a day.

With LDA, a number of topics has to be specified. To achieve the best per-formance, the number of topics should be such, that negative log-likelihood is minimized. The number differs across different corpuses and determining it might be of slow-nature. Word2vec on the other hand is proven to perform the better the higher number of neurons is used (Mikolov et al., 2013a).

LDA is able to see higher correlation, between two elements due to its dependence on Bayesian statistics. Word2vec on the other hand allows to use linear translations of vector to obtain vector representations of higher n-grams.

4.4 Predictors and hyper-parameter tuning

All of the prediction techniques suffer from complex loss functions. Many parameters affect the loss function in different ways and to find the best result, i.e. obtain the lowest loss, a hyper-parameter tuning needs to be performed. Grid search, i.e. looping over all possible combination of pre-defined parameter values is often used. However, all of the algorithms used in the thesis have a long learning phase and using such an approach would be time ineffective. Hence, a random search will be used for all of the prediction methods. This approach is able to explore a higher variety of combinations in less time. In random search, all of the parameter values are drawn randomly from user-defined distributions.

(32)

enormous error.

In the thesis, learning rate is drawn from an exponential distribution with specific mean for each of the used methods. For each method, 50 learning rates are drawn randomly and used in the tuning.

4.4.1.1 Number of layers and neurons in a neural network

Deciding the number of neurons and layers in the neural network is a very important part in building a neural network. Regarding the number of layers, a neural networks is able to approximate any function that contains a continuous mapping from one finite space to another with just one hidden layers. A network with two hidden layers can represent an arbitrary decision boundary to arbitrary accuracy with rational activation functions and can approximate any smooth mapping to any accuracy (Heaton, 2008). In the tuning phase, networks with one and two hidden layers are trialed.

Regarding the number of neurons, there are many rule-of-thumb methods to determine the best number of neurons in the network. The overall goal of tuning the number of neurons is to select such amount, that the network will be able to generalize and perform good on the previously unseen data as well. If the number of neurons is chosen very high, the network will be able to recall the training set to perfection but will not perform well on the testing set. In the thesis, the number of neurons will be chosen from a uniform distribution, with lower boundary being the size of output layer (1) and the size of input layer (variable across different methods).

4.4.1.2 Tree depth in gradient boosting regression trees

The depth of the regression tree is the length of the longest path from the root of the tree to a leaf. The deeper the tree, the more complex decision boundaries can be correctly modelled. In the thesis, the depth of the tree will be randomly chosen between 1 and 7 (the upper boundary was chosen

(33)

upper boundary would provide more insight into the performance of the tree method).

5 Results

In this section, I will report the results of the research. The section is split into three subsections: LDA method, word2vec method and a comparison of the results. In each of the subsections, outcome of respective prediction techniques will be reported. Moreover, each method’s loss function will be visualized. The visualization is a result of point interpolation, which had to be performed due to relatively small number of iterations used during parameter tuning as well as the random method, which was used. Because of this, visualized loss functions may not be completely identical with real loss functions and serve just as a rough estimate. Moreover, the loss functions of neural networks reported in this section are reporting a total number of neurons as one of the variables. This variable represents a sum of all neurons in the respective network. This step was taken due to difficulty of reporting loss functions with higher dimensionality.

5.1 Latent dirichlet allocation

In this section, I will present results of predictive analysis using topic representations as features.

5.1.1 Linear regression

The simplest of the methods did not yield shining results and it was not expected. Hyper-parameter tuning with 50 iterations only took a couple of minutes with the best outcome of RM SLE = 1.006 on the validation data-set. Best results of hyper-parameter tuning are in table 3. Approximated

(34)

Figure 9: Estimated loss function of linear regression. Variable learning rate is portrayed on x-axis, while error on y-axis.

Observing the loss function, it is clear that linear regression prefers higher learning rate, which allows it to converge faster to a point of lower error within the given number of weight adjustments. Although higher learning rates were not used in the study (due to random nature of tuning), they might improve results by a tiny margin if set reasonably low. A learning rate with a very high value would lead to an increase in error due to exploding gradient.

When the model was run on previously unseen data, an outcome of RM SLE = 1.002 was reached. This is not the best of results, as will be seen in coming subsections.

A predictive analysis using neural network performed much better than its simpler counterparts, although coming at a cost of time. While tuning of linear regression took a couple of minutes, the neural network needed almost 2 days to complete only a couple of iterations of hyper-parameter tuning. However, obtained results were much better in almost every iteration. Dur-ing tunDur-ing, the lowest error seen on validation set was an RM SLE of 0.650. On test data, the error was slightly lower with a value of 0.649. Results of

(35)

In figure 10, an approximate loss function of the model is presented. Con-trary to linear regression, the neural network prefers a lower learning rate to achieve the best results. Moreover, it can be observed that a total number of neurons does not have a significant impact on the observed loss as long as the total number of neurons was set above 20. In fact, in some cases the higher number of neurons hurt the overall performance due to overfitting1.

Figure 10: Estimated loss function of the neural network method. Learning rate and total

number of neurons are on horizontal axes; error is on the vertical axis

Out of the three methods used in predicting the price, gradient boosting regression trees performed the best. Moreover, training and tuning of the trees did not take as much time as it was in the case of neural networks; the forest was trained and tuning within a few hours. During the tuning, an error of 0.626 was achieved. When tested on unseen data, the error was 0.623. Best results of the tuning can be seen in table 5.

1_{This realization is rather hard to be observed on a static image. If required, an interactive plot can}

(36)

Figure 11: Approximate loss function of gradient boosting regression trees. Learning rate

and depth are on horizontal axes, error is on the vertical axis.

As can be seen in figure 11, the more complex a tree is, the better the results. Four out of ten best results have the maximum depth a tree was allowed to attain. Regarding the learning rate, although consistently better results were achieved with larger learning rates, best results had a learning rate of around 0.03.

5.2 Word2vec

In this section, results of price prediction with word2vec features will be presented.

5.2.1 Linear regression

Compared to its LDA counterpart, linear regression with word2vec fea-tures resulted in a much lower error, coming at a cost of longer training and tuning time. The process took a couple of hours to complete, compared to a couple of minutes in the previous case. During the tuning stage, several learning rates were tried out and rather high learning rates of around 0.02 were producing the lowest errors on validation set, with the lowest achieved error of 0.899. The results of this process are presented in table 6. On the

(37)

Figure 12: Estimated loss function of linear regression. Changing learning rate is oriented on the horizontal axis, error is on the vertical axis.

Deducing from figure 12, when the learning rate was set too low for the linear regression model, it did fail to converge to the minimum of the loss function. As the learning rate was set higher, the loss was getting smaller until a minimum was reached with set learning rate of 0.02336711.

A neural network performed much better than a much simpler linear re-gression. Compared to the neural network used in LDA, networks used with word2vec features were much more complicated with significantly higher number of neurons, due to higher dimensionality of the dataset. The pro-cess of hyper-parameter tuning took more than 2 days to explore prescribed number of random combinations. On validation set, the lowest root mean squared logarithmic error had a value of 0.509, on test data the error was very similar: 0.510. Hyper-parameter combinations with lowest error during tuning are provided in table 7.

(38)

Figure 13: Estimated loss function of the neural network method. Learning rate and total

number of neurons are on horizontal axes; error is on the vertical axis

Chosen learning rate had a significant effect on recorded error. The neural network performed well with lower learning rates, close to 0.0005. The total number of neurons, as in the previous case, affected the error only slightly and seemed to only fine tune the error of the test.

Gradient boosting regression trees again performed very well and provided a good compromise between error and training time. Whole training and tuning process took only about 10 hours to complete and the results were competitive with that of neural network. During tuning, the lowest RMSLE achieved had a value of 0.527 with a maximum depth of 7 and learning rate of 0.24019185. Once again, best results from the tuning phase are presented in table 8. On the testing dataset, the best performing model from tuning phase achieved an error of 0.528.

(39)

Figure 14: Estimated loss function of the gradient boosting regression trees. Learning

rate and maximum depth of a tree are located on horizontal axes, error is on the vertical

axis

In figure 14, a loss function dependent on hyper-parameters is present. As was the case with previous encounter with gradient boosting regression trees, the loss function attains smaller values with more complex trees, i.e. higher d. A low negative slope can be observed with increasing maximum depth (from left to right). Moreover, if the the tree is set to be very simple, i.e. 1 layer, the loss starts to increase, regardless of the learning rate.

5.3 Comparison of the results

For all methods and predictors, their lowest achieved error is recorded in table 1 below. method/ predictor SGD linear regression neural network boosted trees 1.00187 0.64908 0.62259 LDA benchmark -35.21% -37.85%

(40)

When compared across methods, it is clear that predictors with word2vec features outperformed those, that used LDA features in all cases. This came at a cost of higher computational time, as all of the predictors took slightly more to train and tune. Linear regression with SGD, neural network and gradient boosting regression trees achieved loss decrease by 10.45%, 21.51% and 15.10% respectively.

For LDA features, the choice of predictor is clear: gradient boosting re-gression trees. The predictor offers a lower error, lower computational time and most importantly, it’s not a black box, i.e. its interpretability is very easy, which might be a crucial feature, when working with LDA features. It might be of high value to see, what topic representations play a role in predicting the price.

Linear regression with stochastic gradient descent performed rather poorly, when compared to other to predictors used, as it was not able to separate the data linearly. Neural network had a slightly worse performance therefore it might be considered as a viable predictor. However, it does not offer any advantages over the boosted trees as it takes longer to train and tune and is significantly harder to understand, what is going on inside.

For word2vec, in my opinion, there is no right option, just a wrong one. As in the case of LDA, linear regression performed rather poorly as it was expected. Nonlinear features of word2vec were linearly inseparable. On the contrary, the other two methods had about a 40 percent reduction in error. The two methods offer a compromise between low error and low computational time.

The difference in errors is only slight with a difference of RMSLE being about 0.02. The main difference is the computational time with boosted trees only taking a few hours to compute, while different variations of a neural

(41)

the two is therefore that of sole user preference based on various factors, one of which being available hardware, e.g. a powerful GPU could shrink the time needed to compute both methods drastically, decreasing the difference to a level, where the user would be indifferent. The case of interpretability is not applicable here. Although, as was mentioned before, gradient boosting regression trees offer a possibility to see, what is going on under the roof, majority of features used in this part of the work are weights of a neural network, which, again, is very hard to interpret. Hence, this perk of boosted trees does not play a role in tipping the advantage weights to its side.

To sum up, word2vec method outperformed LDA. As a viable prediction method, either of the neural network and gradient boosting regression trees is a viable option as both offer great performance but come with trade-offs.

(42)

6 Discussion

In this section, I would like to discuss difficulties faced during the thesis and also, offer several proposals, which might improve the analysis.

Starting off with the struggles, the main notable limitation of my thesis is the hardware limitation. As I do not own and do not have an access to a state-of-art machine, much of the analysis took a long time to complete, mainly the hyper-parameter tuning part. If a machine with a very capable GPU would be used, the computational time could be shrunk dramatically. This, in turn, would mean that more time could be spend on trying more combinations of hyper-parameters or even use a completely different method of search. It is highly likely, that there is a combination of hyper-parameters present, which offers lower error than that presented in the thesis.

Another hardware limitation was felt during training gradient boosting regression trees, where all of the trees had to be limited to a maximum depth of 7. I am quite confident, that with even higher depth used, the prediction method would yield even better results. Moreover, with a more capable technology, an unrestricted tree could be trained as well, which spans until all features are exhausted and provides certain performance improvement.

As was already mentioned, improvement can be made. One of the im-provements might be made in increasing the number of combinations trialed during hyper-parameter tuning to see, whether a better solution is present. Also mentioned, increasing the number of maximum depth of tree would improve the work. Other improvements might be made in expanding the prediction techniques used. Some of the proposed techniques might be a ridge regression, a recurrent neural network or even combining the methods to build an ensemble model.

(43)

approach of LDA with that of word2vec, or fasttext, which is similar to word2vec, but rather than treating each word as a whole, it treats it as a collection of n-grams.

(44)

7 Conclusion

In this work, I reported the performance of various techniques in predict-ing a price of sold merchandise. The data were taken from a Mercari Price Suggestion Challenge competition on Kaggle and the main goal was to pre-dict a correct price using unstructured data. In the thesis, I used two of the very well-known techniques, LDA and word2vec, to extract features from the pre-processed textual data. The features were then used in prediction of price. Three different prediction algorithms were used: a linear regression with stochastic gradient descent, a neural network and gradient boosting regression trees.

All models with word2vec features outperformed its counterparts, that used LDA features. For LDA, the best performing predictor were the boosted trees with the lowest RMSLE of 0.62259. For word2vec, the lowest error was achieved by a neural network with RMSLE of 0.50949. This, however, came at a price of great computational time.

(45)

References

Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4):77–84.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022.

Boycheva, N. (2015). Distributional similarity music recommendations versus spotify: A comparison based on user evaluation.

Doshi, L., Krauss, J., Nann, S., and Gloor, P. (2010). Predicting movie prices through dynamic social network analysis. Procedia-Social and Behavioral Sciences, 2(4):6423–6433.

Grimes, S. (2008). Unstructured data and the 80 percent rule. Carabridge Bridgepoints, page 10.

Heaton, J. (2008). Introduction to neural networks with Java. Heaton Re-search, Inc.

Jain, R., Nguyen, R., Miller, T., and Tang, L. (2018). Bitcoin price fore-casting using web search and social media data.

Kolyshkina, I. and van Rooyen, M. (2006). Text mining for insurance claim cost prediction. In Data Mining, pages 192–202. Springer.

Lavrenko, V., Schmill, M., Lawrie, D., Ogilvie, P., Jensen, D., and Allan, J. (2000). Mining of concurrent text and time series. In KDD-2000 Workshop on Text Mining, volume 2000, pages 37–44.

Li, J., Xu, Z., Yu, L., and Tang, L. (2016). Forecasting oil price trends with sentiment of online news articles. Procedia Computer Science, 91:1081–

(46)

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed representations of words and phrases and their composition-ality. In Advances in neural information processing systems, pages 3111– 3119.

Novak, P. K., Smailovi´c, J., Sluban, B., and Mozetiˇc, I. (2015). Sentiment of emojis. PloS one, 10(12):e0144296.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830.

Press, O. and Wolf, L. (2016). Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859.

Rajman, M. and Besan¸con, R. (1998). Text mining: natural language tech-niques and text mining applications. In Data mining and reverse engin-eering, pages 50–64. Springer.

Raschka, S. and Mirjalili, V. (2017). Python machine learning. Packt Pub-lishing Ltd.

Stevens, D. (2014). Predicting real estate price using text mining. Automated Real Estate Description Analysis.

Tan, A.-H. et al. (1999). Text mining: The state of the art and the challenges. In Proceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases, volume 8, pages 65–70. sn.

Wang, X., Gerber, M. S., and Brown, D. E. (2012). Automatic crime predic-tion using events extracted from twitter posts. In Internapredic-tional conference

(47)

Witten, I. H., Frank, E., Hall, M. A., and Pal, C. J. (2016). Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.

W¨uthrich, B., Permunetilleke, D., Leung, S., Lam, W., Cho, V., and Zhang, J. (1998). Daily prediction of major stock indices from textual www data. HKIE Transactions, 5(3):151–156.

Zientala, P. (2017 (accessed 01.07.2018)). Mercari Price Suggestion - Ex-ploratory Data Analysis.

(48)

9 Appendix

9.1 List of figures

1 Frequency of item condition classes . . . 10 2 1st and 2nd hierarchical category levels (Zientala, 2018) . . . 11 3 Most popular brands by category . . . 12 4 Histogram of price and log(p + 1) transformed price . . . 13 5 The frequency of most used n-grams in the description column. 14 6 One-hot encoding process. . . 16 7 An example of an architecture of the skip-gram model with

vocabulary size of 10,000 and 300 hidden layer neurons (Mc-Cormick, 2016). . . 19 8 Creation of input pairs with c = 2 . . . 20 9 Estimated loss function of linear regression. Variable learning

rate is portrayed on x-axis, while error on y-axis. . . 28 10 Estimated loss function of the neural network method.

Learn-ing rate and total number of neurons are on horizontal axes; error is on the vertical axis . . . 29 11 Approximate loss function of gradient boosting regression trees.

Learning rate and depth are on horizontal axes, error is on the vertical axis. . . 30 12 Estimated loss function of linear regression. Changing

learn-ing rate is oriented on the horizontal axis, error is on the vertical axis. . . 31 13 Estimated loss function of the neural network method.

Learn-ing rate and total number of neurons are on horizontal axes; error is on the vertical axis . . . 32 14 Estimated loss function of the gradient boosting regression

trees. Learning rate and maximum depth of a tree are located on horizontal axes, error is on the vertical axis . . . 33

(49)

9.2 List of tables

1 Root mean squared logarithmic error of all methods and clas-sifiers used in the thesis and their percentage change below . 33 2 List of shortcuts and their meaning . . . 44 3 Results of hyper-parameter tuning of linear regression with

LDA features . . . 44 4 Results of hyper-parameter tuning of neural network with

LDA features . . . 45 5 Results of hyper-parameter tuning of gradient boosting

re-gression trees with LDA features . . . 45 6 Results of hyper-parameter tuning of linear regression with

word2vec features . . . 46 7 Results of hyper-parameter tuning of neural network with

word2vec features . . . 46 8 Results of hyper-parameter tuning of gradient boosting

(50)

9.3 Results of hyper-parameter tuning

Abbreviation Meaning

lr learning rate

rmsle root mean squared logarithmic error

tn total number of neurons present in a network

batch batch size used during the training

d maximum depth to which a tree can grow

Table 2: List of shortcuts and their meaning

lr rmsle 0.01492069 1.00633 0.01115367 1.01434 0.00625475 1.02063 0.01000346 1.02944 0.00737537 1.03613 0.00684066 1.03807 0.00899708 1.04037 0.00597488 1.04360 0.00485409 1.04515 0.00341282 1.05136

(51)

batch tn lr rmsle 989 158 0.00285348 0.65088 997 116 0.00032620 0.65091 526 120 0.00407130 0.65113 721 79 0.00300375 0.65157 1017 114 0.00064243 0.65181 246 166 0.00615086 0.65186 65 131 0.00558855 0.65200 222 116 0.00166145 0.65216 422 137 0.00141355 0.65229 160 87 0.00585450 0.65247

Table 4: Results of hyper-parameter tuning of neural network with LDA features

lr d rmsle 0.03522222 7 0.62581 0.03297885 7 0.62582 0.03036866 7 0.62592 0.12037473 7 0.62602 0.02437607 7 0.62607 0.07064272 6 0.62670 0.06821768 6 0.62670 0.07418414 6 0.62692 0.04826818 6 0.62693

(52)

lr rmsle 0.02336711 0.89882 0.02688977 0.90095 0.01830759 0.90656 0.01246978 0.91156 0.00416787 0.91557 0.00624824 0.91657 0.02348575 0.91771 0.02378419 0.91817 0.00798999 0.91915 0.00888997 0.91965

Table 6: Results of hyper-parameter tuning of linear regression with word2vec features

batch total lr rmsle

436 886 0.00057218 0.50942 735 909 0.00012267 0.51111 811 495 0.00052960 0.51477 508 856 0.00192106 0.51573 363 1126 0.00176479 0.51603 889 336 0.00079878 0.52058 724 382 0.00320994 0.52181 88 342 0.00130568 0.52713 869 184 0.00288858 0.53173 850 635 0.00653130 0.53205

(53)

lr d rmsle 0.24019185 7 0.52743 0.21890960 7 0.52836 0.17711830 7 0.52877 0.13864738 7 0.53107 0.26052340 6 0.53386 0.10432205 7 0.53613 0.37259123 5 0.54068 0.07915049 7 0.54094 0.07336917 7 0.54278 0.06773149 7 0.54417

Table 8: Results of hyper-parameter tuning of gradient boosting regression trees with word2vec features

Price prediction from unstructured data

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

Price prediction from unstructured data

Samuel Koˇ

zuch

11690534

Contents

1

Introduction

2

Literature review

3

Dataset description

4

Methodology

5

Results

6

Discussion

7

Conclusion

References

9

Appendix

9.1

List of figures

9.2

List of tables