Strategies to Increase Accuracy in Text Classification

(1)

Strategies to Increase Accuracy in Text Classification

Master Thesis

Dennis Blommesteijn

University of Amsterdam, The Netherlands dennis@blommesteijn.com

ABSTRACT

Text classification via supervised learning involves various steps from processing raw data, features extraction to training and validating classifiers. Within these steps implementation decisions are critical to the resulting classifier accuracy. This paper contains a report of the study performed to determine the optimum parameter setup for reaching the highest possible accuracy when classifying multilingual (Dutch and English) user profiles, collected from social media, with job titles, with the goal of improving the matches between job vacancies and user profiles in a case for HR recruitment. The study includes experiments with eleven labels (job titles), a shifting pivot between test, and training-datasets, the use of combined n-grams, feature extraction methods: bag of words (BOW), word frequency or count (WC) and word importance (via TF-IDF), the use of tagged words corpora with POS tags, and the use of seven well-known classification algorithms. Two Support Vector Machine (SVM) systems, two Naive-Bayes (NB) approaches, two Maximum-Entropy classifiers, and one Decisions Tree (DT). Seven experiments were performed, with a combined total of about 1900 training, and test runs. The used dataset contains of 95,000 profiles that were annotated with eleven job title labels, using a tool specially developed for this purpose.

We concluded that classifiers based on the Support Vector Machine (SVM) achieved the highest classification accuracy (up to 93% with 7-labels). Feature extraction methods of (1,2,3)-grams, and word frequency/ importance showed the highest accuracy gain among all classifiers. The most profound accuracy gain was achieved by excluding labels that contained too generic features. The SVM classifiers reached their accuracy ceiling on 2/3 of the experiments already. By further studies into annotating and removing non-specific information it is believed this accuracy figure can be increase even more.

1. INTRODUCTION

Natural languages are ambiguous, a word can have multiple meanings and a meaning can be defined by multiple words. Words in a specific order, in relation to other words, give context or meaning to a text. Programming languages have no ambiguity like natural languages, because they are buildup from a strict syntax and grammar. They have evolved according to compiler/ interpreter specification from standards and agreements. A machine can directly interpret the commands and produce behavior accordingly. Natural language however, has evolved by the laws of human biology, without formal representation or specification. Machines are unable to interpret natural language, because they cannot disambiguate words without tools.

In this paper we investigate the case of SocialReferral's job vacancy match engine. This engine is responsible for matching job vacancies to user profiles. These profiles are written by humans; collected from social media networks and contain job resumes and curricula vitae [5]. The currently used match engine, classifies user profiles with a list of predefined job titles and keywords. The absence or presence of these keywords in a user profile, defines if a profile belongs to a job title. The vacancies are pre-classified with job titles. The process of tweaking the set of keywords, and implementing the logic behind it, is time consuming and its accuracy measure untested. By introducing machine learning, to classify user profiles with job titles, it is believed more accurate matches can be made, or at least more stability can be created. Larger datasets can be used to identify relations. Moreover there is no more need to manually tweak the keyword set every time new user profiles become available.

In order to use machine learning to classify user profiles with job titles, we require tools. The toolkit we have selected for this case, is the natural language toolkit (NLTK) [1, 7]. NLTK provides a complete package for text classification. It supports a complete pipeline from raw text parsing of documents, to classifier metrics (so we can determine performance). NLTK is build primarily for supervised learning. Supervised learning is a type of machine learning where a classifier or algorithm is trained with a training data-set to distinguish between concepts. In our case this data-set contains user profiles and the concepts are the job titles. To determine a starting point for classification performance, we reflected on examples from NLTK [1, 7]. These examples show it is possible to classify (annotated) texts or documents into two categories with up to 80% accuracy. The case of job titles matching to user profiles, is suited for supervised learning. Because the client has domain knowledge, and supervised learning enables the transfer of this knowledge from a human specialist to the user profile dataset. To facilitate this transfer, we have developed a tool [19]. This tool makes assigning job titles to user profiles on large scale easier for users, because it visualizes user profiles in a web browser. The available user profiles are preloaded into a database. The specialists can search the database on keywords, identify the profiles to which a job title belongs, and annotate them accordingly. This process results in an annotated training,- and test-set (a set where user profiles are labelled with job titles). Both datasets contain the same common denominators, however one is used only for training, and the other for validating the correctness (accuracy) of the classification after training.

The pipeline, from collecting raw user profiles to training classifiers, contains many implementation decisions, which have direct effect on accuracy. To determine strategies for making these decisions, we have to answer the following questions. Is it possible to distill an annotated dataset from user profiles for text classification, and is this dataset still useable after sanitizing and annotating or is there not enough data available for training afterwards? Does the training-set discriminate (enough) to classify unseen profiles that fit the trained labels? Is there a change in accuracy when limiting the training-set? What is the effect on accuracy when using different extraction methods to process raw data? What is the impact on accuracy by using different classifiers (and models)? How do the number of trained labels impact the accuracy, do more labels automatically result in an accuracy decrease? And is additional information improving accuracy [4]? Besides answering accuracy score of classifiers, there is an aspect of practicality. Due to the size of the available dataset: 600,000 user profiles, a classifier should have a manageable memory,-and processor utilization footprint. A single job title contains up to 10,000 user profiles. Training multiple job titles can rapidly increase memory usage by the classifier. It is our understanding that some classifiers will require too much memory to be useable.

To answer these questions, we need to perform numerous training and validation experiments. By changing the configuration parameters of NLTK [7] we can validate the hypotheses. To help run the experiments we use the NLTK-trainer [3]. This NLTK-trainer wraps around the NLTK interface, and omits the need to program complex structures. The experiments will run a default configuration set first. Than changing parameter values according to suite the hypotheses. The used values are described in the next section (section 4. Methods), and section 5 covers the results. Finishing with an overview of all decisions compared to the classification accuracy (section 5.7).

(2)

1.1 Background

In order to reason about natural language, one has to be able to answer questions like who, what, where, when and why for a given piece of text. Typically, when a person writes text, all his or her experience is used for text composition. This results in a text where commonalities (information that is commonly known when communicating to a target audience) is left out. When humans reason about written texts, they use their knowledge to place the text into their frame of reverence or context. For example the sentence below (Ex. 1) contains a sample of raw user profile data. When we read the text we can interpret the sentence and give meaning to the words. We are able to identify the job title software architect [1, pp. 27], however this is not the case with machines. Machines have no context, and therefore cannot interpret them. Tools like NLTK provide the ability for machines to buildup such a context (via training).

Example 1 profile headline: “Software Architect at the University of Amsterdam, Information Technology and Services” [5].

Since the early work of A. Turing [2] and inception of natural language processing (NLP), NLP is the study of natural language between humans and machines, many techniques have been developed. Among which, new classifiers and classifier models. Before the classification process, there is a phase called 'pre-processing'. In this phase raw text is processed into useable data. Excess information like style, markup and unprocessable characters are stripped from the raw data (sanitized). Next this cleaned-up data is broken down into pieces (tokens). Often words are used as tokens. These subsections become the input for classifiers, where they are individually paired (P) with the annotated information (labels). This step is known as feature extraction. Next, the extracted features are inserted into the classifier (via training). Each classifier measures the extracted features according to their internal logic, P(input, label) [1. pp. 243-253]. Resulting in feature score for a given label. Consider the previous example (Ex. 1) again. The resulting structure after annotating, tokenizing and feature extraction is shown below (Ex. 2).

Example 2 sentence after pre-processing, feature extraction words: “(“Software”, “SA”), (“Architect”, “SA”), (“at”, “SA”), (“the”, “SA”), (“University”, “SA”), (“of”, “SA”), (“Amsterdam”, “SA”), (“Information”, “SA”), (“Technology”, “SA”), (“and”, “SA”), (“Services”, “SA”)” [5].

NOTE: SA is the label used in this example for Software Architecture, and paired with the tokenized word (extracted features from the sentence).

After training the classifier contains measurements of features that are most common to a label. When performing text classification, the to be classified text is pre-processed (as described above), and per feature (word) the most likely candidate label is collected, the labels with the highest frequency (depending on the internal logic of the classifier) is the most likely the label to, which the text belongs.

2. RELATED WORK

We are not aware of any studies of text classification targeted at the human resources industry for the purpose of matching (vacancies). Many of the human resource studies we have found are aimed towards advertisement, measuring employee satisfaction, and company performance. However, more general text classification studies have been done. M. Rogati et. al. [27] have done studies about feature extraction, and measuring the effectiveness of several classification models (accuracy). Among the popular algorithms naive-bayes (NB), support vector machine (SVM) and k-nearest neighbor (kNN) were used. In our experiment we have selected among the same classifiers.

A. McCullum et. al. [25] describes text classification targeted at hierarchical categories. They have performed experiments on shrinkage with the goal of achieving higher classification accuracy. Another study into hierarchical text classification by A. Sun et. al. [26] has identified a new measure based on category distance. There are two types of relations. The parent to child,

with a common ancestor sub-categorizes children that belong to the same branch. Like software engineers, who can be sub-categorized into programmers, and web developers. Parent to parent relations, where software engineers are in a different branch compared to business analysts. Both articles [25, 26] describe a hierarchical category structure, which seems to have merit. Arguably job titles can be placed in such hierarchies. It should be possible to classify tech related jobs like software engineer and web developer into a single branch with a common ancestor. We have not identified the possibility for using hierarchies. However, we have identified the base branches of a few selected job titles within the available user profiles [5].

Other studies have used external knowledge from sources like WordNet [15, 16, 17] and Wikipedia to build up a knowledge base that can be used for classification. L. Ratinov [4] argues that learning via (semi) supervised learning is an established approach, but proposes the use of external resources. In our interpretation this notion is carried by G.A. Miller [21] in his studies about WordNet's lexical inheritance system, as well. Both studies are about resolving information with external resources, and trying to understand the context by formalizing concepts (WordNet). We can establish that external knowledge can enrich the classification process, however how this effects classification accuracy is unclear. In our study we have investigated adding external information. We use word-tags (noun, verb, adjective etc.) to enrich tokenized words in the extracted features.

3. DATASET

The used data, the extracted user profiles, are available via CSV1_{export [5]. This CSV file contains 600,000 user profiles.} Prior to export all markup and metadata was removed from all profiles. From a privacy viewpoint all names, companies and specific locations in the examples are obscured (not in the data itself). Each profile contains two fields of data, a title and a data field. The title field contains profile headline, and gives a general description about the information contained in the record. It typically contains job titles, fields of interest or education, as shown below (Ex. 3). Moreover, the profiles contain a mixture of both English and Dutch words. These mixes go deeper than the profile itself. Some profiles vary in paragraph and sentence language.

Example 3 profiles title field: “Corporate Recruiter, Talent Sourcing Specialist”, “Owner”, “Eigenaresse”, “Managing Director, recruits marketing, online marketing & e-commerce professionals, managers”, “<empty>”, “Regional HR Manager”, “Sales representative", “Eigenaar, Senior Corporate Recruiter, “Owner".

The data field contains the actual resume or curriculum vitae, again without markup and metadata marker. Due to the large size of the content, some fields are shown truncated in the example below (Ex. 4).

Example 4 profiles data field: “Founder and Owner Architect architect designer internship internship architect at ABC-DEF company Architecture & Planning”, “Regional HR Manager Regional HR Advisor Graduate Recruitment Officer Responsible for Graduate Recruitment strategy and employer branding initiatives for...”, “Consultant ABC-DEF |Adviseur| Coach|Sportief|Bourgondisch ...”

4. RESEARCH METHODS

NLTK [7], natural language toolkit is written in Python and provides tools for automated processing of raw and annotated texts. NLTK is a framework that requires some programming to link all internal methods to produce a fully working pipeline for text classification. Because this programming can be cumbersome, we opted to use the NLTK-trainer [3] (as described in section 1). This trainer wraps the internal abstraction, and is controlled via a command line interface (CLI). To increase the total available classifiers, external NTLK hooks into Scikit-learn [6] were used.

(3)

To improve the data processing pipeline from raw user profiles to useable data for classification, we have developed a tool [19]. This tool includes a database where the raw data is stored. Pre-processing (described in section 4.1), uses the information from this database and produces processed profiles, which are stored into the same database. The parsing of this data is visualized to the user, so the correctness of the parse can be evaluated. Moreover, the processed profiles need to be annotated before classifiers can be trained with the data (described in section 4.2). Visualization helps here to filter and search the available user profiles, for annotation.

NLTK collects data via its corpus readers. In our approach we have used a separate database to store user profiles. NLTK has no means of extracting data from databases, therefore the data needs to be extracted and formatted (described in section 4.3) manually. Thereafter, the use of the extracted features from the corpus readers and determining classification parameters for the experiments is described in sections 4.3 onwards.

4.1 PRE-PROCESSING

The exported user profiles (from the CSV file) [5] are first imported into a database [19]. This enables more data flexibility, and easier annotating from the user his viewpoint. The following sections describe the resolution strategy (in order of process execution). At each step the used methods are described and reasoned about.

During the process there is mention of taggers. These mechanisms have certain requirements on the various stages within the process that require explanation. At each stage these requirements are explained, however the purpose of taggers is explained inline with the execution order, which is at the end.

4.1.1 Sanitizing Input

Sanitizing, the parsing of raw data from user profiles, correcting encoding and markup. Removing markup can potentially lead to stripping of the sentence markers, and special characters. This will result in an incorrect identification of sentences, and will have an impact on taggers.

All fields within the user profiles are encoded with unicode strings. Unicode contains many unusable,- and unnecessary characters, when processing English and Dutch words. Transliterate has been used to transform characters to known ASCII (English). All untranslatable characters are transformed into tokenized delimiters (a tab, carriage return or other magic separator will automatically be corrected). Characters with accents will be translated into their vowel equivalent (without the accents). Dutch contains slightly more characters than fit within standard ASCII, it contains additional accents, which will be stripped.

4.1.2 Tokenization of Words and Sentences

In the tokenization step often words are used as tokens, however words are not the most atomic element in a sentence or text. Words can be broken down further into characters (computer science perspective) or lemmas (psycholinguistics), among others [1. pp. 67-70]. However, words [15, 16, 17] carry more context than individual characters or lemmas, and are more specific than whole sentences, therefore we have considered tokenization, splitting beyond words, not the scope of this study. Moreover, section 4.3 has more detail about the available extraction options, and recombining of words that can potentially improve accuracy.

Tokenizing sentences is done by splitting raw profile data on sentence delimiters (dot, exclamation,- and question mark). Avoid splitting sentences prematurely. Abbreviations and named entities (NE) can contain the same delimiters. In the example sentence (Ex. 5) such delimiters are shown.

Example 5 sentence delimiter: “ABC Product Manager ProdA, ProdB, ProdC and ProdD Sales Coördinator I'm responsible for all of the sales reporting, dealer bonus payments and (Sales)forecasts for ABC D.E.F. GHI ABC Professional 12 and JKL.”

The example sentence (Ex. 5) reveals 'D.E.F.', which can be classified as a NE and abbreviation, and can potentially split the

sentence prematurely. Typical sentence splits are made with the regular expression (1) described below, which will result in a premature split because it cannot distinguish between the two.

REGEX 1 Sentence split: `/[\.\?\!][\s]+/`

Next step is to tokenize a sentence into words. Again this is done via regular expressions (2), shown below.

REGEX 2 Word split: `/['-][\W]+|[^\w'-][\W|^0-9_]*/

Typically sentences are split on space delimiters, for extracting words. Sanitizing should have replaced all special characters, however there are some characters that fit ASCII, and are more difficult to process. Consider contractions, these should not be split, but (curly) braces, brackets, parenthesis etc. should. An example tokenized sentence with words is shown below (Ex. 6).

Example 6 Tokenized sentences and words: “ABC, Product, Manager, ProdA, ProdB, ProdC, and, ProdD, Sales, Coordinator, I'm, responsible, for, all, of, the, sales, reporting, dealer, bonus, payments, and, Sales, forecasts, for, ABC, DEF, GHI, ABC, Professional, and, JKL”

For now we have opted not to resolve contractions but treat them as single words. Building a special case for contractions can lead to incorrect assumptions by ambiguity later on.

NOTE: tokenization turns out to be more difficult than first expected, no single solution works well in all cases [1, pp. 111]. 4.1.3 Language Detection

Language detection can be helpful for information extraction. In our case we use language detection because it is required by taggers. However, having an extra control measure in place will require some extra effort. Revisiting the sanitizing step (section 4.1.1), Dutch words were transliterated at sanitation. If there were no other operations in the process, standard ASCII would have worked fine, however for tagging we require language identification per sentence, and ultimately a dictionary that can lookup each word per language. Most dictionaries cannot handle words without accents, because they are basically incorrect words. By using the word suggestion feature of our selected dictionary (Hunspell [8, 11]), we were able to restore single vowel characters to their original, and thus solving some small spelling mistakes in the process (Ex. 7).

Example 7 Automatic spelling correction: the Dutch word for categories, `categorieën` is transliterated into: `categorieen`, Hunspell word suggestions gives (among others) `categorieën` as an option. The option transliterated is equal to the input. We can argue that the option is correct.

NOTE: we found most Dutch words in the user profiles were already missing their accents [5]. Arguably because people in The Netherlands use a US-International layout keyboard, and require extra effort to place accents.

From a licensing and availability viewpoint, we have selected the Hunspell library [8], which is used in many public project (among the OpenOffice.org text editor). Therefore we are able to use all libraries available to those projects. For our purposes we have used the American English, British and Dutch dictionaries [9, 10, 11]. Those dictionaries contain only single words. Detecting language for a sentence is done by counting all languages per sentence. The most frequent language is the sentence language. This way of looking up language, allow us to handle language mixes on sentence, paragraph and profile levels. Unfortunately the user profiles show that mixes occur very frequently. However automatically detecting spelling mistakes is impossible. No distinction can be made between either: foreign words (outside dictionary scope), undetected NE's or truly misspelled words.

4.1.4 Sentence and Word Tagging

A tagger adds extra information to a sentence of tokenized words. Sentences are required to determine context of the words in it, and tag accordingly. The part of speech (POS) tagger adds word types, listed below (List 1) [1, pp. 187]. This list shows simplified tags.

(4)

List 1 POS tags: • ADJ – Adjective; • ADV – Adverb; • CNJ – Conjunction; • DET – Determiner; • EX – Existential; • FW – Foreign word; • MOD – Modal verb; • N – Noun; • NP – Proper noun; • NUM – Number; • PRO – Pronoun; • P – Preposition; • TO – the word 'to'; • UH – Interjection; • V – Verb;

• VD – Verb past tense; • VG – Verb present participle; • VN – Verb past participle; • WH – Why determiner; • NN – Default (assumed noun).

Every language has its own words and grammar, therefore taggers must be targeted at a specific language. The mechanism for tagging text operates on sentence. Therefore sentence language is required (section 4.1.3 language detection). To create a tagger for each language, first POS annotated corpora are required. A tagger is trained with the annotated corpora [6], using the sequential classification (or join classification) method or strategy [3, 7]. When a non-fitting word is found, the tagger strategy backs-off to the less favorable (Strategy 1) shown below. When eventually a word is unknown to all taggers it uses the default tag NN. Arguably, a noun is the most common unknown word type in a language [21], because it describes objects in a given space, and often new ones are invented that describe things.

Strategy 1 Sequential tagger strategy:

• DefaultTagger (fallback tagger with basic grammar rules); • AffixTagger (back off DefaultTagger tag=NN);

• UnigramTagger (back off AffixTagger); • BigramTagger (back off UnigramTagger); • TrigramTagger (back off BigramTagger); • BrillTagger.

The English tagger was trained with the Treebank corpus [1, pp. 46, 13], and yields a 92% accuracy. We generalize the English language to one language, there are no corpora available to cover American and British English, only a single generic one. The Dutch tagger was trained with the Alpino corpus [14], and yields 89% accuracy. As goes for English, if other dialects of Dutch were used, they would all be generalized to a single generic one as well.

The examples below (Ex. 8) show results of both taggers, tagging a sentence with tokenized words.

Example 8 POS tag English: “The English tagger uses the treebank corpus.” -> "The, DET, English, NP, tagger, N, uses, V, the, DET, treebank, N, corpus, NP".

Example 9 POS tag Dutch: “De Nederlandse tagger gebruikt de aplino corpus.” -> “De, DET, Nederlandse, ADJ, tagger, N, gebruikt, V, de, DET, aplino, N, corpus, N”.

4.2 ANNOTATING

The processed user profiles are stored in the database, and are visualized via a web page, in an attempt to improve usability. Typically user profiles contain multiple job titles, and the words describing these jobs are ambiguous. Human intervention is required to resolve this ambiguity. For example searching on title “developer” can result in user profiles for software, real estate, project developer etc. By excluding more keywords from the search, the profile selection is narrowed to capture only profiles that are specific to a single job title. Where the selected profiles are satisfactory to the job title, the domain expert (recruiter) can link them together, annotating them.

By using this tool, a recruiter has annotated eleven most common job titles for the available user profiles. This produced about 95,000 profiles. The labelled job titles consist of the following. • Project Manager (3) 13683; • Account Manager (4) 14966; • Product Manager (5) 2817; • Software Developer (6) 4797; • Business Developer (7) 13815; • Program Manager (9) 9962; • Recruitment Consultant (12) 8377; • Sales Representative (24) 6306; • Financial Controller (29) 1784; • Corporate Account Manager (36) 9765; • Marketeer (70) 8356;

The number after the label between parenthesis is derived from the database ID, and helps shorten the notation in the rest of this paper. The trailing number are the amount of profiles per job title.

4.3 FEATURE EXTRACTION

The information that is gathered from pre-processing, and annotating is stored into the database of the tool [19]. NLTK provides many features to automate feature extraction from plaintext, and annotated corpora [7]. In order to use the NLTK extraction methods, processed profiles need to be extracted from the database, and stored in an NLTK readable corpus format.

Three fields need to be extracted from the database. Job title, tokenized words and POS tags. NLTK has two types of categorized corpus reader (mechanism that read corpus files per category).

The 'CategorizedPlaintextCorpusReader' can read plaintext corpora, and is able to extract words split by a word-spacing delimiter, and 'CategorizedTaggedCorpusReader' can read and extract plaintext words linked to tags, with a word-spacing and word-tag delimiter. Examples (Ex. 10, Ex. 11) below show the file encoding of words and words with tags.

Example 10 plaintext corpus: “This is an example. The corpus with words delimited by space and sentences by dots.” Example 11 tagged corpus: “This:DET is:V an:DT example:N. The:DET corpus:NP with:P words:N delimited:VN by:P space:N and:CNJ sentences:N by:P dots:N.”

NOTE: the list of tags can be found in section 4.1.4.

Each profile is exported according to one of these criteria (only words or words with tags), and sentence delimiters are placed after each sentence. The profiles are saved individually to text files in the category (or job title) it belongs to, as listed below.

• Project Manager (3) → /path/to/nltk/corpora/user_profiles/3/; • Account Manager (4) → /path/to/nltk/corpora/user_profiles/4/.

4.3.1 NLTK Corpus Feature Extraction

NLTK reads with the categorized plaintext and tagged corpus reader from the directories where the corpora (user profiles) are stored. We have selected three ways for feature extraction. The bag of words (BOW) approach. Where the words in a given category are read and stored. Only the presence and absence of these words per category will be exposed to the classifier. However if a word is used more often in a category, this count is missed. Word frequency or count (WC) is another method used. This measure counts the occurrence of words for each category. Idiosyncrasies are hereby eliminated. However, more frequently used words and intentional abuse, could result in incorrect scores. To counteract this, word importance (WI) measures are used. WI assigns a score of importance to each word depending on occurrence, and size of the document (profile) it is in. We have used term frequency-inverse document frequency (TF-IDF) [12] to measure word importance. A term (t) in our case is a word that occurs with frequency (f) to the document (d). The inverse document frequency is the commonality of terms across the documents

(5)

divided by all the documents. The result is normalized and stored with the pair P(input, label).

4.3.2 Maximum Features

The maximum features limit is limiting the number of high scoring features to be trained into the classifier. By setting this value to ten, will limit only the ten best scoring features to be trained. Ignoring the others. By default 1000 features are used maximum, this allows for lower memory consumption during classifier training. Changing this number can potentially influence classifier accuracy.

4.3.3 (n) Grams

An n-gram is a sequence of n items (or words), that are grouped as one feature [1, pp. 55]. Classifiers will still treat the n-gram extracted feature a one. The combined words can be more significant to a given category than the single words. Moreover, it is possible to use a combination of uni-grams and bi-grams as input. However, this will cost more processing cycles. N-gram examples (Ex. 11) are shown below.

Example 11 N-gram extracted word features: Original sentence: “software engineering specialist”; Unigram: ( 'software' );

Bigram: ( ('software', 'engineering') );

Trigram: ( ('software', 'engineering', 'specialist') ).

4.4 DATA FRAGMENTING

The extracted features from categorized user profiles, result in a dataset. In the NLTK-trainer a fraction pivot is assigned to split the extracted data into two sets, a test and a training dataset. A fraction of 0.2 uses 20% training data and 80% for testing. Using 100% for training data will result in higher scoring classification, because it is trained with the expected result for the given case. The true capabilities of that classifier cannot be determined when training with test data [1, pp. 218].

4.5 HIERARCHICAL CATEGORIZATION

After analyzing annotating user profiles (section 4.2), there appeared overlap in job titles assigned to profiles. This happened when one profile was annotated with multiple job titles. From the specialist (recruiter) his perspective this grouping was correct. It appeared the job titles in question were closely related to each other, and some user profiles could describe both. To determine the extend of the overlap, a database query was used to sum all groups of multi assigned profiles. Two groups were clearly identified with high overlap (out of 95,000 annotated profiles).

• Group A – Project Manager (3), Program Manager (9), about 9500 profiles;

• Group B – Account Manager (4), Business Developer (7), Sales Representative (24), Corporate Account Manager (36), about 6000 profiles.

When training with extracted features that are too generic, no distinctions can be made to distinguish between two ore more categories. Groups A and B have plenty of overlap, and the remaining job titles 3, 4, 70, 6, 5, 29 and 12 have not. When reflecting on related work (section 2) by A. Sun et. al. [26], hierarchies in categories are clearly demonstrated in our case of groups A and B. These groups show job title branches with a common ancestor. Distinguishing between these sub-categories requires other strategies than currently in place.

4.6 CLASSIFICATION ALGORITHMS

The classifier in NLP decides, which inputs are significant to a label. In our case that would be, which words (only or with POS tags) are significant to job titles. Each classifier algorithm has its own method of predicting pair (P) input for label [1, pp 243-253] (as stated in section 1.1). We have started to compile a list of classifiers that are available in, and through NLTK [3, 7]. These classifiers were given the task of training with a given dataset, and default parameter configuration for a test experiment. The experiment is setup to get a baring on classifier training runtime

memory and processor consumption. The results are listed per classifier in the table (Table 1) below. The abbreviations are used later on in this paper.

NOTE: For our experiment, and further experiments we have used an Intel Xeon 3.4GHz Hyper-threaded Quad-core (E3-1245V2) with 16Gb internal memory.

Table 1: Classifiers running a generic test-set.

Only the working classifiers are used for experiments later in the process. The list below shows the classifier models full name.

• ET – Extra Tree (or Decision Tree) [22]; • LM – Linear Model [23];

• NB – Naive-Bayes [24]; • NM – Nelder-Mead; • P – Powell;

• QN – Quasi-Newton Method; • SVM – Support Vector Machine [20].

4.6.1 Determining Accuracy Score

One of the simplest measures to determine classifier performance is calculating accuracy. Accuracy is determined by the number of correctly labeled features in test data-set. A total of 500 features, and 430 correct labeled results in 430/500 = 0.86, 0.86 * 100 = 86% accuracy (or 0.86 out of 1).

Other methods for measuring document relevance exist. Precision and recall are used in situations where search tasks are performed. Precision indicates how many of the identified items were relevant, and recall indicates how many of these items were not identified. A combined precision and recall score is known as the F-Measure (F-Score) [1, pp. 239]. By default NLTK [3, 7] calculates accuracy of all used labels, which we found more practical to use.

4.7 EXPERIMENT SETUP

With the results from the test experiment (section 4.6) we can devise a plan to test other parameter configurations. The parameters and their use have been discussed in the previous sections. The available parameters with their option values are listed below.

• Labels: 3, 4, 5, 6, 7, 9, 12, 24, 29, 36 and 70; • Fractions: float from 0.2 to 0.8;

• (n) grams: 1, 2, 3, (1,2), (2,3), (1,3) and (1,2,3); • Value types: BOW, WC and WI;

• Corpus: words, words+POS;

• Classifier: DT, LSVC, LR, NB, GIS, IIS and MNB; • Maximum features: integer ranging from 0 to 1000*.

*= maximum features are limited to 1000, because of memory utilization.

Training all classifiers with the combined list of parameters will result in 11 (label combinations) x 7 (fractions) x 7 (n-grams) x 3 (value types) x 7 (classifiers) = 10290 experiments. The typical runtime for a single experiment ranges between 20 and 60 minutes. It is therefore not practical to calculate all

Classifier Detail Model Abbreviation

BernoulliNB Memory usage too high NB BNB

BFGS Memory usage too high QN BFGS

DecisionTree Working ET DT

ExtraTreesClassifier Memory usage too high ET ETC

GaussianNB Memory usage too high NB GNB

GradientBoostingClassifier Memory usage too high ET GBC KNeighborsClassifier Memory usage too high LM KNN

LBFGSB Memory usage too high QN LBFGSB

LinearSVC Working SVM LSVC LogisticRegression Working LM LR Unstable ME CG Working ME GIS Working ME IIS MultinomialNB Working NB MNB Working NB NB

Memory usage too high NM NM

NuSVC Memory usage too high SVM NSVC

Powell Memory usage too high P P

RandomForestClassifier Unstable ET RFC SVC Unstable SVM SVC Maxent CG Maxent GIS Maxent IIS NaiveBayes Nelder-Mead

(6)

combinations, we need a divide and conquer approach, but keep track of possible local maximums [18].

4.7.1 Parameter Reduction Strategies

We are targeting parameter option values one at the time. Starting with reducing the labels, fractions, n-grams, value-types, corpus and max-features. When running the experiment, the other parameter options have fixed values, and not the whole range of values. For example: fractions appear to have a highest accuracy when using option value 0.5. In the rest of the experiments we can set this fixed value. However, we need to make sure we are not targeting a local maximum. As we progress through the parameter options, we can analyze the results and adjust the option values accordingly.

5. RESULTS

In this section we describe the performed experiments in detail. Performing experiments has resulted in seven result-sets, which are presented as well. These results are presented chronologically according to the reduction strategy (section 4.7.1). However, the list of used classifiers is not covered here, since we cannot determine the actual cause of most failures (results can be found in section 4.6 table 1).

5.1 EXPERIMENT 1,2: (UN)GROUPED LABELS

In section 4.5 (hierarchical categorization) it appeared the total group of eleven labels or categories could be reduced by abstracting two groups of job titles. Group A job labels: (3, 9) and Group B (4, 7, 24, 36). However, we need to determine first if this theory holds.

Experiment 1 contains only a single label from group A (3) and B (4), and the remaining other documents (user profiles), which should result in a higher accuracy because there is less overlap in features shared between the labels. Experiment 2 contains all categories from groups A and B, and have more overlap in features shared between the labels. Therefore have an overall lower accuracy should be achieved compared to experiment 1. Furthermore, parameter values for fractions have been set to a larger range. Because the effect of fractions is unknown, we want to make sure no local maximum is reached this early in the process. This assumption holds for n-labels as well. The label group is increased in each iteration with the next label. However the parameters for feature extraction: corpus, n-grams, value types is fixed to our best estimated average. We will cover them at a later stage.

Experiment 1 specific grouped labels • Labels: 3, 4, 70, 6, 5, 29, 12; • Fractions: 0.5, 0.6, 0.7, 0.8; • (n) grams: 2-grams; • Value types: BOW; • Corpus: words;

• Classifier: DT, LSVC, LR, NB, GIS, IIS and MNB; • Maximum features: 1000.

Experiment 2 generic ungrouped labels • Labels: 3, 9, 4, 7, 24, 36;

• Fractions: 0.5, 0.6, 0.7, 0.8; • (n) grams: 2-grams; • Value types: BOW; • Corpus: words;

The results of experiments 1 and 2 are plotted in the graph (Fig. 1) below. SPC-Average is the result of experiment 1, the specific label set. GEN-Average is the average result of experiment 2, the generic label set. Each accuracy point plotted per n-labels per experiment is calculated by averaging accuracy of (4 x 7) = 28 classification and evaluation runs. More detail can be found in the table (Table 2) below.

Figure 1: Experiment 1 and 2 grouped vs. ungrouped labels.

Table 2: experiment 1 and 2 grouped vs. ungrouped labels 5.2 EXPERIMENT 3: FRACTIONS

As described (in section 4.4) fractions determines the split pivot between the percentage training and test data. Some classifiers require more and other less training data. Experiment 3 should resolve how classifiers respond to a shift in training data.

NOTE: we have selected the grouped labels from Experiment 1 as labels. Because they showed the highest accuracy scores.

Experiment 3 fractions • Labels: 3, 4, 70, 6, 5, 29, 12;

• Fractions: 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8; • (n) grams: 2-grams;

• Value types: BOW; • Corpus: words;

The results for experiment 3 are shown below (Table 3). The maximum difference between fractions is about 1.8%, which seems quite large. After examining all results per classifier fractions (Table 4), it appears there is a large difference between the low-end 0.2 and high-end 0.8. Fraction value 0.7 gets the highest accuracy across the classifiers, except at classifier Decision Tree, the offset is 0.008%, which is statistically insignificant.

Table 3: experiment 3 fraction accuracy delta

(n)labels GEN-Average GEN-Best GEN-Worst SPC-Average SPC-Best SPC-Worst

7 0.8286 0.7889 0.8636 6 0.3513 0.2905 0.3774 0.8514 0.7983 0.8887 5 0.4148 0.3491 0.4411 0.8515 0.8019 0.8884 4 0.4672 0.3959 0.4946 0.8722 0.8400 0.8994 3 0.6618 0.6302 0.6760 0.8793 0.8509 0.9066 2 0.5384 0.4894 0.5692 0.9289 0.9066 0.9433

Classifiers and the accuracy difference (delta)

(n) labels DecisionTree LinearSVC LogisticRegression Maxent GIS Maxent IIS MultinomialNB NaiveBayes Max-Delta

2 0.0092 0.0084 0.0058 0.0086 0.0076 0.0066 0.0079 0.0092 3 0.0102 0.0139 0.0106 0.0092 0.0106 0.0050 0.0078 0.0139 4 0.0135 0.0141 0.0106 0.0092 0.0113 0.0050 0.0072 0.0141 5 0.0141 0.0139 0.0121 0.0078 0.0065 0.0041 0.0075 0.0141 6 0.0144 0.0153 0.0125 0.0085 0.0136 0.0038 0.0076 0.0153 7 0.0144 0.0182 0.0130 0.0058 0.0133 0.0038 0.0038 0.0182

(7)

Table 4: experiment 3 fractions full result-set. 5.3 EXPERIMENT 4: N-GRAMS

Extracting corpus words, with n-grams, groups the extracted features, which could result in a higher accuracy (as described in section 4.3.3). For n-grams the same strategy as with labels and fractions is applicable. Test the optimal n-grams combination for all given classifiers. Parameter options labels, and fractions are limited according to results from the previous experiments.

NOTE: n-grams are not bound to a single optimum, but combination of (1,2) will collect features with unigrams, and bigrams. (2,3) collects features with bigrams, and trigrams etc.

Experiment 4 n-grams • Labels: 3, 4, 70, 6, 5, 29, 12; • Fractions: 0.7;

• (n) grams: 1, 2, 3, (1,2), (2,3), (1,3), (1,2,3); • Value types: BOW;

• Corpus: words;

The influence from n-grams on accuracy is shown in the graph below (Fig. 3) and the table (Table 5) shows more details about the graph. (123), and (12)-grams produce the highest accuracy for best scoring classifiers. Compared to (2)-grams (used in the previous experiment) there is a 3-4% increase in accuracy. An overview of the classifier results for 7-labels is shown in table 6. Complete results can be found in Appendix 1.

Figure 2: experiment 4 n-grams influence on accuracy per n-labels.

Table 5: experiment 4 n-grams influence on accuracy per n-labels.

Table 6: experiment 4 n-grams influence on accuracy 7 labels per classifier. 5.4 EXPERIMENT 5: VALUE TYPES

The feature extraction methods WC and WI (described in section 4.3.1) can increase classifier accuracy, because they extract more information from the words corpus compared to BOW. With this experiment we try to find the best scoring values for value-types, to reduce the parameter set even more. Again the parameter options for labels, fractions and n-grams are limited according to the results of the last experiment. Because of the close scores between (1,2)- grams and (1,2,3)-grams both are used, again not trying to limit to a local maximum.

Experiment 5 value types • Labels: 3, 4, 70, 6, 5, 29, 12; • Fractions: 0.7;

• (n) grams: (1,2,3) and (1,2); • Value types: BOW, WC, WI; • Corpus: words;

The effect of changing the feature extraction methods to WC and WI significantly increases classifier accuracy as shown in the graph below (Fig.3) and in detail (Table 7). This increase is especially well observable from 5-labels onwards. The effect on the individual classifiers is listed in Table 8. Scores of < 7-labels etc. can be found in Appendix 2.

LogisticRegression (n) labels Fraction Accuracy Offset to .7 (n) labels Fraction Accuracy Offset to .7

7 0.2 0.8128 -0.0136 7 0.2 0.8507 -0.0554 7 0.3 0.8169 -0.0095 7 0.3 0.8558 -0.0548 7 0.8 0.8228 -0.0037 7 0.4 0.8595 -0.8636 7 0.4 0.8234 -0.0030 7 0.5 0.8595 -0.8636 7 0.7 0.8264 0.0000 7 0.8 0.8616 -0.0020 7 0.6 0.8265 0.0001 7 0.6 0.8618 -0.0018 7 0.5 0.8272 0.0008 7 0.7 0.8636 0.0000 MultinomialNB

(n) labels Fraction Accuracy Offset to .7 (n) labels Fraction Accuracy Offset to .7

7 0.2 0.8305 -0.0133 7 0.8 0.8149 -0.0028 7 0.3 0.8343 -0.0094 7 0.3 0.8168 -0.0009 7 0.4 0.8387 -0.0050 7 0.2 0.8171 -0.0005 7 0.5 0.8396 -0.0041 7 0.7 0.8177 0.0000 7 0.8 0.8415 -0.0022 7 0.6 0.8178 0.0002 7 0.6 0.8429 -0.0009 7 0.4 0.8182 0.0005 7 0.7 0.8438 0.0000 7 0.5 0.8187 0.0010 LinearSVC

(n) labels Fraction Accuracy Offset to .7 (n) labels Fraction Accuracy Offset to .7

7 0.2 0.8406 -0.0191 7 0.2 0.8050 -0.0038 7 0.3 0.8487 -0.0172 7 0.3 0.8058 -0.0030 7 0.4 0.8543 -0.0159 7 0.8 0.8071 -0.0018 7 0.5 0.8550 -0.0150 7 0.6 0.8071 -0.0017 7 0.6 0.8562 -0.0026 7 0.5 0.8082 -0.0007 7 0.8 0.8564 -0.0024 7 0.4 0.8083 -0.0006 7 0.7 0.8587 0.0000 7 0.7 0.8088 0.0000

(n) labels Fraction Accuracy Offset to .7

7 0.2 0.7842 -0.0058 7 0.3 0.7863 -0.0036 7 0.4 0.7887 -0.0013 7 0.6 0.7889 -0.0011 7 0.8 0.7893 -0.0007 7 0.5 0.7895 -0.0004 7 0.7 0.7900 0.0000 Decison Tree Maxent IIS NaiveBayes Maxent GIS

(n) labels Best (1) gram Best (2) gram Best (3) gram Best (12) gram Best (13) gram Best (23) gram Best (123) gram

7 0.8567 0.8636 0.6935 0.9029 0.8934 0.8678 0.9098 6 0.8718 0.8887 0.7251 0.9222 0.9098 0.8944 0.9285 5 0.8714 0.8884 0.7263 0.9218 0.9102 0.8937 0.9280 4 0.8972 0.8994 0.7335 0.9342 0.9245 0.8992 0.9397 3 0.9005 0.9066 0.7509 0.9383 0.9276 0.9057 0.9437 2 0.9391 0.9426 0.8423 0.9655 0.9594 0.9469 0.9727

(n)grams (n)labels DecisionTree LinearSVC LogisticRegression Maxent GIS Maxent IIS MultinomialNB NaiveBayes

Unigram 7 0.8042 0.8514 0.8567 0.5858 0.4878 0.7138 0.6434 Bigram 7 0.8264 0.8588 0.8636 0.7900 0.8438 0.8177 0.8088 Trigram 7 0.5967 0.6935 0.6928 0.6726 0.6860 0.6708 0.6746 (1,2)gram 7 0.8753 0.9011 0.9029 0.6836 0.8523 0.8089 0.7693 (1,3)gram 7 0.8484 0.8925 0.8934 0.6370 0.8390 0.7818 0.7397 (2,3)gram 7 0.8369 0.8627 0.8678 0.8031 0.8447 0.8182 0.8172 (1,2,3)gram 7 0.8818 0.9077 0.9098 0.7060 0.8730 0.8269 0.7978

(8)

Figure 3: experiment 5 BOW, WC & WI best scoring accuracies for n-labels.

Table 7: experiment 5 BOW, WC & WI best scoring accuracies for n-labels.

Table 8: experiment 5 BOW, WC & Wi best scoring classifiers accuracy for 7-labels. 5.5 EXPERIMENT 6: ADDING INFORMATION (POS-TAGS)

Adding extra information from external sources (described in section 1) can result in higher classification accuracy, as argued in the study of L. Ratinov [4] (described in section 2). To determine if these claims have merit, and are useable for our case, we have processed user profiles with tagging capabilities (section 4.1). From our perspective taggers provide extra information. By using the tagged corpora for user profiles we try to determine if extra information (by tagging) can increase accuracy. The parameter options for labels, fractions and n-grams have been fixed to the optimum score of the previous experiment (Experiment 5). However to be extra through, all extraction methods (value-types) have been tested.

Experiment 6 tagged corpora • Labels: 3, 4, 70, 6, 5, 29, 12; • Fractions: 0.7;

• (n) grams: (1,2,3) and (1,2); • Value types: BOW, WC, WI; • Corpus: words+pos;

Adding extra information to words on feature extraction has lead to an insignificant change of the best scoring classifier for 7-labels (1,2,3)-grams as shown in the graph below (Fig. 4) and details (Table 9). However the overall performance for some classifiers compared to corpus words only, have improved (Table 10). Scores of < 7-labels etc. can be found in Appendix 3.

Figure 4: experiment 6 BOW, WC & WI best scoring classifiers accuracy for n-labels POS tagged corpus.

Table 9: experiment 6 BOW, WC & WI best scoring classifiers accuracy for n-labels

POS tagged corpus.

Table 10: experiment 6 BOW, WC & Wi best scoring classifiers accuracy for 7-labels POS tagged corpus.

5.6 EXPERIMENT 7: MAXIMUM FEATURES

Across the experiments default value of one thousand has kept the memory consumption during training down (described in section 4.3.2). However some classifiers can react differently on the available features. This experiment will determine if another value for maximum features will change the accuracy output. To keep a broad perspective, parameter options for labels, fractions, n-grams and value-types are limited according to the results of the previous experiments. However we have opted to use the pos tagged corpus, because it shows slight improvements (although inconclusive). Moreover, by dividing the maximum-features we hope to see a significant change in accuracy. Scores of < 7-labels etc. can be found in Appendix 4. Experiment 7 maximum features

• Labels: 3, 4, 70, 6, 5, 29, 12; • Fractions: 0.7;

• (n) grams: (1,2,3) and (1,2); • Value types: BOW, WC, WI; • Corpus: words+pos;

According to the results shown in the graph below (Fig. 5) and details (Table 11). Accuracy scores shift slightly from one best scoring classifier combined with n-grams and value-type. However they are again statistically insignificant. We could continue the experiments to test divisions of 250 and 750, but reducing the maximum features will ultimately lead to a reduction in accuracy. With this experiment we wanted to make sure the reduction to 1000 features did not introduce anomalies.

(n)labels Best (12) BOW Best (123) BOW Best (12) WC Best (123) WC Best (12) WI Best (123) WI

7 0.9029 0.9098 0.9223 0.9239 0.9271 0.9287 6 0.9222 0.9285 0.9391 0.9414 0.9422 0.9436 5 0.9218 0.9280 0.9390 0.9409 0.9422 0.9439 4 0.9444 0.9524 0.9475 0.9499 0.9512 0.9475 3 0.9620 0.9437 0.9525 0.9545 0.9558 0.9561 2 0.9655 0.9727 0.9734 0.9763 0.9779 0.9792

Classifiers (n)labels (12) BOW (123) BOW (12) WC (123) WC (12) WI (123) WI

DecisionTree 7 0.8753 0.8818 0.8359 0.8377 0.8359 0.8377 LinearSVC 7 0.9011 0.9076 0.9171 0.9209 0.9271 0.9287 LogisticRegression 7 0.9029 0.9098 0.9223 0.9239 0.9199 0.9213 Maxent GIS 7 0.6885 0.7060 0.6836 0.7104 0.6885 0.7104 Maxent IIS 7 0.8523 0.8730 0.8618 0.8736 0.8618 0.8736 MultinomialNB 7 0.8089 0.8269 0.8216 0.8318 0.8450 0.8543 NaiveBayes 7 0.7693 0.7978 0.7858 0.8082 0.7858 0.8082

Classifiers (n)labels (12) BOW (123) BOW (12) WC (123) WC (12) WI (123) WI

7 0.9054 0.9114 0.9272 0.9295 0.9296 0.9313 6 0.9244 0.9317 0.9446 0.9465 0.9446 0.9467 5 0.9245 0.9313 0.9452 0.9477 0.9449 0.9473 4 0.9372 0.9417 0.9536 0.9548 0.9534 0.9551 3 0.9392 0.9435 0.9584 0.9596 0.9556 0.9579 2 0.9681 0.9738 0.9928 0.9935 0.9928 0.9935

(9)

Figure 5: experiment 7 BOW, WC & WI best scoring classifiers accuracy for n-labels POS tagged corpus, maximum-features 500.

Table 11: experiment 7 BOW, WC & WI best scoring classifiers accuracy for n-labels POS tagged corpus, maximum-features 500.

Table 12: experiment 6 BOW, WC & Wi best scoring classifiers accuracy for 7-labels POS tagged corpus, maximum-features 500.

5.7 EXPERIMENT OVERVIEW

During the experiments accuracy has been improved. To get a better understanding how the parameters have influenced the accuracy graphs (Fig. 6 and 7) show the increased accuracy to the experiments, details are shown in Table 13. All the collected data can be found in appendices 1 through 4.

Figure 6: experiment overview per classifier.

Figure 7: experiment overview per classifier (zoomed at experiments 2 to 7).

Table 13: experiments overview per classifier. 6. CONLUSION

During this study we have identified that text classification involves many steps from raw data parsing, feature extracting to training and validating classifiers. Within these steps various implementation decisions are possible, which can potentially reduce the amount of information in the training-set (described in section 4). The results show that the accuracy of the classifiers is determined by the specificness of the dataset. More specific data has resulted in higher accuracy compared to generic data (section 5.1). Moreover, the most significant secondary accuracy improvements were achieved using n-grams and word count (WC) and importance (WI) feature extraction methods.

The development of a machine learning match engine for SocialReferral his job vacancies, has lead to a number of questions, for which we have sought answers. These answers can be beneficial to other applications of text classification as well. Using an intermediary tool backed with a database [19] for the collection and processing of raw data, has made annotation, verification and exporting user profiles easier. The tools enables data visualization in each stage of the process, resulting in a better understanding of the data mutations. On the processed profiles domain specialists can search, filter and annotate using the same visualization techniques. Moreover, the secondary extraction layer provides extra flexibility when it comes to data exportation. In the case of extracting POS tags to word features, it was a case of collecting the required information and outputting it to a file. With a plaintext file this would have meant additional reading, parsing and tokenizing steps.

NOTE: the tagged corpus readers of NLTK are able to filter tagged features, however the tags need to be in the corpus for this to work.

Classifiers (n)labels (12) BOW (123) BOW (12) WC (123) WC (12) WI (123) WI

7 0.9033 0.9091 0.9303 0.9311 0.9259 0.9279 6 0.9221 0.9273 0.9452 0.9479 0.9416 0.9427 5 0.9217 0.9277 0.9463 0.9482 0.9422 0.9432 4 0.9348 0.9384 0.9534 0.9557 0.9509 0.9515 3 0.9386 0.9425 0.9587 0.9622 0.9559 0.9577 2 0.9796 0.9750 0.9928 0.9934 0.9928 0.9934

Experiment DecisionTree LinearSVC LogisticRegression MultinomialNB

7 0.8842 0.9279 0.9311 0.7835 0.9055 0.8582 0.8926 6 0.8854 0.9313 0.9295 0.7666 0.9032 0.8542 0.8850 5 0.8818 0.9287 0.9239 0.7104 0.8736 0.8543 0.8082 4 0.8818 0.9077 0.8678 0.8031 0.8730 0.8269 0.8172 3 0.8272 0.8587 0.8636 0.7900 0.8438 0.8187 0.8088 2 0.8272 0.8587 0.8636 0.7900 0.8438 0.8187 0.8088 1 0.3298 0.3495 0.3529 0.3774 0.3696 0.3604 0.3588

(10)

In the design of the database we opted to normalize tokenized words into a single table. Every word (feature) is isolated. This opened up a hole range of possibilities. Features can be manipulated before they are extracted into the final corpus. Words can be replaced when a spelling mistake have been made and identified. External resources can be used for additional information. Moreover, because features exist in a finite list, compression is possible when extracting them to a corpus file.

After reflecting on our results (section 5), we can conclude that classifiers based on the support vector machine (SVM) model are best suited for text classification, compared to Naive-Bayes, Maximum Entropy and Decision Tree. This result is confirmed by studies of M. Rogati et. al. [27] and S. Charkrabarti et. al. [28]. By analyzing the accuracy results of both Logistic Regression and Linear SVC (SVM models), there appeared to be no more accuracy improvement after the fifth experiment (section 5.4). We believe it to be the result of reaching a dataset ceiling, where the data in the set does not provide enough specific markers for labeling. This ceiling could potentially impact all classifiers. The remaining five classifiers are not achieving the same accuracies, and therefore they could still benefit from the extra information (not yet reaching this ceiling). Effects of a data ceiling are observed by comparing between specific and generic data (section 5.1). After removing the generic featured labels accuracy was increased by 50%. In our view, improving dataset specificness could be achieved via more accurate annotations.

6.1 ADDITIONAL REMARKS

As described in the previous section, the effects of information loss during processing can be significant. During the pre-processing we have identified various tokenized words and sentences that were not delimited correctly. However we have not verified to which extend it will have an effect on accuracy. When running the experiments, classifier Decision Tree compared to others, utilized the most memory and took the longest time to train (especially with (1,2,3)-grams). Depending on the n-labels, training takes up to two hours. The results have shown that its accuracy is average (Table 12), except for the fourth experiment (section 5.3) but it is still outperformed by LinearSVC. For future work we will consider not using Decision Trees because of this observed behavior. The fasted classifiers to train, were the linear models Logistic Regression and Linear SVC, and these provided the highest accuracy as well. Therefore omitting the need to use DTs for text classification even more.

NOTE: these findings conflict with the study of S. Charkrabarti et. al. [28]. They have concerns about large the memory footprint and slow training times for SVM. The results we have observed may be the result of an improved SVM implementation within Scikit-learn [6] for Logistic Regression and Linear SVC.

After evaluating the effects of feature extraction (BOW, WC and WI), on NLTK [7] implemented classifiers Naive-Bayes, Decision Tree, Maximum-Entropy GIS/ IIS they showed the same results for WC as for WI. This could be the result of a possible implementation issue with NLTK. The remaining classifiers that do not show the same results are implemented via Scikit-learn [6]. However, because word importance shows an insignificant increase over word frequency overall, this is would not effect the conclusion.

ACKNOWLEDGEMENTS

Writing this thesis was impossible without the help of A. H. Meij CTO and co-founder of SocialReferral B.V. He helped out a lot when investigating text classification strategies, and provided the user profiles dataset (among other things). Furthermore G. Nieuwkamp, HR expert and co-founder of SocialReferral B.V. to annotate the dataset of user profiles with job titles. This proven to be a big help. And lastly T. van der Storm senior researcher at the Cetrum Wiskunde & Informatica and tutor at the University of Amsterdam for providing academic steering towards graduation.

REFERENCES

[1] S. Bird, E. Klein, E.Loper, Natural Language Processing with Python, O'Reilly, United States of America, 2009; [2] A.M Turing, Computer machinery and intelligence, pages

433-460, Mind, United Kingdom, 1950;

[3] J. Perkins, NLTK-Trainer (version 1.0) [software], Available from: http://nltk-trainer.readthedocs.org/, 2011;

[4] L.Ratinov, Exploiting Knowledge in NLP, Ph.D. Thesis, University of Illinois, Urbana, Illinois, United States of America, 2010;

[5] SocialReferral B.V., Anonymous profile export (version 1.0) [dataset], The Netherlands, 2013;

[6] Scikit-learn developers, Scikit-Learn (version 0.14) [software], Available from: http://scikit-learn.org/, 2013; [7] D. Garrette, P. Ljunglof, J. Nothman, NLTK (version 2.0),

Available from: http://nltk.org/, 2011;

[8] L. Nemeth, Hunspell (version 0.14), Available from: http://hunspell.sourceforge.net/, 2013;

[9] K. Atkinson, G. Keunning, English US dictionary, 2006; [10] K. Atkinson, D. Bartlett, B. Kelk, A. Brown, English GB

dictionary, 2005;

[11] OpenTaal, S. Brouwer et. al., Nederlandstalige Tex Gebruikersgroep, Dutch dictionary, 2010;

[12] T. Roelleke, J. Wang, TF-IDF Uncovered: A Study of Theories and Probabilities, Queen Mary, University of London, United Kingdom, 2008;

[13] M.P. Marcus, B. Santorini, M.A. Marcinkiewicz, Building a Large Annotated Corpus of English: The Penn Treebank, University of Pennsylvania, Northwestern University, United States of America, 1993;

[14] L. Van der Beek, G. Bouma, J. Daciuk, T. Gaustad, R. Malouf, G van Noord, R. Prins, B. Villada, Algorithms for Linguistic Processing NWO PIONIER Progress Report, Chapter 5: The Alpino Dependency Treebank Groningen, The Netherlands, 2002;

[15] C. Fellbaum, D.Gross, K. Miller, Adjectives in WordNet, Princeton University, United States of America, 1993; [16] C.Fellbaum, English Verbs as a Semantic Net, Princeton

University, United States of America, 1993;

[17] G.A. Miller, R. Beckwith, C. Fellbaum, D. Gross, K. Miller, Introduction to WordNet: An On-line Lexical Database, Princeton University, United States of America, 1993; [18] I.P. Gent, T. Walsh, Towards an Understanding of

Hill-climbing Procedures for SAT, AAAI, United States of America, 1993;

[19] D.M.N. Blommesteijn, A.H. Meij, Cortex (version 0.0.1) [software], Available on request, The Netherlands, 2013; [20] K.P. Bennett, C. Campbell, Support Vector Machines:

Hype or Hallelujah?, SIGKDD Explorations, United States of America, 2000;

[21] G.A. Miller, Nouns in WordNet: A Lexical Inheritance System, Princeton University, United States of America, 1993;

[22] U. Manber, M. Tompa, The Complexity of Problems on Probabilistic Nondeterministic, and Alternating Decision Trees, University of Washington, Seattle, United States of America, 1985;

[23] S. R. Searle, Linear Models Volume 1, John Wiley & Sons, United States of America, 1971;

[24] Y. Yang, Discretization for Naive-Bayes Learning, Monash University, Australia, 2003;

[25] A. McCallum, R. Rosenfeld, T. M. Mitchell, and A.Y. Ng, Improving Text Classification by Shrinkage in a Hierarchy of Classes, ICML, United States of America, 1998;

(11)

[26] A. Sun, EP. Lim, Hierarchical Text Classification and Evaluation, Nanyang Technological University, Singapore, 2001;

[27] M. Rogati, Y. Yang, High-Performing Feature Selection for Text Classification, Carnegie Mellon University, United States of America, 2002;

[28] S. Chakrabarti, S. Roy, M.V. Soundalgekar, Fast and accurate text classiﬁcation via multiple linear discriminant projections, IIT Bombay, India, 2003;

(12)

APPENDIX 1: EXPERIMENT 4 N-GRAMS

Table 14: experiment 4 n-grams influence on accuracy per classifier and n-label.

(n)grams (n)labels DecisionTree LinearSVC LogisticRegression MultinomialNB

Unigram 7 0.8042 0.8514 0.8567 0.5858 0.4878 0.7138 0.6434 Bigram 7 0.8264 0.8588 0.8636 0.7900 0.8438 0.8177 0.8088 Trigram 7 0.5967 0.6935 0.6928 0.6726 0.6860 0.6708 0.6746 (1,2)gram 7 0.8753 0.9011 0.9029 0.6836 0.8523 0.8089 0.7693 (1,3)gram 7 0.8484 0.8925 0.8934 0.6370 0.8390 0.7818 0.7397 (2,3)gram 7 0.8369 0.8627 0.8678 0.8031 0.8447 0.8182 0.8172 (1,2,3)gram 7 0.8818 0.9077 0.9098 0.7060 0.8730 0.8269 0.7978 Unigram 6 0.8306 0.8663 0.8718 0.5740 0.4874 0.7267 0.6986 Bigram 6 0.8662 0.8814 0.8887 0.8002 0.8670 0.8362 0.8307 Trigram 6 0.6401 0.7240 0.7251 0.7039 0.7191 0.6991 0.7029 (1,2)gram 6 0.8965 0.9202 0.9222 0.6722 0.8534 0.8235 0.7978 (1,3)gram 6 0.8679 0.9063 0.9098 0.6301 0.8401 0.8007 0.7798 (2,3)gram 6 0.8745 0.8884 0.8944 0.8201 0.8695 0.8413 0.8409 (1,2,3)gram 6 0.9049 0.9278 0.9285 0.6968 0.8781 0.8422 0.8256 Unigram 5 0.8309 0.8676 0.8714 0.5966 0.5070 0.7261 0.6948 Bigram 5 0.8691 0.8803 0.8884 0.8042 0.8632 0.8347 0.8295 Trigram 5 0.6545 0.7246 0.7263 0.7041 0.7209 0.6966 0.7017 (1,2)gram 5 0.8970 0.9203 0.9218 0.6886 0.8565 0.8234 0.7966 (1,3)gram 5 0.8696 0.9085 0.9102 0.6512 0.8463 0.8008 0.7775 (2,3)gram 5 0.8762 0.8883 0.8937 0.8223 0.8728 0.8406 0.8402 (1,2,3)gram 5 0.9054 0.9271 0.9280 0.7118 0.8808 0.8427 0.8251 Unigram 4 0.8668 0.8948 0.8972 0.6395 0.5537 0.7919 0.7648 Bigram 4 0.8829 0.8934 0.8994 0.8427 0.8875 0.8542 0.8549 Trigram 4 0.6666 0.7289 0.7335 0.7131 0.7298 0.7023 0.7068 (1,2)gram 4 0.9138 0.9334 0.9342 0.7351 0.8952 0.8579 0.8486 (1,3)gram 4 0.8827 0.9228 0.9245 0.6934 0.8779 0.8370 0.8208 (2,3)gram 4 0.8865 0.8953 0.8992 0.8501 0.8894 0.8527 0.8551 (1,2,3)gram 4 0.9185 0.9375 0.9397 0.7541 0.9064 0.8657 0.8613 Unigram 3 0.8809 0.9000 0.9005 0.7039 0.7797 0.7996 0.7673 Bigram 3 0.8896 0.8999 0.9066 0.8534 0.8965 0.8588 0.8595 Trigram 3 0.6988 0.7468 0.7509 0.7316 0.7508 0.7167 0.7215 (1,2)gram 3 0.9219 0.9382 0.9383 0.7748 0.8948 0.8619 0.8496 (1,3)gram 3 0.8929 0.9273 0.9276 0.7429 0.8747 0.8418 0.8226 (2,3)gram 3 0.8940 0.9015 0.9057 0.8610 0.9003 0.8574 0.8603 (1,2,3)gram 3 0.9277 0.9421 0.9437 0.7888 0.9072 0.8692 0.8627 Unigram 2 0.9310 0.9372 0.9391 0.8735 0.9141 0.8737 0.8484 Bigram 3 0.9280 0.9390 0.9426 0.9214 0.9373 0.9187 0.9209 Trigram 2 0.8181 0.8319 0.8346 0.8351 0.8423 0.8359 0.8355 (1,2)gram 2 0.9620 0.9645 0.9655 0.9168 0.9490 0.9184 0.9115 (1,3)gram 2 0.9505 0.9584 0.9594 0.9143 0.9508 0.9111 0.9063 (2,3)gram 2 0.9351 0.9427 0.9469 0.9286 0.9426 0.9275 0.9288 (1,2,3)gram 2 0.9694 0.9699 0.9727 0.9305 0.9593 0.9309 0.9288 Maxent GIS Maxent IIS NaiveBayes

(13)

APPENDIX 2: EXPERIMENT 5 VALUE-TYPES

Table 15: experiment 5 BOW, WC & WI accuracy per n-labels and classifier table.

Classifiers (n)labels (12) BOW (123) BOW (12) WC (123) WC (12) WI (123) WI DecisionTree 7 0.8753 0.8818 0.8359 0.8377 0.8359 0.8377 LinearSVC 7 0.9011 0.9076 0.9171 0.9209 0.9271 0.9287 LogisticRegression 7 0.9029 0.9098 0.9223 0.9239 0.9199 0.9213 7 0.6885 0.7060 0.6836 0.7104 0.6885 0.7104 7 0.8523 0.8730 0.8618 0.8736 0.8618 0.8736 MultinomialNB 7 0.8089 0.8269 0.8216 0.8318 0.8450 0.8543 7 0.7693 0.7978 0.7858 0.8082 0.7858 0.8082 DecisionTree 6 0.8965 0.9049 0.8785 0.8776 0.8785 0.8776 LinearSVC 6 0.9204 0.9276 0.9357 0.9368 0.9422 0.9436 LogisticRegression 6 0.9222 0.9285 0.9391 0.9414 0.9342 0.9348 6 0.6790 0.6968 0.6722 0.7044 0.6790 0.7044 6 0.8534 0.8781 0.8683 0.8848 0.8683 0.8848 MultinomialNB 6 0.8235 0.8422 0.8383 0.8488 0.8584 0.8685 6 0.7978 0.8256 0.8160 0.8345 0.8160 0.8345 DecisionTree 5 0.8970 0.9054 0.8810 0.8800 0.8810 0.8800 LinearSVC 5 0.9203 0.9271 0.9360 0.9372 0.9422 0.9439 LogisticRegression 5 0.9218 0.9280 0.9390 0.9409 0.9355 0.9355 5 0.6886 0.7118 0.6949 0.7193 0.6949 0.7193 5 0.8565 0.8808 0.8724 0.8877 0.8724 0.8877 MultinomialNB 5 0.8379 0.8427 0.8379 0.8490 0.8572 0.8676 5 0.7966 0.8251 0.8155 0.8339 0.8155 0.8339 DecisionTree 4 0.9138 0.9185 0.8953 0.8950 0.8953 0.8950 LinearSVC 4 0.9444 0.9524 0.9374 0.9444 0.9512 0.9475 LogisticRegression 4 0.9342 0.9397 0.9475 0.9499 0.9452 0.9453 4 0.7420 0.7541 0.7351 0.7623 0.7420 0.7623 4 0.8952 0.9064 0.9095 0.9135 0.9095 0.9135 MultinomialNB 4 0.8579 0.8657 0.8790 0.8831 0.8743 0.8826 4 0.8486 0.8613 0.8648 0.8718 0.8648 0.8718 DecisionTree 3 0.9219 0.9277 0.9130 0.9138 0.9130 0.9138 LinearSVC 3 0.9381 0.9419 0.9490 0.9524 0.9558 0.9561 LogisticRegression 3 0.9525 0.9437 0.9525 0.9545 0.9507 0.9502 3 0.7748 0.7888 0.7810 0.7971 0.7810 0.7971 3 0.8948 0.9072 0.9095 0.9169 0.9095 0.9169 MultinomialNB 3 0.8619 0.8692 0.8833 0.8868 0.8770 0.8863 3 0.8496 0.8627 0.8674 0.8743 0.8674 0.8743 DecisionTree 2 0.9620 0.9694 0.9645 0.9643 0.9645 0.9643 LinearSVC 2 0.9643 0.9699 0.9696 0.9728 0.9779 0.9792 LogisticRegression 2 0.9655 0.9727 0.9734 0.9763 0.9734 0.9730 2 0.9168 0.9305 0.9230 0.9359 0.9230 0.9359 2 0.9490 0.9593 0.9616 0.9650 0.9616 0.9650 MultinomialNB 2 0.9184 0.9309 0.9302 0.9356 0.9184 0.9273 2 0.9115 0.9288 0.9248 0.9353 0.9248 0.9353 Maxent GIS Maxent IIS NaiveBayes Maxent GIS Maxent IIS NaiveBayes Maxent GIS Maxent IIS NaiveBayes Maxent GIS Maxent IIS NaiveBayes Maxent GIS Maxent IIS NaiveBayes Maxent GIS Maxent IIS NaiveBayes