Data mining scenarios for the discovery of subtypes and the comparison of algorithms Colas, F.P.R.

(1)

Data mining scenarios for the discovery of subtypes and the comparison of algorithms

Colas, F.P.R.

Citation

Colas, F. P. R. (2009, March 4). Data mining scenarios for the discovery of subtypes and the comparison of algorithms. Retrieved from

https://hdl.handle.net/1887/13575

Version: Corrected Publisher’s Version

License: Licence agreement concerning inclusion of doctoral thesis in the Institutional Repository of the University of Leiden

Downloaded from: https://hdl.handle.net/1887/13575

Note: To cite this publication please use the final published version (if

applicable).

(2)

Additional Results in Text Classification

For the nearest neighbors algorithm, a number of feature space transformations is possible. The k nearest neighbors classifier implemented in the libbow library [McC96] defines these transformations by two sets of three letters; for a detailed description of the letters, see the Table B.1. In Figures B.1, B.2 and B.3 we report histograms of the count of pairwise wins for each combination of the feature space transformations.

We remark that binary transformations (b ) tend to perform worse. As well,

the inverse document frequency ( t ) does not show as crucial as we could ex-

pect given its wide use in information retrieval. Further, normalizing the scores

( c) did not show any improvement. The rest of the transformations seem to

perform equally well to the exception of the tc-transformation. Then, as a tc-

transformation are applied on the training set, we observe generally underperform-

ing nearest neighbor classifiers; a possible explanation would be a software-issue

while normalizing the scores of the training set. In our analyses, we avoided this

type of transformations.

(3)

128 Appendices

Table B.1: The feature space transformations are defined in the libbow library by combinations of three letters that refer to the term frequency, the inverse document frequency and the normalization. Recall that x

ij

is the frequency of the word j in the document i. This Table summarizes the different combinations.

Term frequency (tf )

n none Raw frequencies tf (x ij ) = x ij

b binary Binarize the raw fre-

quencies tf (x ij ) =

! 1 if tf (x ij ) ≥ 1 0 otherwise m max-norm Normalize x ij relatively

to the maximum term frequency observed in a document i

tf (x ij ) = _max ^x

^ij

j

x

ij

a augmented

norm Similar to the max-

norm but with ¹ ₂ added tf (x ij ) = ¹ ₂ + _2max ^x

^ij_i

_x

_ij

l log Logarithm of the term

frequency tf (x ij ) = 1 + log(x ij ) Inverse document frequency (idf )

n none idf is not used idf (x ij ) = 1 t idf Inverse of the frequency

of the term x ij in the database which has N documents

idf (x ij ) = log "

N df (x

ij

)

#

Normalization n none Normalization is not

used φ(x ij ) = tf (x ij )idf (x ij ) c cosine Apply a cosine normal-

ization φ(x ij ) = $ _{tf (x}

ij

)idf (x

ij

)

Σ

j

(tf (x

ij

)idf (x

ij

))

²

(4)

anc.anc anc.ann anc.atc anc.atn anc.bnc anc.bnn anc.btc anc.btn anc.lnc anc.lnn anc.ltc anc.ltn

anc.mnc anc.mnn anc.mtc anc.mtn anc.nnc anc.nnn anc.ntc anc.ntn ann.anc ann.ann ann.atc ann.atn ann.bnc ann.bnn ann.btc ann.btn ann.lnc ann.lnn ann.ltc ann.ltn

ann.mnc ann.mnn ann.mtc ann.mtn ann.nnc ann.nnn ann.ntc ann.ntn atc.anc atc.ann atc.atc atc.atn atc.bnc atc.bnn atc.btc atc.btn atc.lnc atc.lnnatc.ltc atc.ltn

atc.mnc atc.mnn atc.mtc atc.mtn atc.nnc atc.nnn atc.ntc atc.ntn atn.anc atn.ann atn.atc atn.atn atn.bnc atn.bnn atn.btc atn.btn atn.lnc atn.lnnatn.ltc atn.ltn atn.mnc atn.mnn atn.mtc atn.mtn atn.nnc atn.nnn atn.ntc atn.ntn 0

100 200 300

bnc.anc bnc.ann bnc.atc bnc.atn bnc.bnc bnc.bnn bnc.btc bnc.btn bnc.lnc bnc.lnn bnc.ltc bnc.ltn

bnc.mnc bnc.mnn bnc.mtc bnc.mtn bnc.nnc bnc.nnn bnc.ntc bnc.ntn bnn.anc bnn.ann bnn.atc bnn.atn bnn.bnc bnn.bnn bnn.btc bnn.btn bnn.lnc bnn.lnn bnn.ltc bnn.ltn

bnn.mnc bnn.mnn bnn.mtc bnn.mtn bnn.nnc bnn.nnn bnn.ntc bnn.ntn btc.anc btc.ann btc.atc btc.atn btc.bnc btc.bnn btc.btc btc.btn btc.lnc btc.lnn btc.ltc btc.ltn

btc.mnc btc.mnn btc.mtc btc.mtn btc.nnc btc.nnn btc.ntc btc.ntn btn.anc btn.ann btn.atc btn.atn btn.bnc btn.bnn btn.btc btn.btn btn.lnc btn.lnnbtn.ltc btn.ltn btn.mnc btn.mnn btn.mtc btn.mtn btn.nnc btn.nnn btn.ntc btn.ntn 0

100 200 300

Figure B.1: Counts of pairwise wins for each transformation, from ann.anc to

btn.ntn.

(5)

130 Appendices

lnc.anc lnc.ann lnc.atc lnc.atn lnc.bnc lnc.bnn lnc.btc lnc.btn lnc.lnc lnc.lnn lnc.ltc lnc.ltn

lnc.mnc lnc.mnn lnc.mtc lnc.mtn lnc.nnc lnc.nnn lnc.ntc lnc.ntn lnn.anc lnn.ann lnn.atc lnn.atn lnn.bnc lnn.bnn lnn.btc lnn.btn lnn.lnc lnn.lnn lnn.ltc lnn.ltn

lnn.mnc lnn.mnn lnn.mtc lnn.mtn lnn.nnc lnn.nnn lnn.ntc lnn.ntn ltc.anc ltc.ann ltc.atc ltc.atn ltc.bnc ltc.bnn ltc.btc ltc.btn ltc.lnc ltc.lnn ltc.ltc ltc.ltn

ltc.mnc ltc.mnn ltc.mtc ltc.mtn ltc.nnc ltc.nnn ltc.ntc ltc.ntn ltn.anc ltn.ann ltn.atc ltn.atn ltn.bnc ltn.bnn ltn.btc ltn.btn ltn.lnc ltn.lnn ltn.ltc ltn.ltn ltn.mnc ltn.mnn ltn.mtc ltn.mtn ltn.nnc ltn.nnn ltn.ntc ltn.ntn 0

100 200 300

mnc.anc mnc.ann mnc.atc mnc.atn mnc.bnc mnc.bnn mnc.btc mnc.btn mnc.lnc mnc.lnn mnc.ltc mnc.ltn

mnc.mnc mnc.mnn mnc.mtc mnc.mtn mnc.nnc mnc.nnn mnc.ntc mnc.ntn mnn.anc mnn.ann mnn.atc mnn.atn mnn.bnc mnn.bnn mnn.btc mnn.btn mnn.lnc mnn.lnnmnn.ltc mnn.ltn

mnn.mnc mnn.mnn mnn.mtc mnn.mtn mnn.nnc mnn.nnn mnn.ntc mnn.ntnmtc.anc mtc.ann mtc.atc mtc.atn mtc.bnc mtc.bnn mtc.btc mtc.btn mtc.lnc mtc.lnn mtc.ltc mtc.ltn

mtc.mnc mtc.mnn mtc.mtc mtc.mtn mtc.nnc mtc.nnn mtc.ntc mtc.ntn mtn.anc mtn.ann mtn.atc mtn.atn mtn.bnc mtn.bnn mtn.btc mtn.btn mtn.lnc mtn.lnnmtn.ltc mtn.ltn mtn.mnc mtn.mnn mtn.mtc mtn.mtn mtn.nnc mtn.nnn mtn.ntc mtn.ntn 0

100 200 300

Figure B.2: Counts of pairwise wins for each transformation, from lnc.anc to

mtn.ntn.

(6)

nnc.anc nnc.ann nnc.atc nnc.atn nnc.bnc nnc.bnn nnc.btc nnc.btn nnc.lnc nnc.lnnnnc.ltc nnc.ltn

nnc.mnc nnc.mnn nnc.mtc nnc.mtn nnc.nnc nnc.nnn nnc.ntc nnc.ntn nnn.anc nnn.ann nnn.atc nnn.atn nnn.bnc nnn.bnn nnn.btc nnn.btn nnn.lnc nnn.lnn nnn.ltc nnn.ltn

nnn.mnc nnn.mnn nnn.mtc nnn.mtn nnn.nnc nnn.nnn nnn.ntc nnn.ntn ntc.anc ntc.ann ntc.atc ntc.atn ntc.bnc ntc.bnn ntc.btc ntc.btn ntc.lnc ntc.lnnntc.ltc ntc.ltn

ntc.mnc ntc.mnn ntc.mtc ntc.mtn ntc.nnc ntc.nnn ntc.ntc ntc.ntn ntn.anc ntn.ann ntn.atc ntn.atn ntn.bnc ntn.bnn ntn.btc ntn.btn ntn.lnc ntn.lnn ntn.ltc ntn.ltn ntn.mnc ntn.mnn ntn.mtc ntn.mtn ntn.nnc 0

100 200 300

Figure B.3: Counts of pairwise wins for each transformation, from nnc.anc to

ntn.ntn.

(7)

Data mining scenarios for the discovery of subtypes and the comparison of algorithms Colas, F.P.R.