Data mining scenarios for the discovery of subtypes and the comparison of algorithms
Colas, F.P.R.
Citation
Colas, F. P. R. (2009, March 4). Data mining scenarios for the discovery of subtypes and the comparison of algorithms. Retrieved from
https://hdl.handle.net/1887/13575
Version: Corrected Publisher’s Version
License: Licence agreement concerning inclusion of doctoral thesis in the Institutional Repository of the University of Leiden
Downloaded from: https://hdl.handle.net/1887/13575
Note: To cite this publication please use the final published version (if
applicable).
Additional Results in Text Classification
For the nearest neighbors algorithm, a number of feature space transformations is possible. The k nearest neighbors classifier implemented in the libbow library [McC96] defines these transformations by two sets of three letters; for a detailed description of the letters, see the Table B.1. In Figures B.1, B.2 and B.3 we report histograms of the count of pairwise wins for each combination of the feature space transformations.
We remark that binary transformations (b ) tend to perform worse. As well,
the inverse document frequency ( t ) does not show as crucial as we could ex-
pect given its wide use in information retrieval. Further, normalizing the scores
( c) did not show any improvement. The rest of the transformations seem to
perform equally well to the exception of the tc-transformation. Then, as a tc-
transformation are applied on the training set, we observe generally underperform-
ing nearest neighbor classifiers; a possible explanation would be a software-issue
while normalizing the scores of the training set. In our analyses, we avoided this
type of transformations.
128 Appendices
Table B.1: The feature space transformations are defined in the libbow library by combinations of three letters that refer to the term frequency, the inverse document frequency and the normalization. Recall that x
ijis the frequency of the word j in the document i. This Table summarizes the different combinations.
Term frequency (tf )
n none Raw frequencies tf (x ij ) = x ij
b binary Binarize the raw fre-
quencies tf (x ij ) =
! 1 if tf (x ij ) ≥ 1 0 otherwise m max-norm Normalize x ij relatively
to the maximum term frequency observed in a document i
tf (x ij ) = max x
ijj
x
ija augmented
norm Similar to the max-
norm but with 1 2 added tf (x ij ) = 1 2 + 2max x
ijix
ijl log Logarithm of the term
frequency tf (x ij ) = 1 + log(x ij ) Inverse document frequency (idf )
n none idf is not used idf (x ij ) = 1 t idf Inverse of the frequency
of the term x ij in the database which has N documents
idf (x ij ) = log "
N df (x
ij)
#
Normalization n none Normalization is not
used φ(x ij ) = tf (x ij )idf (x ij ) c cosine Apply a cosine normal-
ization φ(x ij ) = $ tf (x
ij
)idf (x
ij)
Σ
j(tf (x
ij)idf (x
ij))
2anc.anc anc.ann anc.atc anc.atn anc.bnc anc.bnn anc.btc anc.btn anc.lnc anc.lnn anc.ltc anc.ltn
anc.mnc anc.mnn anc.mtc anc.mtn anc.nnc anc.nnn anc.ntc anc.ntn ann.anc ann.ann ann.atc ann.atn ann.bnc ann.bnn ann.btc ann.btn ann.lnc ann.lnn ann.ltc ann.ltn
ann.mnc ann.mnn ann.mtc ann.mtn ann.nnc ann.nnn ann.ntc ann.ntn atc.anc atc.ann atc.atc atc.atn atc.bnc atc.bnn atc.btc atc.btn atc.lnc atc.lnnatc.ltc atc.ltn
atc.mnc atc.mnn atc.mtc atc.mtn atc.nnc atc.nnn atc.ntc atc.ntn atn.anc atn.ann atn.atc atn.atn atn.bnc atn.bnn atn.btc atn.btn atn.lnc atn.lnnatn.ltc atn.ltn atn.mnc atn.mnn atn.mtc atn.mtn atn.nnc atn.nnn atn.ntc atn.ntn 0
100 200 300
bnc.anc bnc.ann bnc.atc bnc.atn bnc.bnc bnc.bnn bnc.btc bnc.btn bnc.lnc bnc.lnn bnc.ltc bnc.ltn
bnc.mnc bnc.mnn bnc.mtc bnc.mtn bnc.nnc bnc.nnn bnc.ntc bnc.ntn bnn.anc bnn.ann bnn.atc bnn.atn bnn.bnc bnn.bnn bnn.btc bnn.btn bnn.lnc bnn.lnn bnn.ltc bnn.ltn
bnn.mnc bnn.mnn bnn.mtc bnn.mtn bnn.nnc bnn.nnn bnn.ntc bnn.ntn btc.anc btc.ann btc.atc btc.atn btc.bnc btc.bnn btc.btc btc.btn btc.lnc btc.lnn btc.ltc btc.ltn
btc.mnc btc.mnn btc.mtc btc.mtn btc.nnc btc.nnn btc.ntc btc.ntn btn.anc btn.ann btn.atc btn.atn btn.bnc btn.bnn btn.btc btn.btn btn.lnc btn.lnnbtn.ltc btn.ltn btn.mnc btn.mnn btn.mtc btn.mtn btn.nnc btn.nnn btn.ntc btn.ntn 0
100 200 300
Figure B.1: Counts of pairwise wins for each transformation, from ann.anc to
btn.ntn.
130 Appendices
lnc.anc lnc.ann lnc.atc lnc.atn lnc.bnc lnc.bnn lnc.btc lnc.btn lnc.lnc lnc.lnn lnc.ltc lnc.ltn
lnc.mnc lnc.mnn lnc.mtc lnc.mtn lnc.nnc lnc.nnn lnc.ntc lnc.ntn lnn.anc lnn.ann lnn.atc lnn.atn lnn.bnc lnn.bnn lnn.btc lnn.btn lnn.lnc lnn.lnn lnn.ltc lnn.ltn
lnn.mnc lnn.mnn lnn.mtc lnn.mtn lnn.nnc lnn.nnn lnn.ntc lnn.ntn ltc.anc ltc.ann ltc.atc ltc.atn ltc.bnc ltc.bnn ltc.btc ltc.btn ltc.lnc ltc.lnn ltc.ltc ltc.ltn
ltc.mnc ltc.mnn ltc.mtc ltc.mtn ltc.nnc ltc.nnn ltc.ntc ltc.ntn ltn.anc ltn.ann ltn.atc ltn.atn ltn.bnc ltn.bnn ltn.btc ltn.btn ltn.lnc ltn.lnn ltn.ltc ltn.ltn ltn.mnc ltn.mnn ltn.mtc ltn.mtn ltn.nnc ltn.nnn ltn.ntc ltn.ntn 0
100 200 300
mnc.anc mnc.ann mnc.atc mnc.atn mnc.bnc mnc.bnn mnc.btc mnc.btn mnc.lnc mnc.lnn mnc.ltc mnc.ltn
mnc.mnc mnc.mnn mnc.mtc mnc.mtn mnc.nnc mnc.nnn mnc.ntc mnc.ntn mnn.anc mnn.ann mnn.atc mnn.atn mnn.bnc mnn.bnn mnn.btc mnn.btn mnn.lnc mnn.lnnmnn.ltc mnn.ltn
mnn.mnc mnn.mnn mnn.mtc mnn.mtn mnn.nnc mnn.nnn mnn.ntc mnn.ntnmtc.anc mtc.ann mtc.atc mtc.atn mtc.bnc mtc.bnn mtc.btc mtc.btn mtc.lnc mtc.lnn mtc.ltc mtc.ltn
mtc.mnc mtc.mnn mtc.mtc mtc.mtn mtc.nnc mtc.nnn mtc.ntc mtc.ntn mtn.anc mtn.ann mtn.atc mtn.atn mtn.bnc mtn.bnn mtn.btc mtn.btn mtn.lnc mtn.lnnmtn.ltc mtn.ltn mtn.mnc mtn.mnn mtn.mtc mtn.mtn mtn.nnc mtn.nnn mtn.ntc mtn.ntn 0
100 200 300
Figure B.2: Counts of pairwise wins for each transformation, from lnc.anc to
mtn.ntn.
nnc.anc nnc.ann nnc.atc nnc.atn nnc.bnc nnc.bnn nnc.btc nnc.btn nnc.lnc nnc.lnnnnc.ltc nnc.ltn
nnc.mnc nnc.mnn nnc.mtc nnc.mtn nnc.nnc nnc.nnn nnc.ntc nnc.ntn nnn.anc nnn.ann nnn.atc nnn.atn nnn.bnc nnn.bnn nnn.btc nnn.btn nnn.lnc nnn.lnn nnn.ltc nnn.ltn
nnn.mnc nnn.mnn nnn.mtc nnn.mtn nnn.nnc nnn.nnn nnn.ntc nnn.ntn ntc.anc ntc.ann ntc.atc ntc.atn ntc.bnc ntc.bnn ntc.btc ntc.btn ntc.lnc ntc.lnnntc.ltc ntc.ltn
ntc.mnc ntc.mnn ntc.mtc ntc.mtn ntc.nnc ntc.nnn ntc.ntc ntc.ntn ntn.anc ntn.ann ntn.atc ntn.atn ntn.bnc ntn.bnn ntn.btc ntn.btn ntn.lnc ntn.lnn ntn.ltc ntn.ltn ntn.mnc ntn.mnn ntn.mtc ntn.mtn ntn.nnc 0
100 200 300