Combatting illegal timber trade using chemical fingerprints: the power of mathematics and mass spectrometry

(1)

COMBATTING ILLEGAL TIMBER

TRADE USING CHEMICAL

FINGERPRINTS: THE POWER OF

MATHEMATICS AND MASS

SPECTROMETRY

word count: 25.826

Bianca De Saedeleer

Student ID: 01409888

Supervisors: prof. dr. ir. Jan Van den Bulcke, prof. dr. Willem Waegeman

Tutors: dr. ir. Victor Deklerck, ir. Thomas Mortier

A dissertation submitted to Ghent University in partial fulfilment of the requirements for the degree of master of Science in Bioinformatics: Bioscience Engineering

(2)

(3)

Universiteitsbibliotheek Gent, 2020.

This page is not available because it contains personal information.

Ghent University, Library, 2020.

(4)

(5)

This dissertation is carried out as part of a collaboration between the laboratory of wood technology from the Department of Environment and the KERMIT laboratory from the Department of Data Analysis and Mathematical Modelling. The project is in continuity with previous work done by Victor Deklerck and Nathalie Goeders. The non-confidential data was provided by the laboratory of wood technology from the Department of Environment.

A couple of days before the start of the academic year, I changed course impulsively and opted for the Master of Science in Bioinformatics. It was quite a journey and I discovered a whole new world, especially during the Predictive Modelling course by prof. dr. Willem Waegeman. The well guided practical sessions by dr. Peter Rubbens and ir. Jim Clauwaert really sparked my interest and lead to chosing this dissertation topic.

I was capable to successfully carry out this dissertation, because of a bunch of people. First, my promotors prof. dr. ir. Jan Van den Bulcke and prof. dr. Willem Waegeman, who offered me this opportunity and who always kept up the good spirit and enthusiasm. I am also grateful for my tutors dr. Victor Deklerck and Thomas Mortier, who were always very responsive and provided me with useful and extensive answers, guidance and opinions.

On a more personal level, I thank my parents, for allowing me to broaden my horizon at university. They have always emphasized the importance of education and have never stopped believing in me. I would also like to thank my friends for the good laughs, dinners and occasional parties, which kept my motivation up. And of course, thank you Fréderic, I can not imagine how this year would have been without your unconditional love and support. You are franky my biggest fan sitting in the front row.

Bianca De Saedeleer June, 2020

(6)

(7)

All data were available from the start of this study. Since this dissertation focuses on the machine learning aspects and no additional lab work was involved, the COVID-19 pandemic did not affect the further proceedings of this dissertation. All meetings with the promoters and tutors, as well as the final defense were moved to online events. Given the ongoing circumstances, this dissertation will only be submitted digitally.

(8)

(9)

Preface i

COVID-19 pandemic iii

Contents vii

Summary xvii

Nederlandse samenvatting xix

1 Introduction 1

1.1 Timber trade . . . 1

1.1.1 Tropical wood, widely coveted . . . 1

1.1.2 Overexploitation and international efforts . . . 2

1.2 Wood identification techniques . . . 4

1.2.1 Visual methods . . . 4

1.2.2 Genetic methods . . . 5

1.2.3 Chemical methods . . . 6

1.3 Summary of wood identification in practice . . . 13

1.4 Objective and outline of this dissertation . . . 14

2 Machine learning 17 2.1 Introduction . . . 17

2.2 Supervised learning . . . 18

2.2.1 Parametric and non-parametric models . . . 18

2.2.2 Bias-variance trade-off . . . 19

2.2.3 Resampling methods . . . 22

2.2.4 Class imbalance problem . . . 22

2.3 Classifiers for supervised learning . . . 23

(10)

2.3.2 k-Nearest Neighbors . . . 25

2.4 Deep learning . . . 26

2.4.1 Structure of a feed-forward neural network . . . 27

2.4.2 Convolutional Neural Networks . . . 29

2.4.3 Types of layers . . . 30

2.4.4 Training of a neural network . . . 32

2.4.5 Deep learning for wood identification . . . 35

2.5 Model performance assessment . . . 37

2.5.1 Precision, recall, ƒ 1-score and accuracy . . . 37

2.5.2 Area Under the ROC Curve . . . 39

2.6 Hierarchical classification . . . 40

2.6.1 Local hierarchical classification . . . 41

2.6.2 Global hierarchical classification . . . 43

2.6.3 Evaluation metrics for hierarchical classification . . . 44

3 Data and methods 47 3.1 Data collection and description . . . 47

3.2 Post-processing of the spectra . . . 51

3.3 Flat classification . . . 53

3.3.1 Random forests . . . 53

3.3.2 1D Convolutional Neural Networks . . . 55

3.4.1 Random forests . . . 56

3.4.2 1D Convolutional Neural Networks . . . 56

3.5 Reflection of the taxonomy within spectra . . . 56

4 Results and discussion 59 4.1 Flat classification . . . 59

4.3 General comparison of the performances . . . 70

4.4 Interpretation of the confusion matrices . . . 71

4.5 Reflection of the taxonomy within spectra . . . 77

4.5.1 Evaluation of the distance matrices . . . 77

(11)

Bibliography 89

Appendix A Software and hardware specifications 101

Appendix B Additional results – Classification 103

B.1 Adjustment on decision threshold flat RF model . . . 103 B.2 Confusion matrices flat 1D CNN model . . . 110 B.3 Confusion matrices hierarchical 1D CNN model . . . 113

Appendix C Additional results – Taxonomy 117

C.1 Hamming distance matrices . . . 117 C.2 Cosine distance matrices . . . 120 C.3 Confusion matrices of k-NN classification . . . 124

(12)

(13)

1.1 Proportion and distribution of global forest area by climatic domain in 2020. Forests cover 4.06 billion hectares globally, of which 45% is tropical forest (FAO, 2020). . . 2

1.2 DART TOFMS spectra of Pericopsis angolensis (Baker) Meeuwen (left) and Pericopsis elata(Harms) Meeuwen (right). The m/z ratios of the ion frag-ments present in the sample are shown on the x-axis, with their corre-sponding relative intensities on the y-axis. . . 10

1.3 NIRS spectrum of Douglas fir, showing the absorbance (y-axis) at each wavenumber (cm−1_{) (x-axis) (Tsuchikawa and Kobori, 2015) . . . 12}

2.1 Value of the training and test error (y-axis) depending on the model com-plexity (x-axis). Less complex models tend to underfit, which results in a high training and test error. More complex models might overfit and have a low training but high test error (Hastie et al., 2009). . . 20

2.2 Overview of the nested CV strategy (Waegeman, 2019) . . . 22

2.3 Feed-forward neural network consisting of an input layer, one hidden layer and one output layer. Neurons are visualized as nodes, weights as connections and the bias terms by the links originating from the filled nodes (Bishop, 2006). . . 27

2.4 Logistic sigmoid (left), hyperbolic tangent (middle) and ReLu (right) acti-vation functions. . . 29

2.5 Visualization of the structure of a CNN for image recognition (Saha, 2018). 30

2.6 Convolution operation with a kernel size of three and a stride of one on one-dimensional data. Followed by a pooling operation with a pooling size of three (Verma, 2020). . . 31

(14)

2.7 Confusion matrix at family level by a certain model. For example, 65% (first row, first column) of the observations of Irvingiaceae are correctly predicted (TPs), 35% (first row, tenth column) is incorrectly assigned to Fabaceae (FNs) and 5% (tenth row, first column) of the samples belong-ing to Sapotaceae are misclassified as Irvbelong-ingiaceae (FPs). . . 38

2.8 ROC curve showing an AUC of 0.842, with the TPR on the y-axis and FPR on the x-axis. For a FPR of 0.1, thus maintaining 90% specificity, a TPR of 0.646 is obtained. . . 39

2.9 Visualization of a part of the hierarchy present in the dataset, as a tree (left) and as a DAG (right). Node X is the root node. The internal nodes labeled as Moraceae, Olaceae and Meliaceae represent the families. The corresponding genera are Antiaris, Milicia, Ongokea and Khaya. The leaf nodes are labeled by the species names. . . 40

2.10Conceptual visualization of the LCL approach (left) and the LCN approach (right), where classifiers are incorporated at each level or at each node, respectively (dashed boxes) (Silla and Freitas, 2011). . . 42

2.11LCPN approach, where each parent node holds a classifier (dashed boxes) (Silla and Freitas, 2011) . . . 43

3.1 Distribution of the samples at family (left) and genus (right) level. The families are ordered according to the phylogenetic tree, for example, members belonging to Irvingiaceae are closer related to the Ochnaceae than to the Fabaceae family (NCBI, 2020). Genera are grouped alphabet-ically with respect to their family membership, which is indicated by the corresponding colors of the labels. . . 48

3.2 Distribution of the samples at species level. Species are grouped alpha-betically according to their genus. The colors of their labels correspond to their family membership, visualized in Figure 3.1. . . 49

4.1 Confusion matrix at family level, obtained by using the finetuned RF model. The y-axis represents the true labels and on the x-axis the pre-dictions made by the model are displayed. Labels are ordered based on their taxonomic proximity. . . 61

(15)

y-axis represents the true labels and on the x-axis the predictions made by the model are displayed. Labels are ordered based on their taxonomic proximity. Their colors correspond to the family to which they belong, as visualized in Figure 4.1. . . 63

4.3 Confusion matrix at species level, obtained by using the flat RF model. The y-axis represents the true labels and on the x-axis the predictions made by the model are displayed. Labels are ordered based on their taxonomic proximity. Their colors correspond to the family to which they belong, as visualized in Figure 4.1. . . 64

4.4 Confusion matrix at family level, obtained by using the finetuned RF model. The y-axis represents the true labels and on the x-axis the pre-dictions made by the model are displayed. Labels are ordered based on their taxonomic proximity. . . 67

4.5 Confusion matrix at genus level, obtained by using the hierarchical RF fintune model. The y-axis represents the true labels and on the x-axis the predictions made by the model are displayed. Labels are ordered based on their taxonomic proximity. Their colors correspond to the family to which they belong, as visualized in Figure 4.4. . . 68

4.6 Confusion matrix at species level, obtained by using the hierarchical RF fintune model. The y-axis represents the true labels and on the x-axis the predictions made by the model are displayed. Labels are ordered based on their taxonomic proximity. Their colors correspond to the family to which they belong, as visualized in Figure 4.4. . . 69

4.7 Heatmap visualizing the distance matrix between every two samples based on the Euclidean distance. The obtained distances are mapped within the interval[0, 1] for interpretational reasons. . . 78

4.8 Heatmap visualizing the distance matrix between every two samples based on the Manhattan distance. The obtained distances are mapped within the interval[0, 1] for interpretational reasons. . . 79

4.9 Heatmap visualizing the distance matrix between every two samples based on the Hamming distance. Before calculating the Hamming dis-tance, the spectra were binarized as explained in Chapter 3. For this figure, a cut-off of 40 was considered. Afterwards, distances are mapped within the range from zero to one. . . 80

(16)

4.10Heatmap visualizing the distance matrix between every two samples based on the Cosine distance. . . 81

B.1 Confusion matrix at family level, obtained by using the 1D CNN model. The y-axis represents the true labels and on the x-axis the predictions made by the model are displayed. Labels are ordered based on their taxonomic proximity. . . 110

B.2 Confusion matrix at genus level, obtained by using the flat 1D CNN model. The y-axis represents the true labels and on the x-axis the pre-dictions made by the model are displayed. Labels are ordered based on their taxonomic proximity. Their colors correspond to the family to which they belong, as visualized in Figure B.1. . . 111

B.3 Confusion matrix of the flat classification on the species level, obtained by using the flat 1D CNN model. The y-axis represents the true labels and on the x-axis the predictions made by the model are displayed. Labels are ordered based on their taxonomic proximity. Their colors correspond to the family to which they belong, as visualized in Figure B.1. . . 112

B.4 Confusion matrix at family level, obtained by using the 1D CNN model. The y-axis represents the true labels and on the x-axis the predictions made by the model are displayed. Labels are ordered based on their taxonomic proximity. . . 113

B.5 Confusion matrix at genus level, obtained by using the hierarchical 1D CNN model. The y-axis represents the true labels and on the x-axis the predictions made by the model are displayed. Labels are ordered based on their taxonomic proximity. Their colors correspond to the family to which they belong, as visualized in Figure B.4. . . 114

B.6 Confusion matrix at species level, obtained by using the hierarchical 1D CNN model. The y-axis represents the true labels and on the x-axis the predictions made by the model are displayed. Labels are ordered based on their taxonomic proximity. Their colors correspond to the family to which they belong, as visualized in Figure B.4. . . 115

C.1 Heatmap visualizing the distance matrix between every two samples based on the Hamming distance. Before calculating the Hamming dis-tance, the spectra were binarized as explained in Chapter 3. For this figure, a cut-off of 0 was considered. Afterwards, distances are mapped within the range from zero to one. . . 117

(17)

based on the Hamming distance. Before calculating the Hamming dis-tance, the spectra were binarized as explained in Chapter 3. For this figure, a cut-off of 50 was considered. Afterwards, distances are mapped within the range from zero to one. . . 118

C.3 Heatmap visualizing the distance matrix between every two samples based on the Hamming distance. Before calculating the Hamming dis-tance, the spectra were binarized as explained in Chapter 3. For this figure, a cut-off of 60 was considered. Afterwards, distances are mapped within the range from zero to one. . . 119

C.4 Heatmap visualizing the distance matrix between every two samples based on the Cosine distance. Used files binned at 5 mmu and 5% thresh-old . . . 120

C.5 Heatmap visualizing the distance matrix between every two samples based on the Cosine distance. Used files binned at 50 mmu and 5% threshold . . . 121

C.8 Confusion matrix at family level, obtained by using a k-NN classifier based on the Cosine distance. The y-axis has the true labels and on the x-axis the predictions made by the model are displayed. Labels are ordered based on their taxonomic proximity. . . 124

C.9 Confusion matrix at genus level, obtained by using a k-NN classifier based on the Cosine distance. The y-axis has the true labels and on the x-axis the predictions made by the model are displayed. Labels are ordered based on their taxonomic proximity. Their colors correspond to the family to which they belong, as visualized in Figure C.8. . . 125

(18)

C.10Confusion matrix at species level, obtained by using a k-NN classifier based on the Cosine distance. The y-axis has the true labels and on the x-axis the predictions made by the model are displayed. Labels are ordered based on their taxonomic proximity. Their colors correspond to the family to which they belong, as visualized in Figure C.8. . . 126

D.1 Distribution of the samples at family level. The families are ordered ac-cording to their phylogeny. . . 127

D.2 Distribution of samples at genus level. Genera are ordered alphabetically and based on their phylogeny. The colors of their labels correspond to their family membership, visualized in Figure D.1. . . 128

D.3 Distribution of samples at species level. Species are grouped alphabeti-cally with respect to their genus. The colors of their labels correspond to their family membership, visualized in Figure D.1. . . 129

(19)

1.1 Overview of GC-MS, MALDI-TOFMS and Raman spectroscopy. . . 8

2.1 Confusion matrix for a binary case and overview of several metrics. . . 37

3.1 Overview of all families, genera and species included in the dataset. . . . 49

3.2 Specifications of the used RF classifiers . . . 54

3.3 Specifications of the used RF classifiers in the hierarchical models . . . 57

4.1 Accuracies ± standard deviation (%) obtained by several models in flat classification. Specified on family, genus and species level. Results re-ported by Goeders (2019) (Table 3.4 in her work) are shown in the first row. . . 60

4.2 ƒ1-scores ± standard deviation (%) obtained by several models in flat

classification. Specified on family, genus and species level. . . 60

4.3 Accuracies ± standard deviation (%) obtained by several models in hier-archical classification. Specified on family, genus and species level. . . 66

4.4 ƒ1-scores ± standard deviation (%) obtained by several models in

hierar-chical classification. Specified on family, genus and species level. . . 66

4.5 Accuracies ± standard deviation (%) obtained by k-NN based on either Euclidean, Manhattan or Cosine distance. Specified on family, genus and species level. . . 85

4.6 ƒ1-scores ± standard deviation (%) obtained by k-NN based on either

Euclidean, Manhattan or Cosine distance. Specified on family, genus and species level. . . 85

(20)

(21)

Over the past 30 years, 420 million ha of forest has disappeared through deforesta-tion. Especially tropical timbers, originating from the Amazon, Central Africa and Southeast Asia, are widely coveted. Despite several international efforts imposing laws and regulations 30 to 90% of tropical timber is illegally harvested. Reliable wood identification has a key role in combatting illegal timber trade and is mainly per-formed by means of wood anatomy, genomic approaches or chemical fingerprints from Direct Analysis in Real Time Time-Of-Flight Mass Spectrometry (DART TOFMS). The latter is an interesting tool to obtain reproducible classification in a fast and cost effective manner. In this dissertation, DART TOFMS spectra from 49 timber species are classified using Random Forests (RF) and One-Dimensional Convolutional Neural Networks (1D CNNs). Subsequently, the idea of incorporating their taxonomic depen-dencies is explored by use of hierarchical classification. Best predictive performance is obtained using RF in a flat classification setting. Accuracies of 87.34%, 83.41% and 78.76% at family, genus and species level, respectively, are obtained. Further analy-sis on several distance metrics (Euclidean, Manhattan, Hamming and Cosine distance) and k-Nearest Neighbors (k-NN) classification based on the same distance measures, reveals that the phylogenetic differences between tree species are less clearly pro-nounced and captured in DART TOFMS spectra than initially expected. Therefore, hi-erarchical classification does not lead to performance improvement.

Keywords: timber identification, DART TOFMS, hierarchical classification, machine

(22)

(23)

In de afgelopen 30 jaar verdween 420 miljoen hectare bos door ontbossing. Vooral tropisch hout, afkomstig van de Amazone, Centraal-Afrika en Zuidoost-Azië, is erg gegeerd. Ondanks allerlei internationale inspanningen die wet- en regelgeving op-leggen, wordt 30 tot 90% van dit hout illegaal gekapt. Betrouwbare houtidentificatie is de sleutel in de bestrijding van illegale houthandel en wordt voornamelijk uitgevo-erd aan de hand van hout anatomie, genetische technieken of chemische vingeraf-drukken verkregen via Direct Analysis in Real Time Time-Of-Flight Mass Spectrometry (DART TOFMS). Deze laatstgenoemde is een interessant instrument om op een snelle en kosteneffectieve manier reproduceerbare classificatie te verkrijgen. In deze thesis zullen Direct Analysis in Real Time Time-Of-Flight Mass Spectrometry (DART TOFMS) spectra van 49 houtsoorten geclassificeerd worden aan de hand van Random Forests (RF) en één-Dimensionele Convolutionele Neurale Netwerken (1D CNNs). Vervolgens wordt het idee om hun taxonomische afhankelijkheden in rekening te brengen, on-derzocht door gebruik te maken van hiërarchische classificatie. RF in de conven-tionele vlakke classificatie geeft de beste performantie. Accuraatheden van 87,34%, 83,41% en 78,76% voor respectievelijk het familie-, genus- en soortenniveau wor-den bekomen. Verdere analyse op basis van verschillende afstandsmaten (Euclidis-che, Manhattan, Hamming en Cosinus afstand) en k-Nearest Neighbors (k-NN) clas-sificatie op basis van diezelfde afstanden, toont dat de fylogenetische verschillen tussen boomsoorten minder duidelijk uitgesproken zijn en weergegeven worden in DART TOFMS spectra dan in eerste instantie verwacht. Daarom leidt hiërarchische classificatie niet tot prestatieverbetering.

Trefwoorden: houtidentificatie, DART TOFMS, hiërarchische classificatie, machine

(24)

(25)

ANN Artificial Neural Network. 13, 16

ANNs Artificial Neural Networks. 5, 13, 15, 31, 32 AUC Area Under the ROC Curve. vi, 37

BOLD Barcode of Life Data System. 5

BSLL Binarized Structured Label Learning. 41

CBOL Consortium for the Barcode of Life. 5

CITES Convention on International Trade in Endangered Species of Wild Flora and

Fauna. 3

CNN Convolutional Neural Network. ix, 15, 31, 33, 50 CNNs Convolutional Neural Networks. 1, 15, 16, 31–33, 45

CV cross-validation. 24

DAG Directed Acyclic Graph. 39

DART-HRMS Direct Analysis in Real Time High-Resolution Mass Spectrometry. 16

DART TOFMS Direct Analysis in Real Time Time-Of-Flight Mass Spectrometry. ix, 1,

7, 10, 11, 14, 16, 17, 28, 38, 45

DL deep learning. 1, 16, 19, 28, 35, 36 DTW dynamic time warping. 1, 28

EU European Union. 3, 4

EUTR EU Timber Regulation. 4

FLEGT Forest Law Enforcement and Governance and Trade. 3

FN False Negative. 36

(26)

FP False Positive. 36

FPR False Positive Rate. 37

FPs False Positives. x, 36–39

GC-MS Gas Chromatography Mass Spectrometry. xi, 7, 9

GMNB Global Model Naive Bayes. 43 GTTN Global Timber Tracking Network. 14

hF hierarchical F-measure. 43

hP hierarchical Precision. 43

hR hierarchical Recall. 43

IAWA International Association of Wood Anatomists. 5

ITS Internal Transcribed Spacer. 5, 6

IUCN International Union for Conservation of Nature. 3

k-NN k-Nearest Neighbors. 1, 19, 21, 27, 28

KDA Kernel Discriminant Analysis. 11

LCL Local Classifier per Level. 40, 41

LCN Local Classifier per Node. 40, 41

LCPN Local Classifier per Parent Node. 40, 41

LDA Linear Discriminant Analysis. 11, 20 LR Logistic Regression. 19, 20

MALDI-TOFMS Matrix-Assisted Laser Desorption/Ionization Time-Of-Flight Mass

Spec-trometry. ix, xi, 7, 9

ML machine learning. 1, 4, 11, 15–17, 19, 28, 36

MLP Multi-Layer Perceptron. 32

MS Mass Spectrometry. 6, 7, 14, 45

NB Naieve Bayes. 5, 43

(27)

NLP Neuro-Linguistic Programming. 31

PCA Principal Component Analysis. 11, 13, 20

PCR Principal Component Regression. 13

PLS Partial Least-Squares. 13

PLS-DA Partial Least Squares Discriminant Analysis. 13

ReLU Rectified Linear Unit. 29, 30

RF Random Forests. 1, 11, 15, 16, 20, 21, 26, 45, 50

ROC Reciever Operating Characteristic. 37

SGD Stochastic gradient descent. 34

SIMCA Soft Independent Modeling of Class Analogy. 13

SNPs Single Nucleotide Polymorphisms. 6

SVM Support Vector Machine. 5, 15, 16, 20, 40

SVMs Support Vector Machines. 5

TN True Negative. 36

TNs True Negatives. 36

TOF Time-Of-Flight. 10

TP True Positive. 36

TPR True Positive Rate. 37

TPs True Positives. x, 36–39

U.S. United States. 4

UNODC United Nations Office of Drugs and Crime. 14

VOCs Volatile Organic Compounds. 9

(28)

(29)

INTRODUCTION

This first chapter will introduce the reader to the ongoing illegal timber trade, its ac-companying consequences and the current legal actions. Subsequently, an overview of the possible wood identification techniques is provided, mainly focussing on the use of chemical spectra.

1.1 Timber trade

1.1.1 Tropical wood, widely coveted

Tropical timbers, originating from the Amazon, Central Africa and Southeast Asia (dark green areas in Figure 1.1), are popular for all kinds of purposes, for example musi-cal instruments, furniture, household items and parquetry. The increasing demand for these timbers has lead to an overexploitation of several tree species worldwide (Lancaster and Espinoza, 2012). Over the past 30 years, 420 million ha of forest has disappeared through deforestation (FAO, 2020). Popular traded timbers are hard-woods such as Pericopsis elata (Harms) Meeuwen, Tectona grandis L.f., Swietenia spp., Diospyros spp., and Dalbergia spp. (UNEP-WCMC, 2013). The latter are also called rosewoods containing economically important species, such as African black-wood (D. melanoxylon Guill. & Perr.), Brazilian roseblack-wood (D. nigra (Vell.) Benth.) and Thailand rosewood (D. cochinchinensis Pierre) (Hartvig et al., 2015). When compar-ing the imported and exported volume of wood globally, discrepancies are observed. This can be due to unintentional factors, such as unit conversions, different product classification strategies or unsufficient management and communication systems. On the other hand, intentional factors can also be the cause and are considered as illegal logging. Various practices can be carried out, such as declaration by fraudulent paper-work claiming an incorrect geographic provenance, misreporting of product volumes, taxonomic misclassification, mixing of multiple lookalike timbers and/or sourced from different origins, misdeclaration of the product type (e.g. solid wood vs. paricleboard), etc. (Vlam et al., 2018; Liu et al., 2020). Illegal trade where legal checkpoints are by-passed, stays generally unreported. This all leads to a distorted view on timber trade,

(30)

1.1. TIMBER TRADE

making it difficult to adjust the regulations and policies accordingly (Liu et al., 2020). It is estimated that 30 to 90% of timber from the tropics, is illegally harvested (Deklerck, 2019; Vlam et al., 2018; Lowe et al., 2016; Hirschberger, 2008; Hoare, 2015; INTER-POL, 2006; Nellemann, 2014). Wood sourcing can be legal and sustainable if it does not exert pressure on the ecosystem. The sustainable yield will be surpassed if no pre-cautions are taken and the amount of wood logged exceeds growth (Deklerck, 2019). This will lead to detrimental effects in the exporting countries on an economic, social and ecological level, such as tax evasion, depletion, endangerment and eventually (near-)extinction of highly desired species, aridification of the landscape, decay of the soil fertility and a decrease of the global biodiversity as well (invasion of non-native and loss of native species) (Deklerck et al., 2019; Dormontt et al., 2015; Vlam et al., 2018; Barrett et al., 2010; Lowe et al., 2016; Deklerck, 2019; Jansen and Zuidema, 2001). It also counteracts timber stock mangement and leads to unrealistic market prices (Dormontt et al., 2015; Eberhardt, 2013; Barrett et al., 2010; Liu et al., 2020). In 2020, 726 million ha of forest is protected globally (FAO, 2020) and several laws and regulations have been established to support further prosecution of illegal trade, which will be discussed in the next section.

Figure 1.1: Proportion and distribution of global forest area by climatic domain in 2020. Forests cover 4.06 billion hectares globally, of which 45% is tropical forest (FAO, 2020).

1.1.2 Overexploitation and international efforts

As seen in Figure 1.1, tropical forests are spread all over the world. Therefore, tim-ber trade is a global concern that demands international supervision. Several efforts are, and should be further, established to combat unsustainable deforestation and control timber trade. The Convention on International Trade in Endangered Species of Wild Flora and Fauna (CITES) is an international legal system, officially implemented in 1975, which lists endangered plant and animal species in three appendices.

(31)

Ap-pendix I lists species that are almost extinct and for which international trade for commercial purposes is illegal. Exceptions can be made for, for example, scientific research. Appendix II concerns taxa threatened for future extinction and their looka-likes. Trade of these species is permitted, but should not compromise their existence. Appendix III contains species for which controlled trade is requested by at least one country (CITES, 1975a). Despite that already 183 parties (in 2019) have signed this agreement (CITES, 1975b), still more than 10% of the global timber trade is illegal and 80% of the wood from the Brazilian Amazon is illegally logged (Lancaster and Es-pinoza, 2012; Eberhardt, 2013). In order to keep track of the endagerment of species (animals as well as plants), the International Union for Conservation of Nature (IUCN) has compiled a Red List that assesses information about the range, population size, habitat and ecology, use and/or trade, threats, and conservation actions (IUCN, 2019). This list functions as a tool for the further conservation and policy decisions, according to the threatened species and the areas that should be protected. However, this list-ing is not always up to date and underestimates the risk of extinction (Deklerck, 2019; Ocampo-Peñuela et al., 2016), for example, Pterocarpus tinctorius Welw. is assessed as "least concern" in the IUCN Red List on 29/09/2017 and is not updated since, while this species is in CITES App. II since 26/11/2019 (Barstow, 2018).

Importing countries can enforce laws themselves and impose penalties to enforce forest legislation. In 2003, the Forest Law Enforcement and Governance and Trade (FLEGT) Action Plan was established by the EU, which binds the members to only import timber from countries that agree to the bilateral Voluntary Partnership Agree-ments (VPAs). In addition, the EU Timber Regulation (EUTR) was implemented in 2013 and applies to all EU Member States. This regulation enforces the import of legally sourced timber and timber products, and holds importers and suppliers responsible (Lowe et al., 2016; Institute, 2020). Other international acts are, for example, the Canadian Wild Animal and Plant Protection and Regulation of International and In-terprovincial Trade Act (1992), U.S. Lacey Act (2008), Australian Illegal Logging Pro-hibition Act (2012) and Japan’s Act on Promotion of Distribution and Use of Legally Logged Wood Products (2016) (Dormontt et al., 2015; Ravindran et al., 2018; Lowe et al., 2016; Eberhardt, 2013; Deklerck, 2019). Despite all efforts, the effectiveness of these separate laws should be questioned (Eberhardt, 2013). First of all, each leg-islation asks for a different declaration of every imported wood product. Since most laws are specified to the species level, it would be beneficial that all parties request the same taxonomic identification (Lowe et al., 2016). This also applies to specifica-tion of the geographic origin. The U.S. and Canada demand the country of harvest, while in the EU and Australia the concession and region of origin is required (Lowe et al., 2016). Furthermore, as mentioned before, some exporters tend to misreport or misclassify their goods in order to circumvent the regulations (Liu et al., 2020). From

(32)

1.2. WOOD IDENTIFICATION TECHNIQUES

another point of view, there is a gap between conservation science and conserva-tion acconserva-tion. The numerous research reports and monitoring do not lead to changes in practical management or policies. In addition, problems discussed in conservation science do not always coincide with those in practice, because economics and poli-tics are involved as well (Habel et al., 2013; Cunningham et al., 2016; Knight et al., 2008). Therefore, an international law system is needed, based on a transdisciplinary approach, in order to effectively combat the illegal timber trade on a global scale (Eberhardt, 2013; Habel et al., 2013).

1.2 Wood identification techniques

In order to uncover corrupt export permits, wood products should be correctly taxo-nomically identified and compared to their accompanying paperwork (Deklerck et al., 2019). This can be done by means of their inherent wood characteristics, such as anatomy, genetics and chemical composition. Possible identification techniques are categorized regarding those three groups (Dormontt et al., 2015). In addition, one can combine these methods with machine learning (ML) techniques and extensive databases, containing qualitative reference data (Fan et al., 2019), in order to classify samples in a correct and consistent manner. Geographical identification of imported items is also crucial, especially because of the global trade market, which makes the traceability of the products more difficult. For instance, countries with cheap produc-tion costs, e.g. China, use and combine imported wood from different origins for the manufacturing of items. The finished products are exported globally, without men-tioning the provenances of the materials (Gasson, 2011; Liu et al., 2020). However, techniques for provenance determination will not be discussed as these are out of scope for this dissertation.

1.2.1 Visual methods

The first group of identification techniques involves visual identification methods, in-cluding wood anatomy. Macroscopic features are examined using hand lenses, spe-cialized identification keys, atlases of woods, and field manuals. Microscopic anatomy is studied by anatomical slides and optical light microscopes (Dormontt et al., 2015). Several microscopic characteristics are listed in the International Association of Wood Anatomists (IAWA) list (Wheeler et al., 1989) which are used in the extensive online database InsideWood (InsideWood, 2004), in order to obtain a description of a species. However, taxonomic identification by means of wood anatomy mostly allows identifi-cation up to genus level, but it is more difficult to properly identify the species as well,

(33)

which can be a burden for lookalikes (Deklerck et al., 2017; Espinoza et al., 2015). Moreover, performing such classifications is time-consuming (can take up several days) and requires highly trained staff (Dormontt et al., 2015). Shifting the manual identification to computer vision, speeds up the process and provides more reliable and reproducible results, as human error can be a subjective matter (Hermanson and Wiedenhoeft, 2011). Many studies have been focusing on automated identification of images from wood samples, these are discussed further in Chapter 2, Section 2.4.5. Anatomical features can also be quantitatively measured. He et al. (2020) identified

Swietenia macrophylla King, Swietenia mahagoni (L.) Jacq., and Swietenia humilis

Zucc. based on nine characteristics (e.g. vessel element length, fiber length, vessel frequency, ray height and width). Several models were considered in the study, such as decision trees, Naieve Bayes (NB) classifier, Support Vector Machines (SVMs), and Artificial Neural Networks (ANNs). Optimal performance was obtained by using SVM with 91.4% accuracy.

1.2.2 Genetic methods

Genetic methods for timber identification include DNA barcoding and DNA fingerprint-ing. Due to the inherited genetic code, closely related individuals show similarities in their genetics and gene frequency (Dormontt et al., 2015). In DNA barcoding, cer-tain short universal DNA sequences in individuals are compared. Differences between these specific gene regions enable distinction between species. For trees, the highly variable Internal Transcribed Spacer (ITS) regions are mostly used. Other coding re-gions can be used as barcodes as well, these are listed by the Plant Working Group of the Consortium for the Barcode of Life (CBOL). The Barcode of Life Data System (BOLD) is a database containing 7.68 million barcodes of which 68 000 originate from plant species (BOLDSYSTEMS, 2019). With solid, comprehensive reference databases and the appropriate combination of barcodes, this technique can be fast and repro-ducible (Dormontt et al., 2015; Hartvig et al., 2015). Hartvig et al. (2015) managed to distinguish 31 Dalbergia species using the rbcL and matK regions and ITS markers at a reasonable accuracy rate of 86%. However, for plants, several specific regions need to be combined in order to obtain sufficient results. Moreover, this technique lacks discriminative power for recently diverged species, making DNA barcoding rather a supporting tool (Hartvig et al., 2015). In addition, reference databases are still incom-plete and too confined and an international effort is needed to obtain the desired database structure (Gasson, 2011).

DNA fingerprinting can be used to verify whether a sample originates from a cer-tain individual. It makes use of genetic markers such as microsatellites and SNPs.

(34)

These markers differ between individuals, while showing small population variations, and enable the discrimination between individual samples. A reference is matched to a certain sample based on the probability that the sample relates to the refer-ence. Fingerprints of samples originating from the same specimen will correspond to each other. When multiple samples from the same product are taken at different time points, the global supply chain could be tracked and visualized by linking the obtained fingerprints (Dormontt et al., 2015). Reliable and specific population databases, con-taining all possible individuals, are needed in order to obtain justifiable evidence in forensic cases (Tnah et al., 2010). DNA profiles are already extensively used for pa-ternity testing, in crime scenes to identify offenders and victims and for the identifi-cation of endangered and bred animal species. Fingerprints have been developed for several tree species, for example, Neobalanocarpus heimii (King) P.S. Ashton (Tnah et al., 2010), Erythrophleum suaveolens (Guill. & Perr.) Brenan, E. ivorense A.Chev. (Vlam et al., 2018), and the Pterocarpus species (Jiao et al., 2018).

Although these techniques give accurate results, it is sometimes not feasible to ex-tract a sufficient amount of qualitative DNA in order to perform an accurate identifi-cation (Dormontt et al., 2015). The yield of DNA extraction and the fragment length that can be sequenced, equivalently decays over time, which is problematic in the context of heartwood samples (Hartvig et al., 2015). Moreover, for forensic cases, all markers should be characterized and validated, and standard quality checks on the datasets are required as well (Vlam et al., 2018). In addition, sequencing is time-consuming and expensive (Deklerck et al., 2019; Dormontt et al., 2015; Hermanson and Wiedenhoeft, 2011).

1.2.3 Chemical methods

Wood samples can also be taxonomically identified by means of their chemical prop-erties. Information on these characteristics is obtained using Mass Spectrometry (MS) and other spectroscopy methods. MS is based on the ionization of molecules present in a sample, resulting in chemical profiles (Dormontt et al., 2015). Oliver et al. (1998) introduced the term metabolome, referring to the set of metabolites that are synthe-sized by an organism. Spectroscopy techniques retrieve information from the absorp-tion and emission of photons by the specimen (Tsuchikawa and Kobori, 2015). The chemical composition between different tree species can vary, which will appear in diverse chemical fingerprints. These dissimilarities can be used as discriminating fea-tures for taxonomic identification (Deklerck et al., 2019). Furthermore, the chemical properties can also provide complementary information to the observed anatomical

(35)

characteristics or genetic information, in order to distinguish closely related species (Dormontt et al., 2015; Espinoza et al., 2015).

Exploiting the chemical characteristics of living organisms for identification purposes is not a new concept and has been extensively carried out in diverse fields (Musah et al., 2015). Several techniques exist that provide valuable spectral data, such as Gas Chromatography Mass Spectrometry (GC-MS), Matrix-Assisted Laser Desorption/Ion-ization Time-Of-Flight Mass Spectrometry (MALDI-TOFMS) and Raman spectroscopy. An overview of these methods is presented in Table 1.1. In this dissertation, Direct Analysis in Real Time Time-Of-Flight Mass Spectrometry (DART TOFMS) spectra of wood slivers will be considered as chemical fingerprints in order to identify the tax-onomy of each sample. Further discussion on this technique and Near Infrared Spec-troscopy (NIRS) is found in the remaining part of this section.

(36)

1.2. WOOD IDENTIFICATION TECHNIQUES Table 1.1: Overview of GC-MS, MALDI-TOFMS and R aman spectr oscopy . GC -MS MALDI-TOFMS Raman spectroscopy Procedure Ionized components elute at differ ent time points Cocrystallization and absorption (UV, 266 or 337 nm) A cceleration of ions pr oportionate to m / z value Based on vibrational infor mation Monochr omatic light (IR) fr om laser R aman or inelastic scattering of ion Molecule specific fr equency differ ence Obtained spectrum R elative intensities in function of the m / z values of ion fragments R elative intensities in function of the m / z values of ion fragments R elative intensities in function of the R aman shif t ( c m − 1) Time per sample 90 minutes Few minutes Few minutes Characteristics Time-consuming Labour -intensive sample pr eparation Sof t ionization technique W eak intensities (one per 1 0 8− 1 0 1 0incident photons) Non-invasive No sample pr eparation Can suffer fr om fluor escence backgr ound Applications Gold standar d for biomark er discovery 1 Volatile Or ganic Compounds (V OCs) analysis in exhaled br eath, as indication for several pathologies and early stage cancers 2 Identification of bacteria, yeasts and pathogens 3 ,4 DNA analyses 5 Deter mination of biomolecules 4 Development of new specific mark ers for tumors, rheumatoid arthritis, Alzheimer’s disease, and aller gies 4 Identification of compounds in phar maceutical mixtur es 6 Nucleic acid analysis 7 Classification of minerals 8 Discrimination within bacteria (e.g. S. aureus 9, E. coli 1 0) V irus detection (e.g. hepatitis B) 1 1 Composition analysis of essential oils fr om Mentha species 1 2 Studies on biological tissues such as bone, lung, brain, etc. 1 3 References 1Xi et al. (2014) 2Ar mitage and Barbas (2014), Sk arysz et al. (2018) 3Papagiannopoulou et al. (2019) 4W ieser et al. (2012) 5Jurink e et al. (2004) 6Fan et al. (2019) 1 1Tong et al. (2019) 7Efr emov et al. (2008) 1 2Rösch et al. (2002) 8Liu et al. (2017) 1 3Movasaghi et al. (2007) 9Jarvis and Goodacr e (2004) 1 0Ho et al. (2019)

(37)

Direct Analysis in Real Time Time-Of-Flight Mass Spectrometry (DART TOFMS)

DART TOFMS is an ionization technique where a DART ion source is coupled to a TOF mass spectrometer. The ion source uses a stream of excited gas (nitrogen or he-lium) to ionize molecules present in a sample without the need of an electrosprayed liquid solvent (Cody et al., 2005). Electrospray ionization suffers from ion suppres-sion where species with less abundant ions get suppressed by highly abundant ions, leading to unreliable metabolite profiling. Without the need of any sample prepara-tion, solids, liquids, and gases are directly analyzed at ambient pressure and ground potential. The TOF mass spectrometer will then accelerate the ionized molecules pro-portionally to their energy, in an electric field. By introducing a pulse, ions with the same mass-to-charge ratio (m/z, with m the atomic mass and z the charge of the ion) will group together and will arrive at the detector at the same time (Mamyrin, 2001). The m/z ratio of every ion fragment will be measured, so that a mass spec-trum for each sample can be constructed (Figure 1.2). The analysis takes a few sec-onds and provides reproducible spectra with high resolution, specificity and accuracy (Lancaster and Espinoza, 2012; Zhou et al., 2011). The absence of any real sam-ple preparation gives DART its real-time characteristic and makes it a more versatile source than radioisotope-based ion sources. In contrast to other existing ion sources, alkali metal cation attachment is avoided, which provides more interpretable spectra without decreasing the signal (Cody et al., 2005). Furthermore, it only requires small sample sizes (Deklerck et al., 2019) and is able to detect small quantities (in order of nanograms) of pharmaceutics, drugs and explosives in bodily fluids (Cody et al., 2005). Despite all abovementioned advantages, many factors influence the results as well. The sensitivity is affected by the electric field conditions and sample positioning. While lower grid and orifice voltages are preferred, samples should be placed in such a way that maximum particle flow from the surface to the mass spectrometer inlet orifice is ensured. In addition, direct placement in the gas stream should be avoided, because this leads to blockage and/or deflection of the particles (Harris and Fernán-dez, 2009). The obtained spectra and resolution depend on the gas flow rate and temperature (Cajka et al., 2011; Zhou et al., 2011). A higher flow rate increases the amount of identified metabolites, however, if the rate becomes too high (≥ 3 L/min) solvent droplets tend to adhere to the inlet orifice of the mass spectrometer, which leads to contamination (Cajka et al., 2011). Moreover, spectra become irreproducible at high flow rates. Similarly, the number of detected metabolites increases with in-creasing temperatures as well. The m/z ratios of the present metabolites determine the optimum temperature, which usually varies between 150 and 200 °C.

(38)

Tempera-1.2. WOOD IDENTIFICATION TECHNIQUES

tures above 250°C cause signal loss due to quick desorption and irreversible sample degradation (Zhou et al., 2011).

Figure 1.2: DART TOFMS spectra of Pericopsis angolensis (Baker) Meeuwen (left) and Pericopsis elata(Harms) Meeuwen (right). The m/z ratios of the ion fragments present in the sample are shown on the x-axis, with their corresponding relative intensities on the y-axis.

DART TOFMS enables direct detection of several polar and nonpolar chemical com-pounds on various surfaces such as concrete, fruit, wood, clothing, etc. Because of this, it is widely used for quality control and detection of several compounds in food, such as pesticides, fatty acids and cholesterol (Rajchl et al., 2015). Other purposes are the assessment of the geographic origin of biodiesel feedstocks, identification of insect species (Musah et al., 2015), design of (phyto)chemical markers in medicinal plants and herbal drugs, as well as determination of their provenance (Bajpai et al., 2017; Kim et al., 2010, 2015). DART TOFMS spectra have been successfully used for the taxonomical identification and geographical localization of several (CITES-listed) tree species (Lancaster and Espinoza, 2012; Cody et al., 2012; Evans et al., 2017; Finch et al., 2017; Deklerck et al., 2017, 2019). Ionization of the metabolites present in a wood sample results in a unique metabolomic profile for that particular specimen (Deklerck et al., 2019; Zacharias et al., 2018). Since the molecules present in timber depend on the genes of the wood itself, the genetic differences between species are integrated in the metabolic fingerprint (Musah et al., 2015). This is especially use-ful for discrimination of lookalike species. For example, wood from Dalbergia nigra (Vell.) Benth. (CITES App. I) and Dalbergia spruceana (Benth.) Benth. (CITES App. II (UNEP-WCMC, 2013)) cannot be distinguished using wood anatomy (Lancaster and Es-pinoza, 2012) and can be mixed with Swartzia tomentosa DC. and Aniba rosaeodora Ducke in trade. They are then declared as "rosewood", which introduces ambigu-ity. The similar appearance of their timber is likely due to a similar distribution of

(39)

neoflavanoids (dalbergiquinols, dalbergions, neoflavenes, and dalbergins) (Lancaster and Espinoza, 2012). In Dalbergia and Machaerium species, the distribution of neo-and isoflavanoids differ significantly neo-and can be used as a chemotaxonomic marker (De Oliveira et al., 1971). Lancaster and Espinoza (2012) distinguish 13 Dalbergia species using Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) based on the most abundant ions present in the samples, Cody et al. (2012) discriminated White Oak (Quercus alba L.) and Red Oak (Quercus rubra L.) using PCA and LDA as well. Araucaria angustifolia (Bertol.) Kuntze, Araucaria heterophylla (Sal-isb.) Franco, Agathis australis (D.Don) Lindl. and Wollemia nobilis W.G.Jones, K.D.Hill & J.M.Allen look macroscopically and microscopically similar to A. araucana (Molina) K.Koch (CITES App. I). Thus, determination is only possible when based on the physico-chemical characteristics (Evans et al., 2017). Evans et al. (2017) successfully discrim-inate the previously mentioned species by performing Kernel Discriminant Analysis (KDA). KDA was also successful on a dataset covering eight Dalbergia species and six lookalike species (Espinoza et al., 2015). More advanced ML models are also used for wood identification. Deklerck et al. (2019) discriminated ten Meliaeae species using Random Forests (RF) with an accuracy of 82.2%. Most misclassifications occurred for certain species with similar chemical profiles, Swietenia species were confused with each other and Entandrophragma angolense (Welw.) C.DC. was often misclassified as E. candolleiHarms or Khaya anthotheca (Welw.) C.DC.

Near Infrared Spectroscopy (NIRS)

NIRS is a non-destructive technique based on absorption and emission of electro-magnetic waves ranging between 780 and 2500 nm (Figure 1.3). This region mainly contains absorption bands from overtones (780-2000 nm) and combinations of vi-brations (1900-2500 nm), originating from polyatomic molecules. The absorbance of NIR light is weak, because overtones are less likely to occur (Blanco and Villarroya, 2002; Tsuchikawa and Kobori, 2015; Schimleck et al., 2001). The core building blocks of a NIR spectrophotometer are a light source, a stage with sample, a lens, additional filters for wavelength selection, and a detector, which is coupled to a computer for the collection of the images/data (Skirtach, 2019). The choice of the specific equip-ment and settings are adjusted according to the characteristics of the sample and the prevailing analytical conditions. Portable NIRS spectrophotometers also exist, which can be used on-site and give NIRS a flexible aspect. Samples can be solid, liquid or gaseous and minimal to no sample preparation is needed, which makes NIRS a sim-ple and cheap technique. Depending on the type of spectrophotometer, a discrete wavelength or a whole spectrum is obtained (Blanco and Villarroya, 2002). Analysis is fast and provides spectra with similar accuracy and precision as other analytical

(40)

techniques, but they are less sensitive and yield lower resolution. Therefore, NIRS can only be used to study the abundant components (Blanco and Villarroya, 2002; Defoirdt et al., 2017; Schimleck et al., 2001). NIRS is widely applied in several fields, including agricultural foods, process control, textiles, petrochemicals, the pharmaceutical and clinical sector and environmental issues (Tsuchikawa and Kobori, 2015; Blanco and Villarroya, 2002). It also found its way into the pulp and paper industry, as well as in wood science and technology (Tsuchikawa and Kobori, 2015; Kelley et al., 2004; Schimleck et al., 2001; Adedipe et al., 2008).

Figure 1.3: NIRS spectrum of Douglas fir, showing the absorbance (y-axis) at each wavenumber (cm−1_{) (x-axis) (Tsuchikawa and Kobori, 2015)}

NIRS spectra contain a lot of information. For wood samples, the chemical compo-sition, and physical and anatomical properties are covered (Blanco and Villarroya, 2002). This regards for example, density, wood tension, viscosity, moisture content, content of lignin, cellulose and other extractives, such as ethanol and phenol (De-foirdt et al., 2017; Tsuchikawa and Kobori, 2015; Kelley et al., 2004; Schimleck et al., 2001; Blanco and Villarroya, 2002). Therefore, interpreting NIRS spectra is quite chal-lenging (Russ et al., 2011) and requires several pre-processing steps. In order to ob-tain insight and find correlations between spectra and specific properties, multivari-ate analysis techniques, such as Principal Component Analysis (PCA), Partial Least Squares Discriminant Analysis (PLS-DA), Principal Component Regression (PCR), and Partial Least-Squares (PLS) regression are usually used (Blanco and Villarroya, 2002). In the context of NIRS, Soft Independent Modeling of Class Analogy (SIMCA) and Ar-tificial Neural Networks (ANNs) are frequently utilized for the development of linear and non-linear calibration models, respectively. In order to obtain accurate and robust calibration models, a large amount of data is needed (Blanco and Villarroya, 2002; Lazarescu et al., 2017).

Estimation of wood properties is done using PCR and PLS regression (Schimleck et al., 2001; Fang et al., 2011). NIRS data on moisture content from tree species have

(41)

suc-cessfully been classified using PCA (Russ et al., 2011) or using a combination of SIMCA with PCA (Adedipe et al., 2008), PLS-DA or ANN (Lazarescu et al., 2017). The differ-ences in wood density also enables discrimination. Shou et al. (2014) utilized SIMCA based models to determine from which species chinese furniture was made. They dis-criminated Pterocarpus santalinus L.f. (red sanders), a valuable timber in the Chinese market, from Dalbergia louvelii R.Vig. (boise de rose) and Pterocarpus soyauxii Taub. (African padauk). Six other species were discriminated as well, which are D. oliveri Prain, D. bariensis Pierre, D. cochinchinensis, D. retusa Hemsl., P. erinaceus Poir. and

P. macrocarpus Kurz (accuracies obtained between 80 and 100%). Thus, NIRS can

help with the distinction between tree species that macroscopically lookalike.

Swi-etenia macrophylla, also known as Mahogany (CITES App. II (UNEP-WCMC, 2013)) is a

very treasured wood and has several lookalikes, for example Carapa guianensis Aubl., Cedrela odorata_{L., and Micropholis melinoniana Pierre. Discrimination is possible} us-ing PLS-DA models (Braga et al., 2011; Pastore et al., 2011; Bergo et al., 2016). The discriminative power depends on the surface along which is scanned, i.e. transverse, radial or tangential, but all samples subjected to a model should always have the same face (Braga et al., 2011).

1.3 Summary of wood identification in practice

Nowadays, taxonomic identification of timber is mainly based on wood anatomy, re-quiring highly trained experts, which makes it difficult to catch up with the rate of demand. Forensic entities are lacking feasible and cost effective tools to screen and identify illegally sourced timber in a fast, reproducible and accurate manner (Dor-montt et al., 2015; Ravindran et al., 2018). Since a reliable state-of-the-art procedure for timber identification is not established yet, the United Nations Office of Drugs and Crime (UNODC) founded a specific research group that published the Practice Guide for Forensic Timber Identification (in 2016) in consultation with law enforcement. How-ever, the guide does not narrow down to a specific procedure, but bundles all possible identification techniques. It operates as a first stepping stone in the challenge to de-velop a robust screening tool and a standard routine that can be implemented glob-ally (Dormontt et al., 2015). Furthermore, the Global Timber Tracking Network (GTTN) composed the Timber Tracking Tool Infogram, which provides an overview of the ca-pabilities of the current tracking tools available, i.e. (i) wood anatomy, (ii) genetics, (iii) stable isotopes, (iv) DART TOFMS, (v) NIRS and (vi) machine vision. The guide shows which techniques are most suited for a certain question of interest (Schmitz et al., 2019). In order to develop a forensic screening tool that will be applicable on

(42)

1.4. OBJECTIVE AND OUTLINE OF THIS DISSERTATION

a global scale, further collaboration and coordination on an international level is de-sired, together with financial support (Dormontt et al., 2015).

Considering the three groups of identification techniques, chemical fingerprints seem to be promissing as a forensic screening tool. The desired spectra can be obtained by performing mass spectrometry or spectroscopy techniques. Chemical spectra are composed significantly faster and at a lower cost than genetic fingerprints, while still providing sufficient discriminative power. In addition, profiles based on the chemical properties of wood are interesting for discrimination problems between lookalike tree species (Dormontt et al., 2015). However, regarding DART TOFMS, further research is required, because the influence of added glues in wood-based panels on the obtained spectra is not clear yet (Deklerck, 2019). Despite that NIRS is used less, this technique is also promising due to its low cost, speed (takes a few minutes), its ability to differen-tiate different species and to determine the geographic provenance (Dormontt et al., 2015; Braga et al., 2011). Nevertheless, its applicability as a forensic tool still needs to be validated (Deklerck, 2019). In addition, the use of machine vision provides a more consistent and reproducible identification which is also cost effective as it will no longer only depend on human resources (Ravindran et al., 2018; Figueroa-Mata et al., 2018; Hermanson and Wiedenhoeft, 2011). However, this requires comprehen-sive databases containing reference material that captures enough of the biological variability between samples (Dormontt et al., 2015; Ravindran et al., 2018; He et al., 2020).

1.4 Objective and outline of this dissertation

The objective of this dissertation is to further explore the possibilities of DART TOFMS spectra for wood identification in continuation of the work of Goeders (2019). First, the automated post-processing step on the data is implemented in Python instead of R. Goeders (2019) distinguished heartwood samples from different species by means of their DART TOFMS spectra, using Random Forests (RF) and k-Nearest Neighbors (k-NN) based on dynamic time warping (DTW). While k-NN performed poorly, optimal predic-tive performance was achieved by using RF. Accuracies of 86.7%, 85.5% and 81.7% at the family, genus and species level, respectively, were obtained. In this dissertation, the same dataset, containing 858 samples, which are unequally distributed across the species, will be used. The prediction performance of RF and One-Dimensional Convo-lutional Neural Networks (1D CNNs) applied in the conventional flat classification and hierarchical classification will be compared.

(43)

• Chapter 2 – Machine learning

Overview of the most important concepts of machine learning (ML) and its sub-field, deep learning (DL) for this dissertation.

• Chapter 3 – Data and methods

Description of the dataset, the post-processing procedure and performed analy-ses.

• Chapter 4 – Results and discussion

Presentation, discussion and further interpretation of the obtained results.

• Chapter 5 – Conclusion and future work

(44)

(45)

MACHINE LEARNING

After gaining some insights in timber trade and wood identification, the most impor-tant machine learning (ML) and deep learning (DL) concepts are introduced in this chapter. In addition, some research on the use of ML for timber identification is sum-marized. Furthermore, several metrics to assess a model’s predictive performance are discussed, as well as the difference between flat and hierarchical classification.

2.1 Introduction

A lot of data is being produced in the scientific and medical field as well as market-ing, finance, and other business disciplines. Over the last decades, machine learning (ML) has gained a lot of interest as a tool to make sense of these vast amounts of "Big Data". It is a subfield of computer science that uses computational and statis-tical models to automastatis-tically search for patterns present in the data without being explicitly programmed. This learning from rather complex datasets can be performed in four different settings, i.e. supervised, unsupervised, semi-supervised, and rein-forcement, depending on the type of available data. Supervised learning involves the prediction of a certain outcome ˆy based on a given labeled dataset D containing N observations, D = {(x1, y1), (x2, y2), . . . , (xN, yN)}, with x = (1_, 2_ , . . . , P_)T a

vec-tor containing the observed values for the P features of the -th observation and y

the corresponding label, which can be quantitative or qualitative. The former case is known as a regression problem, where the output variable is continuous y ∈RD. An

example is the estimation of the price of a bottle of wine, based on the measures of its ingredients. Predicting whether a bottle of wine will be sold as cheap, mid range or expensive, given its ingredients, is a classification problem. Here, the output is cate-gorical or discrete: y∈ C, with C containing K classes C = {c₁, c₂, . . . , cK}. To obtain

the final label, the probability of a certain instance belonging to a certain class, given the observed data, is computed. The class with the highest probability will be the final outcome. In general, classification problems are described as being binary (K= 2) or

multi-class (K > 2). Popular supervised approaches are k-Nearest Neighbors (k-NN), Logistic Regression (LR), Random Forests (RF) and Support Vector Machine (SVM).

(46)

2.2. SUPERVISED LEARNING

If a dataset contains unlabeled data, i.e. the inputs do not have a corresponding output, unsupervised learning is performed. The aim here is to discover relationships and structures present in the dataset, without the prediction of an output measure. This involves clustering methods, such as k-means and hierarchical clustering, which group observations together based on a similarity measure. These techniques are rather used in exploratory data analyses. Semi-supervised learning problems also exist, where datasets contain response variables for only a part of the observations (James et al., 2013; Hastie et al., 2009; Bishop, 2006). A fourth and last branch is reinforcement learning, which learns from data in a totally different way. An algorithm searches for the sequence of certain states and actions that will result in the highest reward. This is obtained by interacting with the environment using trial and error (Bishop, 2006).

In a high-dimensional space, the volume of the neighborhood increases exponentially with the degree of dimensionality. Because of this, observations tend to be spread sparsely in space, not providing enough examples to obtain sufficient insight in the relationship between an input and output, and eventually leading to lower predictive performance. This is known as the curse of dimensionality. By reducing the dimension-ality, but keeping the most important information, the data will become more inter-pretable (Thas, 2019; Hastie et al., 2009; Bishop, 2006; James et al., 2013). Frequently used techniques for dimensionality reduction are Principal Component Analysis (PCA) (unsupervised) and Linear Discriminant Analysis (LDA) (supervised). In practice, it is common to reduce the feature space prior to further looking into the data and making predictions (Ho et al., 2019; Lancaster and Espinoza, 2012).

2.2 Supervised learning

2.2.1 Parametric and non-parametric models

In supervised learning, an input x relates to its output y by a function ƒ(·). The aim

is to find a function ˆƒ(·) that approximates ƒ (·), such that it can predict the value ˆyas

close as possible to the real, but unknown output y, given the observation x. In order

to estimate ˆƒ(·), the dataset of interest will be split into a training, validation (or

test-ing) and tuning set. The model will try to identify useful relationships between given inputs and their known outputs from the training set. To perform this task, parametric as well as non-parametric approaches exist. Parametric models make assumptions about the functional form of ƒ beforehand, for example that ƒ is linear:

(47)

with 1

, 2 , . . . ,  P

 the features that hold the most useful information in order to

de-scribe the relationship between the input and output, β0, β₁, . . . , βPthe corresponding

parameters of the model and ε an error term that is normally distributed with zero mean and σ2_{variance. By training the model, an optimal set of parameters or weights}

is searched for. The features that contribute more to the relationship will get assigned a higher weight compared to the variables that contribute less. Thus, parametric mod-els reduce the problem of finding ˆƒ to the estimation of a set of P+ 1 parameters for

a p-dimensional function. A possible risk is that the prior assumptions regarding the form of ˆƒ are far from the real ƒ and that the made predictions are inaccurate. Non-parametric models do not make any assumptions about the functional form of ƒ . They circumvent the risk that a prior chosen form of ˆƒ does not fit the data at all. However, as the number of parameters is not reduced in this approach, a vast amount of train-ing data is required in order to cover the feature space and thus to obtain accurate predictions (James et al., 2013).

Next to the parameters discussed above, a model also contains hyperparameters, e.g. learning rate, number of trees in RF, number of considered neighbors in k-NN (see Section 2.3). To obtain a staisfying performance, these should be optimized as well. Unlike parameters, hyperparameters are not learned from the data, but are defined prior to training. Resampling methods (see Section 2.2.3) are used to obtain a separate validation set in order to determine the optimal values.

2.2.2 Bias-variance trade-off

As already explained, a model consists of several parameters. A trade-off exists be-tween including too few or too many parameters in the model. The ability of a model to make reliable predictions for unseen data, is also known as generalization and can be determined by evaluating the performance of the model on a test set. The test set contains observations that the model has not seen yet. The difference between the obtained prediction ˆƒ(x) to the true output ƒ (x) = y of a certain observation x

is the test error. This error can be explained in terms of bias and variance. A model containing few parameters will show low variance, as it depends on less features. The model will be less complex, more interpretable and robust. However, there is a risk that not all important information present in the data is considered during predictions, leading to underfitting. The prediction might be far away from the ground truth, thus showing high bias and also resulting in a high training and test error. Including more parameters will result in a more complex and flexible model, that will improve the predictive performance and lower the bias. Nevertheless, when too many parameters are incorporated, predictions will be based on some parameters that actually

(48)

intro-2.2. SUPERVISED LEARNING

duce more noise than additional useful information. A high variance will be obtained and the model will overfit, which is recognized by a low training but high test error. Both cases result in an unsatisfying predictive performance, a compromise between these two extremes is described as the bias-variance trade-off and is illustrated in Figure 2.1.

Figure 2.1: Value of the training and test error (y-axis) depending on the model com-plexity (x-axis). Less complex models tend to underfit, which results in a high training and test error. More complex models might overfit and have a low training but high test error (Hastie et al., 2009).

To reformulate, the expected test error can be decomposed into the sum of three fundamental quantities, being the variance of ˆƒ(x), the squared bias of ˆƒ(x) and the

variance of the error term ε:

E y− ˆƒ(x)2 = Vr ˆƒ(x) + Bs ˆƒ(x)2+ Vr(ε), (2.2) where V r ˆƒ(x) = E ˆƒ(x) − Eˆƒ(x)2, (2.3) Bs ˆƒ(x) = Eˆƒ(x) − ƒ (x), (2.4) V r(ε) = σ2, ε sN(0, σ2), (2.5)

with ε being normally distributed with zero mean and σ2_variance.

The optimal model provides low variance and low bias, and results in a minimal error. Depending on the goal, many performance measures exist, which are discussed