Lifelong 3D object recognition: A comparison of deep features and handcrafted descriptors

(1)

Lifelong 3D object recognition: A comparison of deep features and handcrafted descriptors

Bachelor’s Project Thesis

Jing Wu, s2655012, J.J.Wu.1@student.rug.nl, Supervisor: Dr. S.H. Mohades Kasaei

Abstract: Over the past few decades we have been finding more and more uses for service robots. While the early robots that were employed by humans were often stationary and operating in static environments, nowadays the demand for service robots capable of more complex behaviour in dynamic environments has been increasing. Object recognition has come a long way the past decade with the improvements in software and hardware allowing widespread imple- mentation using Convolutional Neural Networks (CNN). Unfortunately, the batch training that is required of CNNs is hard to employ in a system that has to be capable of lifelong learning while operating in a dynamic open-ended domain at the same time. Using OrthographicNet, a system designed specifically towards the purpose of achieving lifelong 3D object recognition in open-ended domains, we aim to compare the performances of deep features (MobileNetV2, VGG19 fc1) and handcrafted descriptors (GOOD, ESF). We performed offline experiments for our preliminary comparisons and found during the online testing phase that the deep feature descriptor MobileNetV2 is capable of learning the most object categories whilst providing decent accuracy and learning speed.

1 Introduction

With the rapid developments in software and hardware over the past few decades, the possibilities for creating sophisticated service robots have also increased. However, unlike service robots, industrial type robots have already been widely used since the 1980s in for example manufacturing pro- cesses. While industrial robots are generally stationary and work in static environments, service robots generally have to deal with dynamic environments and often move around by themselves too.

This means that ideally, a service robot should have the ability to recognize and identify new elements and interact with them in the appropriate and de- sired fashion. An obstacle might have to be ma- noeuvred around, whilst a bottle might have to be picked up. Computer vision deals with the following tasks: scene reconstruction, object recognition, image segmentation, 3D pose estimation, motion tracking, object classification, and more [1]. In this paper however we will only be focusing on the object recognition component of computer vision, and

not on the robot’s abilities to manipulate its environment. Since our focus is on lifelong 3D object recognition for service robots, we will be working with open-ended object category learning. What this entails is that the robot at hand has to be able to learn and recognize new object categories while it is active. One of the problems here is that the robot does not know what kinds of objects it may encounter in the future, making it difficult to use an approach that trains the robot beforehand, which is the approach taken by the widely used Convolu- tional Neural Networks (CNN) [2], [3], [4], [5], [6], [7]. While deep CNNs are very effective when the the application has a pre-defined set of object categories and thousands of examples per category, it is not suitable for open ended object recognition. Fur- thermore, CNNs are not open-ended because each new learned category requires a restructuring in the topology of the network [8]. They also generally require vast amounts of data in order to learn new categories, meaning that it will be difficult for the robot to accurately learn and distinguish between newly encountered categories due to the lack

1

(2)

of data. Even if the robot did have access to all the necessary data, the issue of long computation times still remains. It is important for service robots to be able to do the computations rapidly so that they may continue carrying out their tasks without delay in the real world.

In light of the aforementioned issues, Kasaei et al [8] have developed a CNN-based model for 3D object recognition in open-ended domains called OrthographicNet, which has shown to outperform state-of-the-art approaches regarding both object recognition performance and scalability in open- ended scenarios. Since the original paper focuses on how their best results compare to other state- of-the-art approaches, we do not know as much regarding the differences among the various configurations available to OrthographicNet itself. We will perform experiments with the hand-crafted descriptors GOOD [9] and ESF [10], and the deep feature descriptors MobileNetV2 and VGG19 fc1 mentioned in [8] to find out how different configurations within each approach compare to each other and how certain design choices affect the overall performance of the model. We will first compare all descriptors with each other in an offline test setting.

The best performing handcrafted and deep feature descriptor will be used in a subsequent round of testing in an open ended scenario in order to get a more conclusive result on the general performances of these descriptors.

2 Related Works

In order to be able to classify an object’s category, we must first of all be able to detect and represent it. Since we will be using the Washington RGB- D Object dataset for all of our tests we will not have to work with an object detection module. Re- garding object representation; it is important that the used approach does not contain too little information about the object, nor should it contain too much. Excessive information will increase noise and likely slow down the model, while too little information will likely lead to inaccurate predictions.

2.1 Handcrafted Features

Three classes of 3D point cloud object descriptors can be distinguished: local-based, global-based, and

hybrid-based. Global descriptors describe the image as a whole in order to generalize the entire object. Features that may be included by global descriptors are shape, contour, and texture. Lo- cal descriptors on the other hand describe distinct patches of the image. Han et al [11] have conducted a series of experiments in which they compare various descriptors against each other. SHOT- Color [12], a local descriptor, showed decent results but could not compare against the global descriptors ESf (Ensemble of Shape Functions) [10], VFH (Viewpoint Feature Histogram) [13], and GOOD [9]. The latter three are well suited for open-ended domains due to their good performances on both computation speed and accuracy. For these reasons, we will be using both GOOD and ESF in this work.

2.2 Deep Features

In the past decade, progress has been made in the development of learning approaches that support online and incremental object category learning, al- beit without the use of CNNs [14], [15], [16], [17], [18]. While it is true that CNNs perform well when it comes to object recognition and classification in general, it is significantly more difficult to do so in an open-ended fashion due to the necessity to retrain the entire model upon introducing new objects.

An attempt has been made by [19] to solve this issue by pre-training a CNN and transfer those pre-learned features to be fine tuned to the target dataset. Although the network’s robustness does in- deed improve, this increase in performance is still insufficient to make it work in an open ended domain. Since OrthographicNet does not suffer from this necessity upon encountering new object categories that have to be learned, it is therefore possible to pursue lifelong learning for the robot.

We can divide the different deep learning approaches for 3D object recognition into three categories depending on their types of input: volume- based, view-based, and pointset-based methods.

OrthographicNet employs a view-based approach.

Volume-based approaches represent an object as a 3D voxel grid whereas pointset-based approaches work directly on 3D point clouds. View-based approaches construct 2D images of an object by pro- jecting the object’s points from a 3D presentation into 2D planes and have shown the best results

(3)

in object recognition tasks so far [3], [20]. Various other works such as [21], and [22] also make use of multiple views of an object but in the end are not suitable for real-world scenarios due to the difficulty of obtaining a perfect image view of the object at every angle, whereas OrthographicNet does not suffer from this problem. This is important due to the fact that in the real world, objects are often partially obscured after all.

There are a number of other additional aspects that separate OrthographicNet from the other CNN-based approaches. First of all, Orthograph- icNet assumes that each training instance is ex- tracted from on-site experience as opposed to hav- ing access to training data to some degree from the start. This causes the robot to gradually be exposed to training data and emulates a real-time learning experience similar to that of humans.

At the start of the training process, the number of known object classes is therefore 0 and increases incrementally over time. The fact that Orthograph- icNet does not require large amounts of training data and pre-training of the model is an advan- tage when it comes to learning in an open-ended domain in the real world. Second of all, Ortho- graphicNet does not have a pre-defined set of object classes and has the potential to continuously grow and learn new object categories that are distinct enough from each other. It is important to mention that OrthographicNet is suitable for both instance- level and category-level object recognition. Lastly, OrthographicNet is both scale and rotation invari- ant.

To the best of our knowledge, there are currently no other deep learning based approaches which tackle both 3D object pose estimation and recognition in open ended fashion besides Orthographic- Net.

3 Methods

3.1 Object Representation

In order to classify an object a robot must first of all be able to recognize an object. A representation of the object is necessary for the robot in order to attempt classification. One way to represent an object is by using a point cloud, which represents an object as a set of points p_i: i = {1, ..., n}. Since the

Figure 3.1: A pointcloud and orthographic views of a bathtub

plane is in 3D each point contains the coordinates [x, y, z]. They also contain color information (R, G, B). Starting with the calculation of the geometric center of the object, we are able to construct the three principal axes of the object based on eigen- vector analysis. Using the eigenvectors [v1, v2, v3] that we found we are able to determine the axes [x, y, z].

Afterwards, we generate three views of the object which are the frontal, top, and right-side view (Fig- ures 3.1 and 3.2 show some examples). These views are divided into n ∗ n bins, which are used to com- pute a normalized distribution matrix. The result is a histogram that shows how many points are in each bin. Each of these three orthographic views is then individually fed to a CNN in order to extract view-wise features. Finally, an element-wise pooling function is used to merge the obtained view-wise features in order to construct a global deep representation for the object at hand. The three other

Figure 3.2: Orthographic views of a shampoo bottle

(4)

primary views, rear, bottom, and left-side, are not taken into consideration because they are the mir- rors of the aforementioned ones.

3.2 Offline Experiments

The differences between our offline experiment configurations will be the type of object descriptor and type of distance function that we use. Furthermore, we will also be applying a K-fold cross-validation procedure in order to examine the performance of our system more accurately. K-fold cross-validation is widely used for estimating a learning algorithm’s generalization performance. While frequently used for parameter tuning, this method also allows for comparing the performances of different approaches amongst each other, which is exactly what we are doing. This evaluation protocol creates K folds by dividing the full data set into K equal sized subsets.

Each subset contains examples from all categories.

During each iteration only one fold is used for testing, while the remaining folds are used as training data. In our work we set the value of K to 10.

As for the different types of object descriptors, we have chosen two object descriptors using handcrafted features (GOOD and ESF), and two object descriptors using deep transfer learning (Mo- bileNetV2 and VGG19 fc1). Our aim is to find the best configuration of each type and compare the winners with each other in the online evaluation.

The dataset that we will be using for our offline experiments is the Restaurant RGB-D Object Dataset developed by [23]. The Restaurant RGB-D dataset consists of 305 instances of household objects distributed over ten classes. Figure 3.3 shows an example of the objects. Despite the small size of the data set we are still able to run a large number of experiments with it because it offers significant intra-class variation and our application of K-fold cross validation.

We will primarily be using an instance based learning (IBL) approach to form new categories, which can generally be viewed as a combination of

Figure 3.3: Example object views of the Restaurant RGB-D dataset.

Table 3.1: An overview of the 14 different distance functions used in our offline evaluations

Euclidean Manhattan

χ² Pearson

Neyman Canberra

KL divergence Symmetric KL divergence

Motyka Cosine

Dice Bhattacharyya

Gower Sorensen

the representation of an object, a similarity measure, and a classification rule. Hence, the similarity measure that we choose will affect the system’s performance. Given the way we represent objects as explained in the previous subsection by using normalized histograms, the similarity between histograms can be computed by different distance functions.

Taken the previous into account, we have chosen 14 different functions that were dissimilar to each order to test in our configurations. Table 3.1 shows the functions that we have chosen for our work. S.

Cha [24] offers a comprehensive survey on the distance functions that we used.

3.3 Online Experiments

After finding the best configurations for both the handcrafted and deep transfer learning descriptors we must test them again in an open-ended scenario.

The problem with the offline evaluation methodolo- gies is that they do not abide by the simultaneous nature of learning and recognition. Furthermore, they also imply that the set of categories for the system to learn is predefined. OrthographicNet, on the other hand, aims to achieve lifelong learning.

We will be using an online evaluation methodol- ogy using a teaching protocol proposed by [23]. The idea is to emulate the interactions of a recognition system with the surrounding environment over long periods of time in a single context scenario. The teacher follows a protocol and may interact with the system in three ways:

• Teach: used for introducing a new object category to the agent.

• Ask: used to ask the agent what is the category of a given object view.

• Correct: used for providing corrective feedback in case of misclassification.

(5)

While it is possible to use a human teacher, it is faster, more consistent and more efficient to use a simulated teacher. This teacher determines which examples are used for training the algorithm, and which are used for testing. The simulated teacher will present the algorithm with three unseen object views of an object from the categories that have already been learned and test the system. Us- ing these three views, the system will construct a model as described in the “Object Representation”

subsection. The simulated teacher then presents another object view that has not yet been seen by the system in order to check whether it can correctly classify the object (“Ask”). If the system ended up misclassifying the object, the simulated teacher will provide correct feedback (“Correct”). This allows the system to update the current model of the category to which the testing input belonged.

A parameter τ is used as a threshold which the system’s success rate has to pass. The default value that we used in our experiments was τ = 0.67, which implies the success rate had to be twice as large as the error rate. We also test the model using different threshold values in order to see if and how this parameter may influence the system’s performance. Once the system passes this threshold, the simulated teacher will introduce object views of a new object category (“Teach”). Since the system has not had any exposure to this new object category yet, it mimics the simultaneous nature of recognition and learning in humans. The system will begin learning without any knowledge on this new category and will only be exposed to it gradually.

The teaching protocol will be stopped when the system either has successfully managed to learn all categories and has run out of data, or if it turns out that the system is unable pass the the threshold τ . It has to be noted however that in the former case it is still possible for the system to continue learning more unique categories. Regarding the latter case, the system is allowed to go through 100 unsuccess- ful iterations before the protocol is terminated. The simulated teacher essentially makes the assumption that the system is simply unable to learn the newest object category.

The data set which we will be using for this part of our testing will be different from the dataset used during the offline part. Instead, we will be using the Washington RGB-D Object Dataset which consists

Figure 3.4: Example object views of the Washington RGB-D dataset.

of 250,000 views of 300 objects. The objects are cat- egorized into a total of 51 categories. See figure 3.4.

Since the order in which certain objects are presented to the system may affect its performance, we decided to randomize the sequence in which objects are presented for the handcrafted descriptors which are tested first. For instance, the system will be- have differently if the first 10 object categories are all very similar compared to a scenario in which it receives distinctly different object categories. The deep feature descriptors will use the exact same sequence in order to allow us to compare the performance of the two types of descriptors. We will repeat this setup 10 times and take the average of the results. Furthermore, we will also run one set of experiments per descriptor for different values of the threshold parameter τ . Finally, we will be assessing the system’s performance based on the following metrics:

• Question Correction Iteration (QCI):

The total number of iterations the system has gone through before terminating the protocol.

• Average Learned Categories (ALC): The average number of categories the system has learned.

• Average Instances per Category (AIC):

The average number of instances the system required to learn a new object category.

• Global Class Accuracy (GCA): The accuracy of the agent over the whole run

• Average Protocol Accuracy (APA): The average accuracy of the system throughout the entire teaching protocol.

(6)

4 Results

4.1 Offline Results

For the handcrafted descriptors we observed the highest average class accuracy (ACA) to be achieved by a configuration using the ESF descriptor, the χ² distance function with a K-value of 1.

The top 5 ACA results for ESF and GOOD can be found in tables 4.1 and 4.2 respectively.

As for the deep learned features, the highest ACA was achieved by the configuration using the Mo- bileNetV2 descriptor, an orthographic image reso- lution of 50, AVG pooling, the χ²distance function, and a K value of 1. The top 5 results for these two object descriptors can be found in tables table 4.3 and 4.4

For the GOOD object descriptor we observed multiple configurations differing in only their distance functions to be performing identically accuracy wise, with the exception of the top result. For ESF on the other hand, the values for both average instance accuracy (AIA) and ACA differ throughout the table. The computation times also appear to be longer on average for ESF. For both descriptors, the vast majority of the best performing configurations had a K-value of 1. The top 17 results for the good descriptor used 15 bins, instead of 90 (a parameter exclusive to GOOD).

Ultimately, we can see from tables 4.2 and 4.1 that the best performing ESF configuration out-

Table 4.1: Offline evaluation results for ESF object descriptor. The top 5 configurations in terms of ACA are displayed. A total of 75 setups were tested.

K Dist. Func. AIA ACA Time (s)

1 χ² 0.9707 0.9717 5.8

1 Neyman 0.9642 0.9668 7.191

1 KLDivergance 0.9674 0.9655 5.47 1 Sorensen 0.9707 0.9645 6.482 1 Bhattacharyya 0.9609 0.9626 6.203

Table 4.2: Offline evaluation results for GOOD object descriptor.The top 5 configurations in terms of ACA are displayed. A total of 150 setups were tested.

Bins K Dist. Func. AIA ACA Time (s) 15 1 Motyka 0.9674 0.9597 1.682 15 1 Euclidean 0.9577 0.9506 1.727 15 1 Manhattan 0.9577 0.9506 1.623 15 1 Sorensen 0.9577 0.9506 1.711 15 1 Gower 0.9577 0.9506 1.591

Table 4.3: Offline evaluation results for MobileNetV2 object descriptor. The top 5 configurations in terms of ACA are displayed. A total of 900 Setups were tested.

Res. Pooling K Dist. Func. AIA ACA Time (s) 50 AVG 1 χ² 0.9511 0.9481 1.117 100 MAX 1 Motyka 0.9511 0.9419 0.9417 100 MAX 1 Sorensen 0.9479 0.9388 0.7708

50 AVG 1 Motyka 0.9446 0.9371 0.7847 50 AVG 1 Sorensen 0.9446 0.9371 0.7787

Table 4.4: Offline evaluation results for VGG19 fc1 object descriptor. The top 5 configurations in terms of ACA are displayed. A total of 900 setups were tested.

Res. Pooling K Dist. Func. AIA ACA Time (s) 50 APP 3 SymmetricKL 0.9609 0.946 17.97 50 APP 1 χ² 0.9609 0.9454 5.079 50 APP 1 SymmetricKL 0.9609 0.9454 18.16 50 AVG 1 Canberra 0.9609 0.9449 1.72 50 AVG 1 SymmetricKL 0.9609 0.9444 6.507

performs the GOOD configuration in both AIA and ACA, but not time. Depending on whether we look purely at accuracy scores or also take the computation into account, the winner could be either configuration. Unlike their deep feature counterparts, the difference between average computation times for ESF and GOOD is much smaller. Since we are only working with 2 handcrafted descriptors, we believe it would be fine to simply test both descriptors in the open ended scenario because it may provide us some insights on whether we should sacrifice a small amount of accuracy for possibly faster computation. After all, our final goal in the bigger pic- ture is to figure out what works best for real-time applications. It would be in the interest of such applications to be able to perform up to three times faster at a cost of roughly 1 percent point of accuracy, assuming that a similar difference in results will appear in the online testing phase.

Among the top 5 performing VGG19 fc1 configurations we can see that they all have identical AIA values. Whilst table 4.2 shows a similar trend for the GOOD descriptor, the AIA and ACA for ESF and MobileNetV2 seem to contain more variance.

We would like to note however, that configurations with identical AIA and/or ACA values were found throughout the results of all four descriptors. Al- though VGG19 fc1 outperformed MobileNetV2 in terms of AIA, it loses in ACA and especially computation time.

As can be seen in the aforementioned tables, the variance between computation times among the top performing configurations of each descriptor type

(7)

seems to be low, with the exception of VGG19 fc1.

Two instances using SymmetricKL took around 18 seconds each versus the three other instances averaging only 4.4353 seconds each. Upon further inspection, we found that among the 90 slowest setups for VGG19 fc1, 55 of those used Symmet- ricKL, averaging a computation time of 10.8285 seconds. In the case of MobileNetV2, among the 90 slowest setups 41 belonged to SymmetricKL averaging a computation time of 4.5713. Since each distance function is tested a total of 60 times it is safe to say that SymmetricKL predominantly scores among the slowest 10% of all tests when it comes to deep feature descriptors. Taking all this into account, it appears that MobileNetV2 is generally faster than VGG19 fc1 regardless of distance function.

Table 4.5: Average accuracy and computation time results for the top 10 configurations for all four descriptor types.

Descriptor AIA ACA Time (s) ESF 0.9651 0.9623 5.9073 GOOD 0.9574 0.9496 1.5003 VGG19 fc1 0.9599 0.9438 30.0716 MobileNetV2 0.9443 0.9358 4.7338

Although we observed that the vast majority of top performing setups among all four descriptors used a K-value of 1, it did not appear in the rest of the results that there was a strong negative correlation between K-value and accuracy. Lower values of K were found among the worse performing configurations as much as the higher values. However, it is also impossible to ignore the fact that of the 20 setups shown in tables 4.1 to 4.2, only one setup used a K-value other than 1. This may be caused by the fact that the Restaurant RGB-D dataset has relatively few classes, causing less stable performances, hence in our particular case it is more beneficial to work with lower K-values. The experiments also produced confusion matrices showing the success rates for each object category. The best confusion matrices of each descriptor have been added to the appendix of this paper.

4.2 Online Results

Based on the data collected during the offline evaluation stage we have decided to test MobileNetV2 against both ESF and GOOD. Tables 4.6 and 4.7

Table 4.6: Average results over 10 runs for handcrafted features (ESF, χ²) and deep features (MobileNetV2, χ²) in an online environment.

Model QCI ALC AIC GCA APA

DTL 1346 51 392.3 0.7903 0.8014

HC 1355 51 396 0.7891 0.8086

Table 4.7: Average results over 10 runs for handcrafted features (GOOD, Motyka) and deep features (MobileNetV2, χ²) in an online environment.

Model QCI ALC AIC GCA APA

DTL 1348 51 399.4 0.7855 0.8096 HC 1374 51 432.9 0.7627 0.7797

respectively show our results. The deep transfer learning (DTL) model appears to outperform the ESF model in every category aside from APA.

When compared to the GOOD model, it outclasses the other in every category. This means that while GOOD was a few seconds faster than ESF during the offline testing, we can observe that in an online learning environment it falls short in every metric that we measure in the online evaluations compared to ESF. Note that we do not in fact look at computation times in the online evaluation stage in terms of seconds, but in terms of iterations instead.

Therefore, we decided to use ESF for the remainder of the online evaluation stage.

During both comparisons, the random sequence generator was enabled for the handcrafted descriptor with the DTL model using the exact same object sequence for fair comparison. However, this does mean that the sequence of objects differs between GOOD and ESF. Since we run both experiments 10 times respectively, we believe that the average result is sufficiently representative despite the fact that the sequence of objects was not en- tirely identical for the two experiment setups.

Figure 4.1 shows 5 graphs which are generated for every online experiment. The first graph shows how difficult each category was to learn for the agent. The dashed line in grey represents the protocol threshold, which was 0.67 for this experiment.

The second graph shows a generally decreasing GCA as the number of learned categories increases.

The third graph shows how quickly the agent learns each new category. The fourth graph shows the protocol accuracy as a function of the learned number of categories. Lastly, the fifth graph show how many instances were stored for each category.

(8)

0 20 40 60 80 100 120 140 160 180 200

Question / Correction Iterations

0 0.2 0.4 0.6 0.8 1 1.2

Protocol Accuracy

0 5 10 15 20 25 30 35 40 45 50 55

Number of Learned Categories

0.5 0.6 0.7 0.8 0.9 1

Global Classification Accuracy

0 200 400 600 800 1000 1200 1400

0 10 20 30 40 50 60

0 5 10 15 20 25 30 35 40 45 50 55

0.5 0.6 0.7 0.8 0.9 1

Protocol Accuracy

bowl

coffee-mugcan-food cap shampoobanana

toothbrush onion

cereal-boxcup-food rubber-eraser

soda-can apple hand-towel

combcamera bell-pepper

greens cell-phone

lemon plierspotato lightbulbkeyboardglue-stick

ball bag-foodkleenex

toothpastenotebookstaplerjar-food water-bottle

instant-noodles mushroom

pearbindertomato garlicbox-food lime

scissors dry-battery

peach plateorange

calculatorflashlightspongemarkerpitcher 0

5 10 15 20 25

Number of Stored Instances

Figure 4.1: Figures of the MobileNetV2 setup used in table 4.6.

After obtaining our results from the initial experiments with a threshold value of 0.67 we decided to observe what would happen if we were to increase the value of τ . Since τ determines the necessary protocol accuracy that the model has to achieve within

100 iterations before it is allowed to be presented with another category, it may seem intuitive that this parameter is positively correlated with accuracy. Tables 4.8 and 4.9 show us the results of these follow-up experiments. It is important to note here

(9)

that we only ran one experiment for each threshold value and did not use random sequencing. This ensured that all 8 experiments ended up with the same sequence of objects, allowing the effects of changing the threshold to be unaffected by other factors.

As expected, the accuracy scores of both descriptor types increased as τ increased. For ESF, we can also observe a decrease in both QCI and ALC.

While this model is still able to learn all 51 categories at τ = 0.7, it appears to be unable to do so for the higher values. MobileNetV2 on the other hand is still able to learn all 51 categories at τ = 0.8 while ESF only managed to learn 30 out of 51 categories using the same threshold. For every change in threshold that resulted in a lower ALC score, the QCI appears to decrease for both descriptors.

The only case in which an increase in τ did not lead to a decrease in QCI was the case of τ = 0.8 for MobileNetV2. They key difference here is that the ALC did not decrease. This likely implies that changing τ from 0.7 to 0.8 made it more difficult for the model to learn all 51 categories and thus ended up requiring more iterations. The decrease of the QCI score in the other cases is therefore likely caused by the fact that the systems had to termi- nate earlier since they did not manage to learn all 51 categories. We therefore conclude that increasing the threshold improves model accuracy, but after a certain point comes at the cost of the ability to learn more new object categories.

Table 4.8: Results of ESF during online evaluation with varying values for the threshold parameter.

τ QCI ALC AIC GS ACS

0.70 1334 51 8.275 0.7984 0.818 0.80 832 30 7.833 0.8257 0.8759

0.90 636 20 7 0.8742 0.9393

0.95 369 12 6.083 0.8997 0.9752

Table 4.9: Results of MobileNetV2 with varying values for the threshold parameter.

τ QCI ALC AIC GS ACS

0.70 1375 51 8.392 0.8 0.8353

0.80 1911 51 9.725 0.8205 0.8584 0.90 1729 30 10.2 0.8751 0.9392 0.95 697 19 6.158 0.9139 0.972

5 Conclusions

Using OrthographicNet, we compared two handcrafted feature descriptors (GOOD and ESF), against deep features (MobileNetV2 and VGG19 fc1) in order to find out the differences in their performances. During offline testing we found that ESF obtained the highest accuracy scores, whereas VGG19 fc1 was by far the slowest of the four. The vast majority of results in the top 10 configurations used a K value of 1 in a KNN algorithm.

We then tested ESF, GOOD, and MobileNetV2 in an open ended scenario. GOOD was outperformed by its fellow handcrafted descriptor ESF in every single metric. However, ESF was outperformed by MobileNetV2 in 4 out of 5 metrics. Next, the importance of the threshold value τ was inves- tigated by conducting another set of experiments using ESF and MobileNetV2 once again. Here we observed that an increase in τ makes the learning process for the system more difficult, resulting in a decrease in ALC meaning the systems were no longer able to learn all 51 categories after a certain threshold. Another likely consequence is the increase in QCI. More importantly, however, are the increased scores regarding the accuracy metrics. In this set of experiments we found that Mo- bileNetV2 was able to achieve the highest accuracy scores while still being able to learn all 51 object categories.

This brings us to the final conclusion that deep transfer learning in an open ended environment can be a viable approach to pursuing lifelong 3D object recognition in open ended domains and may be worthwhile to be explored further. As for future works, we could explore the possibilities of focusing more on both colour and texture information on top of shape information. Lastly, we can also con- sider methods other than orthographic projection in order to create object representation [25].

References

[1] S. H. Kasaei, J. Melsen, F. van Beers, C. Steenkist, and K. Voncina, “The state of service robots: Cur- rent bottlenecks in object perception and manipulation,” arXiv preprint arXiv:2003.08151, 2020.

[2] D. Tang, Q. Fang, L. Shen, and T. Hu, “Onboard

(10)

detection-tracking-localization,” vol. 25, pp. 1555–

1565, 2020.

[3] A. Kanezaki, Y. Matsushita, and Y. Nishida, “Ro- tationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints,” in Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, pp. 5010–5019, 2018.

[4] A. Garcia-Garcia, F. Gomez-Donoso, J. Garcia- Rodriguez, S. Orts-Escolano, M. Cazorla, and J. Azorin-Lopez, “Pointnet: A 3D convolutional neural network for real-time object class recognition,” in 2016 International Joint Conference on Neural Networks (IJCNN), pp. 1578–1584, IEEE, 2016.

[5] S. Zhi, Y. Liu, X. Li, and Y. Guo, “LightNet: A lightweight 3D convolutional neural network for real-time 3D object recognition.,” in 3DOR, 2017.

[6] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3D shapenets: A deep representation for volumetric shapes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920, 2015.

[7] B. Shi, S. Bai, Z. Zhou, and X. Bai, “Deep- pano: Deep panoramic representation for 3-d shape recognition,” IEEE Signal Processing Let- ters, vol. 22, no. 12, pp. 2339–2343, 2015.

[8] H. Kasaei, “Orthographicnet: A deep learning approach for 3D object recognition in open-ended domains,” arXiv preprint arXiv:1902.03057, 2019.

[9] S. H. Kasaei, A. M. Tom´e, L. S. Lopes, and M. Oliveira, “Good: A global orthographic object descriptor for 3D object recognition and manipulation,” Pattern Recognition Letters, vol. 83, pp. 312–320, 2016.

[10] W. Wohlkinger and M. Vincze, “Ensemble of shape functions for 3D object classification,” in 2011 IEEE international conference on robotics and biomimetics, pp. 2987–2992, IEEE, 2011.

[11] X.-F. Hana, J. S. Jin, J. Xie, M.-J. Wang, and W. Jiang, “A comprehensive review of 3D point cloud descriptors,” arXiv preprint arXiv:1802.02297, 2018.

[12] F. Tombari, S. Salti, and L. Di Stefano, “A com- bined texture-shape descriptor for enhanced 3D feature matching,” in 2011 18th IEEE international conference on image processing, pp. 809–

812, IEEE, 2011.

[13] R. B. Rusu, G. Bradski, R. Thibaux, and J. Hsu,

“Fast 3D recognition and pose using the viewpoint feature histogram,” in 2010 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Sys- tems, pp. 2155–2162, IEEE, 2010.

[14] S. Kasaei, J. Sock, L. S. Lopes, A. M. Tom´e, and T.-K. Kim, “Perceiving, learning, and recogniz- ing 3d objects: An approach to cognitive service robots,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, 2018.

[15] S. H. Kasaei, M. Oliveira, G. H. Lim, L. S.

Lopes, and A. M. Tom´e, “Towards lifelong assistive robotics: A tight coupling between object perception and manipulation,” Neurocomputing, vol. 291, pp. 151–166, 2018.

[16] T. F¨aulhammer, R. Ambru¸s, C. Burbridge, M. Zil- lich, J. Folkesson, N. Hawes, P. Jensfelt, and M. Vincze, “Autonomous learning of object models on a mobile robot,” IEEE Robotics and Automa- tion Letters, vol. 2, no. 1, pp. 26–33, 2016.

[17] S. H. Kasaei, A. M. Tom´e, and L. S. Lopes, “Hier- archical object representation for open-ended object category learning and recognition,” Advances in Neural Information Processing Systems, vol. 29, pp. 1948–1956, 2016.

[18] M. Oliveira, L. S. Lopes, G. H. Lim, S. H.

Kasaei, A. M. Tom´e, and A. Chauhan, “3D object perception and perceptual learning in the race project,” Robotics and Autonomous Systems, vol. 75, pp. 614–626, 2016.

[19] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson,

“How transferable are features in deep neural networks?,” in Advances in neural information processing systems, pp. 3320–3328, 2014.

[20] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas, “Volumetric and multi-view cnns for object classification on 3D data,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5648–5656, 2016.

[21] A. Sinha, J. Bai, and K. Ramani, “Deep learning 3D shape surfaces using geometry images,” in Eu- ropean Conference on Computer Vision, pp. 223–

240, Springer, 2016.

[22] H. Su, S. Maji, E. Kalogerakis, and E. Learned- Miller, “Multi-view convolutional neural networks for 3D shape recognition,” in Proceedings of the IEEE international conference on computer vision, pp. 945–953, 2015.

(11)

[23] S. H. Kasaei, M. Oliveira, G. H. Lim, L. S. Lopes, and A. M. Tom´e, “Interactive open-ended learning for 3D object recognition: An approach and experiments,” Journal of Intelligent & Robotic Systems, vol. 80, no. 3-4, pp. 537–553, 2015.

[24] S.-H. Cha, “Comprehensive survey on distance/similarity measures between probability density functions,” City, vol. 1, no. 2, p. 1, 2007.

[25] T. Parisotto and H. Kasaei, “More: Simultaneous multi-view 3D object recognition and pose estimation,” in 30th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), 2021.

(12)

A Appendix

The heat-maps below are the confusion matrices for the best performing configurations for each descriptor during the offline training phase. Each coloured square along the diagonal depicts how often a certain object class was correctly classified. All other squares show the mistakes that were made, and what the object classifier mistook the actual object for. For instance, we can observe that each descriptor scored the lowest accuracy for target category 4.

Accuracy: 96.09% (ESF)

100.0%

20

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

3.0%

1

0.0%

0

0.0%

0 0.0%

0

97.1%

33

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0 0.0%

0

0.0%

0

100.0%

22

0.0%

0

0.0%

0

2.0%

1

0.0%

0

0.0%

0

0.0%

0

1.8%

1 0.0%

0

0.0%

0

0.0%

0

81.8%

9

0.0%

0

0.0%

0

0.0%

0

3.0%

1

0.0%

0

0.0%

0 0.0%

0

0.0%

0

0.0%

0

0.0%

0

96.4%

27

0.0%

0

0.0%

0

3.0%

1

0.0%

0

0.0%

0 0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

96.1%

49

0.0%

0

0.0%

0

3.6%

1

0.0%

0 0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

100.0%

23

0.0%

0

0.0%

0

0.0%

0 0.0%

0

2.9%

1

0.0%

0

18.2%

2

3.6%

1

0.0%

0

0.0%

0

90.9%

30

0.0%

0

0.0%

0 0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

2.0%

1

0.0%

0

0.0%

0

96.4%

27

1.8%

1 0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

96.5%

55

1 2 3 4 5 6 7 8 9 10

Target Category

1

2

3

4

5

6

7

8

9

10

Predicted Category

0 10 20 30 40 50 60 70 80 90

100 Accuracy: 96.74% (GOOD)

100.0%

20

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

3.0%

1

0.0%

0

0.0%

0 0.0%

0

100.0%

34

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0 0.0%

0

0.0%

0

100.0%

22

0.0%

0

0.0%

0

3.9%

2

0.0%

0

0.0%

0

0.0%

0

0.0%

0 0.0%

0

0.0%

0

0.0%

0

81.8%

9

3.6%

1

0.0%

0

0.0%

0

6.1%

2

0.0%

0

0.0%

0 0.0%

0

0.0%

0

0.0%

0

9.1%

1

96.4%

27

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0 0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

94.1%

48

0.0%

0

0.0%

0

3.6%

1

0.0%

0 0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

100.0%

23

0.0%

0

0.0%

0

0.0%

0 0.0%

0

0.0%

0

0.0%

0

9.1%

1

0.0%

0

0.0%

0

0.0%

0

90.9%

30

0.0%

0

0.0%

0 0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

2.0%

1

0.0%

0

0.0%

0

96.4%

27

0.0%

0 0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

100.0%

57

1 2 3 4 5 6 7 8 9 10

Target Category

1

2

3

4

5

6

7

8

9

10

Predicted Category

0 10 20 30 40 50 60 70 80 90 100

ESF Confusion Matrix GOOD Confusion Matrix

Accuracy: 93.49% (mobileNetV2 _TEST)

100.0%

20

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

3.0%

1

0.0%

0

0.0%

0 0.0%

0

100.0%

34

0.0%

0

0.0%

0

0.0%

0

2.0%

1

0.0%

0

0.0%

0

0.0%

0

0.0%

0 0.0%

0

0.0%

0

100.0%

22

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0 0.0%

0

0.0%

0

0.0%

0

63.6%

7

0.0%

0

0.0%

0

0.0%

0

3.0%

1

0.0%

0

0.0%

0 0.0%

0

0.0%

0

0.0%

0

0.0%

0

78.6%

22

0.0%

0

0.0%

0

12.1%

4

0.0%

0

0.0%

0 0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

96.1%

49

0.0%

0

0.0%

0

7.1%

2

0.0%

0 0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

100.0%

23

0.0%

0

0.0%

0

0.0%

0 0.0%

0

0.0%

0

0.0%

0

36.4%

4

21.4%

6

0.0%

0

0.0%

0

81.8%

27

0.0%

0

0.0%

0 0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

2.0%

1

0.0%

0

0.0%

0

92.9%

26

0.0%

0 0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

100.0%

57

1 2 3 4 5 6 7 8 9 10

Target Category

1

2

3

4

5

6

7

8

9

10

Predicted Category

0 10 20 30 40 50 60 70 80 90 100

Accuracy: 93.81% (vgg19 _fc1_TEST)

100.0%

20

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

3.0%

1

0.0%

0

0.0%

0 0.0%

0

100.0%

34

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0 0.0%

0

0.0%

0

100.0%

22

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0 0.0%

0

0.0%

0

0.0%

0

72.7%

8

3.6%

1

0.0%

0

0.0%

0

18.2%

6

0.0%

0

0.0%

0 0.0%

0

0.0%

0

0.0%

0

0.0%

0

75.0%

21

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0 0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

100.0%

51

0.0%

0

0.0%

0

7.1%

2

0.0%

0 0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

100.0%

23

0.0%

0

0.0%

0

0.0%

0 0.0%

0

0.0%

0

0.0%

0

27.3%

3

21.4%

6

0.0%

0

0.0%

0

78.8%

26

0.0%

0

0.0%

0 0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

92.9%

26

0.0%

0 0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

0.0%

0

100.0%

57

1 2 3 4 5 6 7 8 9 10

Target Category

1

2

3

4

5

6

7

8

9

10

Predicted Category

0 10 20 30 40 50 60 70 80 90 100

MNV2 Confusion Matrix VGG Confusion Matrix

Figure A.1: The best confusion matrices of the proposed approaches based on : (top-row ) hand-crafted descriptors;

(bottom-row ) deep features

(13)

B Appendices

As explained in the Online Results section, we conducted several experiments while changing the threshold value. Similarly to figure 4.1, we have taken three different graphs for each descriptor. In this case, however, we have done so for every threshold value, allowing us to combine the graphs to give us a clearer visual representation of how changing the threshold value affected the experiment results.

0 5 10 15 20 25 30 35 40 45 50 55

0.75 0.8 0.85 0.9 0.95 1 1.05

Global Classification Accuracy

T = 0.70 T = 0.80 T = 0.90 T = 0.95

0 200 400 600 800 1000 1200 1400 1600 1800 2000

0 10 20 30 40 50 60

T = 0.70 T = 0.80 T = 0.90 T = 0.95

cup-foodscissorsjar-food cap

rubber-eraserwater-bottlebell-pepperflashlight toothbrush

pliers marker

dry-battery kleenexbanana

calculatorcereal-box pear apple

camera lemon bowlgreens

ball soda-canpitchergarlic

hand-towel comb

bag-foodcell-phonecoffee-mug binder

notebookkeyboardcan-foodglue-stick lime

sponge mushroom

onion

instant-noodles stapler

box-foodshampooorange peachplate toothpaste

potato lightbulbtomato

Taught Categories

0 5 10 15 20 25

Number of Stored Instances

T = 0.70 T = 0.80 T = 0.90 T = 0.95

Figure B.1: Deep Features: Merged graphs for varying τ values: (Top) Global Classification Accuracy (GCA) for each number of categories learned by the model. An inverse correlation between the two can be observed, although there do appear to be diminishing returns. A decrease in GCA when the total number of learned categories increases makes sense because even from a purely statistical point of view regardless of the object classes, it should become more difficult to pick the correct object class as more classes are introduced. (Middle) Each point in the graph shows when a new category has been learned. In the beginning of the experiments, relatively few question/correction iterations (QCI) are needed for each new category to be learned. As the number of learned classes increases, so does the number of QCI needed to learn a new category. (Bottom) An overview of the number of stored instances per object category. Larger numbers generally indicate more difficulty learning said object, since each time the model fails, it stores the new object views and has to try again. It appears that bell-pepper, dry-battery, pear, camera, and soda-can were generally the most difficult classes.