Transfer Learning in Joint Neural Embeddings for Cooking Recipes and Food Images

(1)

Transfer Learning in Joint Neural Embeddings for Cooking Recipes and Food Images

submitted in partial fulfillment for the degree of master of science Mariel van Staveren

11773952

master information studies data science

faculty of science university of amsterdam Date of defence 2018-06-22

Internal Supervisor External Supervisor Title, Name Dr Thomas Mensink Dr Vladimir Nedović Affiliation UvA, FNWI Flavourspace Email thomas.mensink@uva.nl v@flavourspace.com

(2)

Contents Contents 0 Abstract 1 1 Introduction 1 2 Related Work 2 2.1 Transfer Learning 2 2.2 Nutritional value 2

3 The joint neural embedding model 2

3.1 Representation of ingredients 2

3.2 Representation of cooking instructions 2

3.3 Representation of images 3

3.4 Joint neural embedding 3

4 The recipe collections 3

4.1 Recipe1M (R1M) 3

4.2 Jamie Oliver (JO) 3

4.3 Allerhande (AH) 3

5 Performance of the pre-trained model 4

5.1 Pre-processing JO and AH 4

5.2 Preparing the test-sets 4

5.3 Intra-collection retrieval 5

5.4 Inter-collection retrieval 6

6 Experiments 6

6.1 Fine-tune pre-trained model on Jamie Oliver 6

6.2 Fine-tune pre-trained model on Allerhande 7

6.3 Nutritional value as a feature 9

7 Fine-Tuning versus Training from scratch 10

8 Conclusions 10

8.1 Acknowledgements 10

References 10

(3)

Transfer Learning in Joint Neural Embeddings for Cooking

Recipes and Food Images

ABSTRACT

The research focus of this paper is two-fold. First, we address the per-formance of the joint neural embedding model for cooking recipes and food images [2] over different recipe collections. The model is trained on a large recipe collection that contains many user-submitted recipes and food images. Performance on professional recipes is therefore expected to be low. To enhance the usability of the model, our aim is to produce one model that performs well on both professional and amateur recipes via transfer learning. Second, given the increased interest in and access to nutritional value information, we assess the benefit of adding nutritional value information as a new feature in the model. Two small professional recipe collections are used in this project; the Jamie Oliver (JO) and the Dutch Allerhande (AH) collection. Transfer learning is de-ployed to increase the model’s performance on these professional recipe collections via multiple fine-tuning methods. The results suggest that the JO collection is too small to achieve an increase in performance through fine-tuning. We found that the best method to increase the model’s performance on the AH collection is to fine-tune the pre-trained model on the translated AH collection, using the pre-trained text representation models. Interestingly, this method results in increased performance on amateur recipes as well. This means that the benefits of transfer learning are not restricted to the target task (i.e. professional recipes), but serve the base task (i.e. amateur recipes) as well. Finally, qualitative and quantitative experiments show that the model’s performance can be increased by adding nutritional value as a new feature.

1 INTRODUCTION

People’s lives are increasingly intertwined with the World Wide Web, including one of our most fundamental needs; food. A large corpus of cooking recipes and food images is currently available on the Web. The largest publicly available recipe collection has been constructed by Salvador et al. [2]. This collection contains over 1 million cooking recipes and 800k food images, referred to as the Recipe1M (R1M) collection. Using this collection, Salvador et al. created a joint neural embedding model to embed recipes and food images in a common high-dimensional vector space. The model is trained to minimize the distance in space between a recipe and its corresponding food image. The model yields impressive results on image-recipe and recipe-image retrieval tasks. Based on this model, an application could be developed where, for example, users can input a picture of a delicious lunch and receive a corresponding recipe so they can recreate the dish.

However, the R1M collection contains many user-submitted recipes and food images. Amateur recipes generally differ from

professional recipes in the type and number of ingredients and the style of the instructions. Additionally, amateur food images differ from professional images in aspects such as composition, res-olution, lighting, clarity, and color distribution. Consequently, the joint neural embedding model’s performance on professional recipe collections is expected to be low. This limits the model’s usability in the sense that application developers have to restrict the possible scope of inputs and outputs to amateur recipes and food images. This means that users can only correctly retrieve recipes and food images from fellow amateur cooks, while they may actually be interested in suggestions from a professional chef.

Transfer learning refers to improved learning of a target task through the transfer of knowledge from a base task [14]. Amateur and professional recipes can be considered as two separate tasks. In this case, the joint neural embedding model, pre-trained on the amateur recipes and food images of the R1M collection, serves as the base model. This pre-trained model’s knowledge on amateur recipes can be exploited to train a new model that learns to em-bed professional recipes and food images. However, our aim is to produce one model that performs well on both professional and amateur recipes. Therefore, our approach diverges from traditional transfer learning in the sense that we aim for improved learning of a target task (i.e. professional recipes) without compromising the performance on the base task (i.e. amateur recipes).

In this project, two small professional recipe collections are used; the Jamie Oliver (JO) collection and the Allerhande (AH) collection. The AH collection has two interesting properties; the recipes are in Dutch instead of English and the recipes contain nutritional value information. First, we assess how well the model performs on these professional recipe collections compared to amateur recipes from the R1M collection. Next, transfer learning is applied by fine-tuning the pre-trained model on the professional recipe collections. Additionally, we want to assess the benefit of adding nutritional value information as a new feature in the model. Consequently, the research questions posed in this work are as follows:

Question 1: Performance of pre-trained model Does the joint neural embedding model, pre-trained on amateur recipes and food images, perform equally well on professional and amateur recipe collections? To answer this question, the pre-trained model is tested on image-recipe and recipe-image retrieval tasks for the R1M, JO, and AH recipe collections separately. This is referred to as intra-collection retrieval, be-cause the query item and the retrieved items originate from the same collection. The pre-trained model is also tested on inter-collection retrieval. For example, recipes from the JO collection are retrieved for a query image from the R1M col-lection. Low performance on inter-collection retrieval would indicate a discrepancy between amateur and professional recipes.

(4)

Question 2: Fine-tune pre-trained model on Jamie Oliver What is the best fine-tuning method to enhance the pre-trained model’s performance on the JO collection? To an-swer this question, multiple fine-tuning methods are applied. Learning is deemed to be correct when no over- or under-fitting is apparent. All methods that result in correct learning are further evaluated by testing the new models on the intra-and inter-collection retrieval tasks. The goal is to increase the model’s performance on the JO collection without com-promising the performance on the R1M collection. Question 3: Fine-tune pre-trained model on Allerhande

What is the best method to enhance the pre-trained model’s performance on a recipe collection that is in Dutch instead of English (i.e. the AH collection)? Retrieval performance of the pre-trained model is used as a baseline. Multiple methods are applied to increase the model’s performance on the AH col-lection. The goal is to increase the model’s performance on the AH collection without compromising the performance on the R1M collection.

Question 4: Nutritional value as a feature Does adding nu-tritional value as a feature increase the model’s performance on the AH collection? Retrieval performance of the best per-forming model of the previous section is used as a baseline. The AH collection contains nutritional value information for each recipe. A qualitative assessment is designed to investi-gate if nutritional value has any meaningful discriminative power. Next, nutritional value is incorporated as a feature in the model. Retrieval performance of the new model is compared to the baseline.

Overview of thesis.The Related Work section reviews relevant academic work. Next, the joint neural embedding model and the text-representation models are described. The next section contains information on the content of the three recipe collections (R1M, JO, and AH). The fifth section ("Performance of the pre-trained model") describes the experiments that are used to test the model’s performance on all three recipe collections, and reviews the re-sults. The Experiments section first explains the experiments that are designed to increase the model’s performance on the JO and AH recipe collections through transfer learning. Additionally, the last experiment assesses the added value of utilizing nutritional value information as a feature. The next section ("Fine-Tuning ver-sus Training from scratch") discusses an additional observation concerning large performance differences between training from scratch and fine-tuning methods. Finally, the outcomes are summa-rized in the Conclusions section.

2 RELATED WORK

2.1 Transfer Learning

Transfer learning is a method where a trained model is used as a starting point for another model on a related task [15]. Transfer learning is a popular method in deep learning because using a pre-trained model saves time and computer resources [17].

In research by [16], a pre-trained deep convolutional neural network (CNN) was fine-tuned on medical images to perform tasks such as classification, detection, and segmentation. A pre-trained

CNN is trained, for example, on a large set of labeled natural images. They compared the fine-tuned CNN-model to a CNN-model that has been trained from scratch. The results showed that the fine-tuned model outperformed the model that has been trained from scratch. Importantly, they analyzed how the size of the training set influenced the performance of both models. A reduced training set size led to a larger decrease in performance for the model trained from scratch than for the fine-tuned model. This means that the fine-tuned CNN is more robust to training set size.

The fine-tuning approach used by [16] is similar to our approach. In our approach, the joint neural embedding model by Salvador et al. [2] is fine-tuned on small professional recipe collections. Our approach diverges from the approach used by [16] because we aim to preserve the model’s performance on the base task (i.e. amateur recipes), while increasing the model’s performance on the target task (i.e. professional recipes). This project will contribute to the field of transfer learning by testing the feasibility of this approach within the joint neural embedding model.

2.2 Nutritional value

People increasingly take nutritional value into account when mak-ing food choices. Recent research has been focused on algorithmic nutritional estimation from text [10] or images [11]. Interestingly, research by [9] showed that simple models outperform human raters on nutritional estimation tasks. This means that nutritional value information can be obtained for all recipes, even for recipes that do not explicitly contain nutritional value information. Due to the increased interest in and easy access to nutritional value information, incorporating it into image-recipe embeddings is a meaningful contribution.

3 THE JOINT NEURAL EMBEDDING MODEL

Recipes consist of three features; the ingredients, the cooking in-structions, and the food image. The representations of these fea-tures are discussed first. Next, the joint neural embedding model is described.

3.1 Representation of ingredients

Ingredient names are extracted from the ingredient-text. For exam-ple, "olive oil" is extracted from "2 tbsp of olive oil". Each ingredient is represented by a word2vec representation [3]. The skip-gram word2vec model represents each word as a vector. Two vectors are close in vector space when the corresponding words are placed in similar context. The word2vec model has been pre-trained on the cooking instructions of the R1M collection, and returns vectors with a dimensionality of 300. The pre-trained word2vec model has been made publicly available by Salvador et al. [2].

3.2 Representation of cooking instructions

Cooking instructions are represented through a two-stage LSTM method. A LSTM is a recurrent neural network that can learn long-term word dependencies. LSTM’s are suitable for language modeling because the probability of a word sequence can be mod-eled. In the first stage, a sequence-to-sequence LSTM model is applied to each single cooking instruction to obtain a so-called skip-instructionsvector representation [5]. This first LSTM model

(5)

is referred to as the skip-instructions model, and it has been trained on the R1M collection. The second stage of the two-stage LSTM method is integrated in the joint neural embedding model, and is discussed in section 3.4.

The word2vec model and the skip-instructions model together are referred to as the text-representation models.

3.3 Representation of images

All food images are resized and center-cropped to 256 x 256 images. The images are represented by adopting the deep convolutional neural network (CNN) Resnet-50 [6]. The Resnet-50 model is in-tegrated in the joint neural embedding model, and is discussed in section 3.4.

3.4 Joint neural embedding

The joint neural embedding model is implemented in Torch7 [13]. The model is visualized in Figure 1 (adopted from Salvador et al. [2]). It contains two encoders: one for ingredients, and one for cook-ing instructions. The cook-ingredients-encoder combines the word2vec vectors of all ingredients, through the use of a bidirectional LSTM model. The instructions-encoder forms the second stage of the two-stage LSTM method as discussed in section 3.1. This second LSTM model represents all skip-instructions vectors of a recipe as one vector. The encoder outputs are concatenated to obtain the recipe representation. The recipe representation is embedded into the joint neural embedding space. As discussed before, the image representations are obtained through the Resnet-50 model. The Resnet-50 model is incorporated into the joint neural embedding model by removing the final softmax classification layer and pro-jecting the image representation into the embedding space through linear transformation. The joint neural embedding model is trained to learn transformations that minimize the distance in space be-tween a recipe and its corresponding image.

4 THE RECIPE COLLECTIONS

Three recipe collections are used in this project; R1M, JO, and AH. An overview of collection characteristics is depicted in Table 1. Complete example recipes from each recipe collection are added to the Appendix (see Figures 11, 12, and 13).

Figure 2: Examples of food images from the Recipe1M col-lection

4.1 Recipe1M (R1M)

The Recipe1M collection has been made publicly available by Sal-vador et al. [2]. This dataset was collected by scraping over two dozen cooking websites, extracting and cleaning relevant text from the raw HTML, and downloading associated images. The features that are stored for each recipe are: ID, title, instructions, ingredient

names, partition (i.e. train, test, or validation), the URL, and the names of the images that the recipe is associated with. Examples of R1M food images are shown in Figure 2.

4.2 Jamie Oliver (JO)

The Jamie Oliver (JO) collection was scraped from the Internet website jamieoliver.com. Compared to the R1M collection, the JO collection is much smaller, and on average contains more ingre-dients and cooking instructions (see Table 1). Food images in the JO collection are of higher quality (with respect to composition, resolution, etc.) than the food images from the R1M collection (see Figures 3).

Figure 3: Examples of food images from the Jamie Oliver col-lection

4.3 Allerhande (AH)

The Allerhande (AH) collection was scraped from the Internet website allerhande.nl. The AH collection is much smaller than the R1M collection, but larger than the JO collection. Compared to the R1M collection, food image quality is high (with respect to composition, resolution, etc.) (see Figure 4). Additionally, food images in the AH collection are much wider and bigger than food images from the other two collections.

Figure 4: Examples of food images from the Allerhande col-lection

(6)

Figure 1: Overview of joint neural embedding model Figure adopted from Salvador et al.

Recipe1M (R1M) Jamie Oliver (JO) Allerhande (AH) Website of origin

Various well-known recipe

collections (e.g. food.com, kraftrecipes.com, allrecipes.com, tastykitchen.com)

jamieoliver.com allerhande.nl

Language English English Dutch

Total number of recipes 1,029,720 1097 13179

Train | Test | Val n/a | 3480 | n/a 571 | 142 | 77 8645 | 2463 | 1233 Average number of ingredients 9.3 ± 4.3 11.7 ± 5.6 7.6 ± 2.2 Average number of instructions 10.4 ± 6.9 15.6 ± 8.2 11.8 ± 4.9 Average instruction length (in words) 60.2 ± 36.8 92.3 ± 66.6 51.8 ± 33.6 Average image size (height x width) 562 x 646 689 x 513 1600 x 550

Table 1: Overview of collection characteristics. Total number of recipes includes recipes that are removed by pre-processing processes.

5 PERFORMANCE OF THE PRE-TRAINED

MODEL

In this section, we assess the performance of the pre-trained model for all three recipe collections.

5.1 Pre-processing JO and AH

After scraping the JO and AH datasets from their corresponding websites of origin, relevant text is extracted from the raw HTML. The text is cleaned by removing excessive whitespace, HTML enti-ties, and non-ASCII characters (method adopted from Salvador et

al. [2]). Next, all recipes are assigned a unique 10-digit hexadecimal ID. The recipes are segmented into training, test, and validation sets (ratio: 0.7, 0.2, 0.1, respectively).

5.2 Preparing the test-sets

Test-sets are prepared for the R1M, JO, and AH collections. Since the original R1M collection is very big, as subset of the R1M test-recipes suffices. The sizes of the test-sets are depicted in Table 1. To be able to apply the pre-trained model to the AH collection (originally in Dutch), the AH test-recipes are translated into English through

(7)

im2recipe recipe2im

MedR R@1 R@5 R@10 MedR R@1 R@5 R@10 R1M 5.75 0.229 0.495 0.621 6.9 0.217 0.462 0.587

JO 9.95 0.129 0.383 0.514 14.1 0.098 0.292 0.438 (Translated) AH 18.55 0.093 0.287 0.389 21.8 0.059 0.206 0.342 Table 2: Performance of pre-trained model on the R1M, JO, and AH collections

Google Translate. The JO test-set consists of all JO test-recipes. For each of the three test-sets, recipe representations are extracted using the text-representation models as discussed in section 3.

The JO test-set is small because any recipes that contain more than 20 instructions or ingredients are excluded. Recipes that do not contain any known ingredients (i.e. ingredients that are in the vocabulary of the word2vec model) are excluded as well. From the JO collection, 307 recipes were excluded, often because of the number of instructions exceeding 20.

Finally, the pre-trained joint neural embedding model is applied to all three test-sets. For each recipe, the model returns two vectors that represent the recipe and the corresponding image in embedding space. These vector representations are used in the subsequent retrieval experiments.

5.3 Intra-collection retrieval

The pre-trained model is tested, for each test-set, on two retrieval tasks; the im2recipe and the recipe2im task. In the im2recipe task, recipes are retrieved for a query food image. In the recipe2im task, food images are retrieved for a query recipe. The im2recipe task is performed by randomly selecting a subset of 100 test recipes and their corresponding images. Each recipe and food image is represented by a vector in the embedding space. The similarity of two vectors is determined by their cosine similarity according to the equation:

cos(xxx,yyy) = _{||xxx || · ||yyy||}xxx · yyy (1) For each image in the subset, all recipes are ranked on the basis of their cosine similarity to the image. The rank signifies the position of the ground truth recipe in the list of ranked recipes. When all images in the subset have been queried, the median rank (MedR) and the recall rates at top 1, 5, and 10 (R@1, R@5, and R@10) are calculated (adopted from Salvador et al. [2]). This experiment is repeated 10 times. Mean performances are reported. The recipe2im task is evaluated in the same manner.

Mean performances are displayed in Table 2. As expected, perfor-mance on the R1M test-set is higher than on the JO and AH test-sets, for both retrieval tasks and all performance measures. Interestingly, performance on the JO test-set is much higher than for the (trans-lated) AH test-set. This signifies that the model’s performance is collection-specific. This collection-specificity most likely depends on how similar the specific collection is to the R1M collection with respect to image and recipe features and the co-occurrence of these features. The results imply that the R1M collection is more similar to the JO collection than to the AH collection. Another possibility is that the low performance on the AH test-set is due to translation

Figure 5: Inter-collection (R1M & JO) ranking results These plots show the sorted reported ranks (on the y-axis) for 10

randomly chosen query items (on the x-axis). When no relevant recipe was found, the rank was set to 16.

Figure 6: Inter-collection (R1M & AH) ranking results These plots show the sorted reported ranks (on the y-axis) for 10

randomly chosen query items (on the x-axis). When no relevant recipe was found, the rank was set to 16.

(8)

Figure 7: Example of ranking and subsequent relevance judgment

In this example, JO recipes are retrieved for a R1M image query (i.e., im2recipe). Only the first six recipes (excluding cooking instructions) are shown due to limited space. The recipe that has been judged as "relevant" is encircled by the green box. In this case, rank = 1. errors. Overall, the results suggest that the pre-trained joint neural

embedding model does not perform equally well on professional and amateur recipes.

5.4 Inter-collection retrieval

Inter-collection retrieval refers to retrieving items from collection A for a query item from collection B. This experiment is designed to assess the model’s ability to directly match items from different recipe collections. The R1M collection will be matched with both the AH and JO collection. The experiment is performed in both directions (i.e. from R1M to JO/AH, and from JO/AH to R1M). The method will be explained by walking through an example where recipes from the JO collection are retrieved for food image queries from the R1M collection (i.e. im2recipe).

The JO test-set does not contain the ground truth recipe that belongs to the R1M query image. Therefore, the relevance of the retrieved recipes has to be determined qualitatively. A recipe is deemed to be relevant to the query image if it describes a similar dish-type, with similar ingredients.

First, one R1M query image is randomly selected from the R1M test-set. For any R1M query image, there is a possibility that the JO test-set by chance does not contain any relevant items. To diminish the probability of this happening, a recipe-subset of 130 (instead of 100) recipes is randomly selected from the JO test-set. The size of this subset is limited due to the size of the JO test-set. Similar to

the intra-collection experiment, all JO test-recipes are ranked on the basis of their cosine similarity to the R1M query image.

Finally, the first 15 retrieved recipes are manually inspected, and the rank of the first relevant recipe is reported. When no relevant recipe is found, the rank is set to 16. This experiment is repeated 10 times. An example is shown in Figure 7.

The reported ranks for the R1M and JO combination are sorted and shown in Figure 5. In the im2recipe task, six out of the ten queries resulted in a relevant recipe in the top-15. Performance is lower for the recipe2im task, which corresponds to the results of intra-collection retrieval (see Table 2). The reported ranks for the R1M and AH combination are shown in Figure 6. In the im2recipe task, only three out of the ten queries resulted in a relevant recipe in the top-15. Overall, these results suggest that the pre-trained model’s ability to directly match items from amateur and profes-sional collections is limited. This emphasizes the discrepancy be-tween amateur and professional recipes.

6 EXPERIMENTS

6.1 Fine-tune pre-trained model on Jamie

Oliver

This section describes the experiments that have been designed to answer the second research question: What is the best method to enhance the pre-trained model’s performance on the JO collection?

(9)

Method Text representation models Fixed weights

1 Trained on R1M No

2 Trained on R1M Yes

3 Trained on JO No

4 Trained on JO Yes

Table 3: Fine-tuning methods for Jamie Oliver collection

Four different fine-tuning methods are proposed. The approaches differ in weight fixation and the specific text-representation models that are used. These variables are described below. An overview of all four approaches is shown in Table 3.

Preparing the dataset.To maximize the number of training recipes, the JO recipe collection is re-segmented into a training and valida-tion set (ratio: 0.9, 0.1, respectively). The number of instrucvalida-tions and ingredients is limited to 20, to prevent recipes from being ex-cluded. The training-set contains 985 recipes, and the validation-set contains 110 recipes.

Text-representation models.As discussed before, the word2vec and the skip-instructions model are together referred to as the text-representation models. These models are used to extract recipe representations when preparing the dataset for training and testing. There are two possibilities; either the pre-trained text-representation models are used (i.e. trained on the R1M collection), or the text-representation models are completely re-trained on the JO collec-tion.

Weight-fixation.Weight fixation refers to freezing model-parameters during training. This can be used to restrict the learning to a spe-cific part of the model. The amount of weight fixation depends on which text-representation models are used. If the pre-trained text-representation models are used, either all model-parameters are fine-tuned (i.e. no weights are fixed) or only the parameters of the last two layers are fine-tuned. These are the layers that project the recipe and image representations onto the embedding space. If the text-representation models are trained on the JO collection, fine-tuning only the last two layers is insufficient because the ingre-dients and instructions encoders have to be adjusted to incorporate the new text-representation models. Therefore, the layers repre-senting the two encoders are fine-tuned in addition to the last two layers.

The loss curves for each fine-tuning method are shown in Figure 8. For all plots, the values of the hyper-parameters are fixed to allow for comparison. All plots show a decreasing training loss yet unchanging validation loss. This suggests that the model is not learning any new trends. Transferability of features is limited when the distance between the base task (i.e. R1M) and target task (i.e. JO) is large [18]. However, this is an unlikely explanation given the relatively small performance difference of the pre-trained model on the R1M and JO recipe collections (see Table 2). The unchanging validation loss could be an indication of over-fitting. This means that instead of learning to match JO recipes to JO food images, the model "memorizes" the correct recipe-image mappings from the JO training-set. This implies that the JO training set is too small and the

Figure 8: Training (blue) and validation (red) loss curves For all plots, the hyper-parameters are fixed; batch size = 15, learning rate = 0.00008, number of iterations = 15000, running time

= 3 hours.

model too complex. The unchanging validation loss could also be an indication of an imbalance between the training and validation sets. In that case, the model correctly learns the underlying trends in the training set, but fails to perform well on the validation set due to the differences between the sets. Adjusting the hyper-parameters or the amount of weight fixation in any of the fine-tuning methods did not improve the results.

Given that none of the four methods resulted in correct learning, no model evaluation is performed. These results suggest that it is difficult to increase the pre-trained model’s performance on the JO collection. This might be due to the small size of the JO collection and the high complexity of the joint neural embedding model. The model’s performance on the JO collection can possibly be increased by either increasing the size of the JO collection, or decreasing the complexity of the model.

6.2 Fine-tune pre-trained model on Allerhande

This section describes the experiments that have been designed to answer the third research question; What is the best method to

(10)

Language Text representation models Fixed weights 1 Dutch Trained on Dutch AH No

2 Dutch Trained on Dutch AH Yes 3 English Trained on R1M No 4 English Trained on R1M Yes 5 English Trained on English AH No 6 English Trained on English AH Yes

Table 4: Fine-tuning methods for Allerhande collection

AH R1M

im2rec rec2im im2rec rec2im Baseline 36.0 39.55 5.75 6.9 1 7.55 7.45 50.7 49.1 2 6.05 5.8 48.25 46.05 3 3.5 3.4 3.25 3.4 4 5.15 5.35 3.35 3.45 5 6.35 6.55 21.1 22.7 6 13.95 14.9 29.15 34.15

Table 5: Performance on retrieval tasks for each fine-tuning method, on both R1M and AH. The best method is depicted in bold.

enhance the pre-trained model’s performance on a recipe collection that is in Dutch instead of English (i.e. the AH collection)? Six differ-ent fine-tuning methods are proposed. A new variable is introduced in addition to weight fixation and the specific text-representation models; language. The joint neural embedding model is fine-tuned either on the original Dutch AH recipe collection, or on the AH recipe collection that has been translated to English. An overview of all six methods is depicted in Table 4.

The baseline is the performance of the pre-trained model on the intra-collection retrieval experiment for both the R1M and Dutch AH collections (see Table 5). Only the median rank (MedR) measures are reported for clarity. As expected, baseline performance for the Dutch AH collection is low. This is due to the fact that the text-representation models have been trained on the English R1M dataset. The dictionary of the word2vec representation model therefore does not contain any Dutch words.

The optimized training and validation loss curves and hyper-parameters are shown in the Appendix (Figure 14 and Table 7, respectively.) The evaluation results are depicted in Table 5. The third method results both in the highest performance (for AH and R1M, separately) and in the smallest performance difference (be-tween AH and R1M). In this method, the pre-trained model was fine-tuned on the English AH collection, using the pre-trained text-representation models.

An interesting observation is that fine-tuning the pre-trained model on the English AH collection increases performance on the

Figure 9: Inter-collection ranking results for method 3 These plots show the sorted reported ranks (on the y-axis) for 10

randomly chosen query items (on the x-axis).

R1M collection as well. This indicates that the AH and R1M col-lections share a certain pattern that the pre-trained model did not sufficiently detect when training on the R1M collection. This corre-sponds to the assumption that, in transfer learning, the factors that explain the variations in one setting are needed to capture the vari-ations in the other setting [8]. In this case, factors that explain the variations in the AH collection are used to capture the variations in the R1M collection, and vice versa. These results suggest that the benefit of transfer learning is not restricted to one direction (i.e. from R1M to AH), but can manifest itself bi-directionally (i.e. from R1M to AH and vice versa).

The third method has also been tested on inter-collection re-trieval. The reported ranks are shown in Figure 9. These ranks are generally lower than the reported ranks using the pre-trained model (see Figure 6). This means that fine-tuning the model on the English AH collection increased the model’s ability to directly match items from the R1M and AH collections. Overall, the best method to enhance the model’s performance on the AH collection is the third method, where the pre-trained model is fine-tuned on the translated AH dataset, using the pre-trained text-representation models.

(11)

Figure 10: First ranking based on nutritional value

The query recipe (on the left) and the first five retrieved recipes (on the right) on the basis of Euclidean distance between the normalized nutritional value vectors.

im2recipe recipe2im

MedR R@1 R@5 R@10 MedR R@1 R@5 R@10 Excl. nutritional value 3.29 0.299 0.624 0.757 3.255 0.303 0.629 0.755 Incl. nutritional value 3.05 0.312 0.648 0.773 3.13 0.309 0.646 0.773 t-value 2.560 -2.172 -3.448 -2.574 1.234 -0.977 -2.388 -2.882 p-value 0.011* 0.031* 0.0006* 0.010* 0.218 0.329 0.017* 0.004* Table 6: Effect of nutritional value feature on performance on (translated) AH collection

6.3 Nutritional value as a feature

This section describes the experiments that have been designed to answer the fourth research question: Does adding nutritional value as a feature increase the model’s performance on the AH collection?

Pre-processing nutritional information.The AH collection con-tains information on nutritional value for each recipe. There are six nutritional categories; fat, protein, fibers, energy, sodium, and carbohydrates. The values of each category are normalized across all recipes to bring all values into a range between 0 and 1 (i.e. unity-based normalization), following the equation:

Xnor m=_XX − Xmin

max−Xmin (2)

For each recipe, all six nutritional values are stored in one vector. Qualitative assessment of discriminative power.In this experi-ment, a query recipe is randomly selected. Next, all other recipes

are ranked on the basis of the Euclidean distance between the vec-tors that represent nutritional value. The top-5 retrieved recipes are inspected manually. This is repeated three times. Nutritional value information is deemed to have meaningful discriminative power if the top-5 retrieved recipes are of a similar dish-type as the query recipe (e.g. all desserts).

One of the rankings is shown in Figure 10. This figure shows that the top-5 retrieved recipes are of a similar dish-type as the query recipe; the retrieved recipes and the query recipe are all sweet desserts. These recipes all contains relatively large amounts of sugar (i.e. carbohydrates) and energy. The other two rankings reveal a similar pattern, and are shown in Figures 15 and 16 in the Appendix. These results show that nutritional value has meaningful discriminative power, in the sense that it can be used to distinguish between different types of dishes.

Incorporation of nutritional feature into model.The 6-dimensional nutritional value vector is incorporated into the joint neural em-bedding model through a single linear layer with 6 nodes. This

(12)

linear layers represents the nutritional-encoder within the joint neural embedding model. The encoder returns a 4-dimensional vector that is concatenated to the recipe representation. The new model is fine-tuned on the translated AH collection (including the nutritional value vector representations), using the pre-trained text-representation models. In the baseline model, the nutritional value feature is excluded ("Excl. nutritional value" in Table 6).

The new model and the baseline model are evaluated using the intra-collection retrieval experiment. The im2recipe and recipe2im retrieval tasks are repeated 100 times to increase statistical power. Two-sided independent T-tests are performed to test the difference in performance for all performance measures (i.e. MedR, R@1, R@5, R@10). The results are shown in Table 6. All p-values below 0.05 are assumed to signify a significant difference, and are denoted by an asterisk. For the im2recipe task, all performance measures are sig-nificantly different from the baseline. For the recipe2im task, only R@5 and R@10 are significantly different. These results suggest that nutritional value contributes new information to the recipe rep-resentation, in addition to the ingredients and cooking instructions, and increases the model’s performance on the AH collection.

7 FINE-TUNING VERSUS TRAINING FROM

SCRATCH

In this project, transfer learning has been exploited by fine-tuning the pre-trained joint neural embedding model on professional recipe collections. We also tried training the joint neural embedding model from scratch on the JO and AH collections. To increase the prob-ability of success, we experimented with the model’s complexity. Model complexity is related to the number of learn-able parameters in the model. Decreasing model complexity can be beneficial for training, especially when using a relatively small training set. The complexity of the joint neural embedding model has be decreased by, for example, decreasing the dimensionality of the embedding space. Irrespective of model complexity or hyper-parameter set-tings, training on the JO or AH collections did not result in any learning. This corresponds to the findings of [16], where fine-tuning outperformed training from scratch.

The failure to train the model from scratch is most likely due to the small training set sizes of the JO and AH collections. Training deep neural networks requires a large amount of training data [17]. Even though training from scratch did not work, fine-tuning the pre-trained model on the AH collection resulted in an increase of performance for both the R1M and AH collections. These results have two implications; 1) the model’s learning of the AH collection greatly benefited from transfer learning; and 2) learning the target task (i.e. AH) can even improve performance on the base task (i.e. R1M). This project demonstrates the large advantage of transfer learning via fine-tuning over training from scratch.

8 CONCLUSIONS

In this paper we focused on the performance of the joint neural em-bedding model for amateur and professional recipes and the benefit of utilizing nutritional value information within this model. We showed that the pre-trained model does not perform equally well on amateur and professional recipes. As expected, performance is higher for amateur than professional recipes. Fine-tuning the model

on the Jamie Oliver collection has not worked. This is probably due to the small size of the JO collection. This inference is supported by the fact that fine-tuning did work for the larger AH collection. The best method to enhance the pre-trained model’s performance on the AH collection is to fine-tune the pre-trained model on the trans-lated AH collection, using the pre-trained representation models. Surprisingly, this method resulted in an increase of performance for both the AH and the R1M collection. This suggests that the benefit of transfer learning is not restricted to the target task (i.e. professional recipes), but also serves the base task (i.e. amateur recipes). Finally, we found that nutritional value has meaningful discriminative power, in the sense that it can be used to distinguish between different types of dishes. We showed that adding nutri-tional value as a feature through a simple linear encoder increases the model’s performance on the AH collection.

8.1 Acknowledgements

I want to thank my two supervisors Thomas Mensink and Vladimir Nedović for their enthusiasm and good advice.

REFERENCES

[1] P. Domingos. A few useful things to know about machine learning. Communica-tions of the ACM, 55(10):78-87, 2012.

[2] A. Salvador, N. Hynes, Y. Aytar, J. Marin, F. Ofli, I. Weber, and A. Torralba. Learning cross-modal embeddings for cooking recipes and food images. Training, 720:619-508, 2017.

[3] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word repre-sentations in vector space. arXiv preprint, 1301.3781, 2013.

[4] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler. Skip-thought vectors. In Advances in neural information processing systems, 3294-3302, 2015

[5] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 3104-3112, 2014 [6] K. He, X. Zhang, and S. Ren, and J. Sun. Deep residual learning for image

recogni-tion. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770-778, 2016

[7] H. Bal, D. Epema, C. de Laat, R. van Nieuwpoort, J. Romein, F. Seinstra, C. Snoek, and H. Wijshoff A Medium-Scale Distributed System for Computer Science Research: Infrastructure for the Long Term IEEE Computer, Vol. 49, No. 5, pp. 54-63, May 2016.

[8] I. Goodfellow, Y. Bengio, and A. Courville Deep Learning MIT Press, http: //www.deeplearningbook.org, 2016

[9] M. Rokicki, C. Trattner, and E. Herder The Impact of Recipe Features, Social Cues and Demographics on Estimating the Healthiness of Online Recipes. 2018 [10] T. Kusmierczyk, and K. Nørvåg Online food recipe title semantics: Combining

nutrient facts and topics. In Proc. of CIKM, 2013–2016, 2016.

[11] M. Chokr, and S. Elbassuoni Calories prediction from food images. In AAAI, 4664–4669. 2017.

[12] Z. Zheng, L. Zheng, M. Garrett, Y. Yang, and Y.D. Shen Dual-Path Convolutional Image-Text Embedding. arXiv preprint, 1711.05535, 2017.

[13] R. Collobert, K. Kavukcuoglu, and C. Farabet Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS workshop (No. EPFL-CONF-192376), 2011. [14] L. Torrey, and J. Shavlik Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pp. 242-264, 2010.

[15] S.J. Pan, and Q. Yang A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10), 1345-1359, 2010.

[16] N. Tajbakhsh, J.Y. Shin, S.R. Gurudu, R.T. Hurst, C.B. Kendall, M.B. Gotway, and J. Liang Convolutional neural networks for medical image analysis: Full training or fine tuning? IEEE transactions on medical imaging, 35(5), 1299-1312, 2016. [17] D. Erhan, P.A. Manzagol, Y. Bengio, S. Bengio, and P. Vincent The difficulty

of training deep architectures and the effect of unsupervised pre-training. In Artificial Intelligence and Statistics, pp. 153-160, 2009.

[18] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson How transferable are features in deep neural networks? In Advances in neural information processing systems, pp. 3320-3328), 2014.

9 APPENDIX

(13)

(14)

Figure 12: Example recipes of the Jamie Oliver collection

(15)

Figure 13: Example recipes of the Allerhande collection

Approach Learning rate Number of iterations Running time (hours) Snapshot at iteration

1 0.00011 30.000 7 21.500 2 0.0001 25.500 6 21.500 3 0.000015 30.000 7 21.500 4 0.00013 45.000 10 33.000 5 0.00007 25.500 6 21.500 6 0.00002 18.000 4 17.000

(16)

Figure 14: Training (blue) and validation (red) loss curves; fine-tuning pre-trained model on AH collection

(17)

Figure 15: Second ranking based on nutritional value

(18)

Figure 16: Third ranking based on nutritional value