• No results found

In this part, we demonstrate the result of the experiments.

5.4.1 Seed effect

The random number generator, which is used to shuffle data randomly during training or to randomly initialize the model’s parameters, is initialized using the seed value. Setting a seed value makes sure that the model’s outcomes can be replicated because the same seed value will consistently result in the same random numbers. This can be helpful especially for designing experiments that can be repeated.

Additionally, in some circumstances, changing the seed value can improve the model’s generalization performance. The result in the table 10 shows how seed value will influence the result of the ZSL model on the RED development set.

Seed value f1 score on RED

0 0.8933

42 0.8929

80 0.8942

Table 10: Effect of seed value on the performance

So we will calculate the average of three seed values (0, 42, 80) for all experiments.

5.4.2 Event detection results

In this part, we provide the result of the main part of the project which is the results of the models under certain circumstances. As the data is imbalanced, we just calculate the Precision, Recall, and F1 score on event tokens. We also provide the accuracy of the model on all tokens. Table 11 shows the performance of RED development and test set. Consider the test set has just two classes (”O” and

”B-EVENT”) without any ”I-EVENT” class. Therefore we can not conclude how the model would perform on I-EVENT test examples.

Besides, to see how the models will perform on the movie subtitles test set, we test the annotated data on each pretrained and fine-tuned model and get the results as table 12. Although we separate the result table from RED, the model by which they have been trained and fine-tuned is the same and the

RED development set RED test set

Model precision recall f1-score precision recall f1-score CRF (Baseline) 0.804 0.728 0.764 0.816 0.728 0.769

ZSL 0.888 0.899 0.893 0.881 0.897 0.889

Table 11: Experiments results: RED corpus ZSL: A standard BERT model which is finetuned using the RED corpus. We will use it later to test the

movie subtitles in a zero-shot setup.

whole evaluation process has been done for both RED corpus and movie subtitles at the same time with the same model and seed value.

Model precision recall f1-score

CRF 0.756 0.732 0.744

ZSL 0.757 0.800 0.778

DAPT1 + FT 0.752 0.745 0.748 DAPT2 + FT 0.746 0.824 0.783 DAPT3 + FT 0.752 0.820 0.785

Table 12: Experiments results on the movie subtitle test set DAPT+FT: The domain-adaptive pretrained models which

are finetuned using the RED data

5.4.3 Jensen Shannon Divergence

The Jensen-Shannon distance between the RED corpus and the movie subtitles annotated corpus is 0.504.

Besides, for each event token, the divergence score is calculated in both corpora by JSD, and the top general tokens and events with the highest divergence score are shown separately in figures 5 and 6, respectively. More distinguishing tokens are represented by higher ranks. The side of the chart indicates the corpus within which a token is more prevalent. The left one is related to the RED corpus and the right is related to the movie subtitles.

Figure 5: Results of corpus comparison between the RED and movie subtitles.

The most 50 divergent tokens are shown.

orange: Movie Subtitles (Right side) purple: RED corpus (Left side)

Figure 6: Results of corpus comparison between the RED and movie subtitles.

The most 50 divergent events are shown.

orange: Movie Subtitles (Right side) purple: RED corpus (Left side)

6 Discussion

The results show that the model named DAPT3+FT which is a pre-trained model on 100% of the movie scripts and finetuned over RED corpus outperformed other models. For precision measure, DAPT2+FT has the lowest value among the other proposed models whereas the ZSL model has the highest value. For recall, DAPT1+FT has the lowest value and DAPT2+FT has the highest. For the f1 score, DAPT1+FT has the minimum value and the DAPT3+FT has the maximum value. Based on the results, the DAPT3+FT model outperforms the others.

We can conclude that during the pretraining, the values of the parameters changed massively, and even further tuning on the event detection task could not affect all of them. In other words, the fine-tuning on RED, could not modify all the parameters that have been pretrined during the pretraining.

Therefore, not all the representations that have been learned, are modified with fine-tuning.

Furthermore, from the JS distance result it, can be interpreted that around 50% of both corpora are similar. When we look at the top event tokens, it is clear that the focus of the movie subtitles is on informal conversations and day-to-day life, the words such as ”know”, ”think”, ”do”, ”go”, ”let”,

”get”, ”tell” and so on that makes sense in the movie scripts as they are dialogues between individ-uals. On the other side, the focus of the RED corpus is on formal words like ”STATED”, ”council”,

”SAID”, ”attack” and so on. Remember, it does not mean that the above words are not present on the other side but mostly contribute to the indicating side. The other considerable conclusion is that the distribution for the top 30 words is dense on the movie subtitles side whereas it is sparse on the RED side. This can indicate that the model works well even when the events are not in the source text sufficiently. All when the RED corpus is about 3 times bigger than the movie subtitles. There is also a noticeable divergence in tokens such as ”’s”, ”n’t”, and ”’m” that are observed in the top divergent tokens in the movie subtitles corpus that might have an impact on the representation learning as they are not present in the RED.

Apart from the above, overall 2709 events are in the RED corpus which is not present in the movie subtitles which might be the reason for the low divergence between the corpora as well as the worse performance of the ZSL model when it sees new data.

6.1 Comparison to baseline

We present the results in tables 11 and 12. The proposed methods surpass the baseline CRF model in all experiments. The f1 score of the best model(DAPT3+FT) outperforms the CRF by 4.1% percent.

The other domain-adaptive pretraining models DAPT1+FT and DAPT2+FT improved the perfor-mances over the source-only baseline by 0.4% and 3.9% percent.

GERELATEERDE DOCUMENTEN