Query abandonment in the legal search domain

(1)

Query abandonment in the legal

search domain

Menno van Leeuwen 10280588

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. Evangelos Kanoulas Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam June 24th, 2016

(2)

Abstract

This thesis measures search engine performance using search queries satisfaction as key performance indicator. The used search engine is a commercial available version developed by Legal Intelligence. A number of 100.000 queries from users of the university of Leiden were analysed and divided in two groups. A query can either be satisfying (SAT) or dissatisfying (DSAT). This division was made by implementing four discriminating methods: click only, click plus dwell time, re-formulations and a combination of rere-formulations and clicks. After creating a list of DSAT queries applied filter path for each query in this list was analysed to see if there are law areas where the system performs less good compared to others. For the most simple discrimination method fifty-eight percent of the queries was clas-sified as DSAT, this percentage further increased when dwell time is introduced. Reformulation showed a drop compered to earlier methods, explainable because the definition differs much from the first two. The last method that is an extension of reformulation showed again an increase in DSAT. When looking at applied filter paths their seemed to be no big difference between the four methods. It showed however that within the DSAT lists people more often search for unstructured data such as literature or magazines compared to structured data like laws.

(3)

1 Introduction

Search engines play a central role in the way people discover information available on the Internet. Practically every person with access to the Internet uses one of the generic search providers such as but not limited to: Google, Bing or Yahoo. These engines have a common goal in trying to index the web as a whole covering a broad spectrum of knowledge, e.g sites about information of celebrities on one end and sports on the other. In contrast to these, proprietary search engines exists that focus towards a specific information domain. Whenever an accountant searches on the intranet of a big firm such as Deloitte, or a doctor accessing medical records of a patient, they both use such a proprietary variant of a search engines. Proprietary variants are available in both public and commercial fashion, whilst this thesis will focus on the latter.

One company that develops a proprietary available engine is Legal Intelligence (LI) (Legal Intelligence, 2016), a Rotterdam based company whose main product is a web application focused on searching content within the Dutch legal domain. It searches through publicly available legal data, such as wetten.nl, as well as com-mercial documents that are provided by publishers like Wolters Kluwer.

For LI the main question for this thesis was to look at the way and the reasons why people give up on a particular search. Such abandonment plays an important role as key performance indicator (KPI), since the inability to fulfil the user need for in-formation with relevant documents can lead to customer frustration. This irritation may further lead to the switch to a different alternatives, or in case of a commercial product can result in subscriber cancellation. Looking back at the example of the accountant and the doctor we can also identify relevance of this KPI to different sectors, hence making it a broader measure of interest than just the legal field. In order to measure performance we can ask the question: In which law areas does the search engine perform less well? Two subsections and experiments have been conducted in pursuit of answering the main question.The first question is which queries do not address the information need of the user? And when we have found these the second question arises: What are characteristics of these queries that fail to address the information need?

For the current thesis two major aspects are of interest compared to previous re-search which has been conducted on the topic of query abandonment. The first be-ing the domain specific knowledge the chosen engine tries to comprehend. As will be shown in the literature review research about abandonment tends to focus more on generic search engines than on proprietary ones. (Diriye et al., 2012; Dupret and Lalmas, 2013) Secondly the Dutch language poses a challenge on it’s own, as many models in natural language processing (NLP) use English as language rather than focusing on a language that is relatively uncommon.

(5)

2 Related work

Search query abandonment can be seen as outcome of failure to discover related document(s) on the search engine result page (SERP). (Thuma et al., 2013) Several studies investigating query abandonment have been carried out on general purpose search engines, such as but not limited to Bing and Yahoo!. (Diriye et al., 2012; Dupret and Lalmas, 2013) In order to reveal insights about abandonment, which can lead to user frustration, observability and quantification of user behaviour is key. Quantitative and qualitative research has been conducted on the topic of query abandonment. Qualitative research most often consists of surveys or interviews where the user explicitly states its search task intent. Researches thus can evaluate if accessed documents on the SERP are fulfilling the intended task. (Diriye et al., 2012; Stamou and Efthimiadis, 2010) For the current research quantitative methods such as conducted by Hassan et al. (2013) seem more relevant, since only search query logs of the partner company are available and the absence of permission to conduct surveys or interview users make qualitative research impractical.

Users interaction with search engines can be described at three levels: the query (Diriye et al., 2012; Dupret and Lalmas, 2013) , the session (Kanoulas et al., 2011) and the task level (Buscher et al., 2012). Each one can be seen as multiple in-stances of the previous stage. The first and most straightforward level focuses on single search queries and the resulting interaction with the SERP. Search sessions are an extension, defined as multiple succeeding queries that are reformulations of each other and may be separated in terms of activity by interaction with the SERP page. Multiple differences in session ending exists. A generally accepted definition of the end is denoted by a thirty minute period of inactivity. (Radlinski and Joachims, 2005) Any activity beyond this period is considered part of a new session. User tasks are a much broader and harder to define but are focused on describing the user information need. The user may have a task that may be one session long, e.g. finding a hotel in Amsterdam. A task however can also be more extensive as for example comparing hotel prices across the Netherlands, so the first session could comprehend hotels around Amsterdam, followed by a consecutive search about hotels in the Rotterdam Area. Single query analysis would be a log-ical starting point for any research addressing query abandonment. If time omits the later level can be analysed.

In order to address success or failure in user intent two terms are widely used by various papers. Query satisfaction (SAT) can be defined as the successful fulfil-ment of the information need of the user. In contrast query dissatisfaction (DSAT) occurs when the user need is not addressed as expected. These definitions leave room for debate on what is considered satisfaction, and there seems to be no gen-erally accepted definition. Papers describe satisfaction in numerous ways, mostly as the researches see fit. In the paper of Hassan et al. (2013) a five way system approach is used to describe user satisfaction at different levels. The first system

(6)

starts with the assumption that whenever a document is clicked as result of a search query satisfaction occurs. This very simple approach dismisses satisfaction as re-sult from a false positive clicked document. In order to tackle this bias a second system is proposed, in which document dwell time is taken into account. Dwell time is the time a user spends on a particular document, this by looking at time lapses between two user actions (document clicks or a reformulated query). If a user spends more than a certain time on a particular document than it must be infor-mative and satisfactory. Reformulation is the central idea behind the third system. It assumes that as result of reformulation the user must be dissatisfied with the previous query and therefore reformulates his search. This system ignores the in-formation gained from the previous two systems. The fourth approach combines all the three preceding variants and takes clicks and reformulations into account. Whenever a reformulation occurs it checks if the followup action is a click. The fifth and last variant trains a classifier that takes the reformulations of queries into account at the same time rather that the defined if-this-than-that (IFTTT) way as described in step four. In this thesis the five way approach of describing satisfac-tion will be used as a foundasatisfac-tion, this helps to gain focus and to abscond the debate about what is and is not satisfaction.

(7)

3 Experimental design

In order to analyse search query behaviour two experiments were conducted. The first experiment is focused on answering the first sub research question and la-bels queries as either being satisfactory (SAT) or unsatisfactory (DSAT). As can be seen in previous work there are multiple approaches as how this distinction can be made. In order to not get lost in an endless debate about what is or not is a SAT or DSAT query this thesis follows a four out of five methods to separate queries as described by Hassan et al. (2013). The second experiment focuses on looking at the characteristics of failing queries that are outputted by the first experiment. The applied filter path of these queries is analysed, can we see if certain filter paths, that represent certain law areas within the search engine, are more present in the DSAT query list?

All experiments are implemented in the python 3 programming language. Within this language several packages are utilised to analyse the data further. All scripts were executed on a Macbook Pro (Retina, 13-inch, Early 2015) and Mac OS X El Captain was used as the main operating system. All code has been made pub-licly available at the GitHub platform and is accessible trough the following url: https://github.com/10280588/LI-Scriptie.

3.1 Data

Data was gathered from search logs, those hold records of all user interactions with the system. The cooperating company Legal Intelligence (LI) provided these search logs and were pulled from their search engine database. LI uses a list of observed events to follow these interactions. For this thesis two events were of interest, one event contains all query entered data attributes and the other one com-prehends information about accessed documents. They are respectively labelled as event 164 and event 27. A major limitation to the available data was that exclu-sion of paying clients, analysis of this client data could breach the attorney-client privilege if used without explicit consent. Therefore search logs that tracks the behaviour from students and academics from the University of Leiden was used. They have free access to the web application and their behaviour may be tracked for purposes like presented in the current research.

Data was exported from the database to comma separated value (CSV) files, via the help of a bash script executing multiple curl commands. Three sizes of recorded entries were used: ten, 100 and 100.000 entries respectively. This created a total of six files, three per event. The first two sets were used for testing purposes to see if the algorithm worked correctly, while the largest one was used as final data set to be analysed.

(8)

stamp were selected. Queries queries the search text, applied filter path and time stamp were saved. While for documents the web address (URL) and time stamp were stored.

3.2 Building search sessions

In order to see how user are interacting with the search engine, time series of search queries and documents access were constructed. In order to create these sessions a merged version of the two event data frames was used. This merged frame was fur-ther sorted by userID and the time stamp, constructing a chronological time frame per user. Now these records were slitted up into sessions. A session either starts as the first search event for a new user, or alternatively starts.

These sessions have certain characteristics, such as the session length by com-bining the two observed, and the number of queries and documents per session. In the appendix graphs show the occurrences of the length, number of queries and number of documents. All three graphs show a classical long tail distribution were short sessions occur more often than longer sessions. This reflects logical expecta-tions. An interesting observation is that almost 4000 sessions do not have a single document click at all. This must be taken into account when analysing the results.

3.3 Identifying queries

The first and simplest method classifies a query (Q) as being SAT if it receives at least one document click document (D) at the SERP page. In the defined sessions such a query is successful if a event 164 is followed by a document click event 27. Sessions that started with a document access instead of a search were discarded.

Working as a extension of this method by looking at document clicks, the second method also classifies a query as satisfactory as long as a document click occurs, but adds dwell time as an additional constrain in order for a query to be successful. This constrains follows from the idea that when a document is relevant to the user, and thus satisfies the user need, at least several seconds of dwell time must elapse. Within this dwell time the user uses the document as information, hence if this is the case the user will spend more time on it reading further. Which this leads to a period of inactivity and called the dwell time. After a certain threshold a certain document has therefore satisfied (part) of the users need for information. Experi-ments were conducted with thirty, sixty and ninety seconds as dwell time threshold. Furthermore one can see similarities between one can conclude that system one is equivalent with system two if we let the time delta be zero seconds.

(9)

user is not satisfied with the result he sees on the SERP page, he will reformulate his query is the base argument that this method follows. Therefore a SAT query is one that is not followed by a reformulation Multiple ways can be used to measure reformulation, for example by looking at the stemming of words or at the. This thesis solely uses the Jaccard Index to calculate if the query is a reformulation of the previous one. The Jaccard distance can be written as the intersection divided by the size of the union of the two sets. Elements of each set are denoted by words within a search query. As long as the Jaccard index is above the 35% threshold queries were seen as reformulation of each other, if similarity dropped below this threshold a query was seen as a new search.

The last implemented method, combines reformulations and clicks in a cascading way. It will check if a followup query can be seen as a reformulation, and if this is the case apply the definition of the third method. If on the other hand no re-formulation is found we treat this query as a new query and check if the click as described by the first two method is detected. Bot a combination of this method plus click only as well as a combination with click dwell time (thirty, sixty and ninety seconds) were used.

Each of the four used methods focuses on identifying SAT queries, instead of clas-sifying DSAT queries. However every single query within a collection must either be SAT or DSAT, hence if we denote a method of detecting SAT we can say that every observed search that does not met this requirement can be seen as a DSAT variant.

3.4 Analysing filter paths of DSAT queries

After applying the above four methods the result would be a list of those queries that are classified as DSAT variants. This thesis will introduce a methodology to analyse queries by looking at the filter path, in this way we can answer the second sub question. Filter paths that are applied say something about the law area that the user was looking for with a particular query. From the list of queries occurrences of applied filter paths will be counted and ordered so we can see if there are any big differences between applied filters.

(10)

4 Results

4.1 Query abandonment

Query log result from users of the University of Leiden (Running total of 89.875 queries)

Method NR of SAT NR of DSAT DSAT percentage

Click only 36896 52979 58,95%

Clicks with dwell time (∆t = 30s)

26638 63237 70,36%

22448 67427 75,02%

19976 69899 77,77%

Reformulation only 78967 10908 12,14%

Reformulation with clicks only (no dwell time)

64856 25019 27,84%

Reformulation with clicks and dwell time (∆t = 30s)

54598 35277 39,25%

50408 39467 43,91%

47936 41939 46,66%

In the table above the results of the four methods are showed. At first one can note that out of the original 100.000 queries 89.875 are left. This difference can be explained by the fact that certain data points where cleaned out from the sessions. The biggest change in DSAT percentage can be found when looked at the difference between method two and three. There is a logical explanation since method three focuses on reformulation rather than looking at query clicks. A division therefore can be made between method one and two on the one hand and three and four on the other (in the table divided by the double line). One sees that SAT and DSAT percentage increase as soon as a more narrow definition of SAT is introduced. In-crement in DSAT percentage in method two is caused by the introduction of the dwell time, and as logically expected a longer dwell time leads to increased DSAT percentages. We still see quite high outcomes of DSAT percentages a part of this percentage points for method one and two can be explained by the fact that around 4000 sessions do not have any document clicks at all. Other reason would be that we include a lot of false positives as DSAT, but without asking the user for this in-formation it remains the question if this is the case. If we look at the reformulation only method we see that the percentage of queries that are classified as DSAT drop dramatically, although the same trend in increasing percentages as seen between method one and two can be found between the third and fourth method.

(11)

4.2 DSAT Filter paths

After seeing the differences between percentages it is time to look at how filter paths of different methods compare. In order to do this a comparison will be made between the method with click only and the method with reformulation only. The top twenty of applied filter paths of both methods are displayed in the two tables below. As we can see the top five results stay the same for both methods. The biggest percentage of DSAT queries do not have any associated filter path, 32396 and 8149 respectively. Further results are difficult to obtain without further know-ing the user intent.

Click only - Top 20 applied filters (over 52979 queries)

Filer path nr. of times applied

/ 32396 /Literatuur/Tijdschriften 4068 /Nederland/Rechtspraak 2806 /Literatuur/Naslagwerken 1739 /Literatuur/Tijdschriften/Artikelen 829 /Nederland/Offici¨ele publicaties 517 /Europa/Rechtspraak 388 /Nederland/Wet- en regelgeving 347 /2010-heden 345 /2016 237 /Literatuur/Naslagwerken/Asser Serie 182 /2015 180 /Nederland/Rechtspraak/Hoge Raad 176 /Nederland/Overig 132

/Nederland/Rechtspraak/Centrale Nederlandse rechtscolleges 105 /Literatuur/Naslagwerken/Tekst & Commentaar 102

/2010-heden/2016 96

/Literatuur/Tijdschriften/2010-heden 86

/Literatuur/Tijdschriften/AA 86

(12)

Reformulation only - Top 20 applied filters (over 10908 queries)

Filer path nr. of times applied

/ 8149 /Nederland/Rechtspraak 559 /Literatuur/Tijdschriften 513 /Literatuur/Naslagwerken 230 /Literatuur/Tijdschriften/Artikelen 123 /Europa/Rechtspraak 59 /Nederland/Offici¨ele publicaties 58 /Nederland/Wet- en regelgeving 48 /Nederland/Rechtspraak/Hoge Raad 32 /2010-heden 28 /2015 28 /2016 23 /Literatuur/Naslagwerken/Asser Serie 22 /Nederland/Rechtspraak/Geannoteerd 20

/Nederland/Rechtspraak/Bestuursrecht, Bestuursrecht; Belastingrecht, ..., 2011, 2010 18 /Nederland/Rechtspraak/Bestuursrecht, Bestuursrecht; Ambtenarenrecht,, ..., 2011, 2010 17

/Europa/Offici¨ele publicaties 17

/Nederland/Rechtspraak/Bestuursrecht 15

/2012 14

(13)

5 Conclusion

From the results the biggest conclusion to be drawn is that different methods lead to varying DSAT percentages. The low percentage of DSAT in the reformulation only method suggests a high precision but at the risk of having a low recall of DSAT queries. But as mentioned in the result section, without user surveys or in-terviews it is difficult to evaluate these findings. What we can conclude however, as one would expect, is that when narrowing down the definition the percentage of DSAT increases.

When looking at the filter paths we see no big differences when comparing the click only and reformulation only methods. So independent of the size of the DSAT list, and recall the big differences between percentage points between these methods, the same kind of paths seem to be selected by the user. We can especially see this in the top 5 where they are identical. Filter paths that are most applied seems to be associated with unstructured documents.

If we look at our main question to identify in what kind of law areas the search engine performs less optimal it is still difficult to provide an clear answer. From the results we can draw the conclusion that unstructured data is more difficult to index, this follows from the applied filter paths. Generalising these paths to specific law areas is more difficult. Although some paths are associated with certain fields, this mapping between both can not be found for every filter. On top of that it is hard to verify if these paths are really representing actions when a user was not satisfied with the outcome, or that false positive were included in the created DSAT lists as well. In the next section a few reflections and suggestion will be made to tackle this problem.

(14)

6 Discussion & future work

As can be seen from the results quite some differences exists between the percent-age of queries that fail to satisfy the user information need. One explanation for these varying results is the fact that we have used different methods for separating SAT and DSAT queries. The suggestion for this different methods shows how re-searchers hold different opinions about which queries lead to satisfaction. Maybe a query is only a satisfied one when official documents are found instead of articles from magazines or blog, for instance when the user is looking for a specific law ar-ticle. Besides multiple opinions that exists about the discrimination of queries this thesis focuses on single queries only, neglecting to take into account the broader search session or task the user tries to fulfil. It can be the case that a query is failing today, but when the user reformulates it tomorrow and finds a relevant document that the query is in fact one that satisfies the user.

In order to learn more about users intent, and the correlated search task future work could consist of qualitative explanations for query abandonment. This would also give more insight as to what the user sees as a satisfying result to a query. This research can be build around pop up windows that ask the user if he found what he was looking for, so we can calculate accuracy and recall between predicted queries classified as DSAT and those reported by the user. A further intensified research would comprehend surveys or interviews that can be conducted for extended in-formation about what the users search intent is in the first place, and secondly how good he or she finds the outcomes as suggested by the system.

Future research could also compare the result when different data input is used. At this moment only academic users are present in the data, legal professionals working at law firms have not been included. It therefore would be interesting to see if certain assumptions, such as things users are looking for and the type of used queries,uphold when we cross validate it with professional users. One distinction that may also be of interest is to see if there is an age distinction between the younger lawyer who may be using different search tactics compared to his senior colleague.

(15)

7 References

Buscher, G., White, R. W., Dumais, S., and Huang, J. (2012). Large-scale analysis of individual and task differences in search result page examination strategies. In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, WSDM ’12, pages 373–382, New York, NY, USA. ACM.

Diriye, A., White, R., Buscher, G., and Dumais, S. (2012). Leaving so soon?: Understanding and predicting web search abandonment rationales. In Proceed-ings of the 21st ACM International Conference on Information and Knowledge Management, CIKM ’12, pages 1025–1034, New York, NY, USA. ACM. Dupret, G. and Lalmas, M. (2013). Absence time and user engagement: Evaluating

ranking functions. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM ’13, pages 173–182, New York, NY, USA. ACM.

Hassan, A., Shi, X., Craswell, N., and Ramsey, B. (2013). Beyond clicks: query reformulation as a predictor of search satisfaction. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, CIKM ’13, pages 2019–2028, New York, NY, USA. ACM. Kanoulas, E., Carterette, B., Clough, P. D., and Sanderson, M. (2011).

Evaluat-ing multi-query sessions. In ProceedEvaluat-ings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’11, pages 1053–1062, New York, NY, USA. ACM.

Legal Intelligence (2016). Legal intelligence.

Radlinski, F. and Joachims, T. (2005). Query chains: learning to rank from implicit feedback. In Proceedings of the eleventh ACM SIGKDD international confer-ence on Knowledge discovery in data mining, pages 239–248. ACM.

Stamou, S. and Efthimiadis, E. N. (2010). Advances in Information Retrieval: 32nd European Conference on IR Research, ECIR 2010, Milton Keynes, UK, March 28-31, 2010.Proceedings, chapter Interpreting User Inactivity on Search Results, pages 100–113. Springer Berlin Heidelberg, Berlin, Heidelberg. Thuma, E., Rogers, S., and Ounis, I. (2013). Evaluating bad query abandonment in

an iterative sms-based faq retrieval system. In Proceedings of the 10th con-ference on open research areas in information retrieval, pages 117–120. LE CENTRE DE HAUTES ETUDES INTERNATIONALES D’INFORMATIQUE DOCUMENTAIRE.

(16)

8 Appendix

Figure one and two display the length per session, It forms a traditional long tail distribution.

Figure three and four display the number of documents per session, again it forms a long tail. But an important factor is the fact that near 4000 sessions have no recorded document clicks at all, this can be seen in the zoomed in version with the bar where the number of document is zero.

Figure five and six display the number of queries per session, as can be seen a typical long tail graphs forms itself. The first graph provides an broad overview whilst the second one is a zoomed in version.

(17)

Figure 2: Session length - Top 25

(18)

Figure 4: Queries per session - Top 25

(19)

Query abandonment in the legal search domain