Financial Statement Network Comparison

(1)

Financial Statement Network Comparison

MSc Thesis

written by Ali Reza Farid Amin

Under the supervision of Prof Dr Drona Kandhai, and submitted to the Board of Examiners in partial fulfillment of the requirements for the degree of

MSc in Computational Science

at the Universiteit van Amsterdam.

Date of the public defense: Members of the Thesis Committee:

17/07/2020 Marcel Boersma MSc

Dr Sumit Sourabh Prof Dr Drona Kandhai

(2)

Disclaimer

The authors report no conflicts of interest, and declare that they have no relevant or material financial interests related to the research in this paper. The authors alone are responsible for the content and writing of the paper, and the views expressed here are their personal views and do not necessarily reflect the position of their employer.

(3)

Acknowledgments

I would like to express my gratitude to my daily supervisor, Marcel Boersma. Your guidance helped me to improve my research skills. Thank you for putting me in contact with other professionals when I needed extra resources. Also, I am thankful of your constructive critics and feedback in conducting this research.

I truly appreciate the learning opportunities provided by my first and second examiners, Dr Sumit Sourabh and Prof Dr Drona Kandhai. Thank you for your advices and guide-lines on how to proceed with this research. I’m also thankful of your helpful feedback and critics.

That was with the help and trust of my daily supervisor and examiners that I was able to transform my ideas into useful research results.

I would like to thank Mansour Tirgar and Ioannis Anagnostou for giving me their valu-able opinions in the middle of this research.

Finally, I must express my profound gratitude to my caring and loving parents. Your constant support and encouragement when the times got rough are truly appreciated. The completion of this thesis project could not have been accomplished without your support.

(4)

List of Figures

2.1 The Abstract Relation between Business Processes and Accounts . . . . 5

2.2 Graphical Representation of a Business Process . . . 8

2.3 Representation of the Abstract Model as a Bipartite Graph . . . 10

4.1 Workflow Diagram of our Proposed Solution . . . 20

4.2 An Example of Bipartite Network Representation of Journal Entries . . 21

4.3 Subgraph Extraction up to Distance 2 . . . 23

4.4 An Example of an Account Subgraph (Extracted up to Distance 2) . . . 23

4.5 Clustering the Feature Vectors . . . 25

4.6 Histogram of Clusters for Graph i. . . 26

5.1 Cost Function (Graph Density Experiment) . . . 29

5.2 Cosine Similarity, K = 6 (Graph Density Experiment) . . . 30

5.3 Cost Function (Degree Distribution Experiment) . . . 31

5.4 Cosine Similarity, K = 5 (Degree Distribution Experiment) . . . 31

5.5 Cost Function (Graph Size Experiment) . . . 32

5.6 Cosine Similarity, K= 5 (Graph Size Experiment). . . 33

5.7 Degree Distribution (Graph Size Experiment) . . . 34

5.8 Cost Function (Graph Extension Experiment) . . . 36

5.9 Cosine Similarity, K = 6 (Graph Extension Experiment) . . . 36

5.10 Euclidean Similarity, K = 6 (Graph Extension Experiment) . . . 37

5.11 Graph Density (One-mode Projection Experiment) . . . 38

5.12 Degree Distribution (One-mode Projection Experiment) . . . 39

5.13 Graph Extension (One-mode Projection Experiment). . . 39

5.14 Cost Function (Experiment 1). . . 41

5.15 Cosine Heatmap and t-SNE, K = 20 (Experiment 1) . . . 42

5.16 Distribution of the Feature Vectors over K = 20 Clusters (Experiment 1) 42 5.17 Distribution of the Account Titles (Experiment 1) . . . 43

5.19 Cosine Heatmap and t-SNE, K = 30 (Experiment 2) . . . 45

(7)

List of Tables

2.1 Example of a Journal Entry (Simple Transaction). . . 6

2.2 Example of a Journal Entry (Compound Transaction) . . . 6

2.3 Examples of Debit and Credit Ratios in Transactions . . . 7

5.1 Experiment Settings (Graph Density Experiment) . . . 29

5.2 Experiment Settings (Degree Distribution Experiment). . . 30

5.3 Experiment Settings (Graph Size Experiment) . . . 32

5.4 Experiment Settings (Graph Extension Experiment) . . . 35

5.5 Ground Truth (Density Experiment) . . . 38

5.6 Ground Truth (Degree Distribution Experiment) . . . 38

(8)

Abstract

In this thesis, we aim to compare companies’ journal entries to understand more about their characteristics. Comparison of journal entry data can be in the interest of auditors as they can gain insight on the patterns of financial transactions and accounts. Through our proposed solution, auditors can group similar sets of journal entries and use this in-formation in audit risk assessment.

To compare sets of journal entries, we transformed the financial transactions and ac-counts of the journal entries into bipartite graphs, and we quantified their local features by K-mediods algorithm. Then, we used cosine and euclidean distance metrics to mea-sure the similarity of the bipartite graphs based on their similar local features.

Through different settings of synthetic bipartite graphs, we studied the effectiveness of our solution in detecting similar and dissimilar bipartite graphs. Our method was able to detect subtle changes in the graph density and degree distribution of bipartite graphs. Additionally, we showed that cosine distance could perform better than the euclidean distance in computing the pairwise similarity of the bipartite graphs through their quantized local features. We alternatively performed the same experiment settings on the one-mode projection of the bipartite graphs. And, we observed that extracting local features on the bipartite graphs performed better than extracting local features on their one-mode projections.

In the next step, we performed three experiments on the bipartite network representa-tion of journal entries which we refer to as financial statement networks. The purpose of these experiments was to learn about the characteristics of journal entries and to eval-uate the correctness of our hypotheses about the similarity of the financial statement networks. We saw that the financial statement networks of a single company were more similar to each other and they were distinguishable from the financial statement net-works of other companies. We also compared financial statement netnet-works from school, insurance, and manufacturing industries. Among these three industries, we observed that only the school industry had similar financial statement networks.

(9)

Introduction

In the recent years, firms have been seeking to find smart ways of completing their accounting tasks in order to reduce the time and labor costs. By using AI techniques in accounting and audit, the required time to process the accounting data has reduced significantly (Kokina and Davenport, 2017). And now, instead of sampling the high risk accounting data for analysis, it is possible to analyze the whole accounting data for different audit purposes. Moreover, accounting data from different resources within a company can be more easily compared to gain more understanding into the company’s accounting and business activities (Fay and Negangard, 2017).

One area of study in the application of AI in accounting and audit is on the company’s journal entries. In a nutshell, “ a journal is a chronological record of transactions entered into by a business” (Porter and Norton, 2012).

In fact, there has been various data mining and machine learning studies on the content of journal entries. In one study, the unusual combination of accounts in transactions was detected by analyzing the journal entries for unusual activity patterns (Debreceny and Gray, 2010). In another study, the journal entries were treated as a transactional database where association rule mining was used to find the interesting relations between accounts and the exchanged monetary values in the transactions (Arvaniti, 2016). In addition, anomalous journal entries were detected through learning the content of regu-lar journal entries by training a neural network auto-encoder (Schreyer et al., 2017). In yet another study, the relation between financial accounts and transactions in journal entries were studied as networks (Boersma et al., 2018). Such a network could be imag-ined as money flowing between financial accounts, and by expressing the journal entries as networks, it was possible to get benefit from graph analytics techniques to study the characteristics of the journal entries.

However, no studies have been done to explicitly compare sets of journal entries from the same or different companies. In fact, this can be an interesting task for auditors1. Auditors will be able to detect the group of similar and dissimilar sets of journal entries and use that information in performing risk assessment. In this research, we propose a technique to explicitly compare sets of journal entries.

In a nutshell, the content of book of journal entries includes a collection of transactions where each transaction contains two or more financial accounts, and each financial ac-count is involved in at least one transaction. We are specifically interested to compare

1_{“An auditor is a person authorized to review and verify the accuracy of financial records and ensure}

(10)

1.1. PROBLEM STATEMENT AND MOTIVATION

sets of journal entries based on the relation between transactions and accounts. This relation has been previously represented as a network for a single set of journal entries (Boersma et al., 2018). Given that treating journal entries as networks will enable us to benefit from a plethora of graph analytics techniques, we are going to compare the sets of journal entries through their network representations. Throughout this research, we refer to these network representations as financial statement networks, and for concise-ness, we use the term “financial network” instead of “financial statement network”.

As it was mentioned earlier, this study can be in the interest of auditors as it can po-tentially equip them with a new risk assessment method in the analysis of companies’ journal entries. Specifically, auditors can discover usual patterns of similarity in different sets of journal entries. Such patterns are not distinguishable by the human eyes, but they give valuable information about the characteristics of the journal entries.

The content of this thesis document is structured as follows. In the next sections of Chap-ter 1, the problem statement and motivation of conducting this research is explained.

Chapter 2explains the preliminaries and the background knowledge required to under-stand this thesis. Chapter 3 reviews the state of the art. In this chapter, the relevant machine learning techniques in audit and graph similarity studies are briefly reviewed. InChapter 4, our proposed methodology to compare journal entries is explained. Chap-ter 5 documents the experimental results. And, a conclusion and discussion on future works, inChapter 6, will close this work.

1.1 Problem Statement and Motivation

This research aims to compare sets of journal entries based on their network represen-tations. Our inspiration in performing this research is based on the state of the art methods in applying machine learning and network science on financial data. These methods have led to the discovery of hidden patterns of financial data, which are not identifiable by human eye. For instance, revealing the hidden relations in the network of transactions and accounts has been an important task in detecting financial vulner-abilities such as money laundering (Gao and Ye, 2007; Ngai et al., 2011). Inspired by these state of the art methods, we are motivated to analyze and compare journal entries based on their network representation.

Concretely, when a company is economically active, then we see journal entries, that are generated throughout the year. This ultimately is reflected in the network structure, that we derive from these journal entries. We want to study the network representation of companies’ journal entries to detect if the relation between their underlying financial transactions and accounts are similar.

1.2 Scientific Question and Hypothesis

Research Questions

Question 1 Are companies distinguishable based on the network representation of their journal entries?

(11)

1.3. RESEARCH CONTRIBUTION

Question 2 Do companies that share similar business activities have also similar network rep-resentation of their journal entries?

Hypothesis

Hypothesis 1 While every company has to follow the rules of accounting, they develop their own patterns in recording their business activities. These patterns depend on the type of business, the size of the company, and the company’s way of accounting (bookkeeping preferences). Therefore, we expect a company’s financial networks to be more similar to each other compared to the financial networks from other companies.

Also, we expect this behavior to occur in the comparison of the monthly finan-cial networks where we expect to see the monthly finanfinan-cial networks of the same company to be more similar to each other.

Hypothesis 2 When comparing the similarity of financial networks among companies of different industries, the financial networks of companies within the same industry tend to have more similar network structures.

1.3 Research Contribution

To the best of our knowledge, there have been no previous studies that explicitly compare companies’ journal entry data. We propose a methodology that benefits from machine learning and graph analytics to compare sets of journal entries.

(12)

Theoretical Background

In this chapter, we explain the theoretical background of this research study. In the

Preliminaries on Accounting section, we review how business activities are recorded in accounting, and in the Financial Networks section, we review how a network can be generated from journal entries. Next, in the Graph Analytics section, we review the graph related concepts and metrics that are used for describing the characteristics of financial networks in our methodology.

2.1 Preliminaries on Accounting

In this part, we provide some background knowledge in the area of accounting and audit which is necessary for understanding this thesis. It includes an introduction to transac-tion cycles,journal entries, and financial statements.

2.1.1 Transaction Cycles and Business Processes

In this research, we use the concept of business processes as the blueprints of financial transactions. In this subsection, we review the Business Processes section in Part 1 of Accounting Information Systems by Romney et al. (Romney et al., 2000).

Whenever an exchange of goods or services, that is measurable in accounting terms, occurs between two entities of an organization, a transaction is recorded. Examples of these transactions are making payment to employees and buying raw materials from suppliers as inventories.

The transactions that occur many times such as salary payments or buying inventories are called give-get exchange. In fact, most companies involve a small number of give-get exchange, but each type of exchange gets repeated many times. These repetitive ex-changes are categorized into five transaction cycles of revenue, expenditure, production, human resource and payroll, and financing. In brief,

1. Revenue is for selling goods in exchange for receiving cash (or an agreement of receiving cash in the future).

2. Expenditure is for purchasing raw materials (or resale products) as inventory in exchange for cash (or an agreement of payment in the future).

3. Production or conversion is for using resources in order to produce a ready to sale product. These resources could be in the form of labor and raw materials.

(13)

2.1. PRELIMINARIES ON ACCOUNTING

4. Payroll is for payment of employee benefit in return of labor.

5. Financing is for selling shares and borrowing money to increase funds. This usually includes the payment of dividends to investors and the payment of interests on loans.

In fact, not every organization includes all these five transaction cycles. For example, retail stores do not have a production cycle. Also, each of these cycles can differ based on the type of the company, because each company has its own unique requirements. For example, a service company such as a law firm does not include purchasing materials as inventories.

Each of the five transaction cycles includes a set of business activities that are aimed at accomplishing the goal of that transaction cycle. We call these activities as “business processes” throughout this research. Examples of these business processes are sales and sales dispatch, collections, tax disbursements, payroll disbursements, payroll, purchase inventory, deprecation, etc. In fact, a business process is the blueprint of a business activity and a transaction is an instance of a business process.

In this research, we only work with the business processes that impact financial ac-counts. Financial accounts continuously record a company’s debts, assets, revenue, cash, etc. throughout a financial year. For instance, a cash account records every transaction which increase or decrease the company’s cash. Throughout this research, we use the term “account” instead of “financial account” for conciseness.

The abstract relation between some of the main business processes is illustrated in the following diagram. The stated business processes and accounts are common in manufacturing, wholesale, and retail industries.

(14)

2.1. PRELIMINARIES ON ACCOUNTING

2.1.2 Journal Entries

In accounting, every single financial transaction needs to be recorded as a journal entry. A journal entry has a standard format. It includes the name of the accounts, that are involved in the transaction, as well as the monetary values that are exchanged between those accounts. Also, a journal entry includes a transaction ID and the date of the transaction.

In each transaction, the total debited amount of money must be equal to the total cred-ited amount of money. Based on the type of the account, the act of debiting (crediting) causes an increase or decrease in the amount of account balance. Specifically, the ac-counts that usually contain a debit balance will increase by the act of debiting and the accounts that usually contain a credit balance will increase by the act of crediting.

Furthermore, the complexity of each journal entry is divided into two groups of simple and compound transactions. Every simple transaction includes only two types of ac-counts. It has exactly a single account on the debit side and a single account on the credit side. On the other hand, every compound transaction involves at least two ac-counts on the debit side or at least two acac-counts on the credit side. Examples of simple and compound transactions are depicted in the following tables.

Table 2.1: Example of a Journal Entry (Simple Transaction)

ID Journal Date Account Debit Amount Credit Amount

1002 Collections 02.01.2020 Cash 50$

1002 Collections 02.01.2020 Accounts Receivable 50$

Table 2.2: Example of a Journal Entry (Compound Transaction)

ID Journal Date Account Debit Amount Credit Amount

1003 Sales 02.01.2020 Account Receivable 110$

1003 Sales 02.01.2020 Revenue 100$

1003 Sales 02.01.2020 Tax Payable 10$

A company’s financial accounts are usually organized based on the Chart of Accounts. “Charts of Accounts are lists of account titles used to record financial data such as revenues and expenses as well as to describe assets and liabilities. Using a chart of ac-counts, financial data can be aggregated into operating classifications which are specific to particular organization and industry needs” (Gans et al., 2007).

In Experiment 1, we use Chart of Accounts to map the financial accounts into their general account titles for analyzing the journal entries.

2.1.3 Financial Statements

The information on financial transactions and accounts are aggregated in the form of financial statements. Financial statements are classified into the three main types of balance sheet, income statement, and cash flow statement (Barker, 2001).

1. The balance sheet reports the company’s assets, liabilities, and equity. This finan-cial statement is usually updated at the end of finanfinan-cial quarters.

(15)

2.2. FINANCIAL STATEMENT NETWORKS

2. The income statement reports the company’s profit during a financial period. 3. The cash flow statement reports the amount of in-flow and out-flow of cash over

a financial period.

In fact, each financial statement is a product of data aggregation on fine-grained financial data such as journal entries and general ledger accounts. However, the focus of this research is only on the content of journal entries.

2.2 Financial Statement Networks

It was explained that transaction cycles and their underlyingbusiness processesusually get repeated many times during a fiscal year. Besides, every instance of a business process is recored as a transaction in the book of journal entries. Boersma et al. ana-lyzed journal entries to extract the recorded business processes and financial accounts. Then, based on the relation between the identified business processes and accounts, the authors built a network representation of journal entries (Boersma et al., 2018). In this research, we use the same methodology in generating the network representation of journal entries. In this section, we review this approach in detail.

To describe how this process is performed, let us consider the following journal entries.

Table 2.3: Examples of Debit and Credit Ratios in Transactions

ID Journal Date Account Debit Amount Credit Amount Credit % Debit %

1003 Sales ledger 02.01.2020 Trade Receivable 110$ 100

1003 Sales ledger 02.01.2020 Revenue 100$ 91

1003 Sales ledger 02.01.2020 Tax Payable 10$ 9

ID Journal Date Account Debit Amount Credit Amount Credit % Debit %

1007 Sales ledger 04.01.2020 Trade Receivable 1100$ 100

1007 Sales ledger 04.01.2020 Revenue 1000$ 91

1007 Sales ledger 04.01.2020 Tax Payable 100$ 9

There are two extra columns of “Debit %” and “Credit %” added to each entry com-pared with the standard format of journal entries. For each account, the value of “Debit %” denotes the ratio of the account’s debited amount to the whole debited amount in the transaction. Similarly, the value of “Credit %” denotes the ratio of the account’s credited amount to the whole credited amount in the transaction.

In these two transactions, we can see the presence of the same accounts, but with dif-ferent amount of money. However, the accounts are associated with the same debit and credit ratios in both transactions. In fact, this shows that these two transactions belong to the same business process. Concretely, a group of transactions belong to the same business process, if they contain the same accounts and those accounts have the same debit and credit ratios. Based on this definition, the business process of these two transactions can be graphically represented as the following.

(16)

2.3. GRAPH ANALYTICS

Figure 2.2: Graphical Representation of a Business Process

In this graph, edge labels are ’credit’ and ’debit’. And, the edge weights show the impact of accounts in the business process.

The visual representation of all the extracted business processes and accounts results in a network of business processes and accounts. This network representation allows the application of graph analytics for studying the behavior of journal entries.

In this research, we use the same approach to generate the network representation of journal entries. In the next section, we explain that the network representation of jour-nal entries can be defined as bipartite networks. To compare these networks, we use their local features. And, we define these local features through a set of well known

graph metrics. Therefore, we dedicate the next section of this chapter to review the graph studies relevant to this research.

2.3 Graph Analytics

As it was mentioned earlier, the main purpose of this research is to compare sets of journal entries. We make this comparison by transforming the journal entries into their network representation. This transformation provides the opportunity to benefit from graph analytics in our study.

In this part, we motivate our intention for using graphs by providing a short review on

the application of graphs in different fields of science. Then, in the Bipartite Graphs

subsection, we explain the type of graph, which we use for representing the journal entries. In the final subsection, we review some graph metrics, that we use for defining the local features of graphs in our methodology.

2.3.1 Graph Application in Science

In the recent years, the study of network science and graph theory has emerged dra-matically. Network science is used as a tool to analyze the existing relations in data. It was introduced by Leonard Euler for the first time in 1736 (Melnikov, 1994) and it has continued to develop as the field of graph theory.

Networks are an intuitive representation of many phenomena. They describe a collec-tion of entities and the direct relacollec-tions between them. The entities could be any objects such countries, human organs, or individuals. Depending on the type of the entities, the direct relations between entities could be the pathways between countries, the con-nection of organs through neurons, or the friendship relation between individuals. Such

(17)

networks can be studied as either dynamic or static. Static networks hold a snapshot of data while dynamic networks can show the changes that occur in the data over time (Ranshous et al., 2015).

Networks have application in many fields of science. In biology, networks are applied to biological systems. In such systems, networks are represented in different layers such as the cell and the organ layers, and they are studied as interconnected networks (Berns-dorf et al., 2011).

In neuroscience, the application of networks has enabled a better understanding of hu-man cognition by analyzing how interacting processes generate behavioral phenomena (Medaglia et al., 2015). In addition, networks are used to study cognitive processes through the functional connections between different areas of brain (Bassett and Bull-more, 2017).

Networks also have played an important role in the study of social sciences. The main focus of these studies is on the interactions between groups of people (Newman, 2001). An example of such a network is the collaboration network, where people share common properties such as the movies of their interest.

Another field of science which uses graph theory to tackle with some of its complex prob-lems is the field of finance. In fact, there are various relations in financial data that can be represented as networks (Allen and Babus, 2009). Specifically, a financial network is a collection of financial entities and the connections between them. These financial entities can be firms, banks, or stocks that are connected to each other through various financial interactions. For example, firms’ ownership to stocks can be represented as a network. In this network, the nodes denote the firms and stocks, and the edges indicate the stock ownership.

2.3.2 Bipartite Graphs

As we explained in Section2.2, there are two types of nodes in the network representation of journal entries. These are business process nodes and account nodes. There is an edge between an account node and a business process node if the corresponding business process causes debiting or crediting to that account. Therefore, there are no pair of account nodes that can be directly connected to each other, and there are no pair of business process nodes that can be directly connected to each other. Such a network is formally called a bipartite graph. As an example, Figure2.3 illustrates the bipartite graph representation of theabstract model.

(18)

Figure 2.3: Representation of the Abstract Model as a Bipartite Graph

Formally, a bipartite graph G is defined as G(V, E), where V = {V1∪ V2}. V1 and V2 are

two disjoint node sets of the bipartite graph, and E is the set of all the edges connecting nodes from V1 to V2 (Arunkumar and Komala, 2015).

A common step in analyzing the bipartite graphs is one-mode projection. The one-mode projection onto V1 nodes results in a graph which contains only V1nodes. In this graph,

two V1 nodes are connected when they have at least one common neighboring V2 node

in their parent bipartite graph. We can define the one-mode projection onto V2similarly.

However, one-mode projection of the bipartite graphs usually causes the loss of infor-mation and explosion of edges in the projected graph (Zhou et al., 2007). Therefore, it is more suitable to perform analysis directly on the bipartite graphs.

In this research, we define our proposed local features directly on the bipartite graphs instead of their one-mode projections.

2.3.3 Graph Measures

As we mentioned earlier, we want to compare the network representation of journal entries. To achieve this goal, we use the network’s local features as part of our method-ology. These local features should be able to describe the properties of nodes and their neighborhoods. Particularly, two nodes with similar network properties should have similar feature values. We explain the process of feature selection in theMethods chap-ter. In this part, we review the graph metrics which we use in defining the networks’ local features.

Degree of a node is the number of direct neighbors of that node. For an account node, the degree is the number of business process nodes, which are connected to

(19)

that account node. And, for a business process node, the degree is the number of account nodes, which are connected to that business process node.

– The average degree gives insight on the average number of neighbors, and it is defined separately for account and business process nodes.

Mathematically, the average degree for node sets V1and V2of an undirected

bipartite graph is defined as

avg degreeV1 = P v∈V1deg(v) |V₁| avg degreeV2 = P v∈V2deg(v) |V2|

where deg(v) is the degree of node v, and |V1| and |V2| are the size of V1 and

V2 node sets. Due to the bipartite characteristics:

|V₁_{| · avg degree}_V₁ = |V2| · avg degreeV2

For a directed graph, the same definition can be used by computing the de-gree as deg(v) = in-dede-gree(v) + out-dede-gree(v).

– Degree centrality of nodes v ∈ V1 and u ∈ V2 are defined as the following

(Borgatti and Halgin, 2011).

deg centralityv= deg(v) |V1| deg centralityu= deg(u) |V₂|

In our method, network fill and clustering coefficient are used to describe the graph’s connectivity level.

– Network fill (aka network density) calculates the number of existing edges over the number of possible edges.

f = |E|/(|V1| · |V2|)

– Clustering coefficient of a node v ∈ {V1∪V2} in a bipartite graph is defined

as (Latapy et al., 2008):

Cv =

P

u∈ N (N (v)) coef(v,u)

|N (N (v))|

where N(v) is the set of node v’s neighbors and coef(v,u) is specified with Jaccard or Overlap Coefficient as below.

Jaccard Coefficient = |N (v)∩N (u)|_{|N (v)∪N (u)|}

Overlap Coefficient(min) = _{min(|N (v)|,|N (u)|)}|N (v)∩N (u)| Overlap Coefficient(max) = _{max(|N (v)|,|N (u)|)}|N (v)∩N (u)|

(20)

2.4. DATA CLUSTERING

Using this formula, the average clustering coefficient can be calculated for V1 and

V2.

For the graph nodes with degree k, neighbor connectivity is defined as the aver-age degree of nodes’ neighbors (Maslov and Sneppen, 2002). It is mathematically defined as

knn=

X

k0

k0· p(k0|k)

where p(k0|k) is the conditional probability that a node with degree k is connected to another node with degree k’ and it is formally defined as

p(k0|k) = |{u|u ∈ N (v) ∧ deg(u) = k

0_}|

|{v|deg(v) = k}|

In thefirst experiment with real data, we use neighbor connectivity to know about the relation between the degrees of business process nodes and the degrees of their neighboring account nodes.

We use the above metrics to define the networks’ local features. This process is explained in the Feature Selection part of the method chapter. In fact, our designed feature set is abstract. It is independent of the underlying problem, that we study, and it can be applied on every bipartite graph.

2.4 Data Clustering

To compare the financial networks, we define a set of features for each node of the fi-nancial networks. Then, we measure the similarity of fifi-nancial networks based on the similarity of their nodes’ feature values. To define the node features, we take each node and its surroundings as an induced subgraph. Then, we use a set of properties that can best describe the structure of those subgraphs.

However, the values of local features are in the form of real and integer numbers. For instance, let us assume that we have the feature values of three nodes as node 1 = (0.95,0.94,0.2), node 2 = (0.92,0.9,0.22), and node 3 = (0.65,0.5,0.45). We can clearly see that there are no pair of nodes with equal feature values. However, the feature values of node 1 and node 2 are very close to each other. Therefore, we need a mechanism to identify which nodes have similar feature values. For this purpose, we use K-mediods

algorithm to group the nodes based on the similarity of their feature values.

To understand theK-mediodsalgorithm, it is best to know about theK-meansalgorithm beforehand. In the following, we review these two well-known clustering algorithms.

2.4.1 K-means and K-mediods Algorithms

The K-means algorithm groups feature vectors into K clusters. K is a user specified value and it defines the number of partitions for clustering the feature vectors (Arora et al., 2016). The pseudocode of K-means algorithm is stated below.

(21)

1. Place K feature vectors as initial centroids (the center of each cluster) into the clustering space.

2. Assign each feature vector to the cluster with the closest centroid. 3. Update the position of K centroids with the formula:

µt+1_i = P

xi∈Sti xi

|St i|

where S_it is the set of feature vectors that are in cluster i at iteration t, and µt+1_i is the updated position of centroid i.

4. Repeat steps 2 and 3 until the centroids converge to fixed values or a certain number of iterations is reached.

The cost function of this algorithm is defined as:

J (µt₁, ..., µt_K) = K X k=1 X xi∈Stk ||xi− µtk|| 2

where µt_k is the centroid of cluster k at time t.

The time complexity of K-means algorithm is O(t · K · n · d) where t is the number of iterations, K is the number of clusters, n is the number of feature vectors, and d is the dimension of the feature vectors.

To find the best value of K, we look at the slope of the cost function for different values of K (number of clusters). In general, the value of the cost function will decrease as we increase K. In the extreme case, when K equals to the number of feature vectors, the value of cost function equals to zero. At some point in the slope of cost function, increasing the value of K causes small reduction in the cost value, while decreasing the value of K rises the cost value sharply. This sweet spot is called “elbow point” and the values of K, in this area, are good candidates of selection as the number of clusters.

A disadvantage of K-means algorithm is that the clustering is sensitive to noise and outliers.

K-medoids algorithm is also a partitioning algorithm similar to K-means. In compar-ison with K-means, this algorithm is more robust to outliers (Arora et al., 2016).

In this algorithm, the mediod (center) of a cluster is selected from one of the feature vectors of that cluster. For each cluster, the mediod is the feature vector with the min-imum average pairwise dissimilarity to all the other feature vectors of that cluster. The pairwise dissimilarity is a user defined function and it can be defined as euclidean or cosine distance. Below is the psudocode of K-mediods algorithm.

1. Place K feature vectors as initial mediods.

(22)

3. For each cluster i, compute the average dissimilarity of every feature vector to other feature vectors of that cluster, and select the feature vector with the lowest average dissimilarity as the new mediod.

4. Repeat steps 2 and 3 until there is no change in the assignment of centriods or a certain number of iterations is reached.

The time complexity of K-mediods algorithm is O(K · (n − K)2) where K is the number of clusters and n is the number of feature vectors.

Other Clustering Algorithms

Two other popular unsupervised clustering algorithms are DBSCAN and OPTICS algo-rithms. DBSCAN and OPTICS are density-based clustering algorithms (Kanagala and Krishnaiah, 2016). In that, the feature vectors that are densely located in the feature space are grouped into the same cluster, and the feature vectors that are located alone or in low density areas are detected as outliers.

However, we want to group feature vectors entirely based on their values. Hence, the density and shape of the feature vectors, in the feature space, are irrelevant for our purpose. This is in contrast with the intention of applying DBSCAN and OPTICS.

2.4.2 Distance Measures in Clustering

In our solution, distance measures are used in performing the K-mediods algorithm. Also, in the last step of our methodology,clustering and graph comparison, we measure the pairwise distances of a set of vectors. In this part, we review two distance measures, that we use in this thesis.

In K-mediods algorithm, we use euclidean distance as the dissimilarly function to compute the pairwise dissimilarly between the feature vectors in each cluster. Euclidean distance is one of the instances of Minkowski distance (Van de Geer, 1995), which is defined as:

(

n

X

i=1

|x_i− y_i|p)1p

This formula for p = 1 is the Manhattan distance, for p = 2 is the euclidean dis-tance, and for p = ∞ is the Chebishev distance.

Also, we use cosine similarity to measure the pairwise distances for a set of vec-tors. Cosine similarity measures the angle between two vectors (Muflikhah and Baharudin, 2009). Importantly, this measurement is invariant to the magnitude of the vectors’ values. The cosine similarity is computed as

cos θ = a.b ||¯a|| ∗ ||¯b||

When the angle between two vectors is 180 (cos 180 = −1), they are completely dissimilar. And, when the two vectors have the same angle (cos 0 = 1), they are completely similar.

(23)

State Of The Art

3.1 Machine Learning in Audit

With the advance of digital technologies and big data, auditors have now access to huge amount of financial data for investigation. In addition, the growth of machine learning application and its popularity has lead to development of tools and software, which make implementing these methods easier and less error prone than before.

CPA (Certified Public Accountant) firms are improving their audit quality by applying machine learning methods on financial data. Specifically, in the area of risk assessment, there has been ongoing research on finding the hidden relations and patterns within the financial data (Dickey et al., 2019). For example, Deliotte works with Argus. Argus is a machine learning software that is capable of reading and understanding the contents of contract documents. In addition, this software can find anomalous terms and unusual trends as outliers. Argus treats each part of a contract as a variable, and it is able to discover various patterns of contract data through those variables (Kokina and Daven-port, 2017; Rychlik, 2018).

Another example is the application of the Halo software in PricewaterhouseCoopers (PWC). Halo is particularly used to analyze the content of journal entries. It can iden-tify anomalous terms with questionable meaning or format. Halo can also detect data trends such as unusual increase in authorized posting of journal entries in a period of time. By using this powerful software, auditors can test every single journal entry of a company during the whole financial year, and they are able to filter out the journal en-tries with higher risks for further investigation. (Kokina and Davenport, 2017; Rychlik, 2018)

Additionally, KPMG is engaged in IBM’s Watson software to benefit from cognitive technologies in auditing credit rating tasks. Watson has a powerful natural language processing component. This NLP component enables auditors to ask queries in the form of a human language. Watson can also analyze high volume of unstructured financial data and it can generate hypotheses based on the data characteristics (High, 2012; Kok-ina and Davenport, 2017; Rychlik, 2018).

Also, there has been a number of literature studies, which apply machine learning tech-niques in the area of audit. In one study, Hoogduin analyzed financial transactions through unsupervised clustering of journal entries. This clustering grouped transactions based on the type of business processes. The type of business processes included pur-chases, sales, payments, receipts, and payrolls (Hoogduin, 2019).

(24)

3.2. GRAPH COMPARISON

In another study, an autoencoder neural network was trained to detect the unusual entries in large volume of journal entry data. The aim of using the autoencoder was to reconstruct the content of journal entries. The content of each journal entry was presented as a set of attributes. These attributes included posting time, amount, trans-action code, and so on. The autoencoder neural network is made of the three layers of input, coding, and output layer. The coding layer is a constraint layer. In that, the number of nodes in the coding layer are less than the number of nodes in the input and output layers. This setting forces the neural network to learn a compressed version of journal entries. This compressed version represents the most important values of attributes, and it takes into account the frequency of attribute values and the relation between attributes. In this study, anomalies were divided into the two categories of global and local anomaly. The global anomaly was the unusual values of attributes. And, the local anomaly was the unusual combination of attribute values. (Schreyer et al., 2017)

In a similar application of autoencoders, 99 percent of the journal entries could be re-constructed from the testing data set. In addition to journal entry reconstruction, the testing data was visualized with t-SNE (t-Distributed Stochastic Neighbor Embedding). The visualization showed that journal entries were grouped into different clusters. How-ever, the meaning of the clusters were not further studied by the authors. (ˇZupan et al., 2018)

In another study, K-means algorithm was used to cluster companies in stock market. The selected features were based on a set of financial ratios in stock market. These features were “return on assets”, “return on equity”, “profit to sales ratio”, “earnings per share”, and “operating profit margins”. In this study, companies from automotive, cement and metal industries were selected. And, the result of clustering was used for decision making on the companies’ financial performance. (Momeni et al., 2015)

In yet another study, anomalous transactions were detected by supervised classification of journal entries. A large number of labeled transactions were trained to build a math-ematical function that could predict the label of new journal entries. These labels were defined for different purposes such as detection of journal entries with high risks (Bay et al., 2006).

In fact, machine learning has played a major role in development of the aforementioned techniques. It is predicted that auditors will be equipped with new means of risk as-sessment. And, these new tools will help them to discover hidden patterns in financial data (Kokina and Davenport, 2017).

3.2 Graph Comparison

There are various graph comparison studies in the literature. In this part, we review some of the state of the art approaches in comparing graphs. This includes local and neighborhood similarity, which is the basis of graph comparison in this research. Also, we briefly review two other interesting graph comparison techniques,score function for graph comparison, and graph comparison by entropy.

(25)

3.2. GRAPH COMPARISON

3.2.1 Local and Neighborhood Similarity

A number of well-established techniques, that measure graph similarity, are based on the idea that two nodes are similar if their neighborhood are similar (Porter and Smith, 2010; Grover and Leskovec, 2016; Cherifi et al., 2017). Nodes of a graph or nodes of dif-ferent graphs are similar if they have local and neighborhood similarity. Local similarity of two nodes measures the similarity of their attribute values. And, the neighborhood similarity of two nodes measures the similarity of their neighbors’ attribute values within a certain distance. In fact, the attributes and labels of a node’s neighbors can reveal the characteristics of that node (Henderson et al., 2011).

In the context of the financial networks, we can measure the local and neighborhood similarity for account and business process nodes. When we study the neighborhood similarity of the account nodes, the relations between account nodes and business process nodes are taken into account. Likewise, when we study the neighborhood similarity of business process nodes, the relations between business process nodes and account nodes are taken into account.

3.2.2 Score Function for Graph Comparison

Graph comparison has also found its application in the area of anomaly detection. To detect the anomalous behavior, a score function f is defined for each graph attribute. When the distance between the score of a particular graph element and the aggregate score of all the elements ˆf exceeds a predefined threshold T ( |f − ˆf | > T ), that graph element is detected as anomalous (Ranshous et al., 2015).

For nodes, f can be the degree of nodes. For edges, f can be the weights of the edges. And, for subgraphs, f can be the number of nodes in the subgraphs (Ranshous et al., 2015).

Also, comparison of dynamic graphs are studied as a method of event and change detec-tion. When an event occurs in a time-step, the graph undergoes some changes, which are not seen in the previous and following time-steps. But, when a change occurs, the state of the graph changes and does not return back to its previous states. Concretely, for a graph with score function f (f : Gt → R), an event occurs at time-step t if

|f (Gt) − f (Gt−1)| > T and |f (Gt) − f (Gt+1)| > T . And, a change occurs at time-step t

when |f (Gt) − f (Gt−1)| > T and |f (Gt) − f (Gt+1)| ≤ T . Event detection has been

ap-plied in molecular dynamic simulation and unusual video frames detection, and change detection has found its application in communication and human interaction networks. (Ranshous et al., 2015)

3.2.3 Graph Comparison by Entropy

Graphs have also been studied as mathematical objects with the aim of measuring their structural complexity. For this purpose, several graph entropy measures are introduced.

Shannon entropy is used to measure the amount of information in a graph attribute (Bonchev, 2009). For each graph attribute, the total and mean entropy is computed as below.

(26)

3.3. BAG OF WORDS MODEL AND K-MEANS CLUSTERING Itotal= N · logN − X i∈A Ni· logNi ¯ I = −X i∈A pi· logpi

where A is the set of attribute values in the graph, and pi is the probability that an

element of the graph has the attribute value i. This probability is computed as pi= N_Ni

where Ni is the number of graph elements with attribute value i and N is the total

number of graph elements (Bonchev, 2009).

However, a better performance is achieved when the probability is defined for each node. This new definition of pi is computed for node i as the ratio of the node i’s attribute

value over the sum of all the nodes’ attribute values.

For the degree attribute of a graph with size n, pi = deg(i)_A where deg(i) is the degree of

node i and A =Pn

j=1deg(j).

And, for the shortest path distance attribute, pi = d(i)_D , where d(i) is the sum of the

shortest path distance of node i to every other nodes and D = Pn

j=1d(j). (Bonchev,

2009)

3.3 Bag of Words Model and K-means Clustering

The concept of bag of words has been used in various applications of natural language and image processing. In this approach, a data item is expressed through its most rep-resentative feature values, and the frequencies of these reprep-resentatives are stated in the form of a histogram.

In image processing, bag of words model is used in the areas of object and texture recognition. The standard pipeline of implementing bag of visual words includes fea-ture extraction, learning visual vocabulary, feafea-ture quantization based on the visual vocabulary, and image representation by the frequency of visual words.

Specifically, each image is divided into a set of patches, and each patch is described as a feature vector. Therefore, an image is expressed as a collection of local features. These local features are image specific properties such as occlusion, scale invariants, and rotation invariants. However, spatial information of these properties are ignored. In the process of learning visual vocabulary, a clustering algorithm such as K-means is used to find the most representative feature vectors (aka code words). Next, in the vector quantization phase, each feature vector is mapped to the index of its nearest code word. (Yang and Newsam, 2010)

Given that different class of images have different code word frequency, a classifier can be trained to detect different class of images (Tirilly et al., 2008).

In natural language processing, the concept of bag of words is applied to express a document d as a vector of word frequencies, d = ([nw1, nw2, ..., nwt]). Then, document

vectors are used to measure the similarity between documents.

However, the words of a document usually have different importance, and high frequent words are not usually valuable in comparing the documents. As a remedy to this prob-lem, TF-IDF Vectorizer, a numerical statistic measure, is used to give importance to

(27)

3.3. BAG OF WORDS MODEL AND K-MEANS CLUSTERING

words based on word frequency and the number of documents that include each word. (Ramos et al., 2003)

In TF-IDF, TF stands for term frequency, which is the number of times term t occurs in document d. And, IDF stands for inverse document frequency, which is the rareness of a term in a given document collection. TF and IDF are computed as the following.

tf (termi, documentj) =

Nij

Nj

idf (termi, documents) = log

|documents|

|{d|(d contains termi) ∧ d ∈ documents}|

where Nij is the frequency of term i in document j, and Nj is the total number of terms

in document j.

In our method, we use bag of words model to compare financial networks. We use this concept by representing the financial networks through their local features. Then, we cluster the local features to find the most representative feature values (code words). After that, we map each local feature to the index of its nearest code word.

So far, we have explained about our goal and motivation in conducting this research. We reviewed the necessary preliminaries and background required for understanding our proposed method. Then, we went through some of the state of the art studies, which we found very relevant to our research.

(28)

Methods

Figure 4.1 illustrates the high level view of our proposed solution. The main steps of this solution are generating graphs from journal entry data, extracting the graphs’ local features, and clustering the feature values. Specifically,

Step 1: We construct a bipartite graph for each set of journal entries. We explain this process in theData Preprocessing and Transformation section.

Step 2: Next, we divide the bipartite graphs into a set of subgraphs. The purpose of these subgraphs is to specify the surrounding structure of each node. We explain this process in theExtracting Substructuresection.

Step 3: We represent each subgraph as a feature vector. We explain this process in the

Feature Selectionsection.

Step 4: We cluster the feature vectors by K-mediods algorithm, and we compare the graphs based on the cluster of feature vectors. We explain this process in theClustering

and Graph Comparison section.

Below is the workflow diagram of our solution.

Figure 4.1: Workflow Diagram of our Proposed Solution In this chapter, we explain the steps of this methodology in detail.

(29)

4.1. DATA PREPROCESSING AND TRANSFORMATION

4.1 Data Preprocessing and Transformation

In the first step, we process each set of journal entries and generate its network represen-tation. We explained in theBipartite Graphssubsection that the network representation of journal entries are in the form of bipartite graphs. And, account nodes and business process nodes make the two disjoint node set of the bipartite graph. As we explained, an account is impacted by one or more business processes, and each business process impacts two or more accounts.

We reviewed the process of generating the bipartite network representation of journal entries in theFinancial Networkssubsection. The following figure illustrates an example of this network, which is generated for one financial month of a company’s journal entries.

Figure 4.2: An Example of Bipartite Network Representation of Journal Entries

In this graph, the edges are labeled as either debit or credit. This can also be interpreted as directed edges where the direction of an edge indicates a debit or a credit. A debit edge between a business process node and an account node means that the business pro-cess activation causes debiting to that account. Likewise, a credit edge means that the business process activation causes crediting some amount to that account. Depending on the type of the account, the actions of debiting and crediting increase or decrease the account balance.

In this thesis, we only study the relation between the account and business process nodes. Working with the edge labels is out of the scope of this thesis. But, we are interested to expand our solution to analyze the edge labels in the future.

(30)

4.2. EXTRACTING SUBSTRUCTURES

4.2 Extracting Substructures

To compare the bipartite graphs, we compare the similarity of their local features. For this purpose, we express the local and neighborhood properties of each node by a set of features. In this part, we specify the neighborhood structure of each node through an induced subgraph, and in the next section, we will define a set of features that can best describe the structure of those subgraphs.

For a bipartite graph G = (V,E), where V is the set of nodes and E is the set of edges, we define the induced subgraph of node u ∈ V as g = (V0, E0), where

V0 = ({u} ∪

k

[

i=1

N(u)i)

N(u)k is the node’s distance k neighbors and it is defined as

N(u)k= {v|v ∈ V ∧ distance(u, v) = k} for k ∈ {1, 2, ..., diameter(G)}

where distance(u, v) is the number of edges in the shortest path between node u and node v, and diameter(G) is the maximum distance between the pair of nodes in graph G.

And, we define the edges of this subgraph as

E0 = {(vi, vj)|(vi, vj) ∈ E ∧ vi ∈ V0∧ vj ∈ V0}

In fact, this induced subgraph is analogous to ego networks in social networks. “Ego is an individual focal node. Egos can be persons, groups, organizations, or whole societies. And Neighborhood is the collection of ego and all nodes to whom ego has a connection at some path length” (Hanneman, 2005).

Since the graphs of our comparison are bipartite, each node u ∈ V belongs to either of the disjoint node sets of the bipartite graph (u ∈ {V1∪ V2}). Therefore, node u and its

distance k neighbors have the same node type if k is even and have the opposite node type if k is odd. Specifically,

If node u ∈ V1: N(u)keven ⊆ V1 N(u)kodd ⊆ V2 If node u ∈ V2: N(u)keven ⊆ V2 N(u)kodd ⊆ V1

In Figure 4.3, we illustrate the process of substructure extraction for a single node of a bipartite graph. In this figure, V1 and V2 nodes are illustrated as circles and

rectan-gles. We extract the substructure of a V1 node, which is colored in red. The distance 1

neighbors of the red node are members of V2, and the distance 2 neighbors of the red

(31)

4.3. FEATURE SELECTION

Figure 4.3: Subgraph Extraction up to Distance 2

If we extract the subgraphs of other V1nodes, we see that there are many repeated nodes

and edges among the subgraphs. However, these subgraphs provide useful information about the vicinity of each node.

In a financial network, the subgraph of an account has an intuitive interpretation. Let us consider the induced subgraph of account u in Figure 4.4. In this figure, the circles denote the accounts and the rectangles denote the business processes.

Figure 4.4: An Example of an Account Subgraph (Extracted up to Distance 2)

The induced subgraph of account u represents the structure of possible in-flow and out-flow pathways of monetary values. These pathways are between account u and other accounts, which have at least one common business process with account u.

4.3 Feature Selection

To group similar subgraphs into the same cluster, we need to represent the subgraphs by a set of features. Since there are two types of nodes in the neighborhood of each node, we need a mechanism to distinguish between these two types. In other words, each feature should always refer to the properties of one node type. Generally speaking, a feature means “a distinctive attribute or aspect of something” (Stevenson, 2010). Therefore, it

(32)

4.3. FEATURE SELECTION

is not possible to describe the attributes of two types of nodes by one feature.

Additionally, we should cluster the subgraphs of each node type separately. Otherwise, all the nodes are treated as the same type which is in contrast with the bipartite nature of the graphs. For this purpose, we divide the task of graph comparison to the compari-son of the accounts’ features and the comparicompari-son of the business processes’ features. We state the result of each comparison in our experiments. However, combining the result of these two comparisons is out of the scope of this research.

Therefore, we select a set of features that can best describe the structure of the subgraphs and can also distinguish between the two node types. To achieve this goal, we introduce the Density Based and Degree Based feature sets.

4.3.1 Density Based Features

As it was described in the review section, the clustering coefficient of a node u, in a bipartite graph, indicates the number of common neighbors between node u and other nodes, which have the same type as node u and are distanced exactly two hops away from node u. Hence, we can use the clustering coefficient to describe the level of density between first and second neighbors of node u in our feature set. Specifically, If we limit the distance of neighbor exploration by two hops, we can get a good approximate of the subgraph structure by knowing about the number of first and second neighbors in addition to the clustering coefficient.

We take these three properties into one feature set to describe the induced subgraph of node u. We define this feature set as the following.

Density Based Feature Set of Node u Feature 1 Feature 2 Feature 3

|N (u)1_| _{|N (u)}2_| _{clustering coefficient(u)}

For an account node, the type of distance 1 neighbors is business process and the type of distance 2 neighbors is account. Similarly, for a business process node, the type of distance 1 neighbors is account and the type of distance 2 neighbors is business process.

4.3.2 Degree Based Features

As an alternative to the Density Based features, we introduce another feature set. In this feature set, we also explore the neighborhood of each node up to distance 2. We define this feature set as the following.

Degree Based Feature Set of Node u

Feature 1 Feature 2 Feature 3

deg(u) P

v∈N (u)1deg(v)

P

v∈N (u)2deg(v)

In this feature set, the degree function, deg, returns the degree of nodes in the original bipartite graph. Also, we are certain that each feature refers to one type of the nodes.

(33)

4.4. CLUSTERING AND GRAPH COMPARISON

If node u is an account node, the type of nodes in N (u)1 is always business process and the type of nodes in N (u)2 _{is always account. Likewise, if node u is a business process}

node, the type of nodes in N (u)1 is always account and the type of nodes in N (u)2 is always business process. Hence, if we compare the accounts and business processes separately, each feature always refers to the same type of node.

We generalize this feature set for higher neighborhood distances and other network properties. This generalized feature set is stated in the following.

The Generalized Feature Set

Feature 1 Feature 2 Feature 3 ... Feature k

property(u) op({property(v)|v ∈ N (u)1_}) _{op({property(v)|v ∈ N (u)}2_}) _... _{op({property(v)|v ∈ N (u)}k−1_})

In the above definition, op ∈ {sum, average, max} and the property function can be selected from the below table.

Graph Properties

Graphs with identical nodes Graphs with different nodes degree clustering coefficient degree centrality closeness centrality betweenness centrality integrated centrality degree clustering coefficient

If we compare graphs of identical nodes, we can use centrality measures in addition to the degree and clustering coefficient metrics. Because, the centrality measures are normalized value metrics and they are dependent to the global network characteristics.

4.4 Clustering and Graph Comparison

Despite of having a finite set of features to describe the local and neighborhood properties of each node, we can have infinite number of values for each feature. Therefore, we need a mechanism to detect similar feature values. For this purpose, we cluster the feature vectors by the K-mediodsalgorithm. Figure4.5 provides an illustration of this process where the similar feature vectors are illustrated with the same color.

(34)

4.5. IMPLEMENTATION DETAILS

Next, for each graph i, we count the number of feature vectors of graph i in each cluster. And, we generate a k index vector as

gi= (|feature vectorg_clusteri ₁|, |feature vector_clustergi ₂|, ..., |feature vectorg_clusteri _k|)

As illustrated in Figure 4.6, a graph vector indicates the distribution of the graph’s feature vectors in each cluster.

Figure 4.6: Histogram of Clusters for Graph i

In fact, we can compare a set of graphs based on the distribution of their feature vectors in each cluster. In this project, we usecosineandeuclideandistance metrics to measure the pairwise similarity between the graph vectors.

4.5 Implementation Details

We use Python language to build a pipeline similar to the schematic demonstration of our methodology in Figure 4.1. Here, we briefly explain the different processes of this pipeline.

Step 1: For each set of journal entries, perform the following steps:

(a) Read the journal entry data from data source (database or files) and filter out entries which comply with the rules of double entry bookkeeping (sanity checks). Next, extract the accounts and business processes.

(b) Generate a bipartite graph from the extracted accounts and business pro-cesses. In addition, compute some statistics of the journal entries. This includes the number of transactions, the number of account and business process nodes, the maximum and minimum degree of account and business process nodes, and the average clustering coefficient of the account and busi-ness process nodes.

(c) Store the binary of the generated graph and its statistics. This avoids creating the graphs from scratch in the next runs of the experiments.

Step 2: For each node of each graph, compute the local and neighborhood properties, and store them as a feature vector. The output of this step is in the form

A = {graph name : {node name : [feature 1, feature 2, feature 3], ...}, ...}

Step 3: Take the feature vectors of the graphs and place them into a single 2D array of the form

(35)

4.5. IMPLEMENTATION DETAILS

(It is possible to transform this 2D array into A. Because, the order of indices are reserved.)

Step 4: In array B, scale the values of each feature using the min-max scaler (x0 =

x−min(x) max(x)−min(x)).

Step 5: Quantize the feature vectors of B into K classes by using the K-mediods algorithm. In executing the K-mediods algorithm, use K-means++1 _{algorithm to select the}

initial cluster centers.

Then, replace each feature vector of B with the name of its quantized class (class 1 to K).

Step 6: Transform the array of quantized feature vectors into the form C = {graph name : {node name : class i, ...}, ...} where i ∈ {1, ..., K}.

Step 7: For each graph:

(a) Create a K size vector.

(b) At each index i of that vector, assign the number of feature vectors, that are quantized as class i.

Step 8: Perform pairwise comparison of graph vectors with cosine and euclidean distance metrics and present the results through hitmap and t-SNE2 visualization.

1_{“K-means++ method is a widely used clustering technique that seeks to minimize the average}

squared distance between points in the same cluster” (Arthur and Vassilvitskii, 2006).

2_{“t-SNE visualizes high-dimensional data by giving each datapoint a location in a two or}

(36)

Results and Evaluation

We use two sources of input data in our experiments. These are synthetic bipartite graphs and the bipartite graph representation of journal entries from companies.

We perform experiments on the synthetic bipartite graphs to see if our method can properly detect the known differences in those graphs.

And, by experimenting on the bipartite graph representation of journal entries, we test the correctness of our hypothesesabout the similarity of those graphs.

In each experiment, we state the result of comparing the bipartite graphs based on the feature vectors of account nodes. If comparison by the feature vectors of business process nodes gives a different result, we mention the result of graph comparison for the features of account nodes and the features of business process nodes separately. Also, we experiment with both feature sets, that we defined in Section 4.3. If using these two feature sets do not give the same result, we explain the result for each feature set separately.

5.1 Experiments with Synthetic Graphs

To evaluate the effectiveness of our solution, we conduct the following experiments on synthetic graphs.

1. Most of the graphs, which represent real world entities, are not in the form of complete graphs. As a result, the graph density is used to measure the ratio of the number of edges to the number of possible edges in a graph (Wilks and Meara, 2002). In the Graph Densityexperiment, we evaluate the effectiveness of our method by comparing graphs of different density.

2. An important characteristic of any graph is the degree distribution of its nodes (Fornito et al., 2016). In the Degree Distribution experiment, we evaluate the effectiveness of our method by comparing graphs of different degree distribution. 3. Graph scaling has been studied in different application of graphs (Newman and

Watts, 1999; Rafelski et al., 2012). In theGraph Sizeexperiment, we evaluate the effectiveness of our method by comparing graphs of different scale.

4. In the Graph Extension experiment, we intend to study the effect of repeated substructures in comparing the bipartite graphs. For this purpose, we generate a set of graphs based on the abstract model in Figure 2.1.

(37)

5.1. EXPERIMENTS WITH SYNTHETIC GRAPHS

5. Finally, in theOne-mode Projectionexperiment, we want to see if selecting features on the bipartite graphs performs better than selecting features on their one-mode projections. For this purpose, we repeat the three experiments ofGraph Density,

Degree Distribution, and Graph Extension. But, instead of using the bipartite graphs in feature selection, we use their one-mode projected graphs.

5.1.1 Graph Density

In this experiment, we study the effect of graph density in comparing the bipartite graphs. We experiment with 12 bipartite graphs of the same size. Each graph has 40 account nodes and 400 business process nodes. However, the number of edges is different in each graph. And, we want to see if our method can detect the subtle changes in the graph density.

Table5.1lists the graph density of each bipartite graph.

Table 5.1: Experiment Settings (Graph Density Experiment)

Graph ID Density 1 0.4 2 0.42 3 0.45 4 0.46 5 0.47 6 0.8 7 0.85 8 0.88 9 0.89 10 0.9 11 0.95 12 1

Based on the graph density values in table 5.1, we can cluster the graphs to groups of graph 3, 4, 5 and graph 8, 9, 10.

To compare these 12 graphs, we follow the steps of our solution, which we explained in the method chapter. To find a proper value of cluster size, we use the elbow method. Figure 5.1 shows the changes in the value of cost function for different values of the cluster size.

Figure 5.1: Cost Function (Graph Density Experiment)

Approximately after k = 6, we see a slow reduction in the value of cost function. And, before k = 6, we see that the value of cost function increases sharply. Hence, we select the cluster size as k = 6.

Financial Statement Network Comparison