SARS risk profile indicator model evaluation

(1)

SARS risk profile indicator model

evaluation

HN Theron

orcid.org/0000-0001-5314-0929

Dissertation submitted in fulfilment of the requirements for

the degree

Master of Engineering in Computer and

Electronic Engineering

at the North-West University

Supervisor:

Prof WC Venter

Graduation ceremony: May 2019

Student number: 24150843

(2)

ii

AKNOWLEDGEMENTS

I would like to express my deepest appreciation to:

• My supervisor, Prof. W.C. Venter for his advice, support and guidance.

• Prof A.J. Hoffman for sponsorship and guidance on this project.

• My employer, Xpitec (Pty) Ltd, for their support and patience in allowing me to pursue this degree.

• All my friends and family, who supported and encouraged me throughout the duration of this project.

(3)

iii

ABSTRACT

Managing trade efficiently and in a cost-effective manner can greatly benefit the economy of a country. In South Africa, customs processes lead to many unnecessary delays seemingly without any underlaying reason. This research report aims to show how this conclusion was made, along with a range of other statements about the efficiency and state of customs processing in this country.

The first chapter gives some background information, as well as states the main problem to be addressed, namely developing an application that can analyse raw customs data provided by SAAFF in order to reach meaningful conclusions from the data. The objectives for the project as well as the methodology are discussed in detail. Furthermore, a literature study showing motivation for the research and background on the more important topics are provided.

The conceptual design broadly explains the design phase for the complete project, including all of the functional units and a complete functional analysis of all the units and the

interfaces. The detail design then zooms in on the unit responsible for creating the analyses and provides flowcharts to explain the design of the analyses.

The next chapter provides results after the implementation of the design and evaluates the results in detail. There are two main categories for the analyses, namely the statistical and the modelling categories. The former uses descriptive statistics to process the data in a meaningful way and to ensure easy interpretation for users of the software. The results are displayed in the form of histograms. The second category, modelling, makes use of

statistical principles to deliver ruling on the probability of future occurrences. This could be of great help to SAAFF members when making decisions about their trades and can help them make more informed decisions in regards to the commodities they trade.

The last chapter draws conclusions about the results and gives further recommendations on how the project can be improved upon. The attached appendices give more detail about the statistical tables used as well as descriptions for some of the customs codes used in the analyses. The user manual for the software is also included, along with the source code and full digital archive.

Keywords: South African Revenue Service (SARS), South African Association of Freight Forwarders (SAAFF), Cross-border operations, Customs procedures, Descriptive statistics, Predictive modelling, Analyses, Web application, Customs risk models, Data visualizations

(4)

iv

EXECUTIVE SUMMARY

Customs inspections can be intrusive and detrimental to customs administration and management since it can lead to long delay times at border posts. By limiting intrusive inspections, the overall efficiency of a border post will increase. However, in order to achieve this, SARS’ current risk indicator model must first be evaluated.

By analysing customs declarations data collected from several service providers and border posts and provided as input to the dissertation by SAAFF, one can determine whether these inspections were justifiable. Furthermore, the data can be analysed via statistical principles to determine a range of other results that, when utilised correctly, may be used to improve the overall time and costs of operations and enable South Africa to be a more favourable trading partner.

The approach followed for this project included:

• Developing a C# application capable of collecting the raw data from XML files provided by SAAFF and converting them into SQL data tables in a MS SQL database. The application took 3 minutes and 18 seconds to convert all 743 of the available XML files into a 4.94 GB SQL database;

• A C# library was developed with different functions to be used for different analyses. As of the writing of this report there were 126 different analyses. These however can be used in billions of different ways, depending on the needs of the user;

• The results from the analyses are displayed visually in the form of interactive histograms;

• Historical data is used to make predictions about the expected delay times for future consignments. The user can even select their own input variables to refine the model to their specific needs;

The results show that SARS seems to be biased when it comes to certain consignors. SARS’ stop and inspect model also seems seriously flawed since 85-90% of the

consignments stopped and inspected do not yield any infractions. The design describes the steps followed in order to develop the applications in great detail and the reader can

(5)

v

DECLARATION

I, Hanli Nicola Theron, declare that this report is a presentation of my own original work.

Whenever contributions of others are involved, every effort was made to indicate this clearly, with due reference to the literature.

No part of this work has been submitted in the past, or is being submitted, for a degree or examination at any other university or course.

Signed on this, 7th_{day of November 2018, in Potchefstroom.}

______________ HN Theron

(6)

vi

LIST OF FIGURES

Figure 1: The Engineering Design Process ... 5

Figure 2: The iterative waterfall methodology as applied to this project ... 7

Figure 3: General case for capturing central z curve area [18] ... 12

Figure 4: Specific case for capturing the central z curve area for a 95% CI [18] ... 13

Figure 5: The anatomy of a simple boxplot [20] ... 15

Figure 6: SARS Customs Office Locations [25] ... 19

Figure 7: Proposed back-end architecture ... 21

Figure 8: Proposed front-end architecture ... 22

Figure 9: Overview of the complete proposed conceptual solution ... 23

Figure 10: Functional Analysis ... 24

Figure 11: Operational Flow as seen by Operator ... 28

Figure 12: Breakdown of the select option segment ... 29

Figure 13: Breakdown of the select parameters process (statistics) ... 29

Figure 14: Breakdown of the select parameters process (modelling) ... 30

Figure 15: Overview of technologies utilised ... 35

Figure 16: Snippet from one of the XML files ... 36

Figure 17: Flowchart showing the design of the XMLRead Application ... 37

Figure 18: Entity Relationship Diagram (ERD) ... 39

Figure 19: Summary of the data-processor's functionality ... 40

(10)

x

Figure 21: Basic Design for the modelling web page ... 43

Figure 22: Home page ... 45

Figure 23: Statistics page - selection panel ... 46

Figure 24: Statistics page - results panel ... 47

Figure 25: Modelling page - selection panel ... 49

Figure 26: Modelling page - results panel ... 50

Figure 27: Modelling page - results panel with outliers activated ... 51

Figure 28: Number of Observations per Specific Outcomes ... 54

Figure 29: Average Duration for Observations per Specific Outcomes ... 55

Figure 30: Average Duration per Customs Office ... 57

Figure 31: T-Statistic for Average Duration per Customs Office ... 59

Figure 32: Average Duration per HS Code ... 60

Figure 33: Average Duration per Customs Offices with HS Codes ... 61

Figure 34: Zoomed in version of Average Duration per Customs Office with HS Codes ... 62

Figure 35: Average Duration per CPC Code ... 68

Figure 36: Number of Consignments with Infractions per CPC code ... 69

Figure 37: Fraction of Consignments with Infractions per CPC code ... 70

Figure 38: Average Duration per Previous CPC Code... 71

Figure 39: Average Duration per Transport Code ... 73

Figure 40: Average Duration per SP Source Code ... 75

Figure 41: Average Duration per Country of Export ... 77

(11)

xi

Figure 43: Average Duration per Country of Origin ... 81

Figure 44: Single Input Modelling Example (BBR) ... 83

Figure 45: Single Input Modelling Example (BBR with outliers) ... 84

Figure 46: Single Input Modelling - Comparison of customs office entries ... 85

Figure 47: Single Input Modelling - Comparison of customs office entries (with outliers) ... 86

Figure 48: Single Input Modelling - Comparison of country of export entries ... 87

Figure 49: Single Input Modelling - Comparison country of export entries (with outliers) ... 87

Figure 50: Two Inputs Modelling Example (Brazil and Chemicals) ... 89

Figure 51: Two Inputs Modelling Example with outliers (Brazil and Chemicals) ... 89

Figure 52: Comparison one of delay times for goods originating in Brazil... 90

Figure 53: Comparison one of delay times for goods originating in Brazil (outliers) ... 91

Figure 54: Comparison two of delay times for goods originating in Brazil ... 92

Figure 55: Comparison two of delay times for goods originating in Brazil (outliers) ... 92

Figure 56: Comparison of delays for chemical imports from different countries ... 93

Figure 57: Comparison of delays for chemical imports from different countries (outliers) .... 94

Figure 58: Three Inputs Modelling Example (Beit Bridge, Congo and Road) ... 95

Figure 59: Three Inputs Modelling Example (Beit Bridge, Congo and Road) with outliers ... 96

Figure 60: 3-Input model comparison of delays for goods originating in different countries . 97 Figure 61: 3-Input model comparison (outliers) ... 98

Figure 62: Four Inputs Modelling Example ... 99

Figure 63: Four Inputs Modelling Example (outliers) ... 100

(12)

xii

Figure 65: 4-Input model comparison (outliers) ... 102

Figure 66: Five Inputs Modelling Example ... 103

Figure 67: Five Inputs Modelling Example (outliers) ... 104

Figure 68: 5-Input model comparison of delays for goods originating in different countries 105 Figure 69: 5-Input model comparison (outliers) ... 106

Figure 70: Six Inputs Modelling Example ... 107

Figure 71: Six Inputs Modelling Example (outliers) ... 108

Figure 72: 6-Input model comparison of delays for a variety of goods of different origin .... 109

Figure 74: Seven Inputs Modelling Example ... 112

Figure 75: Seven Inputs Modelling Example (outliers)... 113

Figure 76: 7-Input model comparison of delays for goods arriving via different countries .. 114

Figure 78: Eight Inputs Modelling Example ... 116

Figure 79: Eight Inputs Modelling Example (outliers) ... 117

Figure 80: 8-Input model comparison of delays for different service providers ... 118

Figure 82: Nine Inputs Modelling Example ... 121

Figure 83: Nine Inputs Modelling Example (outliers) ... 122

Figure 84: 9-Input model comparison of delays for different service providers ... 123

(13)

xiii

Figure 87: Request Additional Documents and Amended per Month ... 126

Figure 88: Request Additional Documents and Not Amended per Month ... 126

Figure 89: Fraction of Consignments Stopped and Inspected per Month ... 127

Figure 90: Fraction of Consignments Stopped, Inspected and Amended per Month ... 128

Figure 91: Fraction Justified Cases for Stopped and Inspected per Month ... 128

Figure 92: Cumulative Normal Probability Table 1 [37] ... 137

Figure 93: Cumulative Normal Probability Table 2 [37] ... 138

Figure 94: T-Distribution Table 1 [37] ... 139

(14)

xiv

LIST OF TABLES

Table 1: Decision matrix for RDBMS (Relational Database Management System) ... 32

Table 2: Decision matrix for choice of programming language ... 32

Table 3: Number of entries in each SQL table ... 38

Table 4: Summary of Data Processing Algorithms ... 41

Table 5: Customs response code descriptions ... 46

Table 6: Descriptions of Specific Outcomes ... 47

Table 7: Number of Entries per Input factor ... 53

Table 8: Longest and shortest average delay times per customs office ... 58

Table 9: HS Code Category Entries ... 60

Table 10: Longest and shortest average delay times per HS Code ... 61

Table 11: Summary Average Duration per Customs Office per HS Codes ... 66

Table 12: Longest and shortest average delay times per CPC Code ... 68

Table 13: Longest and shortest average delay times per Previous CPC Code ... 72

Table 14: Transport Codes ... 73

Table 15: Longest and shortest average delay times per Transport Code ... 74

Table 16: Longest and shortest average delay times per SP Source ... 75

Table 17: Longest and shortest average delay times per Country of Export ... 77

Table 18: Longest and shortest average delay times per Country of Import ... 79

Table 19: Longest and shortest average delay times per Country of Origin ... 81

Table 20: Customs Office Abbreviation Descriptions [38] ... 141

(15)

xv

Table 22: Previous CPC Code Procedure Definitions [39] ... 145

(16)

xvi

LIST OF ABBREVIATIONS

ANSI – American National Standards Institute

CI – Confidence Interval

CPC – Customs Procedure Code

CSS – Cascading Style Sheets

C# – C Sharp (Programming Language)

DLL – Dynamic-link Library

FIATA – International Federation of Freight Forwarders Association

GUI – Graphical User Interface

HS Code – Harmonized Commodity Description and Coding System

HTML – Hypertext Markup Language

ICT – Information and Communication Technology

IDE – Integrated Development Environment

I/F – Interface

IQR – Interquartile Range

MS – Microsoft

OGA – Other Governmental Agency

PI – Prediction Interval

RDBMS – Relational Database Management System

SAAFF – South African Association of Freight Forwarders

SADC – Southern African Development Community

(17)

xvii SP – Stored Procedure

SP Source Code – Service Provider Source Code

SQL – Structured Query Language

UI – User Interface

UPS – United Parcel Service

UX – User Experience

VOC – Voucher of Correction

W3C – World Wide Web Consortium

(18)

1

1. INTRODUCTION

This chapter serves to provide some background to the origin of the problem, as well as how the problem will be addressed. This is set out clearly in the problem statement and objectives

1.1 Background

International and domestic trade is an important part of any healthy economy and although countries can strife to become independent of others, in today’s society there are no true closed economies [1]. Regulating domestic trade is important, but regulating international trade is crucial since trade without regulation will allow contraband to easily enter a country which can lead to a range of different problems. The consequences of illegal goods entering and exiting a country can have devastating and unforeseen effects on the economy [2].

Regulating trade can be a cumbersome and costly process, but international trade as a whole is advantageous to the economy of both developing and developed countries. Some of these advantages include [1]:

• International trade leads to global competitiveness, which should in turn ensure that a country’s local businesses are run as efficient as possible in order to stay relevant in the market.

• Even developing countries can reap the benefits of international trade since these countries usually have a steady supply of resources like raw materials in the mining, textile and even agricultural field. This generates revenue that can be traded for better technology or even knowledge and skills in the form of scholars studying abroad. • It helps mitigate the risk of domestic monopolies taking over a certain part of the market

in a country since foreign competition will have to be considered.

• It can lead to great technological innovation in order to stay relevant in the market.

• The effects of national disasters can be mitigated if a shortage in any product can be solved by importing it from another country.

• Most countries just don’t have all the necessary resources to stay self-sufficient and relevant in the modernized world and will most likely lag behind if they choose to close their economy.

(19)

2

These are just some of the advantages of international trade and since no country in the modern world can truly be considered a closed economy [1], it is easy to conclude that international trade is mutually beneficial to all who participate. But, in order to stay relevant in the market, countries need to keep striving to deliver quality products in the shortest time possible. However, no matter how efficiently a business is run, unpredictable and unfavourable delay times at border posts can greatly increase the time it takes to export a product and can lead to great losses and losing the interest of potential investors. Studies have shown that this is especially prevalent in Africa where up to 70% of containers are stopped and opened for inspection [3].

1.2 Problem Statement

Managing international trade in an efficient and cost-effective manner is of great importance for the economy of any country. Too many unnecessary delays can hamper the process and can lead to great losses. By making use of historical customs data gathered from SAAFF (South African Association of Freight Forwarders) in the form of raw XML (Extensible Markup Language) files, one might be able to better grasp the risk indicator model used by SARS (South African Revenue Service) by analysing this data. This information can then be used to improve the current model, if adopted by SARS. Service Providers can also use it to mitigate the risk of being flagged unnecessarily. This in turn will reduce delay times and increase revenue.

1.3 Project Aim

The current hypothesis is that too many trailers are stopped and inspected without solid foundation. In order to test this hypothesis to see if it holds, the relevant customs data provided by SAAFF needs to be analysed. This will be done by making use of a range of statistical and analytical tools.

Furthermore, a web application will be developed where the processed data can be viewed and analysed in a more user-friendly graphical form. This will also enable Service Providers to see through the window into the realm of data they have at their disposal. The web application will enable them to choose which statistics they wish to see over a specific time period, as chosen by them. By analysing these results, they will be able to make more informed decisions regarding their trades. Lastly, a model will be designed to determine the expected duration per consignment based on the available historical data.

(20)

3 1.4 Project Objectives

The objectives for this project will be discussed in terms of the main and secondary objectives as well as the critical success factors and applicable deliverables.

1.4.1 Main Objectives

The main objective of this project is to design and implement a web application capable of allowing users to evaluate and analyse available SAAFF customs data. This entails making use of statistical data analytical techniques in order to process the large amount of raw data. 1.4.2 Secondary Objectives

The secondary objectives for this project include the following:

✓ Build an application that can collect the data from multitudes of raw XML files and convert it into a Microsoft SQL Server database with all of the applicable tables; ✓ Develop a library capable of executing different analyses on the data by making use

of descriptive statistics;

✓ Develop delay duration predictions by making use of the historical data;

✓ Some form of visual aid (aka graphs) should be used in order to exhibit the results to the user;

✓ The results should show some indication of the behaviour of the current risk indicator model as used by SARS.

1.4.3 Critical Success Factors

In order to successfully complete this project, the main objective must be realized. The success factors for the main objective include:

✓ Creation of a database that contains the data gathered from the XML files;

✓ Stored Procedures and C# (C Sharp) classes should be used to interpret the raw data in order to draw useful conclusions;

(21)

4

The success will be measured with the delivery of the web application software. The web application will need a connection to the created database and will make use of the stored procedures and C# classes in order to analyse the data. The conclusions will be made in the dissertation.

1.4.4 Deliverables

Since this project requires no hardware design, the main deliverable of this project is a functional web application capable of the aforementioned objectives. The front end of the web application will be developed using CSS, HTML and JavaScript while the backend application will be developed using C#. Furthermore, a MS SQL server database will be designed which will be used to store all the data gathered from the multitudes of XML files provided by SAAFF.

The statistical calculations will be completed in the form of MS SQL stored procedures as well as C# classes with applicable functions. A DLL will be provided with the means to port all these functions to any other .NET application without needing to re-compile the code. A user guide will be provided explaining how to use the DLL as well as how to utilise each function in order to calculate each statistic. This will especially be helpful if future work is to be considered.

All the applicable software, documents and results will be delivered to the supervisor, Prof W.C. Venter.

1.5 Design Methodology

The research and design methodology should be able to reflect on all aspects of project. Since this is an engineering problem, the engineering design process will always lay the foundation and highlight the important steps that should be followed. Figure 1 [4] shows these steps.

(22)

5

Figure 1: The Engineering Design Process

The engineering design process is a good starting point, but since this is a software project some type of software development methodology must also be considered. Software development methodologies are essential in ensuring that a project meets the client’s requirements, stays on schedule and stays on budget.

In regards to this project, two methodologies were considered namely the rapid prototyping methodology and the waterfall methodology. The waterfall model is a structural process were each of the project’s phases follows in sequence and the client’s input is only considered before the design phase. It’s easy to implement and one of the most used models where the quality of the end product is of greater concern than that of schedule and budget. This is also one of the best methodologies to use in database-related software as is the case with this project [5]. The rapid prototyping methodology is revolutionary in that a working test module can be completed in a very short time. The software is written to ensure a functional prototype as soon as possible. This prototype is especially helpful when the requirements are very vague since the prototype can be improved upon after each iteration or change in requirements. While this is a good methodology to follow when the design of the project is bound to change

(23)

6

drastically, it can also be very costly to the developers. Too many iterations and changes to the specifications can lead to disorder in the design phase [6].

A modified version of the iterative waterfall methodology is proposed here since the orderly fashion in which each milestone in this model is completed fits with the scope of this project. The client’s requirements are also fairly straightforward and minor changes will not affect the design itself that much, however it must be possible to alter the design without rendering all work void. It should be easy to go back to a previous version of the project’s lifecycle and change the design if needed, therefore careful version control of the project needs to be done. Bitbucket’s free version will be used to help achieve this goal.

Since multiple statistical measures will be used to analyse the data, the calculations for each measure will be kept in separate modules to ensure that only the code that needs to be adjusted is altered during an adaption in the design. This will help ensure that the final product is kept as modular as possible to ensure easy implementation and merging onto future products.

Since the project will also be web based it is important to have a user-friendly GUI. This is where the client’s requirements also tend to change a lot, therefore being able to incorporate feedback into each iteration is imperative. Each iteration on the design should in essence improve upon it and ensure that the designer and end consumer agree.

The simplified flow chart in Figure 2 aims to describe this methodology as applied to the project.

(24)

7

Figure 2: The iterative waterfall methodology as applied to this project

1.6 Existing Research

As far as the student is aware, very limited analysis were done on SAAFF’s data and this is the first time the subject is researched in more detail, however inspiration was drawn from work completed by both Hoffman et al. [7] as well as Laporte [8]. Laporte also proposed statistical analysis of customs data with the aim of creating a model that can be used in a risk management system for customs administrations.

(25)

8

2. LITERATURE REVIEW

This literature review aims to provide a better framework of reference on the following subjects:

• The South African Revenue Service (SARS)

• The South African Association of Freight Forwarders (SAAFF) • Transport Corridors, Cross-border Operations and the Road Sector • Confidence and Prediction Intervals

• The Boxplot and Outliers • Website Wireframes

• Extensible Markup Language (XML)

2.1 South African Revenue Service (SARS)

The South African Revenue Service is South Africa’s governmental tax collecting agency and in this regard are also responsible for regulating both domestic and international trade.

SARS was established in accordance with the South African Revenue Service Act 34 of 1997 and is an autonomous part of the state [9], meaning that the National Treasury dictates the rules and regulations, but it is managed by SARS itself.

SARS aims to serve by adhering to the following outcomes, as stated on their official website [9]:

• Increased customs and excise compliance;

• Increased tax compliance;

• Increased ease and fairness of doing business with SARS;

• Increased cost effectiveness and internal efficiencies;

• Increased public trust and credibility.

This means that all goods that enter and exit the country must be regulated by SARS. This is done to limit and prohibit contraband from entering the country as well as to ensure that the products are taxed appropriately in accordance with tax and customs law. SARS aims to do this without obstructing trade.

(26)

9

The funds collected by SARS are mainly conveyed from the private sector to the public sector. This means that they utilise said funds to pay for a range of goods, services and governmental operations that are required by the general public. Some of these include education, road construction and maintenance as well as public safety and welfare.

The South African public generally have tax compliant attitudes and according to a study done by Merida et al. South Africa has a tax compliant attitude of over 70% [10]. The study analysed data from 10 African countries and South Africa had the highest tax compliant attitude rating, followed by Mali.

2.2 South African Association of Freight Forwarders (SAAFF)

The South African Association of Freight Forwarders is a non-profit national authority in the Freight Forwarding industry of South Africa, founded in 1921 [11]. SAAFF operates in association with government departments and parastatal agencies and mainly concern themselves with the smooth facilitation of trade. This ranges from customs documentation and road freight legislation to export control and port health inspections [11]. Ultimately SAAFF aims to reduce costs over the entire supply chain by making use of the collective knowledge of their members. SAAFF enables members of different freight forwarding companies to collaborate and brainstorm on ways to face challenges in the industry and find a solution that is mutually beneficial to all of the members involved.

The organisation’s members consist of experienced executives from freight forwarding companies from around the country. Some of their platinum members include Savino Del Bene, Aramex, UPS and Hellman. The association can be divided into five regions which enables each region to deal with local issues. The five regions are [11]:

• Gauteng

• KwaZulu-Natal

• Eastern Cape

• Western Cape

(27)

10

Furthermore, SAAFF is a member of FIATA – the International Federation of Freight Forwarders Association. FIATA consists of almost 40 000 individual members and employs approximately 9 million people in 150 countries [12]. This means that SAAFF members also benefits from the collective knowledge obtained from thousands of other freight forwarding companies. SAAFF members oversee more than 80% of the country’s international trade and the services provided by the members can greatly impact the direction and efficiency of the supply chain [11].

2.3 Transport Corridors, Cross-border Operations and the Road Sector

Transport corridors provide important areas to facilitate trade between different regions. They usually consist of trade routes that are clearly defined with satisfactory rail and road infrastructure.

According to the annual state of cross-border operations report from March 2017, these transport corridors serve a dual purpose, namely [13]:

• Serve as transport and communication networks between regional countries;

• Encourage and develop investment opportunities since resources, markets and ports between countries can be more readily accessed.

Furthermore, transport corridors help to connect landlocked countries to the trading network. In the SADC, there are six countries with no access to coastal ports (Botswana, Zimbabwe, Zambia, Malawi, Lesotho and Swaziland) [13] which makes them dependent on trading corridors where a transport network with countries such as South Africa with access to a coastal port will ensure a steady supply of resources and access to global markets.

Transport corridors in the SADC oversees more than 80% of trade in the region [14]. Thus, these corridors can be of great strategic value to all of the regions involved and maintaining and enhancing them is essential. However literature shows that this is not the case and that transport corridors and cross-border operations are usually very inefficient with long delay times [13]. Some of the causes for these inefficiencies include inadequate funding to improve upon infrastructure, ineffective practices that include legislation and other regulations, ineffective border management and ICT structures as well as substandard road infrastructure [13], [15].

(28)

11

These are all problems that will need to be addressed in order to make South Africa more relevant in the global market as well as open up investment opportunities. The change of legislation can take years and improving road and other infrastructure at border posts and trade corridors is ultimately very expensive. Logically, one of the easiest and cheapest steps to help facilitate smoother cross-border operations is to improve upon the border management and ICT systems.

2.4 Confidence Interval (CI)

Descriptive statistics are used to summarise data by making use of measures, such as the median, mode, mean, variance and standard deviation. These measures serve to describe the data and acts as the foundation of any statistical model or analysis. Although these measures are powerful on their own, it is important to remember that a single data point is not always a reliable estimation [16]. For example, one can take the population mean or arithmetic average, 𝜇 of the marks of students taking a course. Suppose 𝜇 = 87%. This does not mean most students have 87%, it’s just a representative sample and can be heavily skewed by outliers. If half of the students got 10% and the other half 90%, the average would be 50% which is a very bad representation of the data. This is also why more than one measure is usually applied to data. Furthermore, if data is sampled, the sample mean 𝑥̄ would rarely be equal to the population mean and we would not know how close an estimate it is. This is where the confidence interval comes to play.

The confidence interval or CI is used as a more accurate estimate and reports the most likely range of plausible values [16]. In order to calculate a confidence interval, a confidence level must be selected first. The confidence level can be anything from 1-100%, but remember the higher the confidence level, the higher the range of plausible values that will be presented. The confidence levels most frequently used are 90%, 95% and 99% [16].

In order to calculate the confidence interval on the mean of a population, a normal distribution is required. According to the Central Limit Theorem, the CI will still be valid even if the population is not normal as long as n is reasonably large (n > 40) [17]. Therefore, it is usually assumed a distribution is normal and even if it is not, the CI will still be an accurate estimation.

(29)

12 𝑥̄ − (𝑧 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑣𝑎𝑙𝑢𝑒) 𝑠 √𝑛≤ 𝜇 ≤ 𝑥̄ + (𝑧 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑣𝑎𝑙𝑢𝑒) 𝑠 √𝑛 (1)

where 𝑠 is the standard deviation, 𝑛 the sample size and 𝑥̄ the sample mean. The z critical value represents the area under the standard normal curve, depending on the chosen confidence level. The standard deviation measures the spread of the data in regards to the mean and is calculated as follows:

𝑠 = √∑(𝑥̄𝑖− 𝑥̄ )

2

𝑛

(2)

While the sample mean is simply:

𝑥̄ =(∑ 𝑥̄𝑖) 𝑛

(3)

The z critical value can be captured from a central z curve in terms of the area, depending on the chosen confidence level and read from a cumulative normal probability table as seen in Appendix A. Figure 3 shows the universal case for a normal distribution, where 1 − 𝛼 represents the chosen confidence level.

Figure 3: General case for capturing central z curve area [18]

(30)

13 5

Figure 4: Specific case for capturing the central z curve area for a 95% CI [18]

The confidence interval on a population mean can also be calculated as follows:

𝑥̄ − 𝑡𝛼 2,𝑛−1 𝑠 √𝑛 ≤ 𝜇 ≤ 𝑥̄ + 𝑡𝛼2,𝑛−1 𝑠 √𝑛 (4) where 𝑠 is the standard deviation, 𝑛 the sample size and 𝑥̄ the sample mean and 𝑡𝛼/2,𝑛−1being the upper 100𝛼/2 percentage point and n-1 the degrees of freedom. The appropriate value can be read from the t-distribution table as seen in Appendix A. This method can calculate the CI even with the variance unknown [17].

2.5 Prediction Interval (PI)

The confidence interval is an important indicator when trying to predict the population mean 𝜇 based on discrete samples, but there is also value in predicting a future value 𝑥̄ by making use of the available data. A prediction interval (PI) can do just this.

In order to obtain a prediction interval for a future value, we must first know the value of the prediction error. A point prediction for 𝑥̄ is just the sample mean 𝑥̄, which can also be used as a point prediction for the population mean, 𝜇. Therefore, the prediction error is [16]:

𝐸( 𝑥̄ − 𝑥̄) = 𝜇 − 𝜇 = 0 (5)

Furthermore, since the sample values are presumed to be independent of the subsequent value for 𝑥̄, the variance for the prediction error can be given as:

𝑉( 𝑥̄ − 𝑥̄) =𝜎 2 𝑛 + 𝜎 2_{= 𝜎}2_{(1 +}1 𝑛) (6)

(31)

14

The variance is thus equal to the sum of the variance for 𝑥̄ and the variance of the sample mean, 𝑥̄. Since the future value is independent of the mean of the current values, the prediction is normally distributed and the following holds true [17]:

𝑇 = 𝑥̄ − 𝑥̄ 𝑠√1 + 1_𝑛

) (7)

T has a t-distribution with 𝑛 − 1 degrees of freedom. Therefore, the prediction interval for the future value 𝑥̄ is calculated as follows:

𝑥̄ − (𝑡 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑣𝑎𝑙𝑢𝑒) × 𝑠√1 + 1 𝑛≤ 𝑥̄ ≤ 𝑥̄ + (𝑡 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑣𝑎𝑙𝑢𝑒) × 𝑠√1 + 1 𝑛 (8)

where the t-critical value can be read from the t-distribution table (Appendix A) based on the degrees of freedom and the chosen prediction level. The prediction interval for the estimated observation will always be wider than the confidence interval for the same confidence or prediction level since there is more variation associated with the prediction error than that of the error of estimation [16]. The error of estimation for the CI is the difference between one variable and a constant (𝑥̄− 𝜇), whereas the prediction error is the difference between two random variables (𝑥̄− 𝑥̄ ) [17].

2.6 The Boxplot and Outliers

The Boxplot, also known as a Box and Whiskers plot is a visual display of data used to conveniently convey the five-number summary that includes the following [19]:

• Lower Extreme (Minimum)

• Lower Quartile (1st_{or 25% Quartile)}

• Median

• Upper Quartile (3rd_{or 75% Quartile)}

• Upper Extreme (Maximum)

(32)

15

Figure 5: The anatomy of a simple boxplot [20]

A boxplot can be used to make the following observations [20]: • What the values of the five-point summaries are;

• If the data is symmetrical or skewed;

• If the data is tightly grouped;

• If there are any outliers;

By inspecting Figure 5, an outlier can also be seen to be present. In this example there is only a single lower outlier, but a dataset can have multiple outliers, both in the upper and lower ranges.

Literature is divided about the most effective way to calculate outliers and whether the

calculation of outliers is even necessary [21]. The method that will be used for the purpose of this project is the robust IQR (interquartile range) method, developed by John Tukey. Tukey developed this method for the boxplot (also known as a Tukey contribution) and it seems to be the standard method for calculating outliers for graphical representations [21].

The steps for calculating outliers are as follows:

Step 1: Sort the data from lowest to highest values Step 2: Calculate the median of the dataset

Step 3: Calculate the lower quartile (𝑄₁) Step 4: Calculate the upper quartile (𝑄₃)

(33)

16 Step 5: Determine the interquartile range as follows:

𝐼𝑄𝑅 = 𝑄3− 𝑄1 (9)

Step 6: Determine the upper and lower boundaries for the outliers:

𝑈𝑝𝑝𝑒𝑟 𝐵𝑜𝑢𝑛𝑑𝑎𝑟𝑦 = 𝑄3+ 1.5(𝐼𝑄𝑅) (10)

𝐿𝑜𝑤𝑒𝑟 𝐵𝑜𝑢𝑛𝑑𝑎𝑟𝑦 = 𝑄1− 1.5(𝐼𝑄𝑅) (11)

Any value smaller than the lower boundary and any value larger than the upper boundary are considered outliers. The 1.5 times IQR rule was created by John Tukey since it gives a good estimation of outliers for Gaussian distributions [21].

2.7 Website Wireframes

When creating a new website or web application, it is important to keep the user experience UX) and the user interface (UI) in mind. Website wireframes or “mock-ups” are used as a means to prototype how the website will “look and feel” for the user.

Wireframes can be simple drawings made by hand, or more extensive layouts created in software. Using software can save a lot of time, but can also be expensive, depending on the software used. Some software also has a steep learning curve. Popular software used to create customs wireframes include Adobe XD, Balsamiq, Visio, Photoshop CC, FluidUI and UXPin [22].

Some of the advantages of using wireframe software include [23]: • Time saving since layouts can easily be modified;

• Can be used to review with the client ensuring the client, designer and developer understand the problem domain and intended solution;

• Ensures that all content, such as the navigation bar, images etc. are positioned correctly;

A key disadvantage of wireframes is that they do not replace the concept design and this may lead to confusion between the client and designer. However, this is easily fixed by communicating with the client and presenting the concept design and the wireframe together.

(34)

17 2.8 Extensible Markup Language (XML)

Extensible Markup Language or XML is a simple, yet versatile text format. It was originally designed by the World Wide Web Consortium (W3C) for applications in large-scale electronic publishing but branched out and is used on documents as well as representing self-defined data structures [24]. XML is used to store and transport data and the tags should be designed to be self-descriptive. The creator of the XML file determines the structure of the data and defines the tags.

Furthermore, since XML is extensible, new data can be added and removed without corrupting the file or changing the meaning of the data. XML data is stored in plain text format, which means that both machines and humans can read and interpret the data. SAAFF provided XML files with raw data for this project. The XML tags and data structure of these files are as follows:

<BOEHeaderList> <BOEHeader> <MRN>Data</MRN> <LRN>Data</LRN> <ISVOC>Data</ISVOC>

<UniqueNumber>Data</UniqueNumber> <CustomsOffice>Data</CustomsOffice> <CpcCode>Data</CpcCode>

<PreviousCpc>Data</PreviousCpc> <TransportCode>Data</TransportCode> <CountryExport>Data</CountryExport> <CountryImport>Data</CountryImport> <TotalEntryLines>Data</TotalEntryLines> <TotalWeight>Data</TotalWeight> <TotalPackages>Data</TotalPackages> <ServiceProvider>Data</ServiceProvider> <SPSource>Data</SPSource>

<Declarations> <Declaration>

<DateSubmitted>Data</DateSubmitted>

<DateControlReceived>Data</DateControlReceived> <CustomsResponses>

<CustomsResponse>

<StatusCode>Data</StatusCode> <MessageText>Data</MessageText>

<DateTimeReceived>Data</DateTimeReceived> </CustomsResponse>

(35)

18

</Declaration> </Declarations> <BOELines> <BOELine>

<HSCode>Data</HSCode> <HasPermit>Data</HasPermit> <CustomsValue>Data</CustomsValue> <TradeAgreement />

<StatsQuantity>Data</StatsQuantity> <StatsUOM>Data</StatsUOM>

<CountryOfOrigin>Data</CountryOfOrigin> </BOELine>

</BOELines> </BOEHeader> </BOEHeaderList>

From this structure and adding some human interpretation it can be seen that the data can be divided into seven tables with headings:

• BOEHeader • BOELine • BOELines • CustomsResponse • CustomsResponses • Declarations • Declaration

While each of the tags under these tables represent the columns of each table. The actual data of the file is embedded in these column tags. The BOEHeader contains data such as the customs office where the consignment was processed, the country of import, the country of export and the weight of the goods. The BOELine table contains the codes describing the type of goods in the consignment, the quantity of the goods as well as where the goods originated. The BOELines table is just a link table between BOELine and BOEHeader implying a many-to-many relation between these tables. For example, a consignment can contain goods from many countries and these countries can have similar goods in many consignments.

(36)

19

The CustomsResponse table contains the customs response message as well as the timestamp of this message. The Declaration table contains timestamp for when the declaration documents was submitted, as well as the timestamp for when the control was received. CustomsResponse links CustomsResponses and Declaration since a single declaration can have many response messages such as “Detain Order” and “Release”.

Lastly Declarations is also a link table, linking the Declaration table with the BOEHeader table. This implies that a single consignment can have many declarations.

2.9 SARS Customs Office Locations

The data collected from SAAFF referenced 35 different customs offices. The locations for these customs offices can be seen in Figure 6. These custom office locations include border posts, airports, contact centres and harbours.

Figure 6: SARS Customs Office Locations [25]

A full list of the customs office names and abbreviations can be found in Appendix B. For the purpose of this study these are the only customs offices of concern since SAAFF only provided data on them.

(37)

20 2.10 Chapter Summary

This chapter discussed the key concepts required to help gain an understanding of the nature of the project domain. Managing trade effectively is essential for any country, but international trade is also a very complex process and the South African Revenue Service (SARS) need to adhere to international laws. SARS must ensure that both customs and excise compliance is met as well as tax compliance.

They aim to do this with ease and fairness and seek to be cost effective and efficient. However, this is a difficult endeavour and long delays at customs offices are not rare. The South African Association of Freight Forwarders (SAAFF) wishes to gain a deeper understanding of the current process and gain insight into SARS’ risk model. Any insight on what causes the delays or how to mitigate the risk of being delayed will be of great value and help SAAFF members to provide better service to their customers.

By making use of different statistical methods that include confidence and prediction intervals, as well as boxplot and outliers, models will be built to help SAAFF members analyse the data they provided. The provided data is in a raw XML format. This XML file is readable by humans, but the structure makes it difficult to keep count of all the data and relationships between different fields. Therefore, the data will be ported to a relational database and analyses can be done there.

Wireframes will be used to draw prototypes on how the web application will look and aid in the user interface design process.

(38)

21

3. CONCEPTUAL DESIGN

This chapter aims to provide the reader with an overview of the product to be developed, as well as the resources required to build the system. Not only will the complete functional analysis and architecture of the software be stated but the applicable technologies and tools to be utilised will also be explored.

3.1 Overview 3.1.1 Back-End

The back-end, also known as the “server side” is generally where most of the application logic is applied. This is the data access layer and also consists of both software and hardware; the database where the data is stored and the servers, the hardware on which the database software is installed. This project will make use of a MS SQL database containing all the relevant data extracted from the multitudes of XML files provided by SAAFF. The proposed back-end architecture can be seen in Figure 7. It can be seen from the figure that a C# console application will be used to transfer the XML files to the applicable tables in the database.

(39)

22 3.1.2 Front-End

The front-end, also known as the “client side” handles the visual aspect of a web page or application. This is what the users see when they open and interact with the application. This includes the design of the application as well as the actual development of it. Coding and scripting languages used for the development of the front-end include HTML, CSS and JavaScript. Furthermore, .NET software will be developed using C# to communicate with the server. This will enable the browser to collect the data from the database. The proposed front-end architecture can be seen in Figure 8. The data will be retrieved by making use of MS SQL stored procedures.

Figure 8: Proposed front-end architecture

It can be seen from Figure 8 that the users will be able to access the results of the analysed data by means of a web application.

(40)

23

The users will choose the applicable input parameters which includes but are not limited to: • The starting date

• The ending date

• The calculation they wish to see (e.g. Average Duration per Customs Office)

• The specific entry they wish to see the results for (e.g. for which Customs Offices)

The C# web application will then establish a connection with the database and run the applicable stored procedures based on the presented input parameters. The retrieved data will then be analysed accordingly. Lastly the results will be formatted and displayed graphically for the user to see.

3.1.3 Complete Conceptual Solution

An overview of the complete conceptual solution can be seen in Figure 9.

(41)

24 3.2 Functional Analysis

The functional analysis, also known as the system architecture, is used to discern the functional units as well as to show the relations between these units. The functional analysis for this project can be seen in Figure 10.

Figure 10: Functional Analysis

3.2.1 Unit 1 - Administrator

The administrator is the person responsible for not only maintaining the servers and database, but also the C# applications. This can range from importing new data into the database, as well as expanding the database to accept other formats of data if the need arises. This also

(42)

25

holds true for the web application and the administrator will have to ensure that the web application fills the requirements as requested by the operators. Any issues that might occur or requests by the operators will also be handled by the administrator.

3.2.2 Unit 2 - Computer

This unit can be further divided into five sub-units namely:

• C# ReadXML Application

• SQL Database

• Data Processor

• C# Web Application

• The Internet

The SQL database will be maintained on the server. This is where all the applicable data will be stored, but in order to get the data from the XML files into the database an intermediary application will be required. This is where the “ReadXML” C# application comes in. The data processor will make use of the pre-processed data from the database and perform the necessary analysis on the data. The results will then be passed to the front-end of the web application where the operator can access the web application through the internet with any device with applicable browser support.

3.2.3 Unit 3 – Operator/s

The operator is just the umbrella term for any person or persons that wish to make use of the functionalities of the web application. Thus, from the functional unit perspective they will only need a device capable of accessing the internet through the applicable browser and a stable internet connection and then they will be able to run a range of different analysis based on their preferences. This will be achieved through I/F 6.

3.2.4 Unit 4 – Data Items

This unit can also be divided into three sub categories namely: • Design Documentation

(43)

26

• User Guide and Maintenance Documentation

• Operational Documentation

The design documentation will feature the finer details of the design including the applicable technologies and programming environments used, algorithm implementations, different data structures and the database access methods. This report in itself will serve as design documentation.

The user guide and maintenance documentation will be available to help future developers to make use of the software library. This will help ensure that the methods in the libraries can be used effortlessly in other software implementations. This will also help simplify any maintenance requirements or necessary changes to the source code.

Lastly the operational documentation is of importance for the operator. This document will serve to explain how to access and make use of the web application. It, of course, cannot cover the billions of different combinations available to the operator, but will aid in guiding the operator in getting results tailored to their personal needs.

(44)

27 3.3 Functional Interfaces

3.3.1 Administrator to Computer (I/F 1)

This is the primary interface to be used by the administrator or person/s responsible for developing and maintaining both the front- and back-end. The administrator will not only be able to change the database structures but will also be able to access and edit all of the available source code as required.

3.3.2 C# ReadXML Application to MS SQL Database (I/F 2)

This is a secondary interface that will be used to transfer all available customs data from the raw XML files to the SQL database. New XML files can be added to the database at any given time and future work can include the expansion of this application to include the real-time updating of data. However, a real-time supply of XML files from SAAFF is currently not available.

3.3.3 MS SQL Database to Data Processor (I/F 3)

This interface describes the task of processing the data into a more useful format. For the present project, this is done by making use of SQL stored procedures. A range of different stored procedures will be written and stored in the database that will format and filter the data for specific analyses. The final formatting will be done in the web application itself and this interface is purely used to pre-process the data.

3.3.4 Data Processor to C# Web Application (I/F 4)

This interface manages the flow of data between the web application and the database. The web application will establish a connection with the database and send instructions to run the required stored procedure along with the applicable parameters as provided by the user of the web application. The stored procedure will be executed and the results returned to the web application where the final processing of the data will be done.

3.3.5 C# Web Application to Internet (I/F 5)

This interface describes the connection between the web application and the internet. Since it’s a web application it can be accessed from any device with the correct browser support and a reliable internet connection.

(45)

28 3.3.6 Operator to Internet (I/F 6)

The operator/s or users of the web application will need to interface with the internet in order to gain access to the web application.

3.3.7 Computer to Environment (I/F7)

In this design the computer for the operator/s is just the umbrella term used for any personal computer or smart device capable of accessing the web application. However, the administrator will primarily develop on a personal computer and only use other devices where necessary for testing purposes. The personal computer will either be a laptop or desktop computer and must make use of a standard power supply and should in no way harm the environment or person making use of the device.

3.4 Operational Flow

The operational flow as seen from the viewpoint of the operator/s can be seen in Figure 11.

(46)

29

It can be seen from Figure 11 that the operator will first need access to any computer or smart device with a reliable internet connection. By opening any of the applicable web browsers and going to the correct web address, the user should gain access to the web application. Authorization and authentication are not part of the scope for this project.

The user can then navigate through the application as they wish, but the primary focus of the web application will be the “statistics” and the “predictive modelling” sections. The statistics sector will handle basic descriptive statistics algorithms employed on the customs data. This will include averages, totals and fractions taken on the data as well as other methods of classifying the raw data in order for it to hold meaning for the operators. The predictive modelling sector will make use of statistics to not only describe outcomes, but to predict the probability of an outcome. The primary outcomes that will be focused on in this study is the expected delay time per consignment based on the input parameters such as the country of origin. Figure 12 gives a breakdown of this selection process.

Figure 12: Breakdown of the select option segment

After selecting either of the two options, the operator must provide the application with applicable parameters in order to deduce the correct results from the calculation. The expected parameters will differ, depending on the choice made by the operator in the previous step. If the operator chose “statistics”, the following will hold:

Figure 13: Breakdown of the select parameters process (statistics)

The operator will need to choose the type of statistic or analysis they wish to see, e.g. “Average Duration per Custom Office”. Then, the operator must select the period over

(47)

30

which the data must be analysed. Either all of the data can be analysed or only the data for a specific time period. The operator can then choose the specific outcomes. For the example used here a list of customs offices will appear and the operator can choose to run the analysis for all of the customs offices, or only a select few. The operator may have to specify an entry for a select few calculations, but this will not always be the case. An entry can be seen as a secondary input, only required for some of the analyses. If the operator, however chose “predictive modelling”, slightly different parameters are required. The required parameters can be seen in Figure 14.

Figure 14: Breakdown of the select parameters process (modelling)

The operator will need to specify the number of input parameters they want to use for the modelling of the data. There is a total of nine input or explanatory variables used to model the data. The user may select to use all of them (for a more accurate model) or only a select few. After the user selects the number of0 input variables to be used, he/she will need to select the specific input variables he/she wishes to use.

The nine input variables are: • Customs Office • Country of Origin • Country of Export • Country of Import • Transport Code

• SP (Service Provider) Source Code

(48)

31 • CPC (Customs Procedure) Code • Previous CPC Code

Furthermore, the user will need to select the specific entry for each input parameter. E.g. if customs office is selected, a list of all the applicable customs offices will appear and the user can select a specific one. After completion of the parameter and entry selection the user can select the range of dates over which to perform the analysis. The user will also have the option to select a specific confidence and prediction interval.

After completion of the parameter selection, the process continues as seen in Figure 11. The operator can press the calculate button and wait for the results of the analysis to be returned. If statistics were chosen the user will see the results returned on a graph, but for the modelling a table and boxplot with the results will be returned instead. The operator may continue to make use of the web application and choose to analyse different data, or he/she can close the application.

(49)

32 3.5 Development Environment

3.5.1 Decision Matrices

The decision matrix seen in Table 1 was constructed to help determine the most suitable relational database to use for this project. Note that it was decided to use either MS SQL Server Express or MySQL since the student had prior experience working with both of them.

Table 1: Decision matrix for RDBMS (Relational Database Management System)

RDBMS Distribution ease

Learning Curve

Functionality Familiarity and Prior Experience Support Total MS SQL Express 7 6 10 8 8 8.2 MySQL 10 8 7 6 7 7.5 Weight 0.2 0.1 0.3 0.2 0.2 1

From Table 1 it is clear that MS SQL Server is the preferable choice. The decision matrix for the choice of programming languages can be seen in Table 2.

Table 2: Decision matrix for choice of programming language

Programming Language Distribution ease Learning Curve

Functionality Familiarity and Prior Experience Support Total VB.net 7 6 7 7 9 7.3 C# and ASP.net 8 8 9 10 9 8.9 PHP 9 6 8 6 9 7.8 JavaScript 10 2 8 5 9 7.4 Weight 0.2 0.1 0.3 0.2 0.2 1

It can be seen from Table 2 that C# is the preferred choice for this project. Note that although there are far more programming languages than those listed in the table, they were chosen since the student has prior experience utilising them. C# is a very versatile language and the libraries written in C# can be ported to any .NET frameworks with ease. JavaScript will also be utilised to add neat features to the web application.

SARS risk profile indicator model evaluation