The use of guidelines for automatic quality assessment

(1)

The use of guidelines for automatic quality assessment

University of Amsterdam

Master thesis – Software Engineering by Maurits Henneke (10295240) Supervisor: Paul Klint

January 2014

(2)

Table of Contents

1 Introduction...5

1.1 Background...5

1.2 Goal & Purpose...5

1.3 Outline...6

2 Theory...8

2.1 Software Engineering Process...8

2.1.1 Software specification...8

2.1.2 Software design and implementation...9

2.1.3 Software validation...9

2.1.4 Software evolution...10

2.2 Software measurement...10

2.2.1 The Goal/Question/Metric paradigm...10

2.2.2 Goal definition...11

2.2.3 Questions...11

2.2.4 Metrics...11

2.2.5 Measurements...12

2.2.6 Assessment...12

3 Context...15

3.1 Case study...15

3.2 Intent of the software product...15

4 Methodology...18

4.1 Research approach...18

5 Software guidelines and metrics...21

5.1 In-company guidelines for application development...21

5.2 From guidelines to metrics...22

5.3 Weighing of the metrics...27

5.3.1 Validation of the metrics no risk level...27

5.3.2 Validation of the acceptable risk level...28

5.3.3 Validation of the high risk level...29

5.3.4 Validation of the unacceptable risk level...30

5.3.5 The not assessable level...30

5.4 Omission of metrics...31

5.5 GQM design...33

5.5.1 GQM goal...33

5.5.2 GQM questions...33

6 Analysis...36

6.1 Analyzer...36

6.1.1 Setup...36

6.1.2 Summary...37

6.1.3 Metric...38

6.1.4 Reporting and analyzing...39

6.2 Generating the reports...39

6.2.1 Bundle report...40

6.2.2 Vendor report...40

6.2.3 Overall report...40

7 Evaluation of the measurements...42

7.1 Verification...42

(3)

7.3 Where these results to be expected...45

8 Conclusion...48

9 Future work...49

(4)

Chapter 1

Introduction

1 Introduction

In 2008, in a basement in Utrecht, the company IPPZ was founded. The aim of the company was to integrate the possibilities of web-based technologies into the daily practice of the mental health-care practitioners and their clients. This so-called 'blended care design' has gradually found its way into health care centers and general practices in the last couple of years. The company created a simple but effective software system aimed at facilitating the treatment and

communication between clients and practitioners.

At the beginning of 2012 the plan arose to expand the focus of the company to the general health care domain. This shift also meant that a new software system needed to be built. A software system which was not only pleasurable for the users to work with but which also has a high internal quality to ensure future maintainability. When starting out on creating this new software product the development team constructed guidelines which should be followed to insure a certain degree of consistency throughout the code-base. These guidelines are based on the Proposed Standards Recommendation (PSR) created by the PHP Framework Interoperability Group and inspired by the book “Clean Code” by Robert C. Martin [Martin 2008].

1.1 Background

The literature on software quality and means to improve this quality has been a scientific topic of interest since the nineteen-seventies. However it was only until 1991 that an ISO/IEC standard model for software quality was introduced [ISO 2001]. This standard quality model and the later revised versions cover more than just the quality characteristics of the produced software product. The part of the standard model which deals with the quality of the actual written source code is captured in the product quality model part of this ISO/IEC standard. (see section 4 paragraph 2 of the ISO/IEC 25010 standard for a complete overview of the characteristics which make up this model [ISO 2011]). And even this model still mainly focuses on the functional and semantic correctness of the code and all the characteristics which results therefrom. The introduction of coding standards is not included in the model.

A software product will have to be maintained and understood by people who often did not construct the source code initially. The software engineering process is complex enough to not also want to worry about coming across code which is hard to understand. As part of the maintainability characteristic another sub-characteristic called “understandability” should be introduced. This is a well-known concept in software development but apparently very hard to capture in a model. This lack of standardization has left people and companies making up their own standards and guidelines to which the produced software has to adhere. But measuring understandability is hard because it is subjective.

The Software Engineering Institute designed a Maintainability Index to quantify the ease with which a software product is to maintain during its life-cycle. In recent years this model has been complemented and revised. Part of that research has been done by the Software Improvement Group in collaboration with a variety of different universities and research groups on the topic of assessing the quality of software systems. Many of the ideas and validation of the metrics which will be used in this research will be based on work done by the Software Improvement Group.

1.2 Goal & Purpose

The goal of this Master Thesis is to research whether it is useful to create a language specific software model that can help a company's development team to monitor and improve the quality of a software product.

Based on the in-company software guidelines, metrics will have to be created and combined to create a workable software model which can be used to determine the product quality of a software system as a whole and the different components which make up this system.

(6)

From the main research question several sub questions can be formulated:

• Which metrics follow from the in-company guidelines?

• How to derive a software model from the proposed software metrics?

• How to create a tool which collects the measurements of the software model?

• How to validate the result of the measurements?

1.3 Outline

Chapter 1 Introduction

Chapter 2 Theory

Chapter 3 Context

Chapter 4 Methodology

Chapter 5 Guidelines

& Metrics

Chapter 6 Analysis

Background

Goal & Purpose

Software

Engineering

Process

Software

Measurement

Case Study

Research

Approach

Intent

In-company

Guidelines

to Metrics

of the Metrics

Weighing

Omission

of Metrics

GQM design

Analyzer

Generating

_Reports

Verification

Chapter 7 Evaluation

Chapter 8 Conclusion

Reports

Where

these result

to be expected

(7)

Chapter 2

Theory

Software

Engineering

Process

Software

Measurement

Process

(8)

2 Theory

In this chapter the activity of developing software is discussed. Developing software is done as part of the software engineering process and will be explained in that context.

2.1 Software Engineering Process

The Software Engineering Process describes the different activities of creating a complete software product. Various models for describing this process exist of which the waterfall process model, and the iterative development process model are best known.

Despite the existence of various processes some fundamental activities are shared by all software processes: Software specification, Software design and implementation, Software validation, and Software evolution [Sommerville 2011].

2.1.1 Software specification

The software specification process, which is also known as requirements engineering, is the process of gathering requirements to understand why a software product, or part of a software product, should be created and what that product should do.

During the initial study an estimate is made to examine the necessity and feasibility of the software product. This feasibility study will be used to identify whether the product or proposed features of a product merit the work which it will take to produce it.

The next stage of the process consists of determining the prerequisites and assumptions which lie at the foundation of the ideas behind a software product. When gathering this domain specific knowledge an understanding of the desired functionality is created. Based on the described domain knowledge a set of requirements can be drafted which specifies the most fundamental needs of the software product or feature. These requirements will have to be validated to ensure correctness and completeness to prevent errors in the specifications.

The process of gathering domain knowledge, specifying requirements and validating these requirements is not as sequential as might be suggested. During the gathering of domain knowledge, requirements can be established or even invalidated even when they have not yet been explicitly written down. In a similar way new requirements or changes to the requirements can occur during each activity of the software engineering process.

Once all the requirements are specified, collected and validated you will have a specification on which the software product can be constructed.

Software specification

Software design and implementation

Software validation

(9)

2.1.2 Software design and implementation

The Software design and implementation stage deals with how requirements are implemented and translated into an executable system. This part of the software engineering process is about turning the requirements specifications into architectural designs and constructing the software in accordance to those designs. During the process of creating architectural designs the architectural design decisions can provide new insights into the requirements on which they are based and thus influence these same requirements [Nuseibeh 2001]. The software design and implementation stage can be split into several distinct phases, during each of which the requirements can be influenced. Whether it is during the design phase of the structure of the software, when designing the interfaces between the different components of the system, devising the different components or writing the actual source code.

Structure of the system

Designing the structure of the system means decomposing the system into sub-systems and refining the requirements. For each sub-system which is comprised an abstract specification is created to explain what services and under which constraints it must operate.

Interfaces

Once the sub-systems are designed the interfaces between these systems can be created. An interface specification is an unambiguous description of how a sub-systems interface is to be used without knowing how it is implemented. When you know how the sub-systems need to behave the different components can be designed.

Components

A component can be described as a specific modular part of a system which exposes a set of interfaces whereby its content is obscured from the environment it is used in. Once you know what the different components need to do and all the specifications of the system have been described the software can be constructed.

Source code

Constructing software is about writing source code. In its most basic form source code can be viewed as a list of instruction for the computer to execute. However constructing source code is about more than just writing computer instructions. During the development of the source code the developer needs to create the data structures, design algorithms and translate the system specifications in a manner which is not only logical and structured for the computer but will also be understandable for other developers and maintainers. Writing code which is understandable for other developers, or even for your future self, can be done in several ways. One of the ways to communicate what your code is supposed to do is by simply adding comment in a natural language explaining what and why the code does what it does. However people should also be able to read source code. The intension of why and what source code should do should also be reflected in the way the code is presented. Drafting coding style guidelines is one way of creating uniformity in source code.

2.1.3 Software validation

A system needs to be validated to check whether it does what the specifications of the software product say it should do. Different types of testing can be identified all occurring at different stages of the software process.

Development testing is the first testing stage of the software product. As the name suggests testing will be done during the development of the source code and will be done by the software developers themselves. This type of testing can be divided into unit testing and component testing. With unit testing, a unit is a testable piece of source code in which the functional behavior of the code is verified. When the correctness of the individual units has been tested the specific implementation of these units in a component will be verified.

(10)

functional correctness of the program is validated and verified. Integration testing can be

subdivided into component testing, sub-system testing, and system testing. When a component is integrated into the system it should function as described in the design specification. Groups of components combined make up a sub-system which should exhibit the properties set out by the requirements specifications. During the sub-system testing the focus lies on testing the interfaces between the components to verify integration of the different components in to the system. Finally the entire system will be tested. The different sub-systems and components making up the

system can produce unanticipated interactions once combined which could lead to failures in the software product.

2.1.4 Software evolution

And finally the software product will have to be maintained. During this maintenance phase the software product will evolve and will continuously be developed on. Reasons for altering the system can be found in changing business requirements, due to detected errors, or because of changes to other systems in a software system's environment.

Maintaining control over the software and controlling the alterations are continuous activities which need to be performed on an existing software product. Together with perfecting the existing code and preventing software performance from degrading to unacceptable levels are the main activities which need to be performed during this stage of the process [Pfleeger 2001].

During the software construction phase and the software evolution phase attention must be given to developing understandable code which conveys it intentions to the maintainer. Having

understandable source code will benefit the maintainability during this final phase of the software engineering process.

2.2 Software measurement

People tend to think of software engineering as being an exact science, which is certainly true for the designing and the production of algorithms in source code. However unlike most other

engineering disciplines software engineering has an extensive human-intensive component. Therefore parts of software engineering tend to be more similar to the social sciences then to the hard sciences like, Physics, Chemistry or Math. Before measurements are done on a software product or on software processes we will first need to identify, characterize, and measure the characteristics that are believed to be of importance and which will need to be studied. In respect to other engineering fields this is a very different approach. In those fields the relevant

characteristics of a system, what their properties are, and how these need to be measured are known. In software engineering this lack of standard characteristics makes it important to do a theoretical validation for each of the proposed measurements. This broad scientific base of software engineering makes it possible to use ideas and theories from a wide range of other scientific disciplines. For instance during the software specification phase of the software engineering process, ideas and techniques from the field of psychology can be borrowed to measure the validity of the provided system requirements questionnaires. We will secondly need to verify that measuring the chosen characteristic is useful, which is usually done empirically. To aid in describing the measurements used in this research the Goal/Question/Metric paradigm of Basili et al. is used.

2.2.1 The Goal/Question/Metric paradigm

The GQM paradigm is a goal-oriented measurements framework which is used to organize software measurements. This approach is used to guide the choice for metrics by making use of defined measurement goals. With GQM a top-down approach is used to define metrics, where first a certain goal is defined, this goal is refined into questions, and based on these questions, metrics will be devised.

(11)

When data is gathered through measurements the interpretation of this data is done bottom up. The interpretation of the data should be done within the context of the goal to conclude whether or not the goal is attained.

2.2.2 Goal definition

A measurement goal consists of five parts:

• Purpose: The reason for wanting to specify a goal. You will have to answer the question what you want to accomplish? Examples could be to improve, evaluate or may be make predictions about the Object of study.

• Object of study: The entity which the measurements apply to. In the context of the

software engineering process this will be one, or part, of the four fundamental activities of which the software engineering process is comprised of.

• Issue: The quality attribute or attributes of the Object of study that should be studied.

• Viewpoint: The person, people or organization who should benefit from the measurements.

• Environment: The context in which the Object of study can be found. This can for instance be a specific product or an organization.

2.2.3 Questions

When a goal has been set the questions of how this goal should be researched can be

constructed. Constructing these questions will be done with the help and knowledge of the people for who this research is being performed. Questions should be clear, unambiguous, and have an answer which can be expressed using a dimension (e.g. weight, size, time) which shows if a goal has or has not been achieved. Adopting knowledge acquisition techniques to capture this

knowledge and experience of the domain experts can also provide a greater involvement and participation on their behalf which in turn should be beneficial in accepting the outcome of the research.

2.2.4 Metrics

When questions have been defined, the metrics can be identified which will provide a traceable dependency to the goal. Metrics have the purpose of providing information about any part of the software engineering process where the subject of the metric can be measured on a standard scale. By using the GQM approach a metric will have a clearly defined purpose and will prevent the creation of metrics with are not linked to the goal.

Question

Metric

(12)

According to L. Ejiogu software metrics should have the following features [Ejiogu 1991]:

• Simple and easy to use: The adoption and understanding of a metric should not involve a great effort or much time.

• Convincing: A metric should not go against one's intuition.

• Reliable: When a metric is used for a measurement the result should be consistent.

• Adhere to the principles of measurements theory: The resulting measurement should be possible. For instance: the mathematical operations performed should correspond with the measurement scale.

2.2.5 Measurements

A measurement is a way to associate a value with an attribute or property of a system. In the software engineering process measurements can be taken at every stage of the process although the measurements done on source code are best known and are the ones used in this research. A measurement should have a consistent and intuitive interpretation of what is measured and should be understandable despite the unit of measure used. Thus, for example, a tall person should have a higher value assigned when measuring height than a small person, although the values can differ depending on whether we use inches, meters or feet. This makes that the conclusions based on the measurements should be the same. The previous example makes it clear that before a measurement can be performed a model should be defined. This model represents a particular viewpoint for looking at an attribute or property of a system.

Some measurements consist of other measurements. If you want to know the speed of an object you will first have to measure the distance traveled and next you will have to measure the time it took to travel that distance. A compound measurement like speed is referred to as an indirect measurement. In comparison to a direct measurement, which does not depend on other

measurements or system attributes an indirect measurement involves the measurement of one or more other attributes.

2.2.6 Assessment

Checking for consistency and verification of the measurements is a prerequisite for any

measurements system. Consistency refers to the ability to reproduce the measurements using the same measurement methods in which the same result should be obtained. Verification of the measurements can be obtained when another measurement method produces the same result.

Q2

Q3

M1

M2

M3

M4

M5

M6

M7

Q1

Goal

(13)

For consistency the repeatability of the measurement system needs to be assessed. The importance of this is because you need to prove that the person doing the measuring is not distorting and influencing the measurements and thus altering the result.

Verification is probably the most indispensable and essential part of any scientific research. When a measurement has taken place you will need to check whether you have really measured what you set out to measure.

(14)

Chapter 3

Context

Case Study

Intent of the

Software product

(15)

3 Context

In collaboration with the software developers at IPPZ an outline of their internal quality guidelines was created, more detail on these metrics can be found in paragraph 5.1. These guidelines reflect the quality requirements to which the internally created and accepted software will adhere to.

3.1 Case study

This research is confined to one software product which has approximately 60 individual namespace packages containing 2089 PHP files which are in-house developed code and 46 namespace packages with supporting functional code containing 10K+ PHP files of externally produced code. Part of this externally produced code is what makes up the Symfony2 framework. For this research only the PHP files in the maintainable area of the application will be measured. This part of the application contains the in-house developed bundles and third party bundles also known as the source. The vendor, or core, part of the framework which comprises the majority of the application is outside of the maintainable area of the application.

The source of the project follows the PSR-0*_{auto-loading naming convention and directory path. A}

file and the namespace which is derived from this location have the following structure: \VENDOR_NAME\BUNDLENAME\(NAMESPACE\)*CLASS_NAME

For the use in this research the entire source path will be measured. Each PHP file in every bundle of every vendor directory will be measured and analyzed.

3.2 Intent of the software product

The intent of the software product is to provide a platform for clients, patients, practitioners and professionals in the general health care domain to communicate, share information and offer treatment. Systems like these are better known as a Patient Relationship Management (PRM) system.

Three different kinds of users can be identified in the system on which it will act differently. At first there are the clients who receive care and treatments on which is being reported on. Secondly there are the (medical) professionals who deliver and provide care to the clients. And finally a special role is left for the owner of a medical organization.

The owner of an organization has the ability to create so called health-care spaces. A space can be seen as a department in a hospital or as a small treatment-center. Professionals can be asked, or request an invitation, to join such a space and provide treatment for clients. Clients can also be invited to join such a space or join at their own accord depending on whether the space is publicly accessible.

The main focus of this product lies with the clients or patients around whom a complete dossier is being created. They can communicate and share information with a professional who will also be granted access to this dossier and will be able to add information and devise treatments. A connection can only be established with the approval of the client so the information is always in hands of the client.

The platform offers a private messaging system which can be used in facilitating secure

communication between the different parties. All the communication is encrypted and stored on a * In charter 1 the PSR guidelines were mentioned. These guidelines are introduced by the PHP Framework

Interoperability Group. This group consists of the representatives of various different application and framework developers many of whom are the initial or main contributors to these frameworks and applications. The proposed guidelines are to be considered as best practice for application and framework development. The PSR is divided into four distinct standards. The PSR-0 deals with the autoloading of classes, PSR-1 describes Basic coding standard, PSR-2 covers the coding style guide and the PSR-3 deals with logging. Only the first three influence understandability of code and could be taken into account when deciding on which guidelines to use for this research. [PHP-FIG 2009]

(16)

separate server. Information can also be shared using a community-like feature of the system where people can follow and keep track of each other. Files which clients or patients deem important can be included in their personal dossier which can also complemented by the

professionals who provide health-care. In a health-care space a library is included which can be filled with information relevant to the connected clients. An important feature for the professionals is an overview of active patients from whom the dossier can be viewed. The secure nature of the messaging system also allows for simple sharing of medical information and data between

professionals. For the owner of an organization the platform offers an administrative section which can be used handle health-space dependent settings.

(17)

Chapter 4

Methodology

Research

approach

(18)

4 Methodology

The idea of trying to assess the quality of software is not new and can be studied from various angles. The different factors which affect the quality of a software product are diverse and from the viewpoint of a development team reasoning about these factors will lead you into the realm of the social sciences. The GQM method will be used to provide a structured manner in analyzing the gathered information and provide reasoning about the choices which have been made. The main research question is formulated as GQM goal and will form the basis of this research.

4.1 Research approach

The intent of this Master Thesis is to translate the in-company guidelines into a measurable system and to check whether these guidelines are useful. The GQM method is used in this research to structure the report but not to guide the research since there are basic differences between a standard GQM study and this research.

Inventory

For this research an inventory of the software development guidelines present in the company needed to be taken. Since these guidelines already existed the metrics resulting from these guidelines needed to be resolved. This is where the difference lies in contrast to a standard GQM study. By making use of the GQM approach the reasoning behind these guidelines will be

uncovered.

Construct measuring system

To construct a measuring system a scale for the metrics has to be created. Defining a scale for a metric is done by providing meaning to the value of the measurement of a metric. These value ranges need to be explainable and will need to make sense in the context of the provided goal. Construct analysis program

A program to do the statistical analysis on the source code needed to be created. A program like this will be able to analyze the source code and perform the measurements for the provided metrics. The result of executing this analysis will be collected which will be used to analyze the GQM goal.

Analyze results

The result of the analysis program will be analyzed to answer the questions posed in this research.

Inventory

Construct measuring system

Construct analysis program

Analyze results

(19)

Present conclusion

Finally the findings of this research were presented to the development team. During this presentation a discussion continued about whether these results were to be expected and which lessons could be learned.

(20)

Chapter 5

Software guidelines and metrics

In-company

guidelines

From guidelines

to metrics

GQM design

Weighing

of the metrics

Omission

of metrics

(21)

5 Software guidelines and metrics

This section is about the in-company guidelines and how they relate to the metrics which will be measured. The in-company guidelines are created in conjunction with the development team and are the starting point which will lead to the actual construction of a GQM model. The in-company guidelines form the selection criteria for the various metrics. When the metrics are defined the validation of the weight of the different risk categories of each metric is given.

5.1 In-company guidelines for application development

Guidelines are the product of a specific intention which needs to be satisfied but which is hidden from view and has not been documented as such. Creating a GQM design for automated testing of these guidelines will make the intent of these guidelines clear. These guidelines imply certain metrics. These metrics answer these questions which have not yet been explicitly stated but which will aid in achieving the goal of the GQM design.

During the development of the software product the following in-company guidelines should be taken into account.

Guidelines G1 Classes should be about one thing.

A class in software construction is a template of the properties, attributes and methods of an object. All these aspects of a class should be concerned with the objective of what a specific object is about.

G2 Classes should not contain too much attributes. Classes should be kept small and simple.

A class which is comprised of too many attributes is believed to be hard to understand. Objects with too many attributes are often the result of poor design or inexperience of the developer. Classes will become too lengthy and hard to understand.

G3 Methods should be easy to understand.

A method is a function of a class which can be executed. When a method becomes too complex for a person to understand, the software code will be less maintainable and errors in the code are less likely to be identified. G4 Do not use too much nesting in a method.

This guideline is a specific implementation of the previous one. Through firsthand experience every developer of the development team has had to deal with methods which were so complex due to too much nesting. G5 Only object orientation is allowed.

This is not as much of a guideline as more a rule which needs to be followed. PHP did not start out as an object-oriented programming language which makes that it still allows for various syntax styles. This guideline will not have to be checked but does ensure that all other guidelines are possible.

(22)

5.2 From guidelines to metrics

In collaboration with the development team, appropriate metrics are sought which should be indicative for the in-company guidelines.

In the following tables the metrics for each guideline is provided with a brief description of the metrics.

Guideline G1 Classes should be about one thing. Metrics LCOM Lack of COhesion in Methods

The basic idea behind this metric is to establish the coherence of a class. Like the name suggests, it deals with individual objects. When an object facilitates functionality which is not logical for that object to exhibit or it has attributes not relevant to that object it is most likely that this object is actually comprised of multiple objects. Think of an object which can manipulate two distinctly different other objects and the methods to

manipulate these objects do not have any overlap. It is clear that this object is actually two rolled into one. The methods have no relationship with each other and it is this lack of cohesion what is measured.

The basic LCOM(C) evaluation as described by Mäkelä [Mäkelä 2006] will be used. In the calculation all instance variables and only public methods will be used and the constructor of a class will be ignored.

Citation of the explanation of the measurement by Mäkelä: “The LCOM value for a class C is defined as:

LCOM(C) = 1 - |E(C)| / (|V(C)| x |M(C)|)

where V(C) is the set of instance variables, M(C) is the set of instance methods, and E(C) is the set of pairs (v,m) for each instance variable v in V(C) that is used by method m in M(C). In the following, we consider 'use' to also include indirect use via other methods.” This metric was included during the theoretical phase of this research. Reasoning was that the

understandability of a class can be diminished when the cohesion in that class is low. This metric might aid in constructing a more complete software model for the development team.

The value will range from 0 to 1 where 1 indicates no cohesion and 0 completely coherent.

(23)

Guideline G2 Classes should not contain too much attributes. Classes should be kept small and simple.

Metrics NIV Number of instance variables

A very simple metric to define and which is easily measurable. This metric will count the amount of distinct attributes of a single class definition. A class which has many different attributes can be harder to understand due to the short term memory in which a person will have to hold these values when trying to understand the functionality of the code. The maximum amount of items varies from person to person so a strict amount is used which should be feasible for everyone to recall. This metric was also chosen as an indication metric for the amount of complexity in a class.

The value will range from 0 to N where N is a positive integer. cLOC Lines Of Code of a class

The summation of the metLOC measurements of a single class. This metric is chosen to quantify the understandability of a class. When a class is made up of too many lines of code a person will not be able to understand the full scope of what the class is all about. The same preconditions hold for this metric as which hold for the metLOC metric.

The value will range from 0 to N where N is a positive integer.

An abstract class with protected abstract methods is an example of a class which has a range of zero.

LLEN Length of each line of code

The total amount of characters of a single line of code excluding leading and trailing white space characters and also not counting inline comment. The comprehensiveness of a single line of code is dependent on the length of that line. Robert C. Martin explains in the section about horizontal formatting why short lines are preferable over long lines. The reasoning he provided is empirical and shows that developers strive to keep lines of code short. [Martin 2008]

PSR-2 code formatting will have to be executed before measurement. The value will range from 0 to N where N is a positive integer.

Having a length of zero is more a mathematical convenience than an actual valid measurement. Can you speak of a line of code when there is no code in that line?

cCC Cyclomatic complexity of a class

The summation of the metCC measurements of a single class. This metric quantifies the overall complexity and thus the understandability of the class. When each method in a class is created with the open-close principle in mind and each method only has a single responsibility the likelihood of having methods with a high cyclomatic complexity is small although the combined complexity might be large. Including this metric falls within the same reasoning as including the cyclomatic complexity of each method. The understandability of the code can be negatively influenced by having a high cyclomatic complexity for a class.

The value will range from 1 to N where N is a positive integer.

A value of 0 could be conceivable since having no lines of code is also possible. When a value of zero occurs this metric will be deemed not

(24)

assessable.

CBO Coupling Between Object classes

This metric was first introduced by Chidamber and Kemerer in 1994 [Chidamber 1994]. The use of one object in another object is common practice in Object Oriented programming languages. As Chidamber and Kemerer clearly point out: “Excessive coupling between object classes is detrimental to modular design and prevents reuse. The more independent a class is, the easier it is to reuse it in another application”.

Coupling is easily explained with an example: A CarFactory object will have to have knowledge of the concept of a Car object. So there is coupling between the Car object and the CarFactory object. However when this CarFactory object has knowledge of many more objects which have nothing to do with creating a Car you are dealing with an object which is too smart and can probably do too many other things like creating a Bike.

This metric was also included during the theoretical phase of the research. The choice to include this metric was to point out possible design problems during software construction. A poorly constructed class will diminish the understandability of that class.

(25)

Guideline G3 Methods should be easy to understand. Metrics metLOC Lines Of Code of a method

Like the name says. It is the amount of lines of code which make up a method. A method needs to be quickly understood when a developer is confronted with it. A method which is too lengthy is harder to fathom. Just like with the Indentation Level metric the code will have to be formatted according to the PSR-2 guideline before this metric can be measured. The value will range from 0 to N where N is a positive integer.

A method can indeed have zero length. This is usually a stub method. It is a matter of taste whether this constitutes bad design however an empty method does not diminish the understandability.

metCC Cyclomatic complexity of a method

McCabe's cyclomatic complexity is a well-known metric in software engineering [McCabe 1976]. Like the name suggests you measure

complexity with this metric. In this context, the functional complexity of a piece of software. Easily explained you count the amount of logical paths which can be taken through a selected piece of code. This is one of those metrics which you read about in books and learn about in college but do not actively consider every time you write code. Having a high cyclomatic complexity can be an indication of poor understandability of the code. The value will range from 1 to N where N is a positive integer.

The control flow of a method will always have at least 1 logical path. Even when nothing happens in a method it has still followed this one path of doing nothing.

(26)

Guideline G4 Do not use too much nesting in a method. Metrics metIL Indentation Level of a method

During visual inspection of source code the indentation of the code is one of the first things that provides a clue about the design. When a method looks like a mountain range on its side it is likely that there is too much going on in that method. But indentation is only introduced to make code more readable and if you want to you can write all of your code on one line. For this metric to be any useful, the code will have to be formatted in a uniform manner. This uniform manner is captured in the PSR-2 guideline and before investigation of the source code it will have to be formatted according to this guideline. The consensus for choosing this metric was based on the notion that a function should not do more than one thing and indentation can be one of the indications too much is going on. This finds it origin in Robert C. Martins work [Martin 2008] where he explains, and I quote “Functions should do one thing. They should do it well. They should do it only” The value will range from 0 to N where N is a positive integer. SND Statement's Nesting Depth

The nesting depth of a statement is a measurement of the amount of logical operators in that statement. In a single check like does A equal B there is only one operator. When you do an extra check like: does A equal B or does A equal C you increase the nesting depth by 2. Statements can become as long and difficult as you would like. But it is apparent that the

understandability will decrease when the nesting level increases. It is often a sign of bad design when statements have a too large nesting level.

Experience in the development team was the reason to include this metric in the proposed software model. And also the nature of the PHP programming language and the developers who use this language.

The value will range from 1 to N where N is a positive integer. When a method does not contain a statement which accepts logical operators the metric will be judged as not assessable.

(27)

5.3 Weighing of the metrics.

In the development team of the company where this research is being conducted there is a general agreement on what constitutes good software construction. Experience from every

developer helps to form opinions of what quality is in a software system. These opinions are based on ideas and paradigms found in literature combined with personal taste. And it is personal taste what makes forming automated quality checking based on these rules challenging. The value of most metrics will be instantly observable when looking at the source code. To provide a weighing of the measurements on the source code a scale is introduced with serves as an indication of the risk factor. This risk factor is the notion to which amount a metric is assumed to be detrimental for the understandability. Four levels of severity are introduced which could have been named ++, +, - and - - however more descriptive naming has been chosen resulting in the levels no risk,

acceptable risk, high risk and unacceptable risk. The risk level for a metric is a quantitative value

or value range which provided the weighing is a judgment for the evaluated metric.

• The no risk level indication describes the value or range in which no decrease in understandability is to be expected.

• The acceptable risk level is at the high end of what is acceptable for the source code to exhibit as property of specification described by the provided metric.

• The high risk level is an indication that the specification or property of the provided metric has a negative influence on the understandability of the source code.

• The unacceptable risk level is the value or value range at which a decrease of understandability will be obvious to observe in the source code.

A fifth level can be identified which indicates that the metric is not present in the provided source code. This level will be indicated as the not assessable level although technically this is not really an indication level for a metric.

5.3.1 Validation of the metrics no risk level

Personal taste is hard to quantify but averaging common traits of information processing in human beings is feasible. In 1956 George Miller [Miller 1956] researched the capacity of people to

process information. When the information which a person needs to process consists of only one dimension, in the case of the NIV metric this would be the attributes of a single object class, the amount of items which a person should be able to cope with is 7 with a standard deviation of 2. That is why the NIV no risk value is set to have a maximum value of 5 as can be seen in Table 2. This is at the lower end of what a person should be able to retain in memory.

This idea also lies behind the values chosen for the SND metric. Since a single statement compares two items. Having a nesting level of four would correspond to a maximum of 5 items compared, which is at the low end of the 7 ± 2 scope. This can only be achieved when you have the following comparison: (A op B) op (C op D) op E, where 'op' is a logical operator.

When dealing with more than one dimension a person's capability to process information increases. Having 2 dimensions does not double the capacity but increases it by about a half of the average value. The Coupling Between Objects (CBO) metric's value is based on this idea. The object class will have its own methods and attributes and will use the methods of other classes. When you have other objects referenced, these other objects are the second dimension and thus increases the information processing capacity.

The metLOC metric should have been set to 5 for the no risk threshold for the same reason as described above with the NIV metric, however this met with resistance from the development team. The persuasion to increase this value was because a method should read like a little story. And stories are easier to recall than just memorizing items. This was a compelling argument which is also affirmed by mnemonics [O'Brien 2011]. The same goes for the cLOC metric. A class can be

(28)

seen as a collection of small stories.

The metIL metric is one that sparked a lot of discussion in the development team. This is not as much of a design flaw by itself but merely an indication as a side effect of bad design as described in the previous paragraph. The strict values for no risk are based on the idea that every extra indentation level above 1 can always be brought back down to 1 by mere refactoring [Fowler 1999].

For the LCOM metric risk values an even distribution was chosen where the no risk level is

indicative for a coherent class. An even division of the LCOM scale would yield a no risk level end value of 0.25 however since the amount of significant numbers in this scale is 2 the value is rounded to 0.2.

The cyclomatic complexity metric of a class has been well documented and the risk level values for this metric are taken from literature. The SIG maintainability model formed the base for these values [Heitlager 2007]. The no risk level value ranges from 1 until 10. These values were

constructed at the SIG by comparing many different software projects.

For the cyclomatic complexity of a method the class cyclomatic complexity will form the basis of the values which will be chosen. It was decided that having three maximum no risk or acceptable risk level methods should still fall within range of the cCC metric. Dividing the cCC metric by three will deliver the values which will be chosen for the metCC metric risk level values. In case of the

no risk level values this will lead to values ranging from 1 to 3.

The no risk value range for the line length metric (LLen) is taken from the default setting of the in-company used IDE. Robert C. Martin describes choosing an acceptable value ranging from 80 to 120 characters [Martin 2008].

5.3.2 Validation of the acceptable risk level

As explained in the previous paragraph the no risk level is at the low end of what is considered an acceptable value or value range. The acceptable risk level will starts where the no risk level ends. For the NIV metric this means being at the normal range of values which occurs when looking at the 7 ± 2 scope described by George Miller. Starting from 6 because the no risk value ends at 5 and ending at 7 which is the mean of the 7 ± 2 scope.

The SND metric again follows this same reasoning. With n operands you will have n-1 logical operators. The values will correspond with the NIV values minus one. This will also be the case for the high risk level and the unacceptable risk level.

No risk Acceptable risk High risk Unacceptable risk

NIV

_0-5

_6-7

_8-9

_>9

metIL

₁

₂

₃

_>3

metLOC

_<=15

_16-25

_26-50

_>50

cLOC

_<=100

_101-150

_150-300

_>300

LLen

_<=120

_>120

SND

_1-4

_5-6

_7-8

_>8

metCC

_1-3

_4-7

_8-10

_>10

cCC

_1-10

_11-20

_21-50

_>50

LCOM

_{>0.0, <=0.2}

_{>0.2, <=0.5}

_{>0.5, <=0.7}

_>0.7-1.0

CBO

_0-7

_8-10

_11-14

_>14

(29)

For the CBO metric the acceptable risk level needs to start where the no risk level ends and the range will need to end based on the same idea as described in the previous paragraph. Starting at 8 because the no risk level ended at 7 and finishing at 10 because this is 7 times 1.5, where 1.5 is the correction value for having two dimensions as described above.

The metLOC metrics acceptable risk level value range was chosen, just as randomly, but with the same reasoning in mind as the no risk level value range. A story with the largest amount of lines which is still considered acceptable is judged to be understandable for most people.

Since the cLOC metric values tries to follow the same reasoning as was provided for the metLOC values the acceptable risk level values are decided upon experience from the development team. The metric which describes the indentation level for a method (metIL) was set to 2 for the

acceptable risk level. In the explanation of the no risk level it was mentioned that with mere

refactoring the level can always be brought down to 1. If one statement resides in the body of another statement this code should still be understandable. A common example of this is an if-statement which is called in a for-loop.

From the even distribution of risk level values for LCOM method the acceptable risk level will start where the no risk level ends and since the scale is evenly distributed from 0.0 to 1.0 the

acceptable level will end at 0.5. So measured metric values will be in the range <0.2 to 0.5]. The SIG describes a cyclomatic complexity of a class between 11 and 20 of being of a moderate risk level. This level corresponds with the acceptable risk level discussed in this research.

For the cyclomatic complexity of a method the class values will be divided by three resulting in values ranging from 4 up to 7.

The line length metric has no acceptable risk level since this metric is either not a risk at all or falls in the unacceptable risk level.

5.3.3 Validation of the high risk level

The high risk level is to the acceptable risk level as what the acceptable risk level is to the no risk level. The high risk level will start where the acceptable risk level ends.

For the NIV metric this means being at the high end of George Millers 7 ± 2 scope. Starting at 8 because the acceptable risk level ended with 7 up to 9 which comes from 7 + 2.

The reasoning for the values of the SND metrics high risk level is already provided in the previous paragraph. Starting at 7 for the acceptable risk level ended with 6 and following from the

described formula n – 1 for counting operators where n = 7 + 2 ending at 8.

The CBO metric follows the same reasoning for the high risk level as was described in the

acceptable risk level part. Starting at 11 and ending at 9 times 1.5 which is rounded to 14.

For the metLOC metric the high risk level value range is thought to represent source code which will probably benefit from being refactored however some framework components have a lengthy syntax and need a lot of statements and thus lines, to do something. Like with the previous levels this range will start where the average risk level ends and the end of the range is just as

arbitrarily chosen. This same goes for the cLOC metric.

The metIL metric high risk level value of 3 follows from the choice of having an acceptable risk level of 2 and corresponds with the general perception in the development team of how complex code for this metric will look like.

As mentioned above the LCOM risk levels follow an even distribution from 0.0 to 1.0. The high risk level will thus range from 0.5 up to 0.7. Dividing this scale would provide an end value of 0.75. Same as for the no risk value because of the amount of significant numbers this figure is rounded. Cyclomatic complexity for a class was provided by the SIG and for the high risk level the

(30)

A decision to not divide the cyclomatic complexity of a class by three but by five was made for the

high risk level and the unacceptable risk level. This decision will make the high risk level range

shorter resulting in more methods to fall within the unacceptable risk level.

Just like with the acceptable risk level value the high risk level for the line length does not exist.

5.3.4 Validation of the unacceptable risk level

The unacceptable risk level is to be considered as every level greater than the highest high risk level value.

For the NIV metric this translates into all values greater than 9. The average person should be able to memorize up to 9 attributes at a time but every higher amount will most likely be problematic for most.

The SND metric has this same problem for most people. Having more than 8 operators will mean having more than 9 operands and this is an amount with which most people will not be able to work with.

Having a coupling of more than 14 objects in another object should be considered as

unacceptable. A class which needs this many other objects will most probably try to do too much. This will be a clear indication of a design flaw and refactoring will most likely be needed.

Whatever reason there is behind exceeding the acceptable risk level metLOC metric or even the

high risk level, code which falls within the unacceptable risk level for this metric will need to be

split into smaller pieces to increase understandability. This is a clear indication of wanting to do too much in a single procedure.

For the amount of Lines Of Code in a Class the same reasoning applies. A class which wants to do too much will be very hard to understand. All this functionality will have to be grasped by a person for whom this will possibly be too much to handle. As a side effect people will become hesitant to change large classes or methods fearing they could break code that might be used in places which they are not aware of.

A method which has an indentation of 4 or more will benefit from refactoring. The complexity of such code will make it hard to fathom and also makes testing the code difficult. What was said in the acceptable risk level and the high risk level applies even more for this risk level. Bring

methods which fall in this category down to a lower risk level by extracting methods from the different indentations. Each indentation block will probably be a separate method all by itself. The LCOM metric has the only finite unacceptable risk level range. Every value greater than 0.7 up to 1.0 will be considered as being unacceptable. A class which contain attributes which do not interact with each other in the methods of the class to which they belong to have no functional relationship and could probably be separated into different classes.

Having a to high of a cyclomatic complexity for either a method or for a complete class will make that method or that class hard to understand. The amount of decision paths which can be taken are too many and maintaining code which falls in the unacceptable risk level for these metrics should be redesigned.

The LLEN metric which does not fall within no risk level metric value range will fall within the unacceptable risk level category. This is an all or nothing risk level metric. A line of code is either too long or it is not.

5.3.5 The not assessable level

When analyzing source code it is possible that a metric is not present in a file. This applies to the following metrics. For the statement nesting depth (SND) metric it can occur that when no statements are in the source code the value of this metric cannot be established. The same goes for the amount of indentation which is found in the source code. If no expressions are used which

(31)

causes the indentation level to increase this metric is not assessable. Another interesting metric is the lack of cohesion of methods metric. When a class does not contain class attributes or does not contain methods the cohesion cannot be calculated. The lack of methods will also make the cyclomatic complexity not measurable.

And then there are some metrics which are always assessable. The maximum line length is understandably always present. Even when a file is completely empty you will still have at least one line which may even have a zero length which in turn is still a valid length to have. The coupling between objects is another metric which always has a value. When no coupling is present you have a coupling of zero and this is a reasonable amount to have.

How to know whether a metric is assessable is very

straightforward. You start with assessing the no risk value. If the measurement does not fall into that category, you move on and evaluate the acceptable risk. If the measurement also does not satisfy that risk level you continue with the high risk level and eventually ending at

the not acceptable risk level. If neither of these levels is satisfied the metric is not assessable. How this translates into Rascal code is illustrated in Figure 5.3.1.

For the LCOM metric the not assessable level is also used when an object contains attributes which are only accessed with their on public methods whilst none of these attributes are

interacting with any of the other. These kinds of objects are common in object-oriented design and should not be taken into account since this is a design choice and not a matter of decreasing understandability.

5.4 Omission of metrics

In the last forty years many different metrics have been introduced to measure software systems. Some are applicable to programming languages in general whilst others focus on just Object-Oriented or Procedural programming languages.

The most widely used Object-Oriented metrics suite is the one defined by Chidamber and Kemerer [Chidamber 1994]. The CBO and the LCOM metrics which are used in this research stem from this metrics suite. But why not include the other 4 metrics which they have described. The first metric which was introduced is the Weighted Methods per Class (WMC) metric. This metric is actually part of the proposed metrics but it is called cCC. The name Weighted Methods per Class is just poorly chosen. There is no weighing at all so this would be confusing to use. The second metric is the Depth of the Inheritance Tree (DIT). Depth of the Inheritance Tree is an indication to what extent the class is influenced by the attributes and methods of its ancestors. The greater complexity caused by a deep inheritance tree can impair readability and understandability. However this impairment is not an issue anymore due to the advanced development aid of modern IDE's. The same goes for the Number of Children metric and the Response For a Class metric. The usage of classes and methods can easily be investigated with modern IDE's. The problems described by these metrics are not a problem anymore and are not an indication for a decrease in readability or understandability.

The original Maintainability Index (MI) proposed by the SEI is composed of four different metrics of which only two are partly used for this research [Foreman 1997]. The MI is formulated as following:

171 - 5.2 * ln(aveV) - 0.23 * aveV(g’) - 16.2 * ln (aveLOC) - 50 * sin (sqrt(2.4 * perCM))

(32)

The Cyclomatic Complexity (aveV(g’)) and the Lines Of Code (aveLOC) metrics have already been mentioned. The Halstead Volume (aveV) is based on the number of operators and operands from which a relative size for the entire system can be calculated. Size metrics for problem detection in frameworks have been questioned and refuted [Demeyer 1999]. This is based on the conclusion that the parts of the software system which violate these metrics do not cause problems in the natural evolution of the software system.

The average percent of lines of comments per module (perCM) is optional in the calculation. It is used when the provided comment only relates to the source code. The idea behind this metric is that commented code is easier to understand than code with no comment at all. However as a fellow software developer once wrote in a blog post: “Don't explain bad code, fix it!” [Dohms 2012].

The usefulness of the MI has been analyzed and reevaluated by the Software Improvement Group in the paper “A Practical Model for Measuring Maintainability” [Heitlager 2007]. In their findings they conclude that the arbitrary nature of each part of the MI makes it unfit to merit any valid conclusion about a software systems maintainability. The Maintainability Model which they propose contains many of the same metrics which are used in this research. They have included one metric in their model which will be omitted for this research and this is the duplication of code metric. In their paper they write about duplication and say what it is and that it is detrimental for maintainability however they never provide any proof of why it is so hazardous for maintainability.

(33)

5.5 GQM design

From the guidelines and the resulting metrics the actual intent behind these guidelines can be resolved. To aid in this task a GQM design is drawn up.

5.5.1 GQM goal

The goal in this research will need to consist of the five part described in paragraph 2.2.2 Goal definition.

Goal Purpose

The purpose of this research is to determine whether it is useful to create a specific software model which is based on the in-company guidelines on software development.

Goal Object of Study

The object which is studied is the result from the endeavors made during the software construction phase.

Goal Issue

The issue addressed is the monitoring of the guidelines which are introduced to improve the understandability.

Goal Environment

The environment in this context is the actual software product which is being produced. Goal Viewpoint

The only people who are affected by this research are the members of the development team.

5.5.2 GQM questions

The questions for the GQM design are the result of finding the commonalities between the

described metrics within the context of the defined GQM goal. From the GQM goal we get the Goal Issue which is centered around the understandability of source code and which will be the main focal point of the GQM questions. From the metrics two distinctly different components of

understandability are derived: Information processing/overload and Logical reasoning/complexity. Information processing/overload

This deals with the degree to with source code can be understood by a person due to the sheer volume of it. And does the amount of source code and thus the amount of information for a person to process this information affect the understandability.

Logical reasoning/complexity

This deals with to what degree source code can become too complex for a person to understand. And does this complexity diminish the understandability of the source code and a person's ability to reason about it.

This leads to the following two questions being formulated:

• How does the ability to process information influence the understandability of the source code?

(34)

The use of guidelines for automatic quality assessment