A fault tolerance framework in a concurrent programming environment

(1)

UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl)

UvA-DARE (Digital Academic Repository)

Fu, J.

Publication date

2014

Document Version

Final published version

Link to publication

Citation for published version (APA):

Fu, J. (2014). A fault tolerance framework in a concurrent programming environment.

General rights

It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons).

Disclaimer/Complaints regulations

If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible.

(2)

(3)

A Fault Tolerance Framework

in a Concurrent Programming

(4)

This research received full funding from China Scholarship Council.

This research was also supported by the European Space Agency under the con-tract 4000106331.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechan-ical, photocopying, recording, or otherwise, without permission of the author. Cover design by Jinli Qiu

Image courtesy by http://naviretlav.deviantart.com/art/Processor-199114947 Typeset by LATEX

Printed and bound by Off Page Amsterdam ISBN: 978-94-6182-500-1

(5)

A Fault Tolerance Framework

in a Concurrent Programming

Environment

Academisch Proefschrift

ter verkrijging van de graad van doctor

aan de Universiteit van Amsterdam

op gezag van de Rector Magnificus

prof. dr. D.C. van den Boom

ten overstaan van een door het college voor promoties ingestelde

commissie, in het openbaar te verdedigen in de Aula der Universiteit

op woensdag 29 oktober 2014, te 11:00 uur

door

Jian Fu

(6)

Promotiecommissie:

Promotor: Prof. dr. C.R. Jesshope Overige leden: Prof. dr. ir. C.T.A.M de Laat

Prof. dr. R.J. Meijer Prof. dr. C. Zhang Dr. A.D. Pimentel Dr. L. Fossati Dr. C.U. Grelck

(7)

It has often proved true that the dream of yesterday is the hope of today and the reality of tomorrow.

(8)

List of Figures

1.1 The structure of dissertation. . . 20

2.1 Concurrency tree [19]. . . 22

2.2 Core microarchitecture. . . 27

2.3 The Microgrids of 32 cores. . . 30

2.4 Entity-relationship diagram of MGSim. . . 31

3.1 Intelligent redundant thread creation and synchronization. . . 43

3.2 Redundant thread distribution. . . 44

3.3 Master and redundant distribution network in the Microgrids of 8 cores. 45 3.4 The lifetime of master and redundant threads. . . 47

3.5 The Microgrids of 128 cores. . . 48

3.6 Performance penalty of complete redundancy scope in single core sce-nario. . . 50

3.7 Performance penalty of complete redundancy scope in many cores sce-nario. . . 51

3.8 Performance penalty of selective redundancy scope in single core scenario. 54 3.9 Performance penalty of selective redundancy scope in 2 cores scenario. . 55

3.10 Performance penalty of selective redundancy scope in many cores sce-nario. . . 55

3.11 Execution cycles of selective redundancy scope in 2 cores scenario. . . . 56

4.1 Sphere of replication. . . 62

4.2 The normalized execution time of multithreaded benchmarks using 2 cores. . . 65

4.3 The normalized execution time of multithreaded benchmarks using 4 cores. . . 65

4.4 The normalized execution time vs. relative system efficiency. . . 66

4.5 The normalized execution time of single-threaded benchmarks using 2 cores. . . 67

5.1 The structure of a comparison buffer. . . 75

5.2 The implementation of asynchronous output comparison. . . 76

5.3 The structure of memory sub-system in the Microgrids. . . 77

5.4 The inconsistency may caused by write broadcast on the CB bus. . . 78

5.5 The inconsistency may caused by write broadcast on the memory bus. . 78

5.6 The inconsistency may caused by read response broadcast on the mem-ory bus. . . 79

(11)

iv LIST OF FIGURES 5.7 The relation between normalized execution time and comparison buffer

size for multithreaded benchmarks. . . 81

5.8 The relation between normalized execution time and comparison buffer size for single-threaded benchmarks. . . 82

5.9 The average lifetime of stores in the comparison buffer. . . 84

5.10 The performance of benchmarks in homogenous and heterogenous sit-uations. . . 84

5.11 The lifetime distribution of stores in the comparison buffer for Liver-more Loop 1. . . 86

6.1 Thread instruction analysis of some multithreaded kernels. . . 91

6.2 Rethread recovery process. . . 96

6.3 The relative performance of Rethread-based system (1/2). . . 99

6.3 The relative performance of Rethread-based system (2/2). . . 100

6.4 The relation between performance overhead and small thread re-execution ratio (<0.1) (1/2). . . 101

6.4 The relation between performance overhead and small thread re-execution ratio (<0.1) (2/2). . . 102

6.5 The relation between performance overhead and large thread re-execution ratio (0.1∼1) (1/2). . . 104

6.5 The relation between performance overhead and large thread re-execution ratio (0.1∼1) (2/2). . . 105

6.6 The breakdown of read requests to L2 with thread window size of 1 and 32. . . 106

6.7 The overview of complete recovery process in the Microgrids. . . 107

7.1 The results of fault injection into different structures. . . 118

7.2 The comparison of fault injection results. . . 118

7.3 The distribution of fault detection latency. . . 120

(12)

List of Tables

1.1 Fault injection techniques classification. . . 15

1.2 The relation between each chapters of fault detection strategy. . . 19

2.1 Logical sub-units in the TMU. . . 29

2.2 Control events handled by the TMU. . . 29

3.1 New instructions and their operations. . . 43

3.2 An example of master and redundant threads distribution. . . 45

3.3 The specification of the Microgrids chip with 128 cores. . . 49

3.4 Description of benchmarks. . . 50

4.1 The specification of the Microgrids chip with 4 cores. . . 63

4.2 The benchmarks. . . 64

4.3 DRMT vs. SMT based RMTs. . . 68

5.1 The features of output comparison in two fault tolerance techniques. . . 72

5.2 Length of an entry in comparison buffer . . . 80

5.3 An example of thread distribution. . . 83

5.4 Ratio of store to total instructions per thread of benchmarks. . . 83

6.1 The benchmarks . . . 97

6.2 The times of L2 read requests between full and no re-execution ratio. . 103

7.1 Fault properties. . . 115

7.2 The number of faults injected to each structure. . . 116

7.3 The overall results of fault injection experiment. . . 117

7.4 The utilization of integer register file for each benchmark. . . 119

A.1 Contents of a family table entry. . . 138

B.1 Contents of a thread table entry. . . 140

(13)

(14)

Listings

2.1 Inner product in SL. . . 24

2.2 Family creation and synchronization construct in SL. . . 25

3.1 A summation function with fault tolerance related parameter ftmode. 40 3.2 The assembly without fault tolerance attributes. . . 40

3.3 The assembly at the redundancy’s start point. . . 41

3.4 The assembly within the redundancy’s scope. . . 41

3.5 The format of pair instruction. . . 45

5.1 Pseudo code of output comparison. . . 75

5.2 The assembly code may cause inconsistency between master and redundant thread. . . 78

6.1 An example code with software recovery. . . 108

(15)

(16)

Summary

As CMOS technology scales ever further, multi-core processors are becoming mastream both in research and industry. However, the system vulnerability is in-creasing due to tighter design margins and greater susceptibility to interference, both caused by smaller feature size, lower power supply voltage, higher frequency, greater hardware complexity and more transistors per processor. Meanwhile, con-current programming environment has emerged, a general designation for the norms in the exploitation of systems with multi-core processors, which is widely believed to be the main approach for gaining scalable performance improvement from multi-core systems based on parallelism exploitation and resource scheduling. In this dissertation, we specifically explore the construction of a fault tolerance framework in a concurrent programming environment. During this process, we investigate the features of a concurrent programming environment. With this knowledge, we design a cross-layer, flexible, low-overhead fault tolerance frame-work including fault detection, and recovery, as well as fault injection for its eval-uation. The proposed fault tolerance framework targets the general paradigm of concurrent programming environments, and is evaluated and implemented in a specific platform, i.e., the Microgrids.

This dissertation includes three main parts: fault detection, recovery and in-jection in a concurrent programming environment. In the fault detection part, we present on-demand, thread-level redundancy in the context of a concurrent pro-gramming environment. When and where a program should be duplicated to give high reliability can be specified by programmers or the run-time environment. This makes the system more efficient and flexible through providing affordable fault tolerance based on the system’s current situation. Meanwhile, we also evaluate the redundant multithreading (RMT) technique in a data-flow scheduled multi-threaded (DMT) environment, which is motivated by the evaluation of on-demand redundancy and the DMT implementation in the Microgrids. The results confirm that RMT can benefit from DMT. Furthermore, we provide an asynchronous out-put comparison mechanism for core-level redundant execution technique, which uses a shared buffer for output comparison between a fixed-pair core. It avoids the inter-core data transmission for comparison and saves the capacity of buffer.

In the fault recovery part, we uncover an opportunity to reduce the overheads of fault recovery dramatically in modern processors that appears as a side-effect of introducing hardware multithreading to improve performance. This is because, in our assumed model, threads are usually short code sequences with no branches and few memory side-effects, which means that the number of checkpoints is small and constant. In addition, the state structures of a thread already presented in

(17)

2 SUMMARY hardware can be reused to provide checkpointing. We demonstrate this principle of using a hardware/software co-design called Rethread, which features compiler-generated code annotations and automatic recovery in hardware by re-executing threads. Rethread is a quite novel strategy with extremely low hardware and performance overheads, since it makes full use of the features of multithreading environment and fault detection mechanism.

In the fault injection part, we propose a micro-architecture level simulator im-plemented fault injection technique for dependability analysis. Based on this fault injection technique, we analyzes the vulnerability of different microarchitectural structures in the Microgrids, especially for the structures used to support hard-ware concurrency management. Also, we quantify the level of dependability that the fault tolerance approaches proposed in previous parts provide. Finally, we also provide the correlation between the micro-architecture level faults and their impact on the application level. The dependability analysis is not only a valuable reference for designing a reliable concurrent programming environment, but also an important argument for whether implementing concurrency management in hardware or software.

Overall, the fault tolerance framework proposed in this dissertation is an en-hancement of the concurrent programming environments. There are two related work directions: one is augmented with fault injection; another is enriched with fault detection and fault recovery progressively. Finally, the fault injection en-hancement is used to analyze the dependability of the concurrent programming environments, and measure the capability of the proposed fault detection and recovery strategies. Furthermore, the fault tolerance framework is quite com-prehensive and universal, which includes fault detection, fault recovery and fault injection. It is also very flexible and efficient by making full use of some features of the concurrent programming environments, such as the thread-level concurrency expression.

(18)

Samenvatting

De voortdurende ontwikkeling van CMOS technologie zorgt ervoor dat multi-core processoren steeds meer worden gebruikt voor onderzoek en industriële toepas-singen. Echter worden computersystemen ook kwetsbaarder door beperkingen in de ontwerpruimte en hogere gevoeligheid voor interferentie, veroorzaakt door klei-nere transistoren, lagere voltages in stroomvoorzieningen, hogere frequenties, meer complexiteit in de hardware en meer transistoren per processor. Ondertussen zijn programmeeromgevingen ontwikkeld met steun voor gelijktijdigheid (threading), voortaan "concurrente"programmeeromgevingen genoemd, een generalisatie van de gezamenlijke concepten achter de exploitatie van systemen met multi-core pro-cessoren. Er wordt aangenomen dat het gebruik van deze nieuwe omgevingen effectief extra snelheidsprestaties kan opleveren van multi-core processoren, om-dat ze het parallellisme van multi-cores weten te benutten en het gebruik ervan door softwarecomponenten effectief weten te plannen. In dit proefschrift wordt een gepresenteerd hoe het bouwen van fouttolerantie in zulke concurrente pro-grammeeromgevingen mogelijk is. Om dit te doen, worden de kenmerken van een bestaande omgeving onderzocht, en wordt hiervoor een fouttolerantie framework voorgesteld. Dit framework maakt een doorsnede door meerdere abstractielagen in het systeem, blijft flexibel voor softwareontwikkelaars, en biedt lage overhead voor foutdetectie en herstel. Voor de evaluatie is een methode ontwikkeld om kunst-matig fouten te injecteren in het framework. Het framework wordt beschreven in het algemeen, en geëvalueerd op de Microgrid architectuur.

Het proefschrift bevat drie delen: foutdetectie, -herstel en -injectie in een con-currente programmeeromgeving. In het deel over foutdetectie wordt een techniek gepresenteerd die redundantie per deeltaak, op aanvraag, mogelijk maakt. De tijd en locatie waar een deeltaak moet worden gedupliceerd worden gespecificeerd door programmeurs of het executiesysteem. Dit maakt het systeem efficiënter en flexibeler, door fouttolerantie goedkoper en gevoeliger voor actuele omstandighe-den te maken. Onze evaluatieresultaten geven aan dat processorontwerpen waar de executie van deeltaken ingepland wordt volgens datastromen tussen deeltaken gunstig kunnen zijn voor simpele redundante deeltaken draaiend op dezelfde core. Dit wordt ook tevens bevestigd in dit proefschrift door een eerdere techniek, re-dundante multithreading (RMT) genoemd, te evalueren naast de onze. Hiernaast wordt een mechanisme gepresenteerd voor het asynchroon vergelijken van het uit-voer van gedupliceerde deeltaken, door het gebruik van een gedeelde buffer tussen aangrenzende cores. Deze buffer vermindert de communicatiedruk tussen cores en zorgt voor efficiënter gebruik van opslagruimte.

In het deel over herstel wordt een gelegenheid onthuld om de overhead van herstel drastisch te verminderen dankzij de aanwezigheid van

(19)

4 SAMENVATTING ning voor deeltaken in moderne processoren. Deze gelegenheid ontstaat omdat ons programmeermodel alleen zeer korte deeltaken hanteert met weinig aftakkin-gen en zichtbare effecten, waardoor weinig controleposten nodig zijn voor herstel na foutdetectie. Bovendien kunnen de hardwarecomponenten die deeltaken on-dersteunen ook gebruikt worden voor het implementeren van controleposten. Dit principe wordt gedemonstreerd in dit proefschrift door een gelijktijdig hardware/-software ontwerp, Rethread genoemd, waar programmacode geannoteerd wordt door codevertalers en de effecten van fouten worden hersteld door het automa-tisch herstarten van deeltaken. Rethread is een nieuw uitgevonden strategie voor herstel met zeer lage overhead in systeemprestatie en hardwarekosten, omdat het volledig gebruik maakt van omgevingsfaciliteiten voor deeltaken en het eerder ont-worpen foutdetectiemechanisme.

In het laatste deel over foutinjectie wordt een implementatie gepresenteerd van een foutinjectietechniek binnen een architectuursimulator, bedoeld voor be-trouwbaarheidsanalyse. Hiermee wordt de kwetsbaarheid van meerdere micro-architecturale structuren binnen de Microgrid geanalyseerd, voornamelijk de struc-turen gebruikt ter ondersteuning van deeltakenexecutie. Tevens wordt de be-trouwbaarheid van de technieken die in de eerste twee delen waren gepresenteerd hiermee gekwantificeerd. Tenslotte wordt in dit deel de correlatie tussen fouten op architecturaal niveau en effecten op applicatieniveau gepresenteerd. Betrouw-baarheidsanalyse is niet alleen een kernbelang tijdens het ontwerp van concurrente programmeeromgevingen; het speelt ook een belangrijke rol om te bepalen of het beheer van gelijktijdige activiteiten in software of in hardware moet worden geïm-plementeerd.

In zijn geheel is het fouttolerantieframework dat in dit proefschrift gepresen-teerd wordt een verbetering van concurrente programmeeromgevingen. Dit werk wordt in twee richtingen gepresenteerd: een waar het wordt aangevuld met fout-injectie, en een andere waar detectie en herstel geleidelijk worden uitgebreid. De foutinjectie aanvulling wordt dan gebruikt om de betrouwbaarheid van de con-currente programmeeromgeving en de geschiktheid van de voorgestelde detectie en hersteltechnieken te meten. Bovendien is het framework vrij veelomvattend en universeel, met steun voor foutdetectie, herstel en injectie. Het is vrij flexibel en efficiënt door volledig gebruik te maken van enkele functies van de concurrente programmeeromgeving, zoals expliciete beschrijvingen van gelijktijdigheid in de programmacode van deeltaken.

(20)

Acknowledgements

A drop of water shall be returned with a burst of spring. I would like to acknowledge those people who helped me practically or mentally pursue my PhD at University of Amsterdam.

First and foremost, I would like to thank my promotor, Prof. Chris Jesshope, for offering me the position and the opportunity to pursue a PhD in the CSA group at University of Amsterdam. There is no doubt that pursing my PhD abroad is a challenging, worthwhile and important experience in my life. Besides the supervision that a PhD student can expect from his promotor, such as scientific support, writing skill, and so on, I really appreciate the freedom that he gave me to explore my own ideas. Moreover, he always inspires me to improve my research through discussions with his deep insight. I truly enjoyed and greatly benefited from the past four years of his supervision.

I would like to express my sincere gratitude to Dr. Raphael Poss. He acted as my daily supervisor during my PhD stage. He not only inspired and refined the research ideas I proposed, but also helped me implement them step-by-step. It is no exaggeration to say that he was involved in every point of my PhD research. Without his dedication, my thesis cannot be finished so smoothly. I can never thank him enough for the countless advices offered at any time when I needed them. Also, I learned a lot of wisdom and philosophy from our daily chatting. It is my great honor to have him as my friend.

I am also grateful to Prof. Chunyuan Zhang from National University of De-fense Technology in China, for his encouragement and supervision on my research. He enlightened me on the research of computer architecture, and gave me all the convenience to pursue a PhD abroad.

I would like to thank Dr. Qiang Yang, who picked me up at Schiphol airport and helped me settle down in Amsterdam four years ago. It is worth mentioning that we have nearly identical higher education experience. During these years in my PhD stage, we not only exchanged ideas to enrich each other’s research, but also worked together to make progress of our own research.

Furthermore, I would like to thank all the other colleagues of the CSA group, with whom I spent every working day. They are Sebastian Altmeyer, Roy Bakker, Roeland Douma, Clemens Grelck, Mike Lankamp, Andy Pimentel, Roberta Piscitelli, Simon Polstra, Wei Quan, Toktam Tagahvi, Fangyong Tang, Mark Thompson, Ir-fan Uddin, Peter van Stralen, Michiel van Tol, Merijn Verstraaten.

Meanwhile, I want to appreciate the friendship with many great Chinese friends during my stay in the Netherlands. They are Lingxue Cao, Xiaotang Di, Jianbin

(21)

6 ACKNOWLEDGEMENTS Fang, Chao Li, Hairong Li, Hui Li, Ping Li, Xirong Li, Shangsong Liang, Yang Liang, Lei Liu, Liyuan Liu, Jie Liu, Zhongyu Lou, Ping Lu, Zhe Ma, Congsen Meng, Siqi Shen, Yang Song, Anbang Sun, Zhou Tang, Ke Tao, Chang Wang, Xiang Wang, Yao Wang, Hui Xiong, Huiqi Yan, Liying Zhang, Qianqian Zhang, Ling Zhong, Jing Zhou, Hao Zhu, and so on.

Finally and most importantly, I would like to thank my wife and my parents. We are a happy family and will always support each other forever.

Jian Fu

(22)

Part I

(23)

(24)

Chapter 1

Introduction

1.1 Motivation and Goal

Technology scaling dictated by Moore’s Law [75] has been the key mechanism to achieve remarkable performance improvements during the last few decades. It is widely believed that Moore’s Law will be continued for the next one or two decades despite several challenges including reliability, power and complexity. With the continuation of technology scaling, there are five trends in modern or future pro-cessors: smaller feature size, lower power supply voltage, higher frequency, greater hardware complexity and more transistors per processor. These trends are inextri-cably linked and influence each other. Nevertheless, all of them are leading toward an increasing fault1 _{rate in processors.}

Reduction in chip feature size and supply voltage decreases the critical charge parameter of logic circuit, Qcrit, which is the minimum electron change

distur-bance needed to change the logic level (i.e., a signal can change from a 0 to a 1 or vice versa). A lower Qcrit means a higher soft error rate (SER). And several

existing experiments [10, 25, 28, 45, 78] have already proved that SER will be-come higher with both feature size and supply voltage decreasing. Meanwhile, the smaller feature size results in increased process, voltage and temperature (PVT) variations, which also lead to increased susceptibility to intermittent and perma-nent faults [17, 112]. In addition, frequency has a much greater impact on logic SER than supply voltage [47, 70], as a higher frequency gives a lower noise margin. On the other hand, increased processor complexity makes the likelihood of design bugs that can be manifested as permanent faults becomes greater. Thus it is also a contributor to increasing fault rate. Furthermore, with more transistors in a processor, there are more chances for fault occurrences. Given even a constant fault rate for a single transistor, which is a highly optimistic assumption, the fault rate of a processor is increasing in proportion to the number of transistors per processor. Therefore, reliability is an inevitable challenge during the technology scaling.

Besides the reliability challenge, power and complexity have already influ-enced the traditional monolithic designs of processors. Since the early 2000s, it has turned chip multiprocessor (CMP) into the mainstream to deliver perfor-mance growth, by increasing the number of cores and exploiting high level par-allelism, such as thread-level parallelism (TLP). Currently, CMPs with 2 to 8 cores are widely used across many application domains including general-purpose, embedded, network, digital signal processing (DSP) and graphics processing unit (GPU). Moreover, CMPs with more than 60 cores, such as Xeon Phi, TILE64, and Epiphany-IV, are constantly emerging. There is already a consensus that the core count within a CMP is still increasing due to the technology scaling.

The improvement in performance gained by the use of a CMP depends heavily on the software algorithms used and their implementation. In particular, possible gains are limited by the fraction of the software that can be run in parallel on multiple cores. How to gain the scalable speedup in the CMPs, especially for the many-core processors, is an anticipated research question related the parallelism exploitation and resources scheduling. And we summarise it as the concurrent programming environment, a general designation of the norms in the exploitation of systems with CMPs.

(26)

1.2. CONCURRENT PROGRAMMING ENVIRONMENT 11 As CMPs are the outcomes of technology scaling, they will also suffer high fault rate and will need to be protected from faults. Besides, it is worth mentioning that the traditional high-reliability systems, such as bank, flight and spacecraft control systems, are eager to adopt the latest CMPs for the performance improvement [125], although they are more susceptible to faults. These observations motivate the need for a comprehensive study on the solutions to improving the reliability of system with CMPs, which is the essential part of the concurrent programming environment.

In this dissertation, we focus on the exploration of constructing a fault tolerance framework in a concurrent programming environment. In this process, we investigate the features of a concurrent programming environment, which impact designing a cross-layer, flexible, low-overhead fault tolerance frame-work including fault detection, and recovery, as well as fault injection for its eval-uation.

1.2 Concurrent Programming Environment

Nowadays, the computer community faces the big challenge of programming the expected progression of CMPs with scalable performance and portability. It is anticipated that tens or even hundreds of thousands of cores will be possible on a single chip at the end of technology scaling. However, exploiting these CMPs requires the exposure of explicit concurrency in the code they execute, as implicit instruction-level parallelism (ILP) is not sufficient and very expensive in terms of logic to scale for CMPs with many cores. This in turn require application to expose this concurrency, by either programmer or compiler, neither of which is easy [115]. This challenge is well acknowledged [5, 22, 46], but the concurrency revolution is happening now and urgently requires new tools and new ways of thinking [116]. The concurrency revolution is at all levels of the chain from the architecture up to the programming language including the development tools, which are evolving, but without a general consensus.

We define a concurrent programming environment as a general designation of the norms in the exploitation of CMP systems during the concurrency revolution. The research on concurrent programming environments started in the 1970s, and became mainstream in the 2000s. There are already many concurrent program-ming environments presented both in industry and academia, such as CUDA [81], OpenMP [85], OpenCL [84], Pthreads [9], MPI [86], UPC [120], Cilk [15], the Microgrids [74], etc., and new approaches appear frequently.

There are two basic ways constructing a concurrent programming environment, either top-down or bottom-up. Top-down approaches come from the software side and aim to target general architectures rather than a specific one (e.g., UPC, Cilk). Bottom-up approaches force features from the hardware layer onto the software layer, the resulting software definition is then dedicated for this architecture (e.g., CUDA). Overall, a concurrent programming environment contains two aspects: concurrency expression and management. Concurrency expression intends to ex-pose opportunities for parallel execution, and specifies the communication flow and synchronization points among the concurrent parts. In addition, concur-rency management spreads concurrent workloads to hardware parallel execution resources, referred as to mapping or scheduling.

(27)

12 CHAPTER 1. INTRODUCTION In most current approaches the expression and management of concurrency are not decoupled, which means appropriate abstractions in favor of application porta-bility are missing. Thus what currently happens is that applications are either developed targeting specific platforms or existing applications are retargeted to a platform through a painstaking process of static application mapping and the in-troduction of platform-specific functionality into the application itself. Moreover, the concurrency management is implemented in software (i.e., run-time) usually. With a software concurrency manager, each hardware parallel execution resource is controlled by a program that assigns tasks using state taken from main mem-ory, which assumes that the workload per task is always sufficient to compensate the non-local latencies induced by memory access to task state during schedule decision making and task assignment. This assumption traditionally holds for coarse-grained concurrency (e.g., external I/O), or regular, wide-breadth concur-rency patterns extracted from sequential tight loops (e.g., OpenMP). However, this situation is not so clear with fine-grained heterogeneous task concurrency, for example, graph transformation and dataflow algorithm that typically expose a large amount of irregularly structured fine-grained concurrency.

The Microgrids is a quite novel concurrent programming environment devel-oped by the CSA group in University of Amsterdam. First, it decouples concur-rency expression and management, and targets the scalability, programmability and compatibility of multicore systems. The concurrency expression in the Mi-crogrids is formalized as a system virtualization platform (SVP), which is a set of system services and language interfaces for the exploitation of concurrency on mul-ticore processor chips2_{. Secondly, it introduces hardware support for concurrency}

management to accelerate the concurrency management, which is also beneficial to the resource fungibility3_{and scheduling cost predictability.}

1.3 Fault Tolerance Framework

The goal of fault tolerance is to provide safety and liveness of a computer sys-tem, despite the possibility of faults [109]. A safe and live computer system never produces an incorrect user-visible result, and continues to make forward progress, even in the presence of faults. Fault detection provides safety as a computer sys-tem can stop and do nothing at least when a fault is detected instead of silent data corruption. Moreover, fault recovery4 _{hides the effects of the fault from the}

user, and resumes operation of computer system, which means that the computer system remains live. In addition, fault injection is also important for the vulner-ability analysis of a computer system and evaluation of the effectiveness of any fault detection or recovery approach. Therefore, fault detection, recovery and

2_{Besides the Microgrids, another implementation based on SVP is a coarse-grained software}

approach that uses Pthreads on commodity multicore processor [121], which was later extended with support for distributed systems [122].

3_{Fungibility is the property of a good or a commodity whose individual units are capable of}

mutual substitution. Example fungible commodities include crude oil, wheat, precious metals, and currencies.

4_{Fault recovery only works for transient fault, it may not be sufficient for permanent fault}

be-cause execution after recovery will keep reencountering the same permanent fault. The solutions to permanent fault is not studied in this dissertation.

(28)

1.3. FAULT TOLERANCE FRAMEWORK 13 injection comprise a fault tolerance framework that is sufficient for building a re-liable concurrent programming environment, which is the main purpose of this dissertation.

1.3.1 Fault Detection

Redundancy is the key solution for fault detection [126]. The practical issue is how and at what level to apply what kind of redundancy in the target computer system. Physical (also referred to as "spatial"), temporal and information redundancy are classic approaches for detecting faults. And all fault detection schemes use one or more of these types of redundancy.

Dual modular redundancy (DMR) with a comparator is the simplest form of physical redundancy. Adding an additional replica and replacing the comparator with a voter leads to the triple modular redundancy (TMR). Both DMR and TMR are widely used, such as IBM’s S/390 G5 processor [104], Tandem S2 [53], Hewlett Packard’s NonStop Advanced Architecture [14], and Boeing 777’s flight computer [139]. In addition, physical redundancy can be implemented at various other granularities than simply replicating an entire processor, e.g. in the arithmetic and logic unit (ALU), or even at the gate level. Nevertheless, heterogeneous physical redundancy is encouraged as it can overcome two limitations of homogeneity. First, it enables detection of errors due to design bugs, Boeing 777 [139] and DIVA [7] are designed based on this principle. Second, it reduces the hardware cost compared to homogeneous redundancy, an extreme example is the watchdog timer, as checking a processor’s safety is far simpler than duplicating all its operations. However, physical redundancy is not an efficient approach, due to its large hardware cost, and power and energy consumption. Also the comparator or voter is a single failure point in the physical redundancy system.

The basic form of temporal redundancy is running a program twice on the same hardware and comparing the results of the two executions. Hence, there is no ex-tra hardware cost compared to physical redundancy. But the performance of the system equipped with temporal redundancy is halved, as the total execution time is doubled, even if ignoring the comparison latency. Also the active energy con-sumption is doubled as in DMR. Because of the steep performance cost, temporal redundancy usually uses pipelining techniques or relies on the hardware accumula-tion induced by technology scaling to hide the latency of the redundant operaaccumula-tion. Simultaneous multithreading (SMT) [119] and multi-core processors are two key techniques accommodating redundant multithreading (a kind of temporal redun-dancy). SMT based redundant multithreading has less performance impact than a pure temporal redundancy, as it can benefit from the higher resource utilization when master and redundant threads can co-exist within the processor or across different processors. AR-SMT [97], SRT [94], SRTR [124], CRT [76], CRTR [39], DCC [59] and Reunion [107] are all SMT based redundant multithreading fault detection schemes in a uniprocessor or CMP. Like with heterogeneous physical redundancy, some redundant multithreading techniques (such as Slipstream [114], SlicK [87], and the opportunistic transient fault detection in [40]) provide partial replication of the thread to reduce the performance overhead.

Information redundancy adds redundant bits to a datum to detect when it has been affected by an error, i.e., error-detecting code (EDC). The error detecting

(29)

14 CHAPTER 1. INTRODUCTION capability of an EDC is decided by its hamming distance (HD), which is the num-ber of bit position in which they differ. The simplest and most common EDC is parity. And more sophisticated codes with larger HDs can detect more errors, and many of those codes can also correct an error, i.e., error-correcting code (ECC). Information redundancy is well studied and widely used in the commercial com-munication and memory systems. For instance, the L1 cache is either protected with EDC (as in the Pentium 4 [50], UltraSPARC IV [113], and Power4 [18]) or with ECC (as in the AMD K8 [1] and Alpha 21264 [58]).

1.3.2 Fault Recovery

Fault recovery hides the effects of the fault from the user, and plays a big role in the improvement of the processor’s availability, which is crucial to systems that must provide robust services. Traditionally, there are two primary fault recovery methods: forward fault recovery and backward fault recovery, also known as forward error recovery (FER) and backward error recovery (BER).

FER corrects the fault without reverting back to a previous state, and requires more of each type of redundancy than fault detection [8]. For example, TMR has one more module, compared to DMR, that is required to correct a single error. Therefore, it has extremely heavy overheads on area, energy and power consump-tion, but negligible performance loss. The Stratus [131] computer system is an implementation of pair-and-spare system, and the Tandem S2 [53] uses TMR. Additionally, it is possible to implement FER in software by running three redun-dant streams of instructions [95]. The area/energy overhead is very large (roughly 200%), so FER schemes are usually viable only for small or mission-critical sys-tems.

BER has to capture periodically all the necessary, known-good state of the system, called checkpoints, which are the snapshots the system can rollback to when a fault has been detected [30]. Obviously, this trades recovery latency for area saving. The BER system needs to checkpoint periodically, or log the difference when changes happen. With checkpointing, the system has to save its entire state. While logging only needs to record the changes that are made to the system state. BER schemes can be implemented either in hardware or software.

Usually, hardware-based BER schemes require special optimization in the stor-age component, and it highly depends on where the fault is detected. For example, if a fault is detected before an instruction’s result register values are committed, then the existing branch misprediction recovery mechanism can be used to recover from an error. In other words, BER schemes are different in the systems that de-tect faults before register, memory or I/O output values are committed as they have different checkpoint content. IBM mainframes [43] use register checkpoint hardware to recover from processor errors. The CARER [48] technique and its multiprocessor’s extensions [3, 133] study how to use cache to assist rapid back-ward recovery. They use a dedicated cache as a buffer to hold the temporary data for the computation between checkpoint’s intervals. SafetyNet [110] is more efficient, as it checkpoints the register state, but logs the memory changes.

Software-based BER schemes require no additional hardware. Traditionally, they are coarse-grained recovery schemes with enormous checkpoints used in large-scale enterprise system, such as Tandem NonStop system [101], Condor [68], and

(30)

1.3. FAULT TOLERANCE FRAMEWORK 15 IBM Blue Gene/P [77]. Both the checkpoint time and recovery latency are quite long in the conventional software-based BER schemes. Recently, more and more researchers have recognized that not all applications, or even functions within an application, require the same degree of reliability [27, 32, 67, 102, 105]. So they insert checkpoint code to the application manually or automatically (compiler-based), in order to achieve selective backward recovery. Relax [27] allows user to choose the recovery block’s action between re-executing code, returning default values, or ignoring the faults. Encore [32] tries to find or instrument the idempo-tent region of the code, which can be recovered by simply re-executing the code. This research makes the granularity of software-based BER smaller than before, i.e., reducing the recovery latency. However, the performance overhead of these techniques is not trivial under fault-free situation, as they increase the code size.

1.3.3 Fault Injection

On the other hand, it is also essential to comprehensively study the fault propaga-tion and influence in the computer systems, and carefully evaluate the capability of the fault tolerance approaches. Fault injection has been recognized as a pow-erful technique that allows the evaluation of a prototype system under faults, in particular the measurement of the effectiveness of the fault tolerance approaches. It can be classified in two dimensions as shown in table 1.1. Faults can be in-jected into different abstraction layers of computer systems by hardware, software or simulator.

Table 1.1: Fault injection techniques classification. XX

XX XX

Layer Injector Hardware Software Simulator

Software - + ○ Architecture - + ○ Microarchitecture - - ○ Gate* ₊ _- _○ - _unfeasible. +_feasible.

○_{dependent on the simulator’s level of abstraction.}

*_{Gate-level fault injection techniques are consistent with}

Register-Transfer-Level (RTL) fault injection techniques [71].

Hardware implemented fault injection uses additional hardware to introduce faults into the target system. Generally, it only can directly produce the fault at a very low level, such as the pin or gate level. There are two main categories of hard-ware implemented fault injection. One alters electrical currents and/or voltages at the circuit pins with direct connection by probes or sockets, often called pin-level injection, such as Messaline [6]. The other employs external devices to mimic natural physical phenomena, such as heavy-ion radiation [42] and electromagnetic interference [57], causing faults inside the target chip. This radiation based fault injection is the most effective and accurate technique as it applies the real source of fault. And the pin-level fault injection only can model a small fraction (9-12%)

(31)

16 CHAPTER 1. INTRODUCTION of the faults induced by radiation [96]. However, it is difficult to exactly control the time and location of a fault injected by hardware implemented fault injection techniques. Also, the target system may be damaged due to an inordinate amount of energy used to introduce the fault.

In software implemented fault injection (SWIFI), additional software (e.g., process) is invoked during runtime to modify the variables of the target program by time-out, exception, trap or dedicated code. It simulates the occurrence of faults by changing the software variables, which are the contents of memory or registers. Therefore, it only can inject faults into software and architecture layer due to the access constraints. SWIFI is attractive because it does not require expensive hardware and damage the target hardware. However, it cannot inject fault into locations that are inaccessible to software. Furthermore, the software instrumentation may disturb the running and even change the structure of target program. There are some typical SWIFI techniques, such as FERRARI [55], FTAPE [117], DOCTOR [44], Xception [21] and so on.

Simulator implemented fault injection is very important for reliability evalua-tion at the early design stage of the computer system. The simulator is a model of the system under design, which can be developed at different abstraction levels. The abstraction layer fault injected highly depends on the abstraction level of the simulator, as it is impossible to inject fault into lower layers than the abstraction level of simulator. Mostly, simulators can be implemented in the level of archi-tecture (e.g., TSIM SPARC simulator [118]), microarchiarchi-tecture (e.g., Gem5 [36], MGSim [62]), and RTL (the Hardware Description Language (HDL) based FPGA implemented simulators). Hence, simulator implemented fault injection techniques can inject faults into the abstraction layers from gate to software. In this approach, faults are induced by altering the logical values of the model elements during the simulation. The simulator implemented fault injection has higher observability and controllability than hardware or software implemented fault injections. Its downside is that it should be customized to each specific simulator, and thus it is difficult to have a general implementation or tool for several simulators. DE-PEND [41], MEFISTO [51], VERIFY [103], SINJECT [140], SWAT-Sim [64], and FIMSIM [136] are some well-known simulator implemented fault injection tools.

On the other hand, the abstraction level of fault injection is another dimen-sion for fault injection categories. It was concluded that results obtained from gate-level fault injection closely match those obtained from radiation experiments [99]. Moreover, the high-level fault injection techniques can be highly inaccurate when compared to gate-level fault injection techniques [24, 64]. The inaccuracies generally result from following two aspects:

• there are more masks for gate-level faults than the high-level faults; • the coverage of gate-level faults is broader than the high-level faults. By contrast, high-level fault injection is faster than gate-level fault injection. Meanwhile, the abstraction layer fault injection relies on the way to implement the fault injection as stated above. It is not feasible to inject faults at the gate-level in a SWIFI technique or an architecture-level simulator. However, accuracy is not necessarily a requirement for all fault injections scenarios. For instance, an inaccurate fault injection technique can be very useful as long as it is effective in

(32)

1.4. RESEARCH QUESTIONS 17 driving the correct design decisions for building robust systems, especially for the simulator implemented fault injection.

1.4 Research Questions

As mentioned in section 1.1, due to the technology scaling dictated by Moore’s Law, the chip size grows relative to the gate size, so does the latency between on-chip components (cores, caches and scratchpads) relative to the pipeline cycle time; this divergence is the on-chip equivalent of the memory wall [134]. Moreover, inter-component latencies will become increasingly unpredictable, both due to overall usage unpredictability in heterogeneous application scenarios and due to soft errors in circuits. These latencies cannot be easily tolerated using superscalar issue or VLIW, for the reasons outlined in [11]. Therefore, emerging concurrent programming environments accommodate several new characteristics to address the unpredictable on-chip long-latency tolerance, such as fine-grained concurrency, low level scheduling and mapping, more and more architecture supports, etc.

However, these new features of concurrent programming environment also are challenges for fault tolerance, since the conventional fault tolerance techniques may suffer from the overheads caused by long on-chip latencies. On the other side, as we all know, opportunities are always disguised as challenges, thus the main question of this dissertation is raised:

What does efficient fault tolerance technique look like in the context of those emerg-ing concurrent programmemerg-ing environments?

If we study this primary question even further, we can detail the question as follows:

• Which features of emerging concurrent programming environments, includ-ing concurrent programs and multicore processors, can be exploited to sup-port efficient fault tolerance?

• Is it possible to make fault tolerance technique as a module, which is isolated from the other components of the concurrent programming environment? The questions listed above are motivated by the general concepts of emerging concurrent programming environments regardless of their concrete implementa-tions. However, it is also interesting to see what benefits the fault tolerance tech-nique can get from a specific implementation of emerging concurrent programming environment. To the best of our knowledge, implementations of emerging concur-rent programming environments are rare, and fault tolerance frameworks targeting it is also few. Fortunately, the Microgrids is one of them (cf. chapter 2), which promotes hardware concurrency management to make the cost of concurrency management more predictable, and data-flow scheduled hardware multithreading to tolerate unpredictable on-chip latencies. We investigate the use of the Micro-grids with the two following research questions:

• Could fault tolerance technique also benefit from hardware concurrency man-agement?

(33)

18 CHAPTER 1. INTRODUCTION • How data-flow scheduled hardware multithreading can be exploited in fault

tolerance?

Additionally, another aspect of fault tolerance related research question is in-voked when considering the main question mentioned above. In the current situ-ation, the reliability of a system is considered less than it should be at its design stage. Most fault tolerance techniques focus on constructing a reliable service or system upon existing components. Therefore, it would be better to evaluate the reliability of emerging concurrent programming platform during its design stage, especially for its new features, which raises the general research question as follows: How to analyze the reliability of a system during its design stage, especially in the simulation environment?

Similarly, we can make the question specific to the Microgrids, or extend it to the fault tolerance part of a system, i.e.,

• How is the reliability of hardware concurrency management components compared to the regular components within a processor in the Microgrids? • How is the effectiveness of our fault tolerance approaches provided for

emerg-ing concurrent programmemerg-ing environments?

1.5 Organization

This dissertation is organized as shown in figure 1.1. General background on concurrent programming environments and fault tolerance appears in chapter 1. Chapter 2 introduces the Microgrids, a novel concurrent programming environ-ment with hardware support for concurrency manageenviron-ment, which is the experi-mental platform for the fault tolerance techniques presented in this dissertation. The Microgrids introduction in this chapter focus more on the fault tolerance related aspects rather than others.

The core of this dissertation is organized into three parts: fault detection, recovery and injection in a concurrent programming environment, which span chapter 3-7. Chapter 3-5 focus on the thread level fault detection from the per-spective of demand, system utilization and output comparison, reper-spectively. As these chapters target fault detection strategy together, they have different high-lights and strong linked connection as shown in table 1.2. Chapter 3 presents an on-demand thread-level fault detection technique that includes support from the programming model, compiler, instruction set and microarchitecture, which is flex-ible and efficient. To achieve on-demand, it proposes intelligent redundant thread creation and synchronization, which manages synchronization between the redun-dant threads via master. Chapter 4 implements redunredun-dant multithreading (RMT) technique in a data-flow scheduled multithreaded multicore processor, which poses significant challenges in the design and the resulting design is described and evalu-ated in detail. Chapter 5 presents an asynchronous output comparison mechanism, which can be used in all RMT techniques. It trades the asynchronous latency for area saving, since the latency is easily hidden in the multithreaded environment. Chapter 6 presents a novel approach for transient fault recovery in multithreaded

(34)

1.5. ORGANIZATION 19 processors using thread re-execution. Its low hardware and performance overheads are benefit from the full using of features of multithreaded environment. Chap-ter 7 analyzes the vulnerability of different components in the Microgrids using simulation based microarchitectural level fault injection scheme. Finally, chapter 8 gives concluding remarks.

Table 1.2: The relation between each chapters of fault detection strategy. Contents Chapter 3 Chapter 4 Chapter 5

On-demand Redundancy △ -

-Redundant Multithreading ◻ △ ◻

Output

Comparison AsynchronousIdeal ◻- ◻- △ -Experimental

Platform

128 cores with distributed cache

system ◻ -

-4 cores with a shared

L2 - ◻ ◻

(35)

20 CHAPTER 1. INTRODUCTION

A Fault Tolerance Framework in a Concurrent Programming Environment

2: The Microgrids cccccccccccccccccccccccccccccccc

Fault Detection Fault Recovery Fault Injection 3: On-demand Redundancy ccccccccccccccccccccccc 4: Data-ﬂow Scheduled Redundant Multithreading ccc 5: Asynchronous Output Comparison ccccccccccccccc 6: Rethread: A Low-cost Fault Recovery Scheme ccccc 7: Dependability Analysis: A Fault Injection Approach

(36)

Chapter 2

The Microgrids

—A Concurrent Programming Environment

This chapter is based on:

• R. Poss, M. Lankamp, Q. Yang, J. Fu, M.W. van Tol, and C. Jesshope,

"Apple-CORE: Microgrids of SVP cores", in the Proc. of the 15th Euromicro Conference on

Digital System Design (DSD’12), Cesme, Turkey, September 5-8, 2012.

• R. Poss, M. Lankamp, Q. Yang, J. Fu, I. Uddin, and C. Jesshope, "MGSim—A simulation Environment for Multi-Core Research and Education", in the Proc. of the 13th International Conference on Embedded Computer Systems: Architectures, MOdeling

and Simulation (SAMOS’13), Samos, Greece, July 15-18, 2013.

Abstract

This chapter introduces a concurrent programming environment, called the Microgrids, developed by the CSA group at the University of Amster-dam. The concurrency expression part of the Microgrids is formalized by the system virtualization platform (SVP) model, which can be programmed by a intermediate language SL. The concurrency management part is sup-ported by hardware using ISA extensions in the concrete implementation of the Microgirds. It is worth mentioning that the Microgrids consists of many data-flow scheduled hardware multithreaded cores. Meanwhile, an extensi-ble C++ framework for discrete event, component-level simulation of the Microgrids, the MGSim, is also introduced.

2.1 SVP Model

As a concurrency expression model, the SVP model describes concurrency activi-ties and their relations. In SVP, each concurrency activity is called a thread, and all the same level concurrency activities that are created by a parent are called a familyof threads. Any thread can execute a create action to start a new family of threads. Then, it can use the sync action to wait for its termination, implementing a fork-join style of parallelism. However, detached create is a special case, where a context can use create to start a new family but does not have to execute the sync action. Besides, there is the kill action to asynchronously terminate an execution, and break to abort the family of the thread that executes this action.

Every thread can create families of its own, making the model hierarchical, which we refer to as the concurrency tree as shown in figure 2.1. Each circle is a thread. Circles positioned at the same level and with the same direct creator are of the same thread family. The concurrency tree depicts all the thread families of a program including creating threads (i.e., node in the tree) and non-creating threads (i.e., leaf in the tree). Furthermore, the tree may contain dependency chains between the threads of a family and with their parent. This concurrency tree will evolve dynamically in an application and can capture concurrency at all levels, e.g., at the task level and, due to the thread’s blocking nature, even at the instruction level. With a family being an ordered set of identical threads, and each thread within a family knowing its own index in that order, both homogeneous and heterogeneous computation can be defined. The latter can be achieved by using the index value to control the statically defined actions of a thread (e.g., by branching to different control paths on the index). The index values are defined by a sequence specified by a start index, a constant difference between successive index values (step), and a optional maximum index value (limit).

Chapter 3. SVP Execution Model and its Implementations

• the parent thread cannot see the changes made by the child threads until it has synchronized on the family’s termination;

• a thread cannot see the writes made by a previous thread in the family except for writes to objects which are sent via shared channels.

• in order to share data between unrelated threads, all threads must create a family on the same exclusive place to synchronize consistent access to a shared location.

Specifically, this consistency model makes no guarantees for a thread seeing writes made by an unrelated thread at some point in time. This relieves the memory implementation from ensuring global consistency, allowing more ex-perimentation and optimization with implementations.

3.2.6 Concurrency tree

Every SVP thread can create families of its own; this makes the model hierar-chical. Therefore, an SVP program is represented with a hierarchy of families. This hierarchy resembles a tree of thread families and their organization, and consequently the concurrency present in that SVP program. This representation is called a concurrency tree where each node is a thread which creates a family of threads, called a creating thread. A leaf of the concurrency tree represents a non-creating thread or leaf thread. Consequently, an SVP program is composed of families of creating threads and families of non-creating threads.

create sync Barrier sync Dependency chain A B1 Bn

Figure 3.7: Representation of a program’s concurrency via a concurrency tree. Each

circle is a thread; circles positioned at the same level and with the same direct creator is of the same thread family. The concurrency tree depicts all the thread families of an SVP program including creating threads (i.e. node in the tree) and non-creating threads (i.e. leaf in the tree). Furthermore, the tree contains dependency chains between the threads of a family and with the ascendancy (i.e. parent thread).

Figure 3.7 is an example of a program’s concurrency tree. On the left-hand side, nodes including leaves are exposed with their single ascendant and their

35

Figure 2.1: Concurrency tree [19].

The communication channels of dependency chains mentioned above are of two types: global and shared. Global is written once by the parent thread and is available for being read to each thread in the family. Shared is defined between every consecutive pair of threads in the family. Note that parent thread has to initiate shared for its first child thread and can get the final values of these shared from the last child thread. Families with shared channels are dependent

(38)

2.1. SVP MODEL 23 wherein a thread’s execution will be paused by the synchronization of shared with its antecedent neighbor. Otherwise families are independent. To summarize, the communication pattern prescribes explicit and forward-only data flow, which can guarantee deadlock freedom (given sufficient resources) and determinism [127].

The hardware execution resource is called a place, which is allocated at run-time prior to the family being created. Therefore, a place is an implementation-independent abstraction for a collection of one or more resources where families can be executed. Places can be configured statically or dynamically, depending on the implementation. When a thread creates a family, it can specify the place where to create that family and if it specifies a place other than its own, that thread is said to delegate that family to that place. Delegation serves two purposes; first, it allows programs to avoid resource deadlock by creating a unit of work on a different set of resources. Second, since a place also holds one or more mutually exclusive states, families can be delegated to a place in a mutually exclusive manner, which guarantees that of all families delegated to the same place, only one will run at any time, allowing it to operate on shared data.

The SVP model assumes a global, single address space, shared memory. How-ever, the memory is seen as asynchronous as it can not be used for synchro-nizations, as no explicit memory barriers or atomic operations on memory are provided. It has a restricted consistency model where we only enforce consistency when concurrent activities interact; at the points of create and sync. There are several reasons for this approach to main memory. First of all, it will become increasingly difficult to provide global sequential consistency [60] in future many-core architectures. Secondly, by disallowing synchronizations through memory, we only have forward dependencies between threads, expressed by create, sync and the global/shared communication channels, which guarantees the freedom of communication deadlock and the ability to trivially sequentialize execution.The memory consistency model is defined as follows:

• upon creation, a child family is guaranteed to see the same memory state as the parent thread saw at the point where it executed create;

• the parent thread is guaranteed to see the changes to memory by a child family only when sync on that family has completed;

• subsequent families created on an exclusive place are guaranteed to see the changes to memory made earlier by other families on that place.

The memory consistency relationship between parent and child threads is im-plemented by acquire and release operations of location consistency (LC) [35] on synchronization actions such as creation and termination of threads and writing to and reading from synchronizing objects. However, nothing prevents the parent and the child from modifying the same memory at the same time, which leads to undefined behavior. In addition, there is no way for threads to explicitly synchro-nize access to any memory location with another unrelated thread. Specifically, this consistency model makes no guarantees for a thread seeing writes made by an unrelated thread at some point in time. This relieves the memory implementation from ensuring global consistency, allowing more experimentation and optimization with implementation.

(39)

24 CHAPTER 2. THE MICROGRIDS

2.2 SL Language

This section introduces the intermediate language SL, a common vehicle to pro-gram SVP platforms. SL is designed as an extension to the standard C language (ISO C99/C11). It includes primitive constructs to bulk create threads, bulk syn-chronize on termination of threads, and communicate using word-sized data-flow channels between threads (i.e., global and shared). It is intended for use as tar-get language for higher-level parallelizing compilers. A full specification of the SL core language has been published in [91]. This section thus complements the specification with a high-level overview based on examples.

Thread functions are defined using the sl_def in SL, as depicted in listing 2.1. This construct defines the thread function named "innerprod". The subsequent sl_glparmdefines a global channel to receive the array base, and sl_shparm defines a shared channel for the partial sum. In the body of the function, sl_index defines a variable named "i" which will automatically receive, at run-time, the logical index of the thread currently executing the thread function. It has a signed integer type. Moreover, sl_getp reads from the channel endpoint indicated by its name.

1 #include <s t d i o . h>

2

3 sl_def( innerprod , , sl_glparm( int ∗ , a ) , sl_glparm( int ∗ , b ) ,

4 sl_shparm( int , s ) )

5 {

6 sl_index( i ) ;

7 i n t ∗a = sl_getp ( a ) , ∗b = sl_getp ( b ) ;

8 sl_setp( s , sl_getp ( s ) + a [ i ] ∗ b [ i ] ) ; 9 } 10 sl_enddef 11 12 i n t main ( void ) 13 { 14 i n t v1 [ 5 ] = {1 , 2 , 3 , 4 , 5} , v2 [ 5 ] = {3 , 5 , 7 , 11 , 13};

15 sl_create( , , , 5 , , , , innerprod , sl_glarg ( int ∗ , , v1 ) ,

16 sl_glarg( int ∗ , , v2 ) , sl_sharg ( int , s , 0 ) ) ;

17 sl_sync( ) ;

18

19 p r i n t f ( "%d\n" , sl_geta ( s ) ) ; // p r i n t 143

20 return 0 ;

21 }

Listing 2.1: Inner product in SL.

Furthermore, forward or external declarations can also be expressed using sl_decl, with the same syntax as sl_def but without the function body. Finally, the floating-point (FP) channel endpoints must be declared using sl_glfparm and sl_shfparm. The reason for this separation is that the SL compiler must generate different code for FP channels, and the context-free substitution of SL constructs cannot properly determine whether a given channel type is actually FP. As defined in SVP model, any thread can itself create whole families of logical threads at once and then synchronize on their termination. A single sl_create

(40)

2.2. SL LANGUAGE 25 ... sl_syncconstruct expresses both bulk creation and synchronization, given in listing 2.2, and a example in the "main" function of listing 2.1. This construct expresses the creation of a family of logical threads running the thread function named as "fexp". The start, limit and step integer expressions define the logical range presented the concepts above in section 2.1, while ws defines the window size1_{, which limits how many hardware threads are allocated to the family on}

each core (0 indicates no limit), and thus determines how many logical threads execute simultaneously; any excess logical threads are serialized over the allocated hardware threads, i.e., a new logical thread is created only once the previous logical thread has terminated. All four expressions are optional, and default to 0, 1, 1, 0 respectively if left unspecified.

1 sl_create( , [ placement ] , [ s t a r t ] , [ l i m i t ] , [ step ] , [ ws ] ,

2 [ create− s p e c i f i e r ] , fexp [ , channel − s p e c i f i e r ] ) ;

3 . . .

4 sl_sync( ) ;

Listing 2.2: Family creation and synchronization construct in SL.

As introduced in section 2.1, every family of logical threads executes within a context, which is a specific set of hardware threads and/or cores on chip. This con-text is allocated prior to family execution and must be released (de-allocated) after-wards. The SL construct sl_create ... sl_sync automatically combines context allocation, family creation, family synchronization and context de-allocation, i.e., it does not let the program code control allocation and de-allocation explicitly. However, the program code can specify where on chip the context must be allo-cated, using the placement parameter to sl_create.

Any given placement contains two components: a location on chip, and a size which specifies how many cores to use from that location. By default, the following holds:

• any independent family (with no channels or only global channels) is spread automatically using an even distribution over its placement; for example with N logical threads and a placement size P, each core at the location will run approximately N/P logical threads;

• any dependent family (with at least one shared channel) will only run on the first core of its placement, regardless of the placement size.

When the placement is left unspecified in sl_create, it defaults to 0. The value 0 is special and specifies that the context must be allocated at the same resource as the family running the creating thread, i.e., that both location and size must be inherited. The value 1 also has a special meaning. It specifies that the context must be allocated at the same core as the creating thread, with size 1.

The create-specifier is used to override the regular behavior per instance of sl_createas follows:

• when sl__exclusive is specified, exclusive context of each core is used to implement mutual exclusion. Because exclusive context is singular, any

(41)

26 CHAPTER 2. THE MICROGRIDS creation request to a core where the exclusive context is currently busy will call the creation to wait until the context becomes available, that is, until the previous family at that context terminates;

• when sl__forcewait is specified, the creation will wait until some context can be allocated at the target placement, no matter what. This implies the possibility for deadlock, but may be desirable for system code;

• when sl__forceseq is specified, the creation is always serialized in the cre-ating thread, even if some context is available at the target placement. This feature is intended for testing or comparing performance between parallel and sequential execution.

The last part of parameters in sl_create is channel-specifier, which is con-nected to the channel endpoints in the function created by itself. There are four types of channel-specifier in terms of channel type in integer or FP, global or shared: sl_glarg, sl_sharg, sl_glfarg and sl_shfarg. In listing 2.1, the "main" func-tion defines two vectors, then creates a family of 5 threads running "innerprod". It then sends the vector base addresses via the first two global channel endpoints (i.e., sl_glarg), left unnamed, and the value 0 as source value for the shared third channel (i.e., sl_sharg), labeled "s". After synchronization on termination, it uses sl_geta to read the value sent to the shared channel by the last thread. Note that the source channel endpoints are not defined prior to sl_create, and the final endpoints of shared channels cannot be used with sl_geta before sl_sync. Finally, it is also possible to create a family without synchronizing on its ter-mination, i.e., inform the underlying SVP platform that the creating thread and the created family can proceed asynchronously. This is done by using the word sl_detachinstead of sl_sync.

2.3 Core Microarchitecture

The hardware of Microgrids is composed of a simple RISC core design called D-RISC [16]. Each D-D-RISC core supports hardware multithreading (HMT) using a data-flow scheduler, and is also equipped with a hardware thread management unit (TMU) which can coordinate with neighbouring TMUs for automatic thread and data distribution (cf. section 2.4). The core design, illustrated in figure 2.2, is derived from a 6-stage, single-issue, in-order RISC pipeline:

• the register file is extended as synchronizing storage, where each register has a data-flow state which indicate whether it contains data (full/empty) or is waiting for an asynchronous operation to complete;

• upon issuing an instruction that requires more than one cycle to complete, or whose input operands are not full, the waiting state is written back to the output operand and the value is overwritten with the identity of the issuing thread. This way, the thread can be put back on the schedule queue when its dependency becomes available. Meanwhile, further instructions in the pipeline can continue;

• the L1-D cache is modified so that loads are issued to memory asynchronously, constructing in the line’s storage a list of registers to notify when the load completes;

A fault tolerance framework in a concurrent programming environment - Thesis

UvA-DARE (Digital Academic Repository)

A fault tolerance framework in a concurrent programming environment

Fu, J.

Publication date

2014

Document Version

Final published version

Link to publication

Citation for published version (APA):

Fu, J. (2014). A fault tolerance framework in a concurrent programming environment.

A Fault Tolerance Framework

in a Concurrent Programming

A Fault Tolerance Framework

in a Concurrent Programming

Environment

Academisch Proefschrift

ter verkrijging van de graad van doctor

aan de Universiteit van Amsterdam

op gezag van de Rector Magnificus

prof. dr. D.C. van den Boom

ten overstaan van een door het college voor promoties ingestelde

commissie, in het openbaar te verdedigen in de Aula der Universiteit

op woensdag 29 oktober 2014, te 11:00 uur

door

Jian Fu

Contents

I

Background

7

II

Fault Detection

35

III Fault Recovery

87

IV Fault Injection

109

V

Conclusion

121

VI Appendices

135

List of Figures

List of Tables

Listings

Summary

Samenvatting

Acknowledgements

Part I

Chapter 1

Introduction

Contents

1.1

Motivation and Goal

1.2

Concurrent Programming Environment

1.3

Fault Tolerance Framework

1.3.1

Fault Detection

1.3.2

Fault Recovery

1.3.3

Fault Injection

1.4

Research Questions

1.5

Organization

Chapter 2

The Microgrids

—A Concurrent Programming Environment

Contents

2.1

SVP Model

3.2.6 Concurrency tree

2.2

SL Language

2.3

Core Microarchitecture