An evaluation of machine learning methods for modeling an ORC system

(1)

modeling an ORC system

An evaluation of machine learning methods for

Academic year 2019-2020

Master of Science in Electromechanical Engineering

Master's dissertation submitted in order to obtain the academic degree of

Supervisors: Prof. dr. ir. Michel De Paepe, Prof. dr. ir. Steven Lecompte

Student number: 01500429

Lars Van Mieghem

(2)

(3)

modeling an ORC system

An evaluation of machine learning methods for

Academic year 2019-2020

Master of Science in Electromechanical Engineering

Master's dissertation submitted in order to obtain the academic degree of

Supervisors: Prof. dr. ir. Michel De Paepe, Prof. dr. ir. Steven Lecompte

Student number: 01500429

Lars Van Mieghem

(4)

Copyright

The author gives permission to make this master dissertation available for consultation and to copy parts of this master dissertation for personal use.

In all cases of other use, the copyright terms have to be respected, in particular with regard to the obligation to state explicitly the source when quoting results from this master dissertation.

Lars Van Mieghem Ghent, 16 August 2020

(5)

Foreword

Before presenting the contents of this work, I would like to express my thanks to those people without whom it would not have been able to take its current form. I am thankful to prof. Michel De Paepe for giving me the opportunity to work on this thesis. I would like to thank prof. Steven Lecompte for his advice and for the valuable feedback and insights that I have received from him. Also, I would like to thank my counselors Aditya Pillai and Kenny Couvreur who have always readily assisted me and answered my questions throughout the past year. Finally, I would like to thank the members of my family who have always supported me in these extraordinary times.

(6)

MASTER THESIS 2019-2020 Research group Applied Thermodynamics and Heat Transfer Department of Flow, Heat and Combustion Mechanics – Ghent University

AN EVALUATION OF MACHINE LEARNING METHODS FOR MODELING AN ORC SYSTEM

Lars Van Mieghem, Steven Lecompte, Michel De Paepe, Aditya Pillai, Kenny Couvreur Department of Flow, Heat and Combustion Mechanics

Ghent University

Sint-Pietersnieuwstraat 41 – B9000 Gent – Belgium E-mail: lars.vanmieghem@ugent.be

1 Abstract

The organic Rankine cycle (ORC) is a suitable technol-ogy for electricity generation from low to medium grade heat sources with a limited capacity, such as geother-mal energy or waste heat recuperation (WHR). For the prediction of the part-load behavior of the cycle, the-ory based ”grey-box” models are available. However, these models may not be able to capture all the present phenomena, require careful calibration and use time-consuming iterative solving calculations. The use of novel machine learning (ML) techniques may offer an interesting alternative for modeling the ORC cycle. In this work, the use of two classes of ML algorithms, tree based methods and artificial neural networks (ANN), was evaluated in order to predict the steady-state ex-pander output power as a function of the cycle parame-ters. ML models were trained using both a set of experi-mental data obtained from an experiexperi-mental 11kWe ORC setup, and using a larger set of data generated using a theory based model. Finally, the use of hybrid modeling and transfer learning techniques in order to improve the accuracy of the model trained on the experimental data was evaluated.

Keywords: organic Rankine cycle, part-load, ma-chine learning, neural networks, random forests, transfer learning

2 Introduction

In recent years there has been increasing concern over the issue of global warming, caused by the release of large quantities of greenhouse gases such as CO2 into

the atmosphere of the earth. The Paris climate agree-ment, signed at the COP-21 conference in 2015, sets a long-term goal of keeping the average global temper-ature increase at the end of the century well below 2 degrees Celsius and recognizes that rapid reductions in greenhouse gas emissions will have to be made in accor-dance with using the best available science [1]. The In-ternational Energy Agency estimates that to reach this scenario, the average carbon intensity of electricity

gen-eration will have to decline at a rate of -3.9% over the next decade. By 2060, 98% of all electricity produc-tion should be obtained from carbon-neutral sources [2]. Goals to limit CO2 emissions are being converted into

binding legislation by governments across the world. In Europe, the 2030 climate and energy framework enforces a 40% reduction in greenhouse gas emissions by 2030 compared to the 1990 level [3]. It is clear that these goals can only be accomplished by the large scale use of new forms of renewable and carbon neutral energy production. Among these are a number of technolo-gies based on the use of alternative heat sources. Such technologies include already well established fields such as geothermal power, waste heat recuperation (WHR), biomass, concentrated solar power (CSP) as well as more novel power generation methods such as ocean thermal energy conversion (OTEC) or solar pond power plants (SPPP) [4]. These heat sources typically operate at much lower temperatures (< 400 °C) and lower power ratings (<10 MW) than traditional coal fired or nuclear reactor power plants. Whereas the latter usually em-ploy a water based Rankine cycle (or variant thereof) to extract useful work from the released heat, the use of a Rankine cycle with water as the working fluid is inefficient at a low heat source temperature. Instead, the use of an organic Rankine cycle (ORC) that uses an organic compound (or one with similar properties) as its working fluid instead of water is preferred in these conditions [5]. Although commercial ORC installations have already been around for several decades [6], a lot of interest remains in the improvement of the cycle ef-ficiency. A lot of work has already been done on the selection of an optimal working fluid. Related to this, research has also been performed into the choice of the cycle architecture. Another field has been the develop-ment of expander designs, both of turbines and volu-metric expanders, with a higher efficiency in ORC oper-ation. Finally, there has been an active interest in the development of control strategies for the ORC system. Knowledge of the control strategy is particularly rele-vant for applications where heat loads may vary over time, such as with WHR or solar heat conversion, and

(7)

where the overall inertia of the system is limited. The availability of accurate models is essential for the devel-opment of an effective control strategy. Furthermore, computer models can also assist in plant maintenance and supervision, help determine a safe area of operation and give new insight into the behavior of the cycle [7]. An number of tools have been developed in recent years based on a combination of analytically derived equations and experimental correlations. Working fluid properties can be determined using libraries such as CoolProp [8], RefProp [9] or FluidProp [10]. The overall dynamics of the cycle installation can be modeled in the Modelica en-vironment using the ThermoCycle [11] or ThermoPower [12] libraries. However, such models are mathematically complex and still require careful calibration of the cor-relation parameters.

3 Machine Learning

In the past decade there has been a rapidly rising in-terest in the application of machine learning (ML) al-gorithms to a wide variety of modeling problems in dif-ferent fields. Also in the field of energy engineering an exponential increase in the number of papers describing machine learning techniques has been observed [13]. For the purposes of modeling a physical relationship based on experimental data, supervised learning algorithms are the most logical choice among ML techniques. In supervised learning, the modeling problem is generally described as the selection of a function f relating a set of input features_{x1, x2, . . .} to a certain output variable

y, by training the model on a set of previously obtained training samples {(x1, x2, . . . , y)_i}. In most cases the

output variable is either an element ofR, in which case we speak of a regression problem, or an element of a dis-crete set of possible values{y1, y2, . . .}, in which case we

speak of a classification problem. As the studied vari-ables here are of a continuous nature, a regression model will be used. One of the key concepts in machine learn-ing is the so-called bias-variance tradeoff. Assume that the output variable y follows from the actual relation:

y = f (x) + (1) with x containing the input features, f an unknown function and representing the interference from noise, which we assume is normally distributed with mean equal to zero. To model this relationship, a function of the form hθ(x) is proposed, with θ the to be learned

parameters of the model. Given a set of sample data D, the selected ML-algorithm will provide an estimate ˆθ of these parameters. The accuracy of the obtained model can be evaluated using the mean squared error (MSE):

M SE( ˆθ) = 1 n· n X i=1 (h_θˆ(x(i))− y(i))2 (2)

Most supervised learning algorithms are trained ei-ther by searching a value for θ that minimizes the MSE-value on the training dataset, or by following another op-timization criterion that will yield comparable results.

However the MSE trainon the training data will usually

be an underestimate of the generalization error gen on

an unseen test set, a phenomenon called overfitting. In general, it can be proven from statistical learning theory [14] that the expected value of the test MSE is given by: E(gen) = Bias2+ V ariance + N oise (3)

Where the bias term indicates the ability of the mod-els architecture to portray the studied relation, if an un-limited amount of data were to be available, the variance term indicates the amount of overfitting and the noise term is caused by the actual noise . As is illustrated in Figure 1, both the bias and variance are dependent on the complexity of the selected ML model, with a too simple model having a high bias and a model with too many parameters θ having a high variance. This is what is generally known as the bias-variance tradeoff, as usu-ally the optimal chosen model is found in the middle by trading off the two error terms.

Figure 1: An illustration of the bias-variance tradeoff [14].

The most popular supervised learning model at the time of writing is the artificial neural network (ANN). Their popularity mainly stems from their ability to model a large variety of different relation structures without the need for tedious feature engineering, and the intuitiveness of the resulting models. The simplest NN architecture is called the densely connected or feed-forward neural network. Such a network consists of an input layer, an output layer and one or more hidden lay-ers in between. These hidden laylay-ers consist of a number of so-called neurons. A single such neuron is shown in Figure 2. It receives a number of input values from the previous layer and calculates an output value fed to the next layer according to:

y = f (w1· x1+ ... + wn· xn+ b) (4)

with wi and b the parameters learned during training

and f a nonlinear activation function. Once the net-works architecture is defined, it is fitted to the training data by finding the combination of parameters that min-imizes the MSE of the model on the datapoints in the training set [15].

(8)

Figure 2: An illustration of a single neuron. Another class of ML algorithms that is often used when the data is presentend in a clear tabular form, as is the case here, are methods based on the decision tree model. The decision tree is a fairly simple model that, as its name implies, uses a decision tree to map its input feature to the defined output. While intuitive, on its own the model is very prone to overfitting and highly sensitive to small changes in the training param-eters. However, multiple decision tree models can be combined to form together a powerful predictor, using a technique known as ensembling. Under certain assump-tions, it can be proven in statistical learning theory that by combining a large number of individually strongly overfitting models, an ensemble model can be created that is nonetheless able to accurately model the studied relations, because the variances of the individual models effectively cancel each other out. For tree-based model-ing, two ensembling techniques are widely used. With random forests, one of them, a large number of deci-sion trees are trained simultanously on the dataset. To guarantee that all trees are unique, a small degree of randomness is introduced in a number of ways during training. With boosting, the other approach, a number of trees are trained sequentially, where each tree tries to correct for the points that were badly predicted by the previous tree.

4 Setup

The datasets used by the machine learning algorithms were obtained from a 11kWe ORC setup at the Kortrijk campus of Ghent University. The cycle uses R245fa as the working fluid and contains a recuperator unit to im-prove efficiency. The heating circuit uses 10 Maxxtec 25kWe electrical heater elements and uses Therminol 66 as the heating fluid. The cooling circuit uses a water-glycol mixture (33vol% water-glycol). The cycle is equipped with a Calpeda MXV 25-214 centrifugal turbopump. For the expander, two designs were evaluated, one being a single screw expander with a nominal speed of 3000 rpm, and the other being a double screw expander with a nominal speed of 5000 rpm. An overview of the in-stallation is given in Figure 3.

Figure 3: Left: a picture of the ORC installation. Right: schematic diagram of the cycle architecture [16].

5 Data gathering

Using the sensors indicated in the schematic on the right hand side of Figure 3, a large number of cycle parame-ters were measured which were subsequently processed using the LabView program. Between 2013 and 2016, a total of 885 LabView .tdms files containing time series of various lengths with a sample frequency of 1 s were gathered. These files were processed in Python using the Pandas library to obtain a set of steady state oper-ating points of the cycle. This was performed according to the methodology described by Lecompte et al. [17] by first calculating a 10 minute moving window stan-dard deviation for the measured parameters, and the obtained standard deviations to certain threshold val-ues. Neglecting points where the cycle was not active and where the recuperator was bypassed, a total of 95 and 97 steady state data points were found for respec-tively the single screw and double screw expander.

6 Experimental modeling

Using these experimentally obtained data points, the use of machine learning methods was studied in order to predict the steady-state expander output power (kW) as a function of the inlet temperatures Tcf,in (◦C) and

Thf,in(◦C) of heating and cooling fluids, their flow rates

˙

mhf (kg/s) and ˙Vcf (m3/h) and the expander, nexp

(rpm), and pump speed, npump(rpm). First a short

sta-tistical analysis was performed on the datasets. Corre-lations between the different parameters were estimated and scatter plots were generated to obtain a clear visu-alization of the dataset. To evaluate the performance of trained models, three metrics were introduced, namely the mean squared error M SE, coefficient of determina-tion R2_{and the mean absolute error M AE. To obtain a}

baseline against which more advanced ML models could be evaluated, a linear regression analysis was performed on the dataset. This obtained a value R2_{= 0.6659 for}

the single-screw expander and R2_{= 0.0878 for the}

dou-ble screw expander setup. It thus seems that the degree of linearity present in the relation for the double-screw expander is substantially lower. Subsequently, the use of the two classes of machine learning algorithms de-scribed earlier, namely tree based methods and

(9)

arti-ficial neural networks, was evaluated on the available data sets. Tables 1 and 2 briefly summarize, for respec-tively the single-screw and double-screw expander setup, the eventually obtained performance of the different ML learning algorithms, after an appropriate hyperparame-ter optimization in each case. It was found that for both setups, a neural network based model achieved the high-est accuracy, albeit with a different number of hidden layers, two for the double-screw expander and five for the single-screw expander. For the tree based modeling, it was found that the extremely randomized trees (ex-tra trees) model, which is a modified version of random forests that incorporates a higher degree of variability, performed better for both setups than the standard ran-dom forest model. However, it must be noted that even for the best performing models, accuracy is in all likeli-hood still to low to be of direct practical use. This may be caused by the limited amount of data points available for training, one or more orders of magnitude less that what is traditionally encountered in a ML setting. The fact that the input features were not independent and identically distributed (IID) may also have contributed to the lower accuracy. However, it was at least shown that the obtained models were able to extract a signif-icant part of the underlying relationship, strongly im-proving upon the baseline linear regression model.

Model M SE R2 _{M AE}

Linear regression 0.1132 0.6659 0.1314 Random forest 0.1324 0.6583 0.2116 Extra trees 0.0785 0.8605 0.1201 Neural network 0.0524 0.8918 0.0814 Table 1: Overview of obtained accuracies on the single-screw expander using the experimental dataset.

Model M SE R2 _{M AE}

Linear regression 0.5386 0.0878 0.4992 Random forest 0.4118 0.3518 0.3025 Extra trees 0.2946 0.6150 0.1647 Neural network 0.1756 0.7274 0.1250 Table 2: Overview of obtained accuracies on the double-screw expander using the experimental dataset.

7 Theory based modeling

As the main issue hampering the performance of the models described in the previous section seemed to be the very limited amount of data points, it was investi-gated whether a higher accuracy could be obtained if a model was trained on a substantially larger dataset, ob-tained from an available theory based model describing the same ORC cycle. A model developed in MATLAB at Ghent University by Lecompte et al. [18] was used to generate 1000 data points for the double-screw ex-pander setup with Thf,in, Tcf,in, npump, ˙mhf and ˙Vcf as

the input features and Pexp as the output. Values for

the input features were randomly generated with a uni-form distribution over a typical working range. Similar

to the experimental modeling, a short statistical analy-sis was performed prior to the actual modeling. Table 3 shows the correlation coefficients calculated between the input parameters and the output. It can be seen that the pump speed and inlet temperature of the heating fluid were most strongly correlated.

Parameter Correlation coefficient npump (rpm) 0.6881 ˙ Vcf (m3/h) 0.1720 Tcf,in (◦C) -0.3561 Thf,in(◦C) 0.4564 ˙ mhf (◦C) 0.0568

Table 3: Correlations with the expander power. Next, a linear regression analysis was performed on the obtained data. It was found that even a simple linear regression model was already able to capture an impor-tant part of the present relationship, obtaining a coeffi-cient of determination R2_{= 0.8309. Neither the use of}

lasso nor that of ridge regularization was able to improve this accuracy, and increased regularization parameters only led to a decrease in the model’s performance, indi-cating that little overfitting was present. Having estab-lished a linear baseline model, the use of random forests was investigated. A hyperparameter optimization was performed using a randomized search with 1000 itera-tions and 10-fold cross-validation. The eventual model obtained a cross-validation accuracy R2 _{= 0.9677. An}

analysis of the averaged feature importances confirmed that the pump speed and heating fluid inlet tempera-ture had the largest influence in determining the out-put power. In a similar manner, an extremely random-ized tree model was also trained on the dataset. It was found that the extra tree model slightly outperformed the standard random forest model, with R2 _{= 0.9774,}

similar to the results from the experimental modeling. Boosted trees were another kind of tree-based models that were investigated. For this, the XGBoost [19] al-gorithm was used, which is has a high computational efficiency and contains a number of build-in regulariza-tions. Having performed an optimization for several of the model’s hyperparameters in a randomized search, a value R2 _{= 0.9744 was obtained. Finally, the use of}

artificial neural networks to model the ORC cycle was studied. The 1000 data points were randomly split in 900 points used for training, and 100 used for validation. A hyperparameter optimization was performed to deter-mine the optimal number of hidden layers, the optimal number of neurons/layer, the L2 regularization factor α and the dropout rate rdropout. Dropout is a form

of regularization specific to neural networks, whereby a certain fraction of the neurons in the hidden layers are randomly dropped from the network at each training it-eration. Optimal results were found for a network pos-sessing 10 hidden layers with 15 neurons/layer, α = 0.1 and no dropout present. Using an early stopping call-back, which halts training if no improvement in the val-idation loss can be noted, and the Adam optimizer with a learning rate lr = 0.01, a batch size of 50 and a

(10)

max-imum of 500 training epochs, an accuracy R2_{= 0.9846}

was achieved on the validation dataset. As the lack of data points was suspected as the main reason for the discrepancy in performance between the models trained on the experimental dataset and those described in this section, the above described model was retrained using only a randomly selected portion of the training data. Figure 4 shows that the R2 _{score does indeed quickly}

start to decrease when less than 200 data points are used to train the model.

Figure 4: Graph showing R2_{as a function of the number}

of data points used.

Finally two more advanced deep learning techniques that aim to improve convergence, namely the use of the ELU (Exponential Linear Unit) activation function and the introduction of batch normalization layers, were studied. Using both techniques, the originally obtained accuracy was improved to R2 _{= 0.9945. Table 4 gives}

a general overview of the obtained accuracy scores for the different ML algorithms evaluated on the dataset.

Model M SE R2 M AE Linear regression 462093 0.8309 492.17 Random forests 87758 0.9677 144.52 Extra trees 61839 0.9774 104.64 Boosted trees 69810 0.9744 167.44 Neural networks 19690 0.9945 89.39 Table 4: Results of the ML modeling on the theoretical data.

8 Hybrid modeling

As was shown in the previous section, given a sufficient amount of reliable data points, machine learning models are able to accurately portray the behavior of the stud-ied ORC installation. On the other hand, the models trained in Section 6 using the much smaller experimen-tally obtained dataset achieved significantly lower ac-curacies. However, if prior theoretical knowledge of the cycle could be incorporated into the machine learning al-gorithm, the need for a large data set may be reduced. This approach is generally known as hybrid modeling. A large variety of hybrid modeling techniques has been proposed in literature [20], based on where in the ma-chine learning workflow the theoretical relations are in-troduced. Here the use of transfer learning (TL) was

studied. Transfer learning is a field closely related to hybrid modeling where the aim is to transfer knowledge from a source model Ms trained on a source domain

Ds= (X, P (x)) and source task Ts= (Y, P (y|x)), to a

new target model Mtwith a different domain Dtand/or

task Ttfor which only a limited amount of data is

avail-able [21]. Cases where the source and target domains are identical and only Ts6= Ttare known as inductive

trans-fer learning. When training neural networks, two induc-tive transfer learning techniques, known as freezing and fine-tuning, are widely used. With freezing, the lower layers of the original source model are ”frozen”, meaning that their parameters can no longer be changed, while the parameters in the upper layers are refitted to the new target task. With fine-tuning, the whole network is refitted, but only for a limited number of iterations and with a small learning rate.

Here, the possibility of transferring knowledge from the highly accurate theory based obtained for the double-screw expander setup to a new NN model trained on the experimental dataset for the single-screw ex-pander configuration was investigated. A custom model architecture was generated by removing the last layer from the theory based model, and using its outputs as additional input features for a sequential model consist-ing of four dense layers. In a first step, the weights of the theory based model were frozen, and the newly added dense layers were fitted to the experimental data set of the single screw expander setup for 1000 epochs with a batch size of 32 and a learning rate lr = 0.001. Next, the entire network was fine-tuned for another 500 epochs with a smaller learning rate lr = 10−5.

After ensuring the model was able to converge in all cases, a 10-fold cross-validation of the model resulted in the values M SE = 0.0519, M AE = 0.1180 and R2 _{= 0.9490 using only freezing, and M SE = 0.0465,}

M AE = 0.1347 and R2 _{= 0.9467 after both freezing}

and fine-tuning.

9 Conclusion

In the presented research, the feasibility of using ma-chine learning methods to predict the steady-state ex-pander power of an ORC setup was evaluated. For the models trained on the experimentally obtained data sets, the achieved accuracy fell short of the likely re-quirements for practical use. The limited amount of data points was generally deemed as the main cause for this lack of accuracy. However, the trained mod-els were able to substantially improve upon the base-line performance of the base-linear regression model. The models trained on a larger dataset obtained from a the-ory based model on the other hand obtained substan-tially higher accuracies. Comparing the different mod-eling techniques, the neural networks seemed to achieve the highest accuracy, with R2 = 0.9945 for the best performing model. However, the eventual choice of the ML algorithm may depend on the situation, and other factors such as the available computing power, allowed

(11)

training time and required fine-tuning of the hyperpa-rameters should also be taken into consideration. While the training of ML models may be computationally in-tensive, once trained, the calculation time to make a prediction is several orders of magnitude smaller than with theory based models that use iterative solving pro-cedures. The use of machine learning models may thus be an appropriate choice where real-time prediction of the expander power is required, such as in control appli-cations, or in applications where iterative optimization methods are used to determine the preferred working point of the cycle setup. In the hybrid learning section, it was found that the accuracy on the experimental data set could be improved using transfer learning techniques. However, more research may be required to assess the influence of the used techniques and model architecture. Finally, it should be remarked that machine learning is still very much a developing field. A large amount of the techniques used in this work were only discovered in the past decade [22], and future advances in the field might lead to new insights and better model performances than those achieved here.

References

[1] Conference of the Parties. Twenty-first session. Adoption of the Paris Agreement. Paris, France, 2015.

[2] International Energy Agency. Energy Technology Perspectives 2017. Paris, France, 2017.

[3] General Secretariat of the Council. 2030 climate and energy p,licy framework. Brussels, Belgium, 2014.

[4] B. F. Tchanche et al. “Low-grade heat conversion into power using organic Rankine cycles - A re-view of various applications”. In: Renewable and Sustainable Energy Reviews 15.8 (2011), pp. 3963– 3979.

[5] E. Macchi. “Theoretical basis of the Organic Rankine Cycle”. In: Organic Rankine Cycle (ORC) Power Systems. Technologies and Applica-tions. Ed. by E. Macchi and M. Astolfi. Duxford, United Kingdom: Woodhead Publishing, 2017. Chap. 1, pp. 3–24.

[6] M. Astolfi. “Technical options for Organic Rank-ine Cycle Systems”. In: Organic RankRank-ine Cycle (ORC) Power Systems. Technologies and Applica-tions. Ed. by E. Macchi and M. Astolfi. Duxford, United Kingdom: Woodhead Publishing, 2017. Chap. 3, pp. 67–89.

[7] A. Desideri. “Dynamic modeling of organic rank-ine cycle power systems”. In: Universite de Liege 135.4 (2013).

[8] I. H. Bell et al. “Pure and Pseudo-pure Fluid Ther-mophysical Property Evaluation and the Open-Source Thermophysical Property Library Cool-Prop”. In: Industrial & Engineering Chemistry Research 53.6 (2014), pp. 2498–2508.

[9] E. W. Lemmon et al. NIST Standard Reference Database 23: Reference Fluid Thermodynamic and Transport Properties-REFPROP, Version 10.0, National Institute of Standards and Technology. 2018.

[10] P. Colonna and T.P. Van der Stelt. FluidProp: a program for the estimation of thermo physical properties of fluids. Delft, The Netherlands, 2004. [11] S. Quoilin et al. “ThermoCycle: A Modelica li-brary for the simulation of thermodynamic sys-tems”. In: Proceedings of the 10th International Modelica Conference. Lund, Sweden, 2014. [12] F. Casella and A. Leva. “Modelica open library for

power plant simulation: design and experimental validation”. In: Proceedings of the 2003 Modelica Conference. Link¨oping, Sweden, 2003, pp. 41–50. [13] A. Mosavi et al. “State of the art of machine

learn-ing models in energy systems, a systematic re-view”. In: Energies 12.7 (2019).

[14] P. Mehta et al. “A high-bias, low-variance intro-duction to Machine Learning for physicists”. In: Physics Reports 810 (May 2019), pp. 1–124. [15] F. Chollet. Deep Learning with Python. Shelter

Is-land, NY: Manning, 2018.

[16] S. Gusev et al. “Experimental Comparison Of Working Fluids For Organic Rankine Cycle With Single-Screw Expander”. In: International Refrig-eration and Air Conditioning Conference Confer-ence July (2014), pp. 2653–2663.

[17] S. Lecompte et al. “Experimental results of a small-scale organic Rankine cycle: Steady state identification and application to off-design model validation”. In: Applied Energy 226.May (2018), pp. 82–106.

[18] S. Lecompte et al. “Organic rankine cycle part-load characterization: validated models of an 11 kwe waste heat recovery orc”. In: 12th Interna-tional Conference on Heat Transfer, Fluid Me-chanics and Thermodynamics (2016), pp. 871– 878.

[19] T. Chen and C. Guestrin. “XGBoost, A Scalable Tree Boosting System”. In: 22nd SIGKDD Con-ference on Knowledge Discovery and Data Mining. San Francisco, 2016.

[20] A. Karpatne et al. “Theory-guided data science: A new paradigm for scientific discovery from data”. In: IEEE Transactions on Knowledge and Data Engineering 29.10 (2017), pp. 2318–2331.

[21] S.J. Pan and Q. Yang. “A survey on transfer learn-ing”. In: IEEE Transactions on Knowledge and Data Engineering 22.10 (2010), pp. 1345–1359. [22] H. Wang and B. Ray. “On the Origin of Deep

(12)

4.6 Random Forests . . . 36 4.7 Neural networks . . . 38 4.8 Conclusions . . . 40 5 Theoretical modeling 42 5.1 MATLAB model . . . 42 5.2 Statistical analysis . . . 43 5.3 Linear regression . . . 44 5.4 Random forests . . . 46 5.5 Boosted trees . . . 47 5.6 Neural networks . . . 48 5.7 Conclusions . . . 52 6 Hybrid modeling 55 6.1 Transfer learning . . . 56 6.2 Application . . . 56 6.3 Conclusions . . . 59 7 Conclusions 60 7.1 Conclusions . . . 60 7.2 Future Work . . . 61

(14)

List of Figures

2.1 Typical layout and T-s diagram for a steam Rankine cycle . . . 2

2.2 Thermal power and temperature ranges for various alternative heat sources 3 2.3 T-s diagrams for a saturated Rankine cycle with respectively water, ben-zene and MDM as working fluids . . . 4

2.4 Illustration of the bias-variance tradeoff . . . 10

2.5 Simple feedforward neural network with one hidden layer. . . 11

2.6 Illustration of a single neuron. . . 11

2.7 Schematic explanation of the simple-RNN architecture . . . 13

2.8 Example of a simple decision tree. . . 15

3.1 The experimental ORC setup. . . 22

4.1 Heat map for the correlation coefficient with single screw expander. . . 28

4.2 Heat map for the correlation coefficient with double screw expander. . . 29

4.3 Scatter plots for the single screw expander. . . 30

4.4 Scatter plots for the double screw expander. . . 31

4.5 Illustration of the linear regression RFE accurateness. . . 34

4.6 Training and validation loss for one of the cv-folds (double-screw expander, 2 hidden layers of 10 neurons). . . 38

5.1 Scatter plot for the theoretical model. . . 44

5.2 Recursive feature elimination on the theory based model. . . 45

5.3 Example of a decision tree obtained with XGBoost. . . 48

5.4 Graph showing the resulting feature importances of the XGB model. . . 49

5.5 Training and validation MSE and R2 _{for the NN model. . . .} ₅₀

5.6 R2 as a function of the number of used data points. . . 51

5.7 Plot of Pexp(npump) according to the MATLAB model. . . 53

5.8 Plot of Pexp(npump) according to the NN model. . . 53

6.1 Visualization of the transfer learning model architecture. . . 58

(15)

List of Tables

2.1 Comparison between statistics and machine learning. . . 7

2.2 Overview of the listed studies . . . 20

3.1 Properties of the used sensor devices. . . 23

4.1 Reference standard deviations in steady state. . . 24

4.2 Correlations with the expander power for both cycle configurations. . . 27

4.3 Optimal feature coefficients using recursive feature elimination. . . 33

4.4 MSE for the single screw setup with different regularizations and different values of α. . . 35

4.5 MSE for the double screw setup with different regularizations and different values of α. . . 35

4.6 Hyperparameter optimization for random forest models. . . 36

4.7 Feature importances for the optimal random forest models. . . 37

4.8 Hyperparameter optimization for extra tree models. . . 37

4.9 Influence of the number of hidden layers for the single screw expander. . . 39

4.10 Influence of the number of hidden layers for the double screw expander. . . 39

4.11 Influence of the number neurons for the single screw expander. . . 39

4.12 Influence of the number neurons for the double screw expander. . . 40

4.13 Results using the experimental data for the single-screw expander setup. . 41

4.14 Results using the experimental data for the double-screw expander setup. . 41

5.1 Upper and lower bounds for model parameters. . . 43

5.2 Correlations with the expander power. . . 43

5.3 Optimal feature coefficients for linear regression. . . 45

5.4 Hyperparameter optimization for the random forest theoretical model. . . . 46

5.5 Feature importances for the theoretical random forest model. . . 46

5.6 Hyperparameter optimization for extra trees theoretical model. . . 47

5.7 Hyperparameters for the XGBoost model. . . 47

5.8 Hyperparameter optimization for the neural network theoretical model. . . 50

5.9 Dependence of the NN model on the number of data points. . . 51

5.10 Results of the ML modeling on the theoretical data. . . 54

(16)

Nomenclature

R2 Coefficent of determination AI Artificial Intelligence ANN Artificial Neural Network Bagging Bootstrap Aggregation CNN Convolutional Neural Network GRU Gated Recurrent Unit

LSTM Long Short-Term Memory MAE Mean Absolute Error ML Machine Learning MSE Mean Squared Error ORC Organic Rankine Cycle

OTEC Ocean Thermal Energy Conversion RF Random Forest

RNN Recurrent Neural Network RSS Residual Squared Error SPPP Solar Pond Power Plants SVM Support Vector Machine TL Transfer Learning WHR Waste Heat Recovery

(17)

Chapter 1 Introduction

In recent decades there has been a growing worldwide concern over the issues of global warming and climate change. The impacts of climate change have already been notica-ble and are expected to become more significant in the course of this century. Global annual averaged temperature has increased by 1 ◦C between 1901 and 2016. Extreme temperatures, both cold and hot, have become more likely. Sea levels are rising faster in recent decades than they have done in the past 2000 years. Average annual arctic ice has decreased by about 4% per decade since 1985 [1]. There exists a wide consensus in the scientific community [2] that the anthropogenic emission of greenhouse gases such as CO2 into the earth’s atmosphere forms the main cause of this observed climate change.

The International Panel on Climate Change (IPCC) has stated that continued emission of these gases will “increase the likelihood of severe, pervasive and irreversible impacts for people and ecosystems” [3]. The worldwide reduction in greenhouse gas emissions will thus have to be one of the primary challenges for society in the near future.

In the Paris climate agreement, signed at the 2015 COP-21 conference, signatories agreed to a long-term goal of keeping the average global temperature increase at the end of the century well below 2◦C and recognized the need for rapid reductions in greenhouse gas emissions in accordance with using the best available science [4]. The International Energy Agency has estimated that to reach this scenario, the average carbon intensity of electricity generation will have to decline at a rate of -3.9% over the next decade. By 2060, 98% of all electricity production will have to be obtained from carbon-neutral sources [5]. Goals to limit CO2 emissions are being converted into binding legislation by

govern-ments across the world. In Europe, the 2030 climate and energy framework will enforce a 40% reduction in greenhouse gas emissions compared to the 1990 level [6]. The European climate law proposed by the Commission earlier this year envisions a binding net zero emissions target by 2050 for all member states [7]. In the light off all this, there has been a growing interest in the use of renewable and carbon-neutral energy sources. The organic Rankine cycle, which is the object of study in this work, offers a possible solution for harnessing low to medium grade heat sources with a limited capacity and as such could form part of the development of a wider sustainable energy transformation required to resolve these issues.

(18)

Chapter 2 Literature review

2.1 The organic Rankine cycle

2.1.1 Basic principles

Figure 2.1: A typical layout and T-s diagram for a Rankine cycle with water as the working fluid [8].

In conventional power plants which posses heat sources operating at high tempera-tures and large power ratings, e.g. coal or nuclear power plants, the water based Rankine cycle is generally considered to be the preferred means to convert the generated heat into useful work. A typical layout of a simple Rankine cycle system and the accompanying T-s diagram are given in Figure 2.1. As can be deduced from the figure, the water in a Rank-ine cycle undergoes four distinct processes. First, the feedwater is brought up to a high pressure by the pumping installation, ideally in an isentropic way. Then, the heat gener-ated by the power plant is transferred to the water in the boiler, producing a saturgener-ated or superheated vapor at the outlet. This vapor is then led through the turbine, where it delivers the turbine work while expanding isentropically. Finally, the expanded vapor is send through a condenser which outputs a saturated liquid, completing the cycle [8].

(19)

Figure 2.2: Thermal power and temperature ranges for various alternative heat sources [12].

In recent decades, more interest has arisen into the possibilities of operating power cycles based on alternative heat sources operating at much lower temperatures and much smaller power ratings than those in traditional plants. The most common of these are biomass combustion, geothermal power and solar energy, which offer a means of electricity generation that is renewable, carbon neutral and without any of the safety concerns asso-ciated with nuclear power, and waste heat recuperation (WHR) from industrial processes or internal combustion engines, which allows for an efficiency increase with existing tech-nologies. More innovative applications include the harnessing of temperature gradients in the oceans, known as ocean thermal energy conversion (OTEC), and solar pond power plants (SPPP) where solar radiation is absorbed in saline water ponds and the generated heat is then recuperated by a power cycle system [9]. A typical PWR-type nuclear reactor operates at a temperature of arround 300 °C and a thermal power of several GWt [10]. A unit in a coal fired plant often operates around 600 °C and several hundred MWt thermal power [11]. Figure 2.2 shows typical output power and temperature ranges for the above described thermal heat sources. Although, as can be seen from the figure, there are large differences between the various applications, with the output power ranging from less than 1 kW to over 10 MW, and the maximum temperature from about 400 °C to less than 100 °C, a common characteristic is that both quantities are significantly lower than with traditional power plants.

For these heat sources where the temperature and thermal power are limited, the organic Rankine cycle (ORC) offers, as will be explained in the following section, certain advantages over the classical Rankine cycle with water as the working fluid. An organic Rankine cycle system utilizes the same thermodynamical Rankine cycle that was explained

(20)

Figure 2.3: T-s diagrams for a saturated Rankine cycle with respectively water, benzene and MDM as working fluids [12].

earlier, with the only difference being that the working fluid is no longer water. Figure 2.3 shows the T-s diagrams for the Rankine cycle with consecutively water, benzene and octamethyltrisiloxane (MDM) as the working fluid. As can be deduced from the diagrams, the organic working fluids usually exhibit an inclination of the vapor curve, which becomes more pronounced for highly complex molecules such as MDM. This inclination raises the importance of the liquid preheating phase and of the vapor desuperheating phase. For a fixed evaporation and condensation temperature, this means an increase of the irreversibilities in the heat transfer, and thus a decrease in efficiency. This phenomenon can be countered, albeit at an increased cost, by adding a recuperator unit that uses the vapor at the expander outlet to preheat the liquid leaving the condenser [13]. This way a thermodynamic efficiency is obtained similar to that in a saturated water based Rankine cycle, irrespective of the choice of working fluid [12]. In reality most heat sources have a limited heat capacity and do therefore not operate at a constant temperature, complicating the thermodynamic analysis of cycle efficiency. Superheating and more complicated cycle architectures can then be used to obtain higher cycle efficiencies [14, 15].

2.1.2 Relevance of the ORC

As was mentioned earlier, for small scale and low temperature power cycles, an alternative (often organic) working fluid is preferred over water. The reasons for this are multiple. First of all there is the issue of the expander design. Due to the large enthalpy drop that steam encounters during the expansion process, a complex multistage turbine layout is required to keep flow and rotor speeds in each stage at acceptable levels. The liquid formation at the end of the expansion process further complicates the turbine design. All this makes the overall turbine setup expensive to construct and maintenance intensive afterwards. While acceptable for large scale power plants, these designs are prohibitive to the economic operation of a small scale plant. For a working fluid used in the ORC cycle, the enthalpy drop is generally an order of magnitude lower than for steam, which makes it feasible to use a simple single stage expander.

(21)

A second advantage of the ORC is that the vapor pressure at typical environment temperatures is significantly higher. This makes it possible to cool the working fluid down to the ambient temperature without the need for subatmospheric pressures in the condenser. As the efficiency of the cycle is already quite low because of the low heat source temperature, being able to take advantage of the full temperature difference with the environment is key for economic operation. Further advantages include the smaller difference between ρliquidand ρvaporwhich makes it possible to use a simpler once-through

boiler design and the large ρvapor allowing for a smaller expander design.

Despite all these advantages, the use of an organic fluid also entails a number of disadvantages. The main ones are the higher cost of the working fluid compared to water, possible toxicity or chemical instability and detrimental influences on the environment such as contributing to ozone layer depletion or global warming. Also, the smaller enthalpy drop requires larger flow rates and thus more pump work, leading to a somewhat larger Back Work Ratio (BWR) [12, 16].

2.1.3 Modeling and control of the ORC

Although commercial applications of the ORC are already available, a lot of interest remains in improving the performance of ORC systems to make these systems more eco-nomically attractive. Research on improving ORC performance has been focused on several domains. A lot of work has been already been put into the problem of selecting the optimal working fluid. An overview of the methodologies for working fluid selection can be found in Badr et al. [17]. Apart from the use of a single working fluid, there is also interest in the use of a mixture of different organic compounds [18]. Related to the selection of the optimal working fluid is the selection of the cycle architecture. A lot of attention has also been given to increasing the efficiency of the expander design. Finally, there is an active interest into the development of effective control strategies for an ORC system [19, 20]. The development of a control strategy is particularly important for the more innovative applications such as solar power generation, small-scale WHR or off-grid operation, where due to the variability of the heat source and small thermal inertia of the installation the cycle dynamics play an important role, and other applications where part-load operation occurs.

The availability of reliable computer models to predict the behavior of an ORC system can play a key role in devising an effective and safe control strategy. Furthermore, accu-rate models are also relevant for plant maintenance and supervision, for evaluating and improving transient behavior and for the testing of dangerous working conditions [21].

Based on analytical methods and correlations obtained from experimental data, a number of tools have been developed in recent years to aide in the modeling of ORC systems. First of all, in order to be able to accurately model the behavior of the sys-tem it is important to be able to determine the thermophysical properties of the working fluids. For this purpose, several software libraries have been developed. These include the open-source CoolProp package [22] and the proprietary RefProp [23] and FluidProp

(22)

[24] packages. For the simulation of steady-state models the open-source ORCmKit li-brary was developed [25] that is compatible with EES, Matlab and Python. For dynamic modeling of ORC systems, two libraries are currently widely used. One of them is the ThermoCycle library [26] which was specifically developed for the dynamic simulation of thermal systems with organic working fluids. The other is the ThermoPower library [27] which was originally created for the modeling of water based thermal systems, but has been adapted to incorporate the aforementioned fluid property libraries for ORC model-ing [28]. Both of these libraries are based on the Modelica language, which is the most widely used equation-based object oriented (EOO) modeling platform. The EOO-concept allows for a modular simulation approach, where predefined components retrieved from libraries can be combined to form the desired cycle architecture [29].

2.2 Machine learning

In recent years more interest has arisen in the use of machine learning (ML) as a solution for a wide variety of modeling problems. Also in the field of energy engineering there has been a significant interest recently in the use of machine learning techniques to tackle various modeling and optimization problems. Mosavi et al. [30] found an exponential growth in the number of energy related papers using machine learning, rising from only a few dozen papers published yearly only a decade ago to several hundred in recent years.

Machine learning is a subset of the much broader field of artificial intelligence (AI). Although there exists no single well established definition of what exactly the term ma-chine learning should entail, a good definition is proposed by Murphy [31], who describes machine learning as:

”A set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or to perform other kinds of decision making under uncertainty.”

2.2.1 Types of machine learning

In general one can distinguish three categories of machine learning algorithms, namely supervised learning, unsupervised learning and reinforcement learning. In supervised learning, it is assumed that an unknown function f maps a set of input parameters (x1, ..., xn) to a certain output parameter y. The objective is then to find, using the data

available in the training set, a function h, called the hypothesis, that best approximates f [32]. In most cases, the variable y is either an element of R, in which case we speak of a regression problem, or an element of a discrete set _{y1, y2, y3, ...}, in which case we are

dealing with a classification problem. Unsupervised learning, on the other hand, focuses on problems where there exist no clearly defined input-output relations. The goal is instead to be able to recognize patters and structure large datasets. Reinforcement learning,

(23)

Table 2.1: Comparison between statistics and machine learning.

Classical statistics Machine learning

Strong hypothesis of an idealized model Highly complex nonlinear models Good insight in models possible Only high accuracy matters

Limited computational power required Can be highly computationally intensive Based on mathematical reasoning Largely empirical approach

Tabular data More complex data structures allowed

Fixed dataset Algorithm may generate own data

ultimately, is the most recent domain in machine learning and focusses on self-learning algorithms that are able to gather data autonomously (e.g. an autonomous car that learns to drive itself) [33]. For the purpose of modeling clearly defined physical relations such as the behavior of an ORC system based on available experimental data, considering the problem as a form of supervised learning seems the most appropriate. Since the output parameters of the ORC system are continuous values, the use of regression techniques will be required.

2.2.2 Comparison with statistics

As can be deduced from the definition given earlier, machine learning has a lot in common with classical statistical methods. Though the exact distinction between the two fields may be partially subjective, there are some important distinctions to be noted. Classical statistical modeling usually starts out by assuming that the available data is the result of an idealized (ideally linear) stochastic model. Statistical methods derived from probability theory (e.g. maximum likelihood estimators) are then used to make an estimate for the model’s parameters. While these methods have several advantages, such as the limited amount of computational power required, their rigorous mathematical foundations and the fact that they are often relatively intuitive to interpret, the strong idealizations on which they are based limit their usefulness. In many cases, the relations in the data are highly complex and involve a large amount of nonlinearities, and in these cases, machine learning methods are required to obtain accurate results.

Contrary to classical statistical methods, machine learning techniques treat the data mechanism as a ”black box”. On the condition that sufficient data is available, a complex general-purpose model that can incorporate a high degree of nonlinearity is then trained to fit the dataset. The model architecture and training algorithms may be inspired by statistical methods, but often only have an empirical basis. Machine learning generally requires large datasets and substantial computing power to work, and the resulting mod-els are difficult to interpret intuitively. They can however make accurate predictions for a wide variety of complicated problems that were previously unsolvable using classical tech-niques. An overview of the differences between classical statistics and machine learning is given in Table 2.1 [34, 35].

(24)

2.2.3 The bias-variance tradeoff

One of the key issues that are encountered when bringing machine learning into practice is what is called the bias-variance tradeoff. The bias-variance tradeoff is an elemental principle in the field of statistical learning theory that holds true for all forms of supervised learning, and pertains to the limitations that the availability of data poses to the ability of training more complex models. Assume that the predicted variable y is given by the relation:

y = f (x) + (2.1)

withx the vector of input parameters, f an unknown function and the ”noise” a normally distributed variable with mean zero and standard deviation σ. To model the relationship,

a function hθ(x) is proposed, with θ representing the unknown model parameters. Given

a set of sample data, _{D = (X, y), the machine learning algorithm will fit the model to} this sample dataset and produce an estimate ˆθ. To evaluate the accuracy of the eventual model h_θˆ(x), a metric such as the mean squared error (MSE) can be introduced:

( ˆθ) = 1 n · n X i=1 (h_θˆ(x(i))− y(i))2 (2.2)

Using this formula, two distinct errors, the training error train calculated on the sample

dataset _{D, and the generalization error}gen calculated on a different set of previously

unseen data points, often called the validation dataset, can be defined. It is important to note the distinction between these two error values. Suppose that one starts out with a relatively simple model with few parameters θ, the two error values will then lie close to each other. As the model is subsequently made more complex by increasing the number of learned parameters, train will continuously improve, until the model contains

so many parameters that it can perfectly predict every point in the training dataset. This phenomenon is known as overfitting: the training error equals zero, but testing the model on unseen data points will nonetheless result in a very significant generalization error gen.

This raises the question of how many parameters the model should contain in order to obtain a minimum value for gen given the available dataset D. It can be proven from

statistical learning theory [36] that the expected value E(gen) is given by:

E(gen) = Bias2+ V ariance + N oise (2.3)

with: Bias2 =X i (f (xi)− E(h_θˆ(xi)))2 (2.4) V ariance =X i E(h_θˆ(xi)− E(h_θˆ(xi)))2 (2.5)

(25)

and:

N oise =X

i

σ2 (2.6)

The values xi form the validation dataset. The bias term can be interpreted as the

expected generalization error if an infinite amount of training data were to be available. The more complex hθ(x) becomes, the smaller will be its bias. The variance on the other

hand indicates how strongly the model performance fluctuates based on the choice of the training dataset _{D. The higher the complexity of the model, the more it becomes prone} to overfitting. While the bias decreases, the variance thus simultaneously increases with the complexity of the model. This is generally known as the bias-variance tradeoff and is illustrated in Figure 2.4. As is illustrated in the figure, the goal is to find a model that lies closely to the optimal bias-variance equilibrium [37].

In statistical learning theory, a common way to describe the complexity of a model is its so-called Vapnik-Chervonenkis (VC) dimension. The VC dimension is defined as the largest possible size of a set of data points S = _{x1,x2, ...}, (xi 6= xj, i6= j) to which the

model can perfectly fit any possible choice of labels Y = _{y1, y2, ...}. For example, for a

linear regression model the VC dimension is equal to the number of input features plus one. Factors (such as the number of input features) that determine the properties and the VC dimension of the machine learning model are generally known as hyperparameters. Under a number of assumptions, it can be proven that for a supervised learning model there is a probability of at least 1_{− δ that::}

gen ≤ bias+ O r D nlog n D + 1 nlog 1 δ ! (2.7) with D the VC dimension and n the size of the training set [38]. This equation clearly proves what could be understood intuitively in the previous paragraph, namely that in-creasing the model complexity leads to overfitting and an increase in variance. It also shows that having a large enough training set can be an effective way to reduce variance in a model [39].

2.2.4 Artificial neural networks

Artificial neural networks (ANN’s) are one of the most well known machine learning al-gorithms. Under neural networks are understood a range of algorithms that model the relation between the input features and the studied output as a network built up out of layers of artificial neurons. Although they are among the oldest machine learning tech-niques, the real potential of neural networks was only recognized in the early 2010’s with the advent of more powerful computing architecture, the availability of larger datasets and the discovery of more efficient optimization techniques. The main advantage of neural net-works is their ability to autonomously learn appropriate data representations. Whereas other ML-algorithms in many cases require tedious work to preprocess the data and create a set of appropriate features to feed into the model, a procedure known as feature

(26)

engi-Figure 2.4: An illustration of the bias-variance tradeoff [36].

neering, the multi-layered structure of an ANN is much less sensitive to the exact form in which the input data is presented. This significantly reduces the difficulty of obtaining a performant model. A second reason for the popularity of neural networks are the promis-ing results that have been obtained for the processpromis-ing of so-called non-tabular data such as images or sequences (see Section 2.2.4.2). On the downside however, neural networks often require a large amount of data to prevent overfitting. Also, the large amount of datapoints and model parameters often leads to more required computational power [40]. 2.2.4.1 Feedforward neural networks

The simplest variant of ANN’s are feedforward neural networks. Figure 2.6 shows an illustration of a single neuron of such a network. Each neuron in the network has certain number of inputs _{x1, ..., xn} produces a single output y, which is obtained by first taking

a linear combination of the inputs, with weights _{w1, ..., wn} and a bias constant b, and

then putting the obtained result through a nonlinear activation function f . The values of the weights and bias are to be determined by the learning process, for the activation function usually a ReLU (”Rectified Linear Unit”) function is chosen for which it holds:

f (x) = (

0, x < 0

(27)

Input layer Hidden layer Output layer Input 1 Input 2 Input 3 Input 4 Input 5 Ouput

Figure 2.5: A simple feedforward neural network with one hidden layer.

x2 w2

Σ

f

Activation function y Output x1 w1 x3 w3 Weights Bias b Inputs

(28)

The input-output relation of the neuron can thus be formulated mathematically as: y = f (w1· x1+ ... + wn· xn+ b) (2.9)

These neurons are combined into several layers which together form the neural network, as is illustrated in Figure 2.5. The input layer represents the selected input features, the output layer consists of a single neuron that generates the output y. In between are one or more so-called hidden layers that take as their input the output values of the previous layer and analogously send their output values as an input to the next layer. If only a few hidden layers are used, we speak of a shallow learning model, if a substantial amount of hidden layers are used we speak of a deep learning model.

The number of layers, the number of neurons per layer and the way in which the layers are interconnected can be considered as hyperparameters of the model that have to be defined and often fine-tuned by the researcher. Once the architecture of the network is defined, the network is fit to a training dataset in order to obtain optimal estimates for the weights and biases of all the neurons. This is done by defining a loss function _L(θ), with the vector θ containing all the weights and biases of the network, that indicates the deviation of the predicted values of the model from the actual values. For regression purposes, the mean square error (MSE) introduced in the previous section to evaluate the model’s accuracy, is usually chosen as the loss function:

L(θ) = 1 n · n X i=1 (hθ(x(i))− y(i))2 (2.10)

with hθ(x(i)) the estimated output of data point x(i) and y(i) the actual output. Optimal

estimates of the parameters are then obtained by finding the global minimum of the loss function. Although a lot of different methods exists for finding the minimum of a multivariate function, most methods used in machine learning are based on the gradient descent algorithm. This optimization algorithm starts by assigning a random small value to the neuron weights and biases. It then iteratively calculates the gradient _∇L(θi) and

makes a new estimate θi+1 according to:

θi+1 =θi+ α· ∇L(θi) (2.11)

where α is defined as the learning rate. The gradient _∇L(θi) is calculated using the

back-propagation algorithm first proposed by Rumelhart et al. [41] which is able to efficiently calculate the gradients of NN loss functions.

Even so, as machine learning is generally used with large datasets, it would in most cases be impractical if at each step the gradient of the loss function would have to be calculated over all the data points in the training set. A solution is to partition the training set into a number of batches, with each batch j being associated with its own loss function_Lj. At

each step, the gradient is then approximated by _∇Lj(θi), iterating through the different

(29)

Figure 2.7: A schematic explanation of the simple-RNN architecture [33].

The batch size b can be considered as an additional hyperparameter of the model. One iteration through all the available batches is also known as one ”epoch” of training the network.

In practice, more advanced optimizers such as RMSprop or Adam are used that are based on gradient descent but are generally more accurate, less computationally intensive and faster to converge [33, 36, 42, 43].

2.2.4.2 Advanced neural networks

While a feedforward neural network may be an appropriate choice when the input features form a set of independent variables, as may be the case for example when training a steady state model of the ORC system, when the input features are of a more structured nature, other ANN types may deliver better results. For example, in the field of computer vision learning, convolutional neural networks (CNN’s) have delivered very strong performances [44]. For the modeling of the dynamic behavior of an ORC system, another architecture referred to as a recurrent neural network (RNN) seems to be a potentially interesting solution. The RNN model has been specifically developed to deal with input features of a sequential nature, such as for example time series. This is done by adding the capability to use a form of internal short-term memory. A schematic illustration of the working of a simple-RNN model, one of the most basic types of RNN’s, is given in Figure 2.7.

Whereas in a feedforward neural network, the inputs of each layer are processed in one step to obtain the output, the simple-RNN model uses an iterative procedure to determine the output of a layer. In the first iteration, a vectorxin(t = 0) representing the first time

step for every input feature is fed into the layer. The vector of output features

(30)

is then stored as the states(t = 0). In subsequent iterations, the output is then given by: xout(t = i) = f (W · xin(t = i) + U · s(t = i − 1) + b) (2.13)

The matrices U, W _{∈ R}n×n _{contain the weights that are determined during the learning,}

when the output of the final iteration is trained to fit to the output value of the sequence. The simple-RNN model provides a basic understanding of how recurrent neural networks function. In practice however, more advanced RNN’s such as long short-term memory (LSTM) and gated recurrent unit (GRU) networks often achieve better performances because of their improved longer term memory capabilities [33].

2.2.5 Ensembling and tree-based methods

One of the main classes of algorithms competing with neural networks for supervised learning problems with tabular data are the tree-based methods such as random forests and gradient boosted trees. These methods are based on two concepts, decision tree learning and ensembling, which will be briefly explained first.

2.2.5.1 Decision trees

Decision trees are relatively simple machine learning models that are easy to interpret. However, their high sensitivity to even small changes in the training data and tendency to overfit makes them an unattractive learning model when viewed in their own right. On the other side, these properties make decision trees a perfect candidate for the application of ensembling techniques.

A regression decision tree works by partitioning the space of input featuresx∈ Rn_into

J regions R1, R2, ..., RJ. For each region, a mean output value ˆyRj is estimated based on

the training data points in that region. The relationship y = f (x) can then be visualized as a binary tree, where at each internal node one input feature xi is evaluated, as is shown

in Figure 2.8 (where the right branches correspond to xi ≥ ti and the left branches to

xi < ti). To train the model, the regions Ri ought to be determined such that the residual

sum of squares (RSS): RSS = J X j=1 X xi∈Rj (yi(xi)− ˆyRj) 2 (2.14) is minimized. As it would be impractical to evaluate every possible combination of regions Ri, an approximation is usually calculated using the recursive binary splitting algorithm,

which partitions the feature space in a top-down approach by at each step splitting one region in such a way that the RSS improves most significantly. The algorithm stops when a certain treshold for the minimum amount of data points in a region is met [45].

(31)

x1 < 4

x2 < 0

7 12

x3 <−5

6 4

Figure 2.8: Example of a simple decision tree.

2.2.5.2 Ensembling

As was explained earlier in Section 2.2.3, one of the fundamental challenges in machine learning is the bias-variance tradeoff. In those cases where the variance term is the main contributor to the generalization error, ensembling is a powerful technique to improve model accuracy. Ensembling refers to the idea that a combination of high-variance indi-vidually overfitting models can together produce reliable predictions. It has been shown in statistical learning theory that, provided the correlation between the training of the constituting models is limited, ensembling can substantially lower the variance while pre-venting an increase in bias. For this reason, ensemble methods usually introduce an element of randomness in the learning process to reduce the degree of correlation [36]. 2.2.5.3 Random forests

Random forest are one of the two best known decision tree ensemble methods. Their working principle is based on a form of ensembling known as bootstrap aggregation or ”bagging”. With bagging, the original training dataset _{D is used to produce a large} number of new ”bootstrapped” datasets _D_iBT. These datasets are created by random sampling with replacement of the original dataset _{D, a procedure known as empirical} bootstrapping. The new datasets _DBT

i can then each be used to train a decision tree that

will produce different estimates, which makes it possible to use the ensemble principle to obtain a more stable, aggregated estimate by taking the mean over all the existing trees. A ML-model trained following the procedure described above is generally known as a ”bagged tree”. What is generally considered a random forest is very similar to a bagged trees model, but combines the bagging with random feature selection to further decrease the correlation between different trees. Whereas, as was explained earlier, decision trees are usually build up by at each step selecting the optimal split out of all available pre-dictors p, with random forests only a randomly selected subset of m predictor features is available to choose from. A similiar algorithm that follows the same philosophy is the extremely randomized trees (”extra trees”) algorithm. The extra trees algorithm does not use bagging and is trained on the full sample dataset. Instead, the overall variance is reduced by choosing random treshold values for the features in each tree node [36, 46–48].

An evaluation of machine learning methods for modeling an ORC system

modeling an ORC system

An evaluation of machine learning methods for

Lars Van Mieghem

modeling an ORC system

An evaluation of machine learning methods for

Lars Van Mieghem

Copyright

Foreword

1

Abstract

2

Introduction

3

Machine Learning

4

Setup

5

Data gathering

6

Experimental modeling

7

Theory based modeling

8

Hybrid modeling

9

Conclusion

References

Contents

List of Figures

List of Tables

Nomenclature

Chapter 1

Introduction

Chapter 2

Literature review

2.1

The organic Rankine cycle

2.1.1

Basic principles

2.1.2

Relevance of the ORC

2.1.3

Modeling and control of the ORC

2.2

Machine learning

2.2.1

Types of machine learning

2.2.2

Comparison with statistics

2.2.3

The bias-variance tradeoff

2.2.4

Artificial neural networks

Σ

f

2.2.5

Ensembling and tree-based methods