3. GLM for thermophysical property prediction ⚗️#

Open in Colab

Goals of this exercise 🌟#

  • We will learn how to apply (generalized) linear regression

  • We will review some performance metrics for assesing regression models

A quick reminder ✅#

The process of “learning” in the context of supervised learning can be seen as exploring a hypothesis space \(\mathcal{H}\) looking for the most appropriate hypothesis function \(h\). In the context of linear regression the hypothesis space is of course the space of linear functions.

Let’s imagine our input space is two-dimensional, continuous and non-negative. This could be denoted mathematically as \(\textbf{x} \in \mathbb{R}_+^2\). For example, for an ideal gas, its pressure is a function of the temperature and volume. In this case, our dataset will be a collection of \(N\) points with temperature and volume values as inputs and pressure as output

\[ \{(\textbf{x}^{(1)}, y^{(1)}), (\textbf{x}^{(2)}, y^{(2)}), ..., (\textbf{x}^{(N)}, y^{(N)}) \} \]

where for each point \(\textbf{x} = [x_1, x_2]^T\). Our hypothesis function would be

\[ h(\textbf{x}, \textbf{w}) = w_0 + w_1x_1 + w_2+x_2 \]

where \(\textbf{w} = [w_0, w_1, w_2]^T\) is the vector of model parameters that the machine has to learn. You will soon realize that “learn” means solving an optimization problem to arrive to a set of optimal parameters. In this case, we could for example minimize the sum of squarred errors to get the optimal parameters \(\textbf{w}^* \)

\[ \textbf{w}^* = argmin_{\textbf{w}} ~~ \frac{1}{2} \sum_{n=1}^N \left( y^{(n)} - h(\textbf{x}^{(n)}, \textbf{w}) \right)^2 \]

This turns out to be a convex problem. This means that there exist only one optimum which is the global optimum.

Attention

Did you remember the difference between local and global optima?

There are many ways in which this optimization problem can be solved. For example, we can use gradient-based methods (e.g., gradient descent), Hessian-based methods (e.g., Newton’s method) or, in this case, even an analytical solution exists!

What about non-linear problems? 🤔#

We can expand this concept to cover non-linear spaces by introducing the concept of basis functions \(\phi(\textbf{x})\) that map our original inputs to a different space where the problem becomes linear. Then, in this new space we perform linear regression (or classification) and effectively we are performing non-linear regression (classification) in the original space! Nice trick right?

This get rise to what we call Generalized Linear Models (GLM)!

Linear regression 📉#

Let’s now play around with this concepts by looking at a specific example: regressing thermophysical data of saturated and superheated vapor.

The data is taken from the Appendix F of [SVNAS04].

Let’s import some libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split

and then import the data

if 'google.colab' in str(get_ipython()):
  df = pd.read_csv("https://raw.githubusercontent.com/edgarsmdn/MLCE_book/main/references/superheated_vapor.csv")
else:
  df = pd.read_csv("references/superheated_vapor.csv")

df.head()
Pressure Property Liq_Sat Vap_Sat 75 100 125 150 175 200 ... 425 450 475 500 525 550 575 600 625 650
0 1.0 V 1.000 129200.0000 160640.0000 172180.0000 183720.0000 195270.0000 206810.0000 218350.0000 ... NaN 333730.00 NaN 356810.0000 NaN 379880.0000 NaN 402960.0000 NaN 426040.0000
1 1.0 U 29.334 2385.2000 2480.8000 2516.4000 2552.3000 2588.5000 2624.9000 2661.7000 ... NaN 3049.90 NaN 3132.4000 NaN 3216.7000 NaN 3302.6000 NaN 3390.3000
2 1.0 H 29.335 2514.4000 2641.5000 2688.6000 2736.0000 2783.7000 2831.7000 2880.1000 ... NaN 3383.60 NaN 3489.2000 NaN 3596.5000 NaN 3705.6000 NaN 3816.4000
3 1.0 S 0.106 8.9767 9.3828 9.5136 9.6365 9.7527 9.8629 9.9679 ... NaN 10.82 NaN 10.9612 NaN 11.0957 NaN 11.2243 NaN 11.3476
4 10.0 V 1.010 14670.0000 16030.0000 17190.0000 18350.0000 19510.0000 20660.0000 21820.0000 ... NaN 33370.00 NaN 35670.0000 NaN 37980.0000 NaN 40290.0000 NaN 42600.0000

5 rows × 37 columns

Many things to notice from observing the data above:

  • We have 4 different properties V, U, H and S referring to the specific properties volume [cm\(^3\) g\(^{-1}\)], internal energy [kJ kg\(^{-1}\)], enthalpy [kJ kg\(^{-1}\)] and entropy[kJ kg\(^{-1}\) K\(^{-1}\)], respectively.

  • We have values for each property at different pressures [kPa] and temperatures [°C].

  • We have also the value of each property at each pressure for the saturated liquid and vapor.

Since we plan to build models per each property individually, let’s separate the data per property.

V = df.loc[df['Property'] == 'V']
U = df.loc[df['Property'] == 'U']
H = df.loc[df['Property'] == 'H']
S = df.loc[df['Property'] == 'S']

Ploting the data, whenever possible, is always useful! So, let’s plot for instance the saturated liquid data to see what trends it follows

property_to_plot = 'Liq_Sat'

# Plot saturated liquid
plt.figure(figsize=(13, 7))
plt.subplot(221)
plt.plot(V['Pressure'], V[property_to_plot], 'kx', markersize=3)
plt.xlabel('Pressure [kPa]')
plt.ylabel('Specific volume [$cm^3 g^{-1}$]')

plt.subplot(222)
plt.plot(U['Pressure'], U[property_to_plot], 'kx', markersize=3)
plt.xlabel('Pressure [kPa]')
plt.ylabel('Specific internal energy [$kJ kg^{-1}$]')

plt.subplot(223)
plt.plot(H['Pressure'], H[property_to_plot], 'kx', markersize=3)
plt.xlabel('Pressure [kPa]')
plt.ylabel('Specific enthalpy [$kJ kg^{-1}$]')

plt.subplot(224)
plt.plot(S['Pressure'], S[property_to_plot], 'kx', markersize=3)
plt.xlabel('Pressure [kPa]')
plt.ylabel('Specific enthropy [$kJ kg^{-1} K^{-1}$]')

plt.suptitle(property_to_plot, size=15)
plt.show()
_images/e19e758dff8de368fe9a3ee2e4073b635b4c18599518b2a91bc2fc71f74bca55.png

We can also get some statical description of the data really fast by calling the method describe() on a DataFrame or Series

V['Liq_Sat'].describe()
count    136.000000
mean       1.215382
std        0.128941
min        1.000000
25%        1.111500
50%        1.191000
75%        1.309750
max        1.504000
Name: Liq_Sat, dtype: float64

Well, the trend is clearly non-linear in the whole range of pressures right?

We will focus in this exercise on a single property, let’s take for instance the specific volume of the saturated liquid.

Our goal is to build a mathematical model (for now using (generalized) linear regression), that predicts the specific volume of a saturated liquid as a function of the pressure. This means that we are dealing with a one-dimensional problem.

It is also quite noticible to see that a simple line is not going to fit the data very well on the whole range of pressures. Do you see why looking at your data first is important? This process is formally known as Exploratory Data Analysis (EDA) and refers to the general task of analyzing the data using statistical metrics, visualization tools, etc…

Important

Before starting modeling, explore your data!

For now, let’s assume we only know how to do a simple linear regression. Maybe one idea would be to split the data into sections and approximate each section with a different linear regression. This of course will result in a discrete model that is not very handdy… But let’s try

lim_1 = 300
lim_2 = 1500

plt.figure()
plt.plot(V['Pressure'], V['Liq_Sat'], 'kx', markersize=3)
plt.axvline(lim_1, linestyle='--', color='r')
plt.axvline(lim_2, linestyle='--', color='r')
plt.xlabel('Pressure [kPa]')
plt.ylabel('Specific volume [$cm^3 g^{-1}$]')
plt.title('Saturated liquid', size=15)
plt.show()
_images/bbe90e00039b3ce7af9e1a7cd4e8557b9c0d451ede9658e9e8665079f084c141.png

Let’s extract these sections from our data

First_P  = V['Pressure'].loc[V['Pressure'] < lim_1].to_numpy().reshape(-1, 1)
First_V  = V['Liq_Sat'].loc[V['Pressure'] < lim_1].to_numpy().reshape(-1, 1)

Second_P = V['Pressure'].loc[(V['Pressure'] >= lim_1) & 
                                  (V['Pressure'] < lim_2)].to_numpy().reshape(-1, 1)
Second_V = V['Liq_Sat'].loc[(V['Pressure'] >= lim_1) & 
                                  (V['Pressure'] < lim_2)].to_numpy().reshape(-1, 1)

Third_P  = V['Pressure'].loc[V['Pressure'] >= lim_2].to_numpy().reshape(-1, 1)
Third_V  = V['Liq_Sat'].loc[V['Pressure'] >= lim_2].to_numpy().reshape(-1, 1)

Now, we fit the 3 models

LR_1 = LinearRegression().fit(First_P, First_V)
LR_2 = LinearRegression().fit(Second_P, Second_V)
LR_3 = LinearRegression().fit(Third_P, Third_V)

# - Plot splits and models
plt.figure()
plt.xlabel('Pressure [kPa]')
plt.ylabel('Specific volume [$cm^3 g^{-1}$]')

# -- First split
plt.plot(First_P, First_V, 'rx', markersize=2, label='1st Section')    
plt.plot(np.linspace(0, lim_1 + 1000, 100), 
         LR_1.predict(np.linspace(0,lim_1 + 1000, 100).reshape(-1, 1)),'r', linewidth=1) 

# -- Second split
plt.plot(Second_P, Second_V, 'bx', markersize=2, label='2nd Section')  
plt.plot(np.linspace(lim_1, lim_2+1000, 100), 
         LR_2.predict(np.linspace(lim_1, lim_2+1000, 100).reshape(-1, 1)),'b', linewidth=1)


# -- Third split
plt.plot(Third_P, Third_V, 'kx', markersize=2, label='3rd Section')    
plt.plot(np.linspace(lim_2, max(Third_P), 100), 
         LR_3.predict(np.linspace(lim_2, max(Third_P), 100).reshape(-1, 1)),'k', linewidth=1)

plt.legend()
plt.show()
_images/692faaa5712f484b285c312a1892ab4c611abe30d8acefe2d350969f36eb26b6.png

We can check how well our model is fitting our data by printing the R\(^2\) coefficient

The coefficient R\(^2\) is defined as \(1 - \frac{u}{v}\), where \(u\) is the residual sum of squares \(\sum (y - \hat{y})^2\) and \(v\) is the total sum of squares \(\sum (y - \mu_y)^2\). Here, \(y\) denotes the true output value, \(\hat{y}\) denotes the predicted output and \(\mu_y\) stands for the mean of the output data. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the mean of \(y\), disregarding the input features, would get a R\(^2\) score of 0.0.

print('\n R2 for 1st split:', LR_1.score(First_P, First_V))
print('\n R2 for 2nd split:', LR_2.score(Second_P, Second_V))
print('\n R2 for 3rd split:', LR_3.score(Third_P, Third_V))
 R2 for 1st split: 0.9263208134364597

 R2 for 2nd split: 0.9870087187227413

 R2 for 3rd split: 0.9990370407798591

Question

What about training and test split? 😟 Can my model extrapolate to other pressures accurately?

To check the model parameters \(\textbf{w}\)

print('\n Slope for 1st split    :', LR_1.coef_)
print(' Intercept for 1st split:', LR_1.intercept_)

print('\n Slope for 2nd split    :', LR_2.coef_)
print(' Intercept for 2nd split:', LR_2.intercept_)

print('\n Slope for 3rd split    :', LR_3.coef_)
print(' Intercept for 3rd split:', LR_3.intercept_)
 Slope for 1st split    : [[0.00023137]]
 Intercept for 1st split: [1.01438786]

 Slope for 2nd split    : [[6.67710359e-05]]
 Intercept for 2nd split: [1.05921557]

 Slope for 3rd split    : [[3.44236231e-05]]
 Intercept for 3rd split: [1.11072464]

Question

What about scaling? 😟 Does scaling matters for linear regression? Why or why not?

Multivariate-linear regression 🤹#

In our data set, we have a bunch of data corresponding to superheated vapor. We can take a look into it to see how can we create a mathematical model for it. This would be useful later for process simulation or process optimization.

Let’s pick just one property for now, the enthalpy.

#%matplotlib qt #Uncomment this line to plot in a separate window (not available in Colab)

fig = plt.figure(figsize=(10,5))
ax = Axes3D(fig, auto_add_to_figure=False)
fig.add_axes(ax)

x = H['Pressure']
y = np.array([int(H.columns[4:][i]) for i in range(len(H.columns[4:]))])
X,Y = np.meshgrid(x,y)
Z = H.loc[:, '75':'650']

ax.scatter(X, Y, Z.T)
ax.set_xlabel('Pressure [kPa]')
ax.set_ylabel('Temperature [°C]')
ax.set_zlabel('Specific enthalpy [$kJ kg^{-1}$]')
ax.set_title('Superheated vapor', size=15)
plt.show()
_images/91ae0957b5ea51901736fbcf191575d68ef4f0285480d9d98c373e3edc060fc0.png

It seems like a multivariate linear regression would fit our experimental data quite well, right?

You will notice that the enthalpy data contains NaN values (i.e., empty values), which we should remove before fitting our model

Ps   = X.reshape(-1,1)
Ts   = Y.reshape(-1,1)
Hs   = np.array(Z.T).reshape(-1,1)

# -- Clean data to eliminate NaN which cannot be used to fit the LR
H_bool  = np.isnan(Hs)
P_clean = np.zeros(len(Ps)-np.count_nonzero(H_bool))
T_clean = np.zeros(len(P_clean))
H_clean = np.zeros(len(P_clean))

j = 0
for i in range(Ps.shape[0]):
    if H_bool[i] == False:
        P_clean[j] = Ps[i]
        T_clean[j] = Ts[i]
        H_clean[j] = Hs[i]
        j += 1

Exercise - multi-variate linear regression ❗❗#

  • Fit a multi-variate linear regression to the enthalpy data

  • What are the model parameters?

  • What is the R\(^2\)?

  • Plot your model using plot_surface along with the experimental data

# Your code for the linear regression here
# Your code for model parameters here
# Your code for R2 here
# Your code for plotting here

Generalized Linear Regression 📈#

Previously we sectioned the volume data for a saturated liquid into three parts and we approximate each part with a linear regression. In this way, we obtained a discrete model for the whole range of pressures. This type of discrete models can be sometimes troublesome when applied to optimization problems (e.g. they have points where the gradient does not exist). A non-linear smooth model would be preferable (e.g., some sort of polynomial).

Ps = V['Pressure'].to_numpy().reshape(-1,1)
Vs = V['Liq_Sat'].to_numpy().reshape(-1,1)

Let’s create some polynomial basis functions. For instance:

  • \(\phi_1(x) = x\)

  • \(\phi_2(x) = x^2\)

  • \(\phi_3(x) = x^3\)

  • \(\phi_4(x) = x^4\)

Notice, that by using these basis functions we will be moving the problem from one-dimension to four-dimensions, in which the new input features are given by the corresponding basis function. We will integrate this series of transformations into a pipeline just to exemplify its use.

pf = PolynomialFeatures(degree=4, include_bias=False)
LR = LinearRegression()

# Define pipeline
GLM = Pipeline([("pf", pf),  ("LR", LR) ])
GLM.fit(Ps, Vs)
Pipeline(steps=[('pf', PolynomialFeatures(degree=4, include_bias=False)),
                ('LR', LinearRegression())])

A pipeline helps us in sequentially applying a list of transformations to the data. This could help us in preventing that we forget to apply one of the transformations when using our model.

Let’s plot now the GLM

evaluation_points = np.linspace(Ps[0], Ps[-1], 100)

plt.figure()
plt.plot(Ps, Vs, 'kx', markersize=3)
plt.plot(evaluation_points, GLM.predict(evaluation_points), 'b', linewidth=1)
plt.title('Polynomial regression', size=15)
plt.xlabel('Pressure [kPa]')
plt.ylabel('Specific volume [$cm^3 g^{-1}$]')
plt.xlim((0, Ps[-1]))
plt.ylim((Vs[0], Vs[-1]))
plt.show()
_images/35ab507f6230a3475a2b3fa0d1639a44a28daa6e15ee3efc4ceb0959a5fea2db.png
# Print model parameters
print('Parameters: ', GLM['LR'].coef_)
print('Intercept : ', GLM['LR'].intercept_)
Parameters:  [[ 1.02749632e-04 -1.93838195e-08  2.14712769e-12 -8.15517498e-17]]
Intercept :  [1.03748932]
print('\n R2 for GLM:', GLM.score(Ps, Vs))
 R2 for GLM: 0.9970238146790564

Exercise - polynomial regression ❗❗#

  • Create a function that takes as inputs 1) the pressure data and 2) the model parameters and returns the specific volume.

  • Use the polynomial model parameters obtained with sklearn as 2)

  • Plot the predictions along with the experimental data. Do you obtain the same plot?

def poly_regression(Ps, model_params):
    w_0 = model_params[0]
    w_1 = model_params[1]
    w_2 = model_params[2]
    w_3 = model_params[3]
    w_4 = model_params[4]
    
    # Your code here

# Your code for usign the function here
# Your code for plotting here

Using a different base#

Generalized Linear Models are not restriceted to polynomial features. We can, in principle, use any basis functions that we want. How can we use for example a logarithmic basis function?

class log_feature(BaseEstimator, TransformerMixin):
        
    def fit(self, X, y=None):
        return self
        
    def transform(self, X):
       X_log = np.log(X)
       X_new = np.hstack((X, X_log))
       return X_new
log_f = log_feature()

GLM_log = Pipeline([("log_f", log_f), ("LR", LR),])
GLM_log.fit(Ps, Vs)
Pipeline(steps=[('log_f', log_feature()), ('LR', LinearRegression())])
evaluation_points = np.linspace(Ps[0], Ps[-1], 100)

plt.figure()
plt.plot(Ps, Vs, 'kx', markersize=3)
plt.plot(evaluation_points, GLM_log.predict(evaluation_points), 'b', linewidth=1)
plt.title('Generalized linear regression', size=15)
plt.xlabel('Pressure [kPa]')
plt.ylabel('Specific volume [$cm^3 g^{-1}$]')
plt.xlim((0, Ps[-1]))
plt.ylim((Vs[0], Vs[-1]))
plt.show()
_images/4bd29b381db4e0c1a393d5d6f7a9a7ef64000aaf2252e5c5977a4b12318d5766.png
# Print model parameters
print('Parameters: ', GLM_log['LR'].coef_)
print('Intercept : ', GLM_log['LR'].intercept_)
Parameters:  [[3.13403431e-05 2.00436068e-02]]
Intercept :  [0.95683431]
print('\n R2 for GLM:', GLM_log.score(Ps, Vs))
 R2 for GLM: 0.998201429670517

Regarding physical interpretability, what are the advantages that you see when using the logarithmic basis function vs. the polynomials?

Hint

Can the pressure become negative?

Exercise - Generalized linear regression ❗❗#

  • Create a function that takes as inputs 1) the pressure data and 2) the model parameters and returns the specific volume.

  • Use the logaritmic model parameters obtained with sklearn as 2)

  • Plot the predictions along with the experimental data. Do you obtain the same plot?

# Your code here

References#

SVNAS04

Joseph Mauk Smith, Hendrick C Van Ness, Michael M Abbott, and Mark Thomas Swihart. Introduction to chemical engineering thermodynamics. McGraw-Hill Education, 7th edition, 2004.