{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "mltE3GprMezc"
},
"source": [
"# 3. GLM for thermophysical property prediction ⚗️\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JkjRqcpYdrNf"
},
"source": [
"## Goals of this exercise 🌟"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zpHMQNK1dsZn"
},
"source": [
"* We will learn how to apply (generalized) linear regression\n",
"* We will review some performance metrics for assesing regression models"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zlpGaWn6e50X"
},
"source": [
"## A quick reminder ✅"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xZkte0lwfAts"
},
"source": [
"The process of \"learning\" in the context of supervised **learning** can be seen as exploring a hypothesis space $\\mathcal{H}$ looking for the most appropriate hypothesis function $h$. In the context of linear regression the hypothesis space is of course the space of linear functions.\n",
"\n",
"Let's imagine our input space is two-dimensional, continuous and non-negative. This could be denoted mathematically as $\\textbf{x} \\in \\mathbb{R}_+^2$. For example, for an ideal gas, its pressure is a function of the temperature and volume. In this case, our dataset will be a collection of $N$ points with temperature and volume values as inputs and pressure as output\n",
"\n",
"$$\n",
"\\{(\\textbf{x}^{(1)}, y^{(1)}), (\\textbf{x}^{(2)}, y^{(2)}), ..., (\\textbf{x}^{(N)}, y^{(N)}) \\}\n",
"$$\n",
"\n",
"where for each point $\\textbf{x} = [x_1, x_2]^T$. Our hypothesis function would be\n",
"\n",
"$$\n",
"h(\\textbf{x}, \\textbf{w}) = w_0 + w_1x_1 + w_2+x_2\n",
"$$\n",
"\n",
"where $\\textbf{w} = [w_0, w_1, w_2]^T$ is the vector of model parameters that the machine has to learn. You will soon realize that \"learn\" means solving an optimization problem to arrive to a set of optimal parameters. In this case, we could for example minimize the sum of squarred errors to get the optimal parameters $\\textbf{w}^* $\n",
"\n",
"$$\n",
"\\textbf{w}^* = argmin_{\\textbf{w}} ~~ \\frac{1}{2} \\sum_{n=1}^N \\left( y^{(n)} - h(\\textbf{x}^{(n)}, \\textbf{w}) \\right)^2\n",
"$$\n",
"\n",
"This turns out to be a convex problem. This means that there exist only one optimum which is the global optimum. \n",
"\n",
"```{attention}\n",
"Did you remember the difference between local and global optima? \n",
"```\n",
"\n",
"There are many ways in which this optimization problem can be solved. For example, we can use gradient-based methods (e.g., [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent)), Hessian-based methods (e.g., [Newton's method](https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization)) or, in this case, even an [analytical solution](https://math.stackexchange.com/questions/4177039/deriving-the-normal-equation-for-linear-regression) exists!"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CK73IYAeq-aa"
},
"source": [
"### What about non-linear problems? 🤔\n",
"\n",
"We can expand this concept to cover non-linear spaces by introducing the concept of basis functions $\\phi(\\textbf{x})$ that map our original inputs to a different space where the problem becomes linear. Then, in this new space we perform linear regression (or classification) and effectively we are performing non-linear regression (classification) in the original space! Nice trick right?\n",
"\n",
"This get rise to what we call Generalized Linear Models (GLM)!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Linear regression 📉"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "O8HLcupPMy5-"
},
"source": [
"Let's now play around with this concepts by looking at a specific example: regressing thermophysical data of saturated and superheated vapor.\n",
"\n",
"The data is taken from the Appendix F of {cite}`smith2004introduction`.\n",
"\n",
"Let's import some libraries."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "H-wT_7-7MYwQ"
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.linear_model import LinearRegression\n",
"from mpl_toolkits.mplot3d import Axes3D\n",
"from matplotlib import cm\n",
"from sklearn.preprocessing import PolynomialFeatures\n",
"from sklearn.pipeline import Pipeline\n",
"from sklearn.base import BaseEstimator, TransformerMixin\n",
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Kj6uyrseJA6X"
},
"source": [
"and then import the data"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "OqAckSZ7NWls"
},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | Pressure | \n", "Property | \n", "Liq_Sat | \n", "Vap_Sat | \n", "75 | \n", "100 | \n", "125 | \n", "150 | \n", "175 | \n", "200 | \n", "... | \n", "425 | \n", "450 | \n", "475 | \n", "500 | \n", "525 | \n", "550 | \n", "575 | \n", "600 | \n", "625 | \n", "650 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1.0 | \n", "V | \n", "1.000 | \n", "129200.0000 | \n", "160640.0000 | \n", "172180.0000 | \n", "183720.0000 | \n", "195270.0000 | \n", "206810.0000 | \n", "218350.0000 | \n", "... | \n", "NaN | \n", "333730.00 | \n", "NaN | \n", "356810.0000 | \n", "NaN | \n", "379880.0000 | \n", "NaN | \n", "402960.0000 | \n", "NaN | \n", "426040.0000 | \n", "
1 | \n", "1.0 | \n", "U | \n", "29.334 | \n", "2385.2000 | \n", "2480.8000 | \n", "2516.4000 | \n", "2552.3000 | \n", "2588.5000 | \n", "2624.9000 | \n", "2661.7000 | \n", "... | \n", "NaN | \n", "3049.90 | \n", "NaN | \n", "3132.4000 | \n", "NaN | \n", "3216.7000 | \n", "NaN | \n", "3302.6000 | \n", "NaN | \n", "3390.3000 | \n", "
2 | \n", "1.0 | \n", "H | \n", "29.335 | \n", "2514.4000 | \n", "2641.5000 | \n", "2688.6000 | \n", "2736.0000 | \n", "2783.7000 | \n", "2831.7000 | \n", "2880.1000 | \n", "... | \n", "NaN | \n", "3383.60 | \n", "NaN | \n", "3489.2000 | \n", "NaN | \n", "3596.5000 | \n", "NaN | \n", "3705.6000 | \n", "NaN | \n", "3816.4000 | \n", "
3 | \n", "1.0 | \n", "S | \n", "0.106 | \n", "8.9767 | \n", "9.3828 | \n", "9.5136 | \n", "9.6365 | \n", "9.7527 | \n", "9.8629 | \n", "9.9679 | \n", "... | \n", "NaN | \n", "10.82 | \n", "NaN | \n", "10.9612 | \n", "NaN | \n", "11.0957 | \n", "NaN | \n", "11.2243 | \n", "NaN | \n", "11.3476 | \n", "
4 | \n", "10.0 | \n", "V | \n", "1.010 | \n", "14670.0000 | \n", "16030.0000 | \n", "17190.0000 | \n", "18350.0000 | \n", "19510.0000 | \n", "20660.0000 | \n", "21820.0000 | \n", "... | \n", "NaN | \n", "33370.00 | \n", "NaN | \n", "35670.0000 | \n", "NaN | \n", "37980.0000 | \n", "NaN | \n", "40290.0000 | \n", "NaN | \n", "42600.0000 | \n", "
5 rows × 37 columns
\n", "