Compare revisions

Alexander Grote · Alexander Grote · Alexander Grote · 6b33fa1d · 6b33fa1d · 6b33fa1d
--- a/ml_bootcamp/Regression_Exercise.ipynb
+++ b/ml_bootcamp/Regression_Exercise.ipynb
 %% Cell type:markdown id: tags:

 ## The CRISP-DM Process

 > Cross-industry standard process for data mining, also known as CRISP-DM, is an open standard process model that describes common approaches used by data mining experts. It is the most widely-used analytics model.
 >
 > -- Source: https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining

 <p align="center">
 <img src="https://upload.wikimedia.org/wikipedia/commons/b/b9/CRISP-DM_Process_Diagram.png" width="400" />
 </p>

 CRISP-DM breaks the process of data mining into six major phases:

 - Business Understanding
 - Data Understanding
 - Data Preparation
 - Modeling
 - Evaluation
 - Deployment

 The sequence of the phases is not strict and moving back and forth between different phases is usually required. The arrows in the process diagram indicate the most important and frequent dependencies between phases. The outer circle in the diagram symbolizes the cyclic nature of data mining itself. A data mining process continues after a solution has been deployed. The lessons learned during the process can trigger new, often more focused business questions, and subsequent data mining processes will benefit from the experiences of previous ones.

 **Disclaimer**: Because we are not solving a real-world data science project, we are skipping the **Business Understanding** and **Deployment Step**. However, in my experience, these steps are the most important ones to provide business value.

 %% Cell type:markdown id: tags:

 ## Task Description: House Prices - Advanced Regression Techniques

 This notebook follows the idea of the "House Prices - Advanced Regression Techniques" competition on Kaggle. However, the dataset for this competition has been compiled by Dean De Cock for use in data science education. It was designed after the Boston Housing dataset and is now considered a more modernized and expanded version of it. More details of this dataset are described in [Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project](http://jse.amstat.org/v19n3/decock.pdf).

 >**Goal**: It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable.
 >
 >**Metric**: Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)
 >
 > -- description taken from [Kaggle](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview/evaluation)

 %% Cell type:markdown id: tags:

 ## Install & import packages

 %% Cell type:code id: tags:

 ``` python
 !pip install -r requirements.txt
 ```

 %% Cell type:code id: tags:

 ``` python
 import numpy as np  # linear algebra
 import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
 from scipy import stats  # statistical functions
 import os  # access to operating system related functions

 # plotting libraries
 import seaborn as sns
 import matplotlib.pyplot as plt

 # ml related libraries
 from sklearn.impute import SimpleImputer
 from sklearn.model_selection import train_test_split
 from sklearn.linear_model import Lasso
 from sklearn.pipeline import make_pipeline
 from sklearn.preprocessing import RobustScaler
 from sklearn.ensemble import RandomForestRegressor
 from sklearn.metrics import mean_squared_error, make_scorer
 from sklearn.model_selection import GridSearchCV
 from category_encoders.target_encoder import TargetEncoder
 ```

 %% Cell type:code id: tags:

 ``` python
 # plot inline
 %matplotlib inline
 ```

 %% Cell type:markdown id: tags:

 ## Read data

 In the next cell, we will download the data from an url and differentiate between the features X and the target variable y. Then we will create a train and test set.

 %% Cell type:code id: tags:

 ``` python
 # download original data
 data = pd.read_csv("http://jse.amstat.org/v19n3/decock/AmesHousing.txt", sep='\t')

 # get features and target
 X, y = data.drop(['PID', 'Order', 'SalePrice'], axis=1), data['SalePrice']

 # split into train and testset
 X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
 ```

 %% Cell type:code id: tags:

 ``` python
 # read more about the data
 with open('./data_description.txt', 'r') as file:
    description = file.read()

 print(description)
 ```

 %% Cell type:markdown id: tags:

 ## Data Understanding

 ### Gathering basic information about our data

 %% Cell type:code id: tags:

 ``` python
 # displaying first rows of data set
 X_train.head(5)
 ```

 %% Cell type:code id: tags:

 ``` python
 # get information about data types
 X_train.info()
 ```

 %% Cell type:code id: tags:

 ``` python
 #check the numbers of samples and features
 print("The X_train data size is : {} ".format(X_train.shape))
 print("The X_test data size is : {} ".format(X_test.shape))
 ```

 %% Cell type:markdown id: tags:

 ### Plotting target variable

 Since we are interested in forecasting the house price, we will first have a look at the distribution of the house prices themselves. The first plot shows the distribution of the sales price, while the second plot shows the probability of our data against the quantiles of a specified theoretical distribution. If our target variable followed a (perfect) normal distribution, all blue points would be on the red line. For more information on the second plot, click [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html).

 %% Cell type:code id: tags:

 ``` python
 # visualize SalesPrice (target variable)
 sns.distplot(y_train , fit=stats.norm)

 #Now plot the distribution
 (mu, sigma) = stats.norm.fit(y_train)
 plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)], loc='best')
 plt.ylabel('Frequency')
 plt.title('SalePrice distribution')

 #Get also the QQ-plot
 fig = plt.figure()
 res = stats.probplot(y_train, plot=plt)
 plt.show()
 ```

 %% Cell type:markdown id: tags:

 #### Task: What are the conclusions that we can draw from these two plots?
 Use the cell below to answer this question

 %% Cell type:markdown id: tags:


 %% Cell type:markdown id: tags:

 ### Visualize features

 Now that we have a better understanding of what we are looking at, we can explore our features visually.

 %% Cell type:code id: tags:

 ``` python
 ## visualizing numerical features
 X_train.select_dtypes(np.number).hist(bins = 50,figsize =(30,20))
 plt.show()
 ```

 %% Cell type:code id: tags:

 ``` python
 ## visualizing categorical data
 categorical_columns = X_train.select_dtypes('object').columns

 n_columns = 5
 n_rows = len(categorical_columns) // n_columns + 1

 fig = plt.figure(figsize =(20,30))

 for idx, column in enumerate(categorical_columns):

    ax = plt.subplot(n_rows, n_columns, idx + 1)
    X_train[column].value_counts().plot(kind='bar')
    ax.set_title(f'Distribution of {column}')

 plt.tight_layout()
 plt.show()
 ```

 %% Cell type:code id: tags:

 ``` python
 ## visualize missing data ratio
 X_train_na = (X_train.isnull().sum() / len(X_train)) * 100
 X_train_na = X_train_na.drop(X_train_na[X_train_na == 0].index).sort_values(ascending=False)[:30]
 missing_data = pd.DataFrame({'Missing Ratio': X_train_na})
 missing_data.head(20)

 f, ax = plt.subplots()
-plt.xticks(rotation='90')
+plt.xticks(rotation='vertical')
 sns.barplot(x=X_train_na.index, y=X_train_na)
 plt.xlabel('Features', fontsize=15)
 plt.ylabel('Percent of missing values', fontsize=15)
 plt.title('Percent missing data by feature', fontsize=15)
 plt.show()
 ```

 %% Cell type:code id: tags:

 ``` python
 ## visualize correlation
-sns.heatmap(X_train.corr())
+sns.heatmap(X_train.select_dtypes(np.number).corr())
 plt.show()
 ```

 %% Cell type:markdown id: tags:

 **Disclaimer**: usually, one would conduct an even more in-depth visual analysis of the dataset. For instance, one would investigate the relationship between all variables and the target variable. The Python package [Seaborn](https://seaborn.pydata.org/index.html) provides some good tutorials on data visualisation.

 %% Cell type:markdown id: tags:

 ## Data Preparation

 Below is a brief, non-exhaustive overview of the most common data preparation steps.

 %% Cell type:markdown id: tags:

 ### Imputing missing values

 Imputing missing values often requires domain knowledge. In our dataset, for instance, there are a lot of columns, in which the missing value has a meaning and can therefore be meaningful encoded. If any value is missing at random, we can only make assumptions what this value should be encoded as. However, there are some advanced imputing techniques like k-nearest neighbors or an iterative imputer that try to make the best guess for us. If you want to read more about them, checkout sklearn's [documentation](https://scikit-learn.org/stable/modules/impute.html#impute).

 %% Cell type:markdown id: tags:

 #### Task: Please impute your missing values
 Use the cells below to for your code

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:

 ### Outlier removal

 Some models, like linear regression, is sensitive to outliers. Hence, depending on your models requirements, you might want to exclude abnormal data points.

 %% Cell type:markdown id: tags:

 #### Task: Please investigate the above grade square feet area ('Gr Liv Area') for outliers
 Use the cells below to for your code

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:

 ### Feature engineering

 In order to maximize our model's performance, we should also look into creating new features. This usually requires domain knowledge. However, there are also automated tools available. One of these tools is called featuretools. Click [here](https://github.com/alteryx/featuretools) for more information.

 %% Cell type:markdown id: tags:

 #### Task: Please think of a new feature and visualize if it has any correlation with the target variable
 Use the cells below to for your code

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:

 ### Encoding of categorical features

 In sklearn, all machine learning algorithms assume that the categorical features are represented as numbers. This transformation can be done in many ways. Among the most popular is probably [one-hot-encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) or [label encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder). If you are interested in reading more about other, not so common encoding possibilities, check out the [category_encoder package](https://contrib.scikit-learn.org/category_encoders/).

 %% Cell type:markdown id: tags:

 #### Task: Please encode your categorical features as numbers. Also think about numeric variables that are actually categorical.
 Use the cells below to for your code

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:

 ### Perform feature selection / extraction
 Usually, one would also perform feature selection or feature extraction. This will most likely increase the model performance if done well. However, since we are still in the explanatory phase, we will skip it. If you later want to interpret your model and its result, often feature selection is useful. You can read more about feature selection [here](https://scikit-learn.org/stable/modules/feature_selection.html).

 %% Cell type:markdown id: tags:

 ## Modelling

 %% Cell type:markdown id: tags:

 #### Task: Please train at least two models.
 An easy way to get started is to use models from [sklearn](https://scikit-learn.org/stable/index.html).
 Use the cells below to for your code.

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:

 ## Evaluation

 %% Cell type:markdown id: tags:

 #### Task: Please evaluate your models on the given error metric and use at least one naive benchmark
 Use the cells below to for your code

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:

 #### Task: Perform hyperparameter tuning with cross validation on one of your models. Do the results improve?
 Use the cells below to for your code

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:


 %% Cell type:markdown id: tags:

 #### Task: Please evaluate the residuals of your models. Do you consistently under- or overestimate the house prices?
 Use the cells below to for your code

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:

 The distribution of the residuals follow a normal distribution and hence we do not consistently under- or overestimate the house prices

 %% Cell type:markdown id: tags:

 ## Next Steps

 %% Cell type:markdown id: tags:

 #### Task: How can you improve our existing model?
 Use the cell below to for your answer

 %% Cell type:markdown id: tags:


 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:

 ## The CRISP-DM Process

 > Cross-industry standard process for data mining, also known as CRISP-DM, is an open standard process model that describes common approaches used by data mining experts. It is the most widely-used analytics model.
 >
 > -- Source: https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining

 <p align="center">
 <img src="https://upload.wikimedia.org/wikipedia/commons/b/b9/CRISP-DM_Process_Diagram.png" width="400" />
 </p>

 CRISP-DM breaks the process of data mining into six major phases:

 - Business Understanding
 - Data Understanding
 - Data Preparation
 - Modeling
 - Evaluation
 - Deployment

 The sequence of the phases is not strict and moving back and forth between different phases is usually required. The arrows in the process diagram indicate the most important and frequent dependencies between phases. The outer circle in the diagram symbolizes the cyclic nature of data mining itself. A data mining process continues after a solution has been deployed. The lessons learned during the process can trigger new, often more focused business questions, and subsequent data mining processes will benefit from the experiences of previous ones.

 **Disclaimer**: Because we are not solving a real-world data science project, we are skipping the **Business Understanding** and **Deployment Step**. However, in my experience, these steps are the most important ones to provide business value.

 %% Cell type:markdown id: tags:

 ## Task Description: House Prices - Advanced Regression Techniques

 This notebook follows the idea of the "House Prices - Advanced Regression Techniques" competition on Kaggle. However, the dataset for this competition has been compiled by Dean De Cock for use in data science education. It was designed after the Boston Housing dataset and is now considered a more modernized and expanded version of it. More details of this dataset are described in [Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project](http://jse.amstat.org/v19n3/decock.pdf).

 >**Goal**: It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable.
 >
 >**Metric**: Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)
 >
 > -- description taken from [Kaggle](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview/evaluation)

 %% Cell type:markdown id: tags:

 ## Install & import packages

 %% Cell type:code id: tags:

 ``` python
 !pip install -r requirements.txt
 ```

 %% Cell type:code id: tags:

 ``` python
 import numpy as np  # linear algebra
 import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
 from scipy import stats  # statistical functions
 import os  # access to operating system related functions

 # plotting libraries
 import seaborn as sns
 import matplotlib.pyplot as plt

 # ml related libraries
 from sklearn.impute import SimpleImputer
 from sklearn.model_selection import train_test_split
 from sklearn.linear_model import Lasso
 from sklearn.pipeline import make_pipeline
 from sklearn.preprocessing import RobustScaler
 from sklearn.ensemble import RandomForestRegressor
 from sklearn.metrics import mean_squared_error, make_scorer
 from sklearn.model_selection import GridSearchCV
 from category_encoders.target_encoder import TargetEncoder
 ```

 %% Cell type:code id: tags:

 ``` python
 # plot inline
 %matplotlib inline
 ```

 %% Cell type:markdown id: tags:

 ## Read data

 In the next cell, we will download the data from an url and differentiate between the features X and the target variable y. Then we will create a train and test set.

 %% Cell type:code id: tags:

 ``` python
 # download original data
 data = pd.read_csv("http://jse.amstat.org/v19n3/decock/AmesHousing.txt", sep='\t')

 # get features and target
 X, y = data.drop(['PID', 'Order', 'SalePrice'], axis=1), data['SalePrice']

 # split into train and testset
 X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
 ```

 %% Cell type:code id: tags:

 ``` python
 # read more about the data
 with open('./data_description.txt', 'r') as file:
    description = file.read()

 print(description)
 ```

 %% Cell type:markdown id: tags:

 ## Data Understanding

 ### Gathering basic information about our data

 %% Cell type:code id: tags:

 ``` python
 # displaying first rows of data set
 X_train.head(5)
 ```

 %% Cell type:code id: tags:

 ``` python
 # get information about data types
 X_train.info()
 ```

 %% Cell type:code id: tags:

 ``` python
 #check the numbers of samples and features
 print("The X_train data size is : {} ".format(X_train.shape))
 print("The X_test data size is : {} ".format(X_test.shape))
 ```

 %% Cell type:markdown id: tags:

 ### Plotting target variable

 Since we are interested in forecasting the house price, we will first have a look at the distribution of the house prices themselves. The first plot shows the distribution of the sales price, while the second plot shows the probability of our data against the quantiles of a specified theoretical distribution. If our target variable followed a (perfect) normal distribution, all blue points would be on the red line. For more information on the second plot, click [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html).

 %% Cell type:code id: tags:

 ``` python
 # visualize SalesPrice (target variable)
 sns.distplot(y_train , fit=stats.norm)

 #Now plot the distribution
 (mu, sigma) = stats.norm.fit(y_train)
 plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)], loc='best')
 plt.ylabel('Frequency')
 plt.title('SalePrice distribution')

 #Get also the QQ-plot
 fig = plt.figure()
 res = stats.probplot(y_train, plot=plt)
 plt.show()
 ```

 %% Cell type:markdown id: tags:

 #### Task: What are the conclusions that we can draw from these two plots?
 Use the cell below to answer this question

 %% Cell type:markdown id: tags:


 %% Cell type:markdown id: tags:

 ### Visualize features

 Now that we have a better understanding of what we are looking at, we can explore our features visually.

 %% Cell type:code id: tags:

 ``` python
 ## visualizing numerical features
 X_train.select_dtypes(np.number).hist(bins = 50,figsize =(30,20))
 plt.show()
 ```

 %% Cell type:code id: tags:

 ``` python
 ## visualizing categorical data
 categorical_columns = X_train.select_dtypes('object').columns

 n_columns = 5
 n_rows = len(categorical_columns) // n_columns + 1

 fig = plt.figure(figsize =(20,30))

 for idx, column in enumerate(categorical_columns):

    ax = plt.subplot(n_rows, n_columns, idx + 1)
    X_train[column].value_counts().plot(kind='bar')
    ax.set_title(f'Distribution of {column}')

 plt.tight_layout()
 plt.show()
 ```

 %% Cell type:code id: tags:

 ``` python
 ## visualize missing data ratio
 X_train_na = (X_train.isnull().sum() / len(X_train)) * 100
 X_train_na = X_train_na.drop(X_train_na[X_train_na == 0].index).sort_values(ascending=False)[:30]
 missing_data = pd.DataFrame({'Missing Ratio': X_train_na})
 missing_data.head(20)

 f, ax = plt.subplots()
-plt.xticks(rotation='90')
+plt.xticks(rotation='vertical')
 sns.barplot(x=X_train_na.index, y=X_train_na)
 plt.xlabel('Features', fontsize=15)
 plt.ylabel('Percent of missing values', fontsize=15)
 plt.title('Percent missing data by feature', fontsize=15)
 plt.show()
 ```

 %% Cell type:code id: tags:

 ``` python
 ## visualize correlation
-sns.heatmap(X_train.corr())
+sns.heatmap(X_train.select_dtypes(np.number).corr())
 plt.show()
 ```

 %% Cell type:markdown id: tags:

 **Disclaimer**: usually, one would conduct an even more in-depth visual analysis of the dataset. For instance, one would investigate the relationship between all variables and the target variable. The Python package [Seaborn](https://seaborn.pydata.org/index.html) provides some good tutorials on data visualisation.

 %% Cell type:markdown id: tags:

 ## Data Preparation

 Below is a brief, non-exhaustive overview of the most common data preparation steps.

 %% Cell type:markdown id: tags:

 ### Imputing missing values

 Imputing missing values often requires domain knowledge. In our dataset, for instance, there are a lot of columns, in which the missing value has a meaning and can therefore be meaningful encoded. If any value is missing at random, we can only make assumptions what this value should be encoded as. However, there are some advanced imputing techniques like k-nearest neighbors or an iterative imputer that try to make the best guess for us. If you want to read more about them, checkout sklearn's [documentation](https://scikit-learn.org/stable/modules/impute.html#impute).

 %% Cell type:markdown id: tags:

 #### Task: Please impute your missing values
 Use the cells below to for your code

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:

 ### Outlier removal

 Some models, like linear regression, is sensitive to outliers. Hence, depending on your models requirements, you might want to exclude abnormal data points.

 %% Cell type:markdown id: tags:

 #### Task: Please investigate the above grade square feet area ('Gr Liv Area') for outliers
 Use the cells below to for your code

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:

 ### Feature engineering

 In order to maximize our model's performance, we should also look into creating new features. This usually requires domain knowledge. However, there are also automated tools available. One of these tools is called featuretools. Click [here](https://github.com/alteryx/featuretools) for more information.

 %% Cell type:markdown id: tags:

 #### Task: Please think of a new feature and visualize if it has any correlation with the target variable
 Use the cells below to for your code

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:

 ### Encoding of categorical features

 In sklearn, all machine learning algorithms assume that the categorical features are represented as numbers. This transformation can be done in many ways. Among the most popular is probably [one-hot-encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) or [label encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder). If you are interested in reading more about other, not so common encoding possibilities, check out the [category_encoder package](https://contrib.scikit-learn.org/category_encoders/).

 %% Cell type:markdown id: tags:

 #### Task: Please encode your categorical features as numbers. Also think about numeric variables that are actually categorical.
 Use the cells below to for your code

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:

 ### Perform feature selection / extraction
 Usually, one would also perform feature selection or feature extraction. This will most likely increase the model performance if done well. However, since we are still in the explanatory phase, we will skip it. If you later want to interpret your model and its result, often feature selection is useful. You can read more about feature selection [here](https://scikit-learn.org/stable/modules/feature_selection.html).

 %% Cell type:markdown id: tags:

 ## Modelling

 %% Cell type:markdown id: tags:

 #### Task: Please train at least two models.
 An easy way to get started is to use models from [sklearn](https://scikit-learn.org/stable/index.html).
 Use the cells below to for your code.

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:

 ## Evaluation

 %% Cell type:markdown id: tags:

 #### Task: Please evaluate your models on the given error metric and use at least one naive benchmark
 Use the cells below to for your code

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:

 #### Task: Perform hyperparameter tuning with cross validation on one of your models. Do the results improve?
 Use the cells below to for your code

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:


 %% Cell type:markdown id: tags:

 #### Task: Please evaluate the residuals of your models. Do you consistently under- or overestimate the house prices?
 Use the cells below to for your code

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:

 The distribution of the residuals follow a normal distribution and hence we do not consistently under- or overestimate the house prices

 %% Cell type:markdown id: tags:

 ## Next Steps

 %% Cell type:markdown id: tags:

 #### Task: How can you improve our existing model?
 Use the cell below to for your answer

 %% Cell type:markdown id: tags:


 %% Cell type:code id: tags:

 ``` python
 ```

--- a/ml_bootcamp/Regression_Solution.ipynb
+++ b/ml_bootcamp/Regression_Solution.ipynb
--- a/ml_bootcamp/requirements.txt
+++ b/ml_bootcamp/requirements.txt
-sklearn
+scikit-learn
 numpy
 pandas
 seaborn
No results found