Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision
  • main
1 result

Target

Select target project
  • do1037/business-data-analytics-exercises-ss24
  • ustvs/business-data-analytics-exercises-ss24
2 results
Select Git revision
  • main
1 result
Show changes
Commits on Source (3)
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## The CRISP-DM Process ## The CRISP-DM Process
> Cross-industry standard process for data mining, also known as CRISP-DM, is an open standard process model that describes common approaches used by data mining experts. It is the most widely-used analytics model. > Cross-industry standard process for data mining, also known as CRISP-DM, is an open standard process model that describes common approaches used by data mining experts. It is the most widely-used analytics model.
> >
> -- Source: https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining > -- Source: https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining
<p align="center"> <p align="center">
<img src="https://upload.wikimedia.org/wikipedia/commons/b/b9/CRISP-DM_Process_Diagram.png" width="400" /> <img src="https://upload.wikimedia.org/wikipedia/commons/b/b9/CRISP-DM_Process_Diagram.png" width="400" />
</p> </p>
CRISP-DM breaks the process of data mining into six major phases: CRISP-DM breaks the process of data mining into six major phases:
- Business Understanding - Business Understanding
- Data Understanding - Data Understanding
- Data Preparation - Data Preparation
- Modeling - Modeling
- Evaluation - Evaluation
- Deployment - Deployment
The sequence of the phases is not strict and moving back and forth between different phases is usually required. The arrows in the process diagram indicate the most important and frequent dependencies between phases. The outer circle in the diagram symbolizes the cyclic nature of data mining itself. A data mining process continues after a solution has been deployed. The lessons learned during the process can trigger new, often more focused business questions, and subsequent data mining processes will benefit from the experiences of previous ones. The sequence of the phases is not strict and moving back and forth between different phases is usually required. The arrows in the process diagram indicate the most important and frequent dependencies between phases. The outer circle in the diagram symbolizes the cyclic nature of data mining itself. A data mining process continues after a solution has been deployed. The lessons learned during the process can trigger new, often more focused business questions, and subsequent data mining processes will benefit from the experiences of previous ones.
**Disclaimer**: Because we are not solving a real-world data science project, we are skipping the **Business Understanding** and **Deployment Step**. However, in my experience, these steps are the most important ones to provide business value. **Disclaimer**: Because we are not solving a real-world data science project, we are skipping the **Business Understanding** and **Deployment Step**. However, in my experience, these steps are the most important ones to provide business value.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Task Description: House Prices - Advanced Regression Techniques ## Task Description: House Prices - Advanced Regression Techniques
This notebook follows the idea of the "House Prices - Advanced Regression Techniques" competition on Kaggle. However, the dataset for this competition has been compiled by Dean De Cock for use in data science education. It was designed after the Boston Housing dataset and is now considered a more modernized and expanded version of it. More details of this dataset are described in [Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project](http://jse.amstat.org/v19n3/decock.pdf). This notebook follows the idea of the "House Prices - Advanced Regression Techniques" competition on Kaggle. However, the dataset for this competition has been compiled by Dean De Cock for use in data science education. It was designed after the Boston Housing dataset and is now considered a more modernized and expanded version of it. More details of this dataset are described in [Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project](http://jse.amstat.org/v19n3/decock.pdf).
>**Goal**: It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable. >**Goal**: It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable.
> >
>**Metric**: Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.) >**Metric**: Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)
> >
> -- description taken from [Kaggle](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview/evaluation) > -- description taken from [Kaggle](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview/evaluation)
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Install & import packages ## Install & import packages
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
!pip install -r requirements.txt !pip install -r requirements.txt
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import numpy as np # linear algebra import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from scipy import stats # statistical functions from scipy import stats # statistical functions
import os # access to operating system related functions import os # access to operating system related functions
# plotting libraries # plotting libraries
import seaborn as sns import seaborn as sns
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
# ml related libraries # ml related libraries
from sklearn.impute import SimpleImputer from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso from sklearn.linear_model import Lasso
from sklearn.pipeline import make_pipeline from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler from sklearn.preprocessing import RobustScaler
from sklearn.ensemble import RandomForestRegressor from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, make_scorer from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.model_selection import GridSearchCV from sklearn.model_selection import GridSearchCV
from category_encoders.target_encoder import TargetEncoder from category_encoders.target_encoder import TargetEncoder
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# plot inline # plot inline
%matplotlib inline %matplotlib inline
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Read data ## Read data
In the next cell, we will download the data from an url and differentiate between the features X and the target variable y. Then we will create a train and test set. In the next cell, we will download the data from an url and differentiate between the features X and the target variable y. Then we will create a train and test set.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# download original data # download original data
data = pd.read_csv("http://jse.amstat.org/v19n3/decock/AmesHousing.txt", sep='\t') data = pd.read_csv("http://jse.amstat.org/v19n3/decock/AmesHousing.txt", sep='\t')
# get features and target # get features and target
X, y = data.drop(['PID', 'Order', 'SalePrice'], axis=1), data['SalePrice'] X, y = data.drop(['PID', 'Order', 'SalePrice'], axis=1), data['SalePrice']
# split into train and testset # split into train and testset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# read more about the data # read more about the data
with open('./data_description.txt', 'r') as file: with open('./data_description.txt', 'r') as file:
description = file.read() description = file.read()
print(description) print(description)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Data Understanding ## Data Understanding
### Gathering basic information about our data ### Gathering basic information about our data
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# displaying first rows of data set # displaying first rows of data set
X_train.head(5) X_train.head(5)
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# get information about data types # get information about data types
X_train.info() X_train.info()
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
#check the numbers of samples and features #check the numbers of samples and features
print("The X_train data size is : {} ".format(X_train.shape)) print("The X_train data size is : {} ".format(X_train.shape))
print("The X_test data size is : {} ".format(X_test.shape)) print("The X_test data size is : {} ".format(X_test.shape))
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Plotting target variable ### Plotting target variable
Since we are interested in forecasting the house price, we will first have a look at the distribution of the house prices themselves. The first plot shows the distribution of the sales price, while the second plot shows the probability of our data against the quantiles of a specified theoretical distribution. If our target variable followed a (perfect) normal distribution, all blue points would be on the red line. For more information on the second plot, click [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html). Since we are interested in forecasting the house price, we will first have a look at the distribution of the house prices themselves. The first plot shows the distribution of the sales price, while the second plot shows the probability of our data against the quantiles of a specified theoretical distribution. If our target variable followed a (perfect) normal distribution, all blue points would be on the red line. For more information on the second plot, click [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html).
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# visualize SalesPrice (target variable) # visualize SalesPrice (target variable)
sns.distplot(y_train , fit=stats.norm) sns.distplot(y_train , fit=stats.norm)
#Now plot the distribution #Now plot the distribution
(mu, sigma) = stats.norm.fit(y_train) (mu, sigma) = stats.norm.fit(y_train)
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)], loc='best') plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)], loc='best')
plt.ylabel('Frequency') plt.ylabel('Frequency')
plt.title('SalePrice distribution') plt.title('SalePrice distribution')
#Get also the QQ-plot #Get also the QQ-plot
fig = plt.figure() fig = plt.figure()
res = stats.probplot(y_train, plot=plt) res = stats.probplot(y_train, plot=plt)
plt.show() plt.show()
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Task: What are the conclusions that we can draw from these two plots? #### Task: What are the conclusions that we can draw from these two plots?
Use the cell below to answer this question Use the cell below to answer this question
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Visualize features ### Visualize features
Now that we have a better understanding of what we are looking at, we can explore our features visually. Now that we have a better understanding of what we are looking at, we can explore our features visually.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
## visualizing numerical features ## visualizing numerical features
X_train.select_dtypes(np.number).hist(bins = 50,figsize =(30,20)) X_train.select_dtypes(np.number).hist(bins = 50,figsize =(30,20))
plt.show() plt.show()
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
## visualizing categorical data ## visualizing categorical data
categorical_columns = X_train.select_dtypes('object').columns categorical_columns = X_train.select_dtypes('object').columns
n_columns = 5 n_columns = 5
n_rows = len(categorical_columns) // n_columns + 1 n_rows = len(categorical_columns) // n_columns + 1
fig = plt.figure(figsize =(20,30)) fig = plt.figure(figsize =(20,30))
for idx, column in enumerate(categorical_columns): for idx, column in enumerate(categorical_columns):
ax = plt.subplot(n_rows, n_columns, idx + 1) ax = plt.subplot(n_rows, n_columns, idx + 1)
X_train[column].value_counts().plot(kind='bar') X_train[column].value_counts().plot(kind='bar')
ax.set_title(f'Distribution of {column}') ax.set_title(f'Distribution of {column}')
plt.tight_layout() plt.tight_layout()
plt.show() plt.show()
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
## visualize missing data ratio ## visualize missing data ratio
X_train_na = (X_train.isnull().sum() / len(X_train)) * 100 X_train_na = (X_train.isnull().sum() / len(X_train)) * 100
X_train_na = X_train_na.drop(X_train_na[X_train_na == 0].index).sort_values(ascending=False)[:30] X_train_na = X_train_na.drop(X_train_na[X_train_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio': X_train_na}) missing_data = pd.DataFrame({'Missing Ratio': X_train_na})
missing_data.head(20) missing_data.head(20)
f, ax = plt.subplots() f, ax = plt.subplots()
plt.xticks(rotation='90') plt.xticks(rotation='vertical')
sns.barplot(x=X_train_na.index, y=X_train_na) sns.barplot(x=X_train_na.index, y=X_train_na)
plt.xlabel('Features', fontsize=15) plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15) plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15) plt.title('Percent missing data by feature', fontsize=15)
plt.show() plt.show()
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
## visualize correlation ## visualize correlation
sns.heatmap(X_train.corr()) sns.heatmap(X_train.select_dtypes(np.number).corr())
plt.show() plt.show()
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**Disclaimer**: usually, one would conduct an even more in-depth visual analysis of the dataset. For instance, one would investigate the relationship between all variables and the target variable. The Python package [Seaborn](https://seaborn.pydata.org/index.html) provides some good tutorials on data visualisation. **Disclaimer**: usually, one would conduct an even more in-depth visual analysis of the dataset. For instance, one would investigate the relationship between all variables and the target variable. The Python package [Seaborn](https://seaborn.pydata.org/index.html) provides some good tutorials on data visualisation.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Data Preparation ## Data Preparation
Below is a brief, non-exhaustive overview of the most common data preparation steps. Below is a brief, non-exhaustive overview of the most common data preparation steps.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Imputing missing values ### Imputing missing values
Imputing missing values often requires domain knowledge. In our dataset, for instance, there are a lot of columns, in which the missing value has a meaning and can therefore be meaningful encoded. If any value is missing at random, we can only make assumptions what this value should be encoded as. However, there are some advanced imputing techniques like k-nearest neighbors or an iterative imputer that try to make the best guess for us. If you want to read more about them, checkout sklearn's [documentation](https://scikit-learn.org/stable/modules/impute.html#impute). Imputing missing values often requires domain knowledge. In our dataset, for instance, there are a lot of columns, in which the missing value has a meaning and can therefore be meaningful encoded. If any value is missing at random, we can only make assumptions what this value should be encoded as. However, there are some advanced imputing techniques like k-nearest neighbors or an iterative imputer that try to make the best guess for us. If you want to read more about them, checkout sklearn's [documentation](https://scikit-learn.org/stable/modules/impute.html#impute).
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Task: Please impute your missing values #### Task: Please impute your missing values
Use the cells below to for your code Use the cells below to for your code
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Outlier removal ### Outlier removal
Some models, like linear regression, is sensitive to outliers. Hence, depending on your models requirements, you might want to exclude abnormal data points. Some models, like linear regression, is sensitive to outliers. Hence, depending on your models requirements, you might want to exclude abnormal data points.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Task: Please investigate the above grade square feet area ('Gr Liv Area') for outliers #### Task: Please investigate the above grade square feet area ('Gr Liv Area') for outliers
Use the cells below to for your code Use the cells below to for your code
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Feature engineering ### Feature engineering
In order to maximize our model's performance, we should also look into creating new features. This usually requires domain knowledge. However, there are also automated tools available. One of these tools is called featuretools. Click [here](https://github.com/alteryx/featuretools) for more information. In order to maximize our model's performance, we should also look into creating new features. This usually requires domain knowledge. However, there are also automated tools available. One of these tools is called featuretools. Click [here](https://github.com/alteryx/featuretools) for more information.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Task: Please think of a new feature and visualize if it has any correlation with the target variable #### Task: Please think of a new feature and visualize if it has any correlation with the target variable
Use the cells below to for your code Use the cells below to for your code
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Encoding of categorical features ### Encoding of categorical features
In sklearn, all machine learning algorithms assume that the categorical features are represented as numbers. This transformation can be done in many ways. Among the most popular is probably [one-hot-encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) or [label encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder). If you are interested in reading more about other, not so common encoding possibilities, check out the [category_encoder package](https://contrib.scikit-learn.org/category_encoders/). In sklearn, all machine learning algorithms assume that the categorical features are represented as numbers. This transformation can be done in many ways. Among the most popular is probably [one-hot-encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) or [label encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder). If you are interested in reading more about other, not so common encoding possibilities, check out the [category_encoder package](https://contrib.scikit-learn.org/category_encoders/).
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Task: Please encode your categorical features as numbers. Also think about numeric variables that are actually categorical. #### Task: Please encode your categorical features as numbers. Also think about numeric variables that are actually categorical.
Use the cells below to for your code Use the cells below to for your code
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Perform feature selection / extraction ### Perform feature selection / extraction
Usually, one would also perform feature selection or feature extraction. This will most likely increase the model performance if done well. However, since we are still in the explanatory phase, we will skip it. If you later want to interpret your model and its result, often feature selection is useful. You can read more about feature selection [here](https://scikit-learn.org/stable/modules/feature_selection.html). Usually, one would also perform feature selection or feature extraction. This will most likely increase the model performance if done well. However, since we are still in the explanatory phase, we will skip it. If you later want to interpret your model and its result, often feature selection is useful. You can read more about feature selection [here](https://scikit-learn.org/stable/modules/feature_selection.html).
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Modelling ## Modelling
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Task: Please train at least two models. #### Task: Please train at least two models.
An easy way to get started is to use models from [sklearn](https://scikit-learn.org/stable/index.html). An easy way to get started is to use models from [sklearn](https://scikit-learn.org/stable/index.html).
Use the cells below to for your code. Use the cells below to for your code.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Evaluation ## Evaluation
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Task: Please evaluate your models on the given error metric and use at least one naive benchmark #### Task: Please evaluate your models on the given error metric and use at least one naive benchmark
Use the cells below to for your code Use the cells below to for your code
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Task: Perform hyperparameter tuning with cross validation on one of your models. Do the results improve? #### Task: Perform hyperparameter tuning with cross validation on one of your models. Do the results improve?
Use the cells below to for your code Use the cells below to for your code
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Task: Please evaluate the residuals of your models. Do you consistently under- or overestimate the house prices? #### Task: Please evaluate the residuals of your models. Do you consistently under- or overestimate the house prices?
Use the cells below to for your code Use the cells below to for your code
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The distribution of the residuals follow a normal distribution and hence we do not consistently under- or overestimate the house prices The distribution of the residuals follow a normal distribution and hence we do not consistently under- or overestimate the house prices
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Next Steps ## Next Steps
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Task: How can you improve our existing model? #### Task: How can you improve our existing model?
Use the cell below to for your answer Use the cell below to for your answer
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
......
This diff is collapsed.
sklearn scikit-learn
numpy numpy
pandas pandas
seaborn seaborn
......