Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • do1037/business-data-analytics-exercises-ss24
  • ustvs/business-data-analytics-exercises-ss24
2 results
Show changes
Commits on Source (3)
%% Cell type:markdown id: tags:
## The CRISP-DM Process
> Cross-industry standard process for data mining, also known as CRISP-DM, is an open standard process model that describes common approaches used by data mining experts. It is the most widely-used analytics model.
>
> -- Source: https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining
<p align="center">
<img src="https://upload.wikimedia.org/wikipedia/commons/b/b9/CRISP-DM_Process_Diagram.png" width="400" />
</p>
CRISP-DM breaks the process of data mining into six major phases:
- Business Understanding
- Data Understanding
- Data Preparation
- Modeling
- Evaluation
- Deployment
The sequence of the phases is not strict and moving back and forth between different phases is usually required. The arrows in the process diagram indicate the most important and frequent dependencies between phases. The outer circle in the diagram symbolizes the cyclic nature of data mining itself. A data mining process continues after a solution has been deployed. The lessons learned during the process can trigger new, often more focused business questions, and subsequent data mining processes will benefit from the experiences of previous ones.
**Disclaimer**: Because we are not solving a real-world data science project, we are skipping the **Business Understanding** and **Deployment Step**. However, in my experience, these steps are the most important ones to provide business value.
%% Cell type:markdown id: tags:
## Task Description: House Prices - Advanced Regression Techniques
This notebook follows the idea of the "House Prices - Advanced Regression Techniques" competition on Kaggle. However, the dataset for this competition has been compiled by Dean De Cock for use in data science education. It was designed after the Boston Housing dataset and is now considered a more modernized and expanded version of it. More details of this dataset are described in [Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project](http://jse.amstat.org/v19n3/decock.pdf).
>**Goal**: It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable.
>
>**Metric**: Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)
>
> -- description taken from [Kaggle](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview/evaluation)
%% Cell type:markdown id: tags:
## Install & import packages
%% Cell type:code id: tags:
``` python
!pip install -r requirements.txt
```
%% Cell type:code id: tags:
``` python
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from scipy import stats # statistical functions
import os # access to operating system related functions
# plotting libraries
import seaborn as sns
import matplotlib.pyplot as plt
# ml related libraries
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.model_selection import GridSearchCV
from category_encoders.target_encoder import TargetEncoder
```
%% Cell type:code id: tags:
``` python
# plot inline
%matplotlib inline
```
%% Cell type:markdown id: tags:
## Read data
In the next cell, we will download the data from an url and differentiate between the features X and the target variable y. Then we will create a train and test set.
%% Cell type:code id: tags:
``` python
# download original data
data = pd.read_csv("http://jse.amstat.org/v19n3/decock/AmesHousing.txt", sep='\t')
# get features and target
X, y = data.drop(['PID', 'Order', 'SalePrice'], axis=1), data['SalePrice']
# split into train and testset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
```
%% Cell type:code id: tags:
``` python
# read more about the data
with open('./data_description.txt', 'r') as file:
description = file.read()
print(description)
```
%% Cell type:markdown id: tags:
## Data Understanding
### Gathering basic information about our data
%% Cell type:code id: tags:
``` python
# displaying first rows of data set
X_train.head(5)
```
%% Cell type:code id: tags:
``` python
# get information about data types
X_train.info()
```
%% Cell type:code id: tags:
``` python
#check the numbers of samples and features
print("The X_train data size is : {} ".format(X_train.shape))
print("The X_test data size is : {} ".format(X_test.shape))
```
%% Cell type:markdown id: tags:
### Plotting target variable
Since we are interested in forecasting the house price, we will first have a look at the distribution of the house prices themselves. The first plot shows the distribution of the sales price, while the second plot shows the probability of our data against the quantiles of a specified theoretical distribution. If our target variable followed a (perfect) normal distribution, all blue points would be on the red line. For more information on the second plot, click [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html).
%% Cell type:code id: tags:
``` python
# visualize SalesPrice (target variable)
sns.distplot(y_train , fit=stats.norm)
#Now plot the distribution
(mu, sigma) = stats.norm.fit(y_train)
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)], loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')
#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(y_train, plot=plt)
plt.show()
```
%% Cell type:markdown id: tags:
#### Task: What are the conclusions that we can draw from these two plots?
Use the cell below to answer this question
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Visualize features
Now that we have a better understanding of what we are looking at, we can explore our features visually.
%% Cell type:code id: tags:
``` python
## visualizing numerical features
X_train.select_dtypes(np.number).hist(bins = 50,figsize =(30,20))
plt.show()
```
%% Cell type:code id: tags:
``` python
## visualizing categorical data
categorical_columns = X_train.select_dtypes('object').columns
n_columns = 5
n_rows = len(categorical_columns) // n_columns + 1
fig = plt.figure(figsize =(20,30))
for idx, column in enumerate(categorical_columns):
ax = plt.subplot(n_rows, n_columns, idx + 1)
X_train[column].value_counts().plot(kind='bar')
ax.set_title(f'Distribution of {column}')
plt.tight_layout()
plt.show()
```
%% Cell type:code id: tags:
``` python
## visualize missing data ratio
X_train_na = (X_train.isnull().sum() / len(X_train)) * 100
X_train_na = X_train_na.drop(X_train_na[X_train_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio': X_train_na})
missing_data.head(20)
f, ax = plt.subplots()
plt.xticks(rotation='90')
plt.xticks(rotation='vertical')
sns.barplot(x=X_train_na.index, y=X_train_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
plt.show()
```
%% Cell type:code id: tags:
``` python
## visualize correlation
sns.heatmap(X_train.corr())
sns.heatmap(X_train.select_dtypes(np.number).corr())
plt.show()
```
%% Cell type:markdown id: tags:
**Disclaimer**: usually, one would conduct an even more in-depth visual analysis of the dataset. For instance, one would investigate the relationship between all variables and the target variable. The Python package [Seaborn](https://seaborn.pydata.org/index.html) provides some good tutorials on data visualisation.
%% Cell type:markdown id: tags:
## Data Preparation
Below is a brief, non-exhaustive overview of the most common data preparation steps.
%% Cell type:markdown id: tags:
### Imputing missing values
Imputing missing values often requires domain knowledge. In our dataset, for instance, there are a lot of columns, in which the missing value has a meaning and can therefore be meaningful encoded. If any value is missing at random, we can only make assumptions what this value should be encoded as. However, there are some advanced imputing techniques like k-nearest neighbors or an iterative imputer that try to make the best guess for us. If you want to read more about them, checkout sklearn's [documentation](https://scikit-learn.org/stable/modules/impute.html#impute).
%% Cell type:markdown id: tags:
#### Task: Please impute your missing values
Use the cells below to for your code
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Outlier removal
Some models, like linear regression, is sensitive to outliers. Hence, depending on your models requirements, you might want to exclude abnormal data points.
%% Cell type:markdown id: tags:
#### Task: Please investigate the above grade square feet area ('Gr Liv Area') for outliers
Use the cells below to for your code
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Feature engineering
In order to maximize our model's performance, we should also look into creating new features. This usually requires domain knowledge. However, there are also automated tools available. One of these tools is called featuretools. Click [here](https://github.com/alteryx/featuretools) for more information.
%% Cell type:markdown id: tags:
#### Task: Please think of a new feature and visualize if it has any correlation with the target variable
Use the cells below to for your code
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Encoding of categorical features
In sklearn, all machine learning algorithms assume that the categorical features are represented as numbers. This transformation can be done in many ways. Among the most popular is probably [one-hot-encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) or [label encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder). If you are interested in reading more about other, not so common encoding possibilities, check out the [category_encoder package](https://contrib.scikit-learn.org/category_encoders/).
%% Cell type:markdown id: tags:
#### Task: Please encode your categorical features as numbers. Also think about numeric variables that are actually categorical.
Use the cells below to for your code
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### Perform feature selection / extraction
Usually, one would also perform feature selection or feature extraction. This will most likely increase the model performance if done well. However, since we are still in the explanatory phase, we will skip it. If you later want to interpret your model and its result, often feature selection is useful. You can read more about feature selection [here](https://scikit-learn.org/stable/modules/feature_selection.html).
%% Cell type:markdown id: tags:
## Modelling
%% Cell type:markdown id: tags:
#### Task: Please train at least two models.
An easy way to get started is to use models from [sklearn](https://scikit-learn.org/stable/index.html).
Use the cells below to for your code.
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## Evaluation
%% Cell type:markdown id: tags:
#### Task: Please evaluate your models on the given error metric and use at least one naive benchmark
Use the cells below to for your code
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
#### Task: Perform hyperparameter tuning with cross validation on one of your models. Do the results improve?
Use the cells below to for your code
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
#### Task: Please evaluate the residuals of your models. Do you consistently under- or overestimate the house prices?
Use the cells below to for your code
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
The distribution of the residuals follow a normal distribution and hence we do not consistently under- or overestimate the house prices
%% Cell type:markdown id: tags:
## Next Steps
%% Cell type:markdown id: tags:
#### Task: How can you improve our existing model?
Use the cell below to for your answer
%% Cell type:markdown id: tags:
%% Cell type:code id: tags:
``` python
```
......
This diff is collapsed.
sklearn
scikit-learn
numpy
pandas
seaborn
......