From 53a87a2481117c0bab2fea143ae22faf42b43af4 Mon Sep 17 00:00:00 2001 From: Alexander Grote <grote@fzi.de> Date: Fri, 12 Apr 2024 14:20:19 +0200 Subject: [PATCH] adding regression exercise --- ml_bootcamp/Regression_Exercise.ipynb | 589 ++++++++++++++++++++++++++ ml_bootcamp/data_description.txt | 523 +++++++++++++++++++++++ ml_bootcamp/requirements.txt | 8 + 3 files changed, 1120 insertions(+) create mode 100644 ml_bootcamp/Regression_Exercise.ipynb create mode 100644 ml_bootcamp/data_description.txt create mode 100644 ml_bootcamp/requirements.txt diff --git a/ml_bootcamp/Regression_Exercise.ipynb b/ml_bootcamp/Regression_Exercise.ipynb new file mode 100644 index 0000000..73e6324 --- /dev/null +++ b/ml_bootcamp/Regression_Exercise.ipynb @@ -0,0 +1,589 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The CRISP-DM Process\n", + "\n", + "> Cross-industry standard process for data mining, also known as CRISP-DM, is an open standard process model that describes common approaches used by data mining experts. It is the most widely-used analytics model.\n", + "> \n", + "> -- Source: https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining\n", + "\n", + "<p align=\"center\">\n", + "<img src=\"https://upload.wikimedia.org/wikipedia/commons/b/b9/CRISP-DM_Process_Diagram.png\" width=\"400\" />\n", + "</p>\n", + "\n", + "CRISP-DM breaks the process of data mining into six major phases:\n", + "\n", + "- Business Understanding\n", + "- Data Understanding\n", + "- Data Preparation\n", + "- Modeling\n", + "- Evaluation\n", + "- Deployment\n", + "\n", + "The sequence of the phases is not strict and moving back and forth between different phases is usually required. The arrows in the process diagram indicate the most important and frequent dependencies between phases. The outer circle in the diagram symbolizes the cyclic nature of data mining itself. A data mining process continues after a solution has been deployed. The lessons learned during the process can trigger new, often more focused business questions, and subsequent data mining processes will benefit from the experiences of previous ones.\n", + "\n", + "**Disclaimer**: Because we are not solving a real-world data science project, we are skipping the **Business Understanding** and **Deployment Step**. However, in my experience, these steps are the most important ones to provide business value." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Task Description: House Prices - Advanced Regression Techniques\n", + "\n", + "This notebook follows the idea of the \"House Prices - Advanced Regression Techniques\" competition on Kaggle. However, the dataset for this competition has been compiled by Dean De Cock for use in data science education. It was designed after the Boston Housing dataset and is now considered a more modernized and expanded version of it. More details of this dataset are described in [Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project](http://jse.amstat.org/v19n3/decock.pdf).\n", + "\n", + ">**Goal**: It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable. \n", + ">\n", + ">**Metric**: Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)\n", + ">\n", + "> -- description taken from [Kaggle](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview/evaluation)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Install & import packages" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install -r requirements.txt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np # linear algebra\n", + "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", + "from scipy import stats # statistical functions\n", + "import os # access to operating system related functions\n", + "\n", + "# plotting libraries\n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# ml related libraries\n", + "from sklearn.impute import SimpleImputer\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.linear_model import Lasso\n", + "from sklearn.pipeline import make_pipeline\n", + "from sklearn.preprocessing import RobustScaler\n", + "from sklearn.ensemble import RandomForestRegressor\n", + "from sklearn.metrics import mean_squared_error, make_scorer\n", + "from sklearn.model_selection import GridSearchCV\n", + "from category_encoders.target_encoder import TargetEncoder" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# plot inline\n", + "%matplotlib inline " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Read data\n", + "\n", + "In the next cell, we will download the data from an url and differentiate between the features X and the target variable y. Then we will create a train and test set." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# download original data\n", + "data = pd.read_csv(\"http://jse.amstat.org/v19n3/decock/AmesHousing.txt\", sep='\\t')\n", + "\n", + "# get features and target\n", + "X, y = data.drop(['PID', 'Order', 'SalePrice'], axis=1), data['SalePrice']\n", + "\n", + "# split into train and testset\n", + "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "# read more about the data\n", + "with open('./data_description.txt', 'r') as file:\n", + " description = file.read()\n", + " \n", + "print(description)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data Understanding\n", + "\n", + "### Gathering basic information about our data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# displaying first rows of data set\n", + "X_train.head(5)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "# get information about data types\n", + "X_train.info()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#check the numbers of samples and features\n", + "print(\"The X_train data size is : {} \".format(X_train.shape))\n", + "print(\"The X_test data size is : {} \".format(X_test.shape))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Plotting target variable\n", + "\n", + "Since we are interested in forecasting the house price, we will first have a look at the distribution of the house prices themselves. The first plot shows the distribution of the sales price, while the second plot shows the probability of our data against the quantiles of a specified theoretical distribution. If our target variable followed a (perfect) normal distribution, all blue points would be on the red line. For more information on the second plot, click [here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# visualize SalesPrice (target variable)\n", + "sns.distplot(y_train , fit=stats.norm)\n", + "\n", + "#Now plot the distribution\n", + "(mu, sigma) = stats.norm.fit(y_train)\n", + "plt.legend(['Normal dist. ($\\mu=$ {:.2f} and $\\sigma=$ {:.2f} )'.format(mu, sigma)], loc='best')\n", + "plt.ylabel('Frequency')\n", + "plt.title('SalePrice distribution')\n", + "\n", + "#Get also the QQ-plot\n", + "fig = plt.figure()\n", + "res = stats.probplot(y_train, plot=plt)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Task: What are the conclusions that we can draw from these two plots?\n", + "Use the cell below to answer this question" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Visualize features\n", + "\n", + "Now that we have a better understanding of what we are looking at, we can explore our features visually." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "## visualizing numerical features\n", + "X_train.select_dtypes(np.number).hist(bins = 50,figsize =(30,20))\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "## visualizing categorical data\n", + "categorical_columns = X_train.select_dtypes('object').columns\n", + "\n", + "n_columns = 5\n", + "n_rows = len(categorical_columns) // n_columns + 1\n", + "\n", + "fig = plt.figure(figsize =(20,30))\n", + "\n", + "for idx, column in enumerate(categorical_columns):\n", + " \n", + " ax = plt.subplot(n_rows, n_columns, idx + 1)\n", + " X_train[column].value_counts().plot(kind='bar')\n", + " ax.set_title(f'Distribution of {column}')\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "## visualize missing data ratio\n", + "X_train_na = (X_train.isnull().sum() / len(X_train)) * 100\n", + "X_train_na = X_train_na.drop(X_train_na[X_train_na == 0].index).sort_values(ascending=False)[:30]\n", + "missing_data = pd.DataFrame({'Missing Ratio': X_train_na})\n", + "missing_data.head(20)\n", + "\n", + "f, ax = plt.subplots()\n", + "plt.xticks(rotation='90')\n", + "sns.barplot(x=X_train_na.index, y=X_train_na)\n", + "plt.xlabel('Features', fontsize=15)\n", + "plt.ylabel('Percent of missing values', fontsize=15)\n", + "plt.title('Percent missing data by feature', fontsize=15)\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "## visualize correlation\n", + "sns.heatmap(X_train.corr())\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Disclaimer**: usually, one would conduct an even more in-depth visual analysis of the dataset. For instance, one would investigate the relationship between all variables and the target variable. The Python package [Seaborn](https://seaborn.pydata.org/index.html) provides some good tutorials on data visualisation. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data Preparation\n", + "\n", + "Below is a brief, non-exhaustive overview of the most common data preparation steps." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Imputing missing values\n", + "\n", + "Imputing missing values often requires domain knowledge. In our dataset, for instance, there are a lot of columns, in which the missing value has a meaning and can therefore be meaningful encoded. If any value is missing at random, we can only make assumptions what this value should be encoded as. However, there are some advanced imputing techniques like k-nearest neighbors or an iterative imputer that try to make the best guess for us. If you want to read more about them, checkout sklearn's [documentation](https://scikit-learn.org/stable/modules/impute.html#impute)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Task: Please impute your missing values\n", + "Use the cells below to for your code" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Outlier removal\n", + "\n", + "Some models, like linear regression, is sensitive to outliers. Hence, depending on your models requirements, you might want to exclude abnormal data points." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Task: Please investigate the above grade square feet area ('Gr Liv Area') for outliers\n", + "Use the cells below to for your code" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Feature engineering\n", + "\n", + "In order to maximize our model's performance, we should also look into creating new features. This usually requires domain knowledge. However, there are also automated tools available. One of these tools is called featuretools. Click [here](https://github.com/alteryx/featuretools) for more information." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Task: Please think of a new feature and visualize if it has any correlation with the target variable\n", + "Use the cells below to for your code" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Encoding of categorical features \n", + "\n", + "In sklearn, all machine learning algorithms assume that the categorical features are represented as numbers. This transformation can be done in many ways. Among the most popular is probably [one-hot-encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) or [label encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder). If you are interested in reading more about other, not so common encoding possibilities, check out the [category_encoder package](https://contrib.scikit-learn.org/category_encoders/). " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Task: Please encode your categorical features as numbers. Also think about numeric variables that are actually categorical.\n", + "Use the cells below to for your code" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Perform feature selection / extraction\n", + "Usually, one would also perform feature selection or feature extraction. This will most likely increase the model performance if done well. However, since we are still in the explanatory phase, we will skip it. If you later want to interpret your model and its result, often feature selection is useful. You can read more about feature selection [here](https://scikit-learn.org/stable/modules/feature_selection.html). " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Modelling" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Task: Please train at least two models.\n", + "An easy way to get started is to use models from [sklearn](https://scikit-learn.org/stable/index.html).\n", + "Use the cells below to for your code. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Evaluation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Task: Please evaluate your models on the given error metric and use at least one naive benchmark\n", + "Use the cells below to for your code" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Task: Perform hyperparameter tuning with cross validation on one of your models. Do the results improve?\n", + "Use the cells below to for your code" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Task: Please evaluate the residuals of your models. Do you consistently under- or overestimate the house prices?\n", + "Use the cells below to for your code" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The distribution of the residuals follow a normal distribution and hence we do not consistently under- or overestimate the house prices" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Next Steps" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Task: How can you improve our existing model?\n", + "Use the cell below to for your answer" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "interpreter": { + "hash": "2258dfa08c46b77d2bc2b7524b68bff9e582ce15fe6c25ecebf58bb0c8a8f397" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.13" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/ml_bootcamp/data_description.txt b/ml_bootcamp/data_description.txt new file mode 100644 index 0000000..cba0710 --- /dev/null +++ b/ml_bootcamp/data_description.txt @@ -0,0 +1,523 @@ +MSSubClass: Identifies the type of dwelling involved in the sale. + + 20 1-STORY 1946 & NEWER ALL STYLES + 30 1-STORY 1945 & OLDER + 40 1-STORY W/FINISHED ATTIC ALL AGES + 45 1-1/2 STORY - UNFINISHED ALL AGES + 50 1-1/2 STORY FINISHED ALL AGES + 60 2-STORY 1946 & NEWER + 70 2-STORY 1945 & OLDER + 75 2-1/2 STORY ALL AGES + 80 SPLIT OR MULTI-LEVEL + 85 SPLIT FOYER + 90 DUPLEX - ALL STYLES AND AGES + 120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER + 150 1-1/2 STORY PUD - ALL AGES + 160 2-STORY PUD - 1946 & NEWER + 180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER + 190 2 FAMILY CONVERSION - ALL STYLES AND AGES + +MSZoning: Identifies the general zoning classification of the sale. + + A Agriculture + C Commercial + FV Floating Village Residential + I Industrial + RH Residential High Density + RL Residential Low Density + RP Residential Low Density Park + RM Residential Medium Density + +LotFrontage: Linear feet of street connected to property + +LotArea: Lot size in square feet + +Street: Type of road access to property + + Grvl Gravel + Pave Paved + +Alley: Type of alley access to property + + Grvl Gravel + Pave Paved + NA No alley access + +LotShape: General shape of property + + Reg Regular + IR1 Slightly irregular + IR2 Moderately Irregular + IR3 Irregular + +LandContour: Flatness of the property + + Lvl Near Flat/Level + Bnk Banked - Quick and significant rise from street grade to building + HLS Hillside - Significant slope from side to side + Low Depression + +Utilities: Type of utilities available + + AllPub All public Utilities (E,G,W,& S) + NoSewr Electricity, Gas, and Water (Septic Tank) + NoSeWa Electricity and Gas Only + ELO Electricity only + +LotConfig: Lot configuration + + Inside Inside lot + Corner Corner lot + CulDSac Cul-de-sac + FR2 Frontage on 2 sides of property + FR3 Frontage on 3 sides of property + +LandSlope: Slope of property + + Gtl Gentle slope + Mod Moderate Slope + Sev Severe Slope + +Neighborhood: Physical locations within Ames city limits + + Blmngtn Bloomington Heights + Blueste Bluestem + BrDale Briardale + BrkSide Brookside + ClearCr Clear Creek + CollgCr College Creek + Crawfor Crawford + Edwards Edwards + Gilbert Gilbert + IDOTRR Iowa DOT and Rail Road + MeadowV Meadow Village + Mitchel Mitchell + Names North Ames + NoRidge Northridge + NPkVill Northpark Villa + NridgHt Northridge Heights + NWAmes Northwest Ames + OldTown Old Town + SWISU South & West of Iowa State University + Sawyer Sawyer + SawyerW Sawyer West + Somerst Somerset + StoneBr Stone Brook + Timber Timberland + Veenker Veenker + +Condition1: Proximity to various conditions + + Artery Adjacent to arterial street + Feedr Adjacent to feeder street + Norm Normal + RRNn Within 200' of North-South Railroad + RRAn Adjacent to North-South Railroad + PosN Near positive off-site feature--park, greenbelt, etc. + PosA Adjacent to postive off-site feature + RRNe Within 200' of East-West Railroad + RRAe Adjacent to East-West Railroad + +Condition2: Proximity to various conditions (if more than one is present) + + Artery Adjacent to arterial street + Feedr Adjacent to feeder street + Norm Normal + RRNn Within 200' of North-South Railroad + RRAn Adjacent to North-South Railroad + PosN Near positive off-site feature--park, greenbelt, etc. + PosA Adjacent to postive off-site feature + RRNe Within 200' of East-West Railroad + RRAe Adjacent to East-West Railroad + +BldgType: Type of dwelling + + 1Fam Single-family Detached + 2FmCon Two-family Conversion; originally built as one-family dwelling + Duplx Duplex + TwnhsE Townhouse End Unit + TwnhsI Townhouse Inside Unit + +HouseStyle: Style of dwelling + + 1Story One story + 1.5Fin One and one-half story: 2nd level finished + 1.5Unf One and one-half story: 2nd level unfinished + 2Story Two story + 2.5Fin Two and one-half story: 2nd level finished + 2.5Unf Two and one-half story: 2nd level unfinished + SFoyer Split Foyer + SLvl Split Level + +OverallQual: Rates the overall material and finish of the house + + 10 Very Excellent + 9 Excellent + 8 Very Good + 7 Good + 6 Above Average + 5 Average + 4 Below Average + 3 Fair + 2 Poor + 1 Very Poor + +OverallCond: Rates the overall condition of the house + + 10 Very Excellent + 9 Excellent + 8 Very Good + 7 Good + 6 Above Average + 5 Average + 4 Below Average + 3 Fair + 2 Poor + 1 Very Poor + +YearBuilt: Original construction date + +YearRemodAdd: Remodel date (same as construction date if no remodeling or additions) + +RoofStyle: Type of roof + + Flat Flat + Gable Gable + Gambrel Gabrel (Barn) + Hip Hip + Mansard Mansard + Shed Shed + +RoofMatl: Roof material + + ClyTile Clay or Tile + CompShg Standard (Composite) Shingle + Membran Membrane + Metal Metal + Roll Roll + Tar&Grv Gravel & Tar + WdShake Wood Shakes + WdShngl Wood Shingles + +Exterior1st: Exterior covering on house + + AsbShng Asbestos Shingles + AsphShn Asphalt Shingles + BrkComm Brick Common + BrkFace Brick Face + CBlock Cinder Block + CemntBd Cement Board + HdBoard Hard Board + ImStucc Imitation Stucco + MetalSd Metal Siding + Other Other + Plywood Plywood + PreCast PreCast + Stone Stone + Stucco Stucco + VinylSd Vinyl Siding + Wd Sdng Wood Siding + WdShing Wood Shingles + +Exterior2nd: Exterior covering on house (if more than one material) + + AsbShng Asbestos Shingles + AsphShn Asphalt Shingles + BrkComm Brick Common + BrkFace Brick Face + CBlock Cinder Block + CemntBd Cement Board + HdBoard Hard Board + ImStucc Imitation Stucco + MetalSd Metal Siding + Other Other + Plywood Plywood + PreCast PreCast + Stone Stone + Stucco Stucco + VinylSd Vinyl Siding + Wd Sdng Wood Siding + WdShing Wood Shingles + +MasVnrType: Masonry veneer type + + BrkCmn Brick Common + BrkFace Brick Face + CBlock Cinder Block + None None + Stone Stone + +MasVnrArea: Masonry veneer area in square feet + +ExterQual: Evaluates the quality of the material on the exterior + + Ex Excellent + Gd Good + TA Average/Typical + Fa Fair + Po Poor + +ExterCond: Evaluates the present condition of the material on the exterior + + Ex Excellent + Gd Good + TA Average/Typical + Fa Fair + Po Poor + +Foundation: Type of foundation + + BrkTil Brick & Tile + CBlock Cinder Block + PConc Poured Contrete + Slab Slab + Stone Stone + Wood Wood + +BsmtQual: Evaluates the height of the basement + + Ex Excellent (100+ inches) + Gd Good (90-99 inches) + TA Typical (80-89 inches) + Fa Fair (70-79 inches) + Po Poor (<70 inches + NA No Basement + +BsmtCond: Evaluates the general condition of the basement + + Ex Excellent + Gd Good + TA Typical - slight dampness allowed + Fa Fair - dampness or some cracking or settling + Po Poor - Severe cracking, settling, or wetness + NA No Basement + +BsmtExposure: Refers to walkout or garden level walls + + Gd Good Exposure + Av Average Exposure (split levels or foyers typically score average or above) + Mn Mimimum Exposure + No No Exposure + NA No Basement + +BsmtFinType1: Rating of basement finished area + + GLQ Good Living Quarters + ALQ Average Living Quarters + BLQ Below Average Living Quarters + Rec Average Rec Room + LwQ Low Quality + Unf Unfinshed + NA No Basement + +BsmtFinSF1: Type 1 finished square feet + +BsmtFinType2: Rating of basement finished area (if multiple types) + + GLQ Good Living Quarters + ALQ Average Living Quarters + BLQ Below Average Living Quarters + Rec Average Rec Room + LwQ Low Quality + Unf Unfinshed + NA No Basement + +BsmtFinSF2: Type 2 finished square feet + +BsmtUnfSF: Unfinished square feet of basement area + +TotalBsmtSF: Total square feet of basement area + +Heating: Type of heating + + Floor Floor Furnace + GasA Gas forced warm air furnace + GasW Gas hot water or steam heat + Grav Gravity furnace + OthW Hot water or steam heat other than gas + Wall Wall furnace + +HeatingQC: Heating quality and condition + + Ex Excellent + Gd Good + TA Average/Typical + Fa Fair + Po Poor + +CentralAir: Central air conditioning + + N No + Y Yes + +Electrical: Electrical system + + SBrkr Standard Circuit Breakers & Romex + FuseA Fuse Box over 60 AMP and all Romex wiring (Average) + FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair) + FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor) + Mix Mixed + +1stFlrSF: First Floor square feet + +2ndFlrSF: Second floor square feet + +LowQualFinSF: Low quality finished square feet (all floors) + +GrLivArea: Above grade (ground) living area square feet + +BsmtFullBath: Basement full bathrooms + +BsmtHalfBath: Basement half bathrooms + +FullBath: Full bathrooms above grade + +HalfBath: Half baths above grade + +Bedroom: Bedrooms above grade (does NOT include basement bedrooms) + +Kitchen: Kitchens above grade + +KitchenQual: Kitchen quality + + Ex Excellent + Gd Good + TA Typical/Average + Fa Fair + Po Poor + +TotRmsAbvGrd: Total rooms above grade (does not include bathrooms) + +Functional: Home functionality (Assume typical unless deductions are warranted) + + Typ Typical Functionality + Min1 Minor Deductions 1 + Min2 Minor Deductions 2 + Mod Moderate Deductions + Maj1 Major Deductions 1 + Maj2 Major Deductions 2 + Sev Severely Damaged + Sal Salvage only + +Fireplaces: Number of fireplaces + +FireplaceQu: Fireplace quality + + Ex Excellent - Exceptional Masonry Fireplace + Gd Good - Masonry Fireplace in main level + TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement + Fa Fair - Prefabricated Fireplace in basement + Po Poor - Ben Franklin Stove + NA No Fireplace + +GarageType: Garage location + + 2Types More than one type of garage + Attchd Attached to home + Basment Basement Garage + BuiltIn Built-In (Garage part of house - typically has room above garage) + CarPort Car Port + Detchd Detached from home + NA No Garage + +GarageYrBlt: Year garage was built + +GarageFinish: Interior finish of the garage + + Fin Finished + RFn Rough Finished + Unf Unfinished + NA No Garage + +GarageCars: Size of garage in car capacity + +GarageArea: Size of garage in square feet + +GarageQual: Garage quality + + Ex Excellent + Gd Good + TA Typical/Average + Fa Fair + Po Poor + NA No Garage + +GarageCond: Garage condition + + Ex Excellent + Gd Good + TA Typical/Average + Fa Fair + Po Poor + NA No Garage + +PavedDrive: Paved driveway + + Y Paved + P Partial Pavement + N Dirt/Gravel + +WoodDeckSF: Wood deck area in square feet + +OpenPorchSF: Open porch area in square feet + +EnclosedPorch: Enclosed porch area in square feet + +3SsnPorch: Three season porch area in square feet + +ScreenPorch: Screen porch area in square feet + +PoolArea: Pool area in square feet + +PoolQC: Pool quality + + Ex Excellent + Gd Good + TA Average/Typical + Fa Fair + NA No Pool + +Fence: Fence quality + + GdPrv Good Privacy + MnPrv Minimum Privacy + GdWo Good Wood + MnWw Minimum Wood/Wire + NA No Fence + +MiscFeature: Miscellaneous feature not covered in other categories + + Elev Elevator + Gar2 2nd Garage (if not described in garage section) + Othr Other + Shed Shed (over 100 SF) + TenC Tennis Court + NA None + +MiscVal: $Value of miscellaneous feature + +MoSold: Month Sold (MM) + +YrSold: Year Sold (YYYY) + +SaleType: Type of sale + + WD Warranty Deed - Conventional + CWD Warranty Deed - Cash + VWD Warranty Deed - VA Loan + New Home just constructed and sold + COD Court Officer Deed/Estate + Con Contract 15% Down payment regular terms + ConLw Contract Low Down payment and low interest + ConLI Contract Low Interest + ConLD Contract Low Down + Oth Other + +SaleCondition: Condition of sale + + Normal Normal Sale + Abnorml Abnormal Sale - trade, foreclosure, short sale + AdjLand Adjoining Land Purchase + Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit + Family Sale between family members + Partial Home was not completed when last assessed (associated with New Homes) diff --git a/ml_bootcamp/requirements.txt b/ml_bootcamp/requirements.txt new file mode 100644 index 0000000..6c6363b --- /dev/null +++ b/ml_bootcamp/requirements.txt @@ -0,0 +1,8 @@ +sklearn +numpy +pandas +seaborn +IPython +ipykernel +notebook +category_encoders \ No newline at end of file -- GitLab