# Predicting Rental Prices

### Machine Learning Practical Session
#### Business Analytics Taster Day 2021
Some of you are maybe thinking about renting a house in Amsterdam. However, one will be confronted with long waitlists for student housing due to the high demand. So, one may decide to rent privately, becoming vulnerable to overpriced contracts. Deciding if an apartment is worth the ask-price is never easy. Therefore, we will be analyzing house sale data, to gain insights into what the price is determined by. Can we even train a machine learning model to find patterns in the data and predict an accurate price?

In this practical, we will go through an entire Data Science. From the basics of data exploration and cleaning up and until using machine learning to train a model to predict house rental prices. The analysis is based on a dataset of Dean De Cock describing the sale of individual residential property in Ames, Iowa, USA from 2006 to 2010.

#### Coding
One does not need to know anything about coding to follow this practical. However, we will show the code and you may follow some of it. We will use the coding language Python, as that language is mostly used in Business for Machine learning. It is important to follow this file chronically and run every piece of code, to make sure everything is initialized correctly. 

First, it is interesting to try to read the code. For that you need to know some conventions. The following code stores the value 7 into a variable we name *x*. After that, we can use *x* to do a calculation. You can run the code by clicking hovering over the cell and clicking the play/run button or clicking on the code and *Ctrl+Enter*.

In [None]:
x = 7
x * 6

42

We need to import some software libraries from which we are going to use some functionalities. Run the next code.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import files
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
import io

Now, download the *Ames_dataset.csv* from our Google Drive folder to the computer. Afterward, by running the following code, one gets a prompt to upload a file, *Choose file*. Choose the *Ames_dataset.csv* CSV file. We will store this file in the variable *data*. (Ensure that the file is named *Ames_dataset.csv*).

In [2]:
uploaded = files.upload()


In [None]:
data = pd.read_csv(io.BytesIO(uploaded['Ames_dataset.csv']))

## Data Exploration
Naturally, you want to see the data. Is it successfully uploaded? The next line will request the 10 top rows of the dataset to be shown.

In [None]:
data.head(10)

Unnamed: 0,SalePrice,GrLivArea,TotalBsmtSF,1stFlrSF,2ndFlrSF,Bedroom,TotRmsAbvGrd,Neighborhood,MSZoning,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MSSubClass,BldgType,HouseStyle,RoofStyle,Exterior1st,Exterior2nd,ExterQual,ExterCond,MasVnrType,MasVnrArea,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtUnfSF,CentralAir,Electrical,HeatingQC,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,Kitchen,KitchenQual,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,Fence,WoodDeckSF,OpenPorchSF,EnclosedPorch,ScreenPorch,LotFrontage,LotArea,LotShape,LotConfig,PavedDrive,SaleType
0,208500,1710,856,856,854,3,8,CollgCr,RL,7,5,2003,2003,C60,1Fam,2Story,Gable,VinylSd,VinylSd,Gd,TA,BrkFace,196.0,PConc,Gd,TA,No,GLQ,706,150,Y,SBrkr,Ex,1,0,2,1,1,Gd,0,,Attchd,2003.0,RFn,2,548,,0,61,0,0,65.0,8450,Reg,Inside,Y,WD
1,181500,1262,1262,1262,0,3,6,Veenker,RL,6,8,1976,1976,C20,1Fam,1Story,Gable,MetalSd,MetalSd,TA,TA,,0.0,CBlock,Gd,TA,Gd,ALQ,978,284,Y,SBrkr,Ex,0,1,2,0,1,TA,1,TA,Attchd,1976.0,RFn,2,460,,298,0,0,0,80.0,9600,Reg,FR2,Y,WD
2,223500,1786,920,920,866,3,6,CollgCr,RL,7,5,2001,2002,C60,1Fam,2Story,Gable,VinylSd,VinylSd,Gd,TA,BrkFace,162.0,PConc,Gd,TA,Mn,GLQ,486,434,Y,SBrkr,Ex,1,0,2,1,1,Gd,1,TA,Attchd,2001.0,RFn,2,608,,0,42,0,0,68.0,11250,IR1,Inside,Y,WD
3,250000,2198,1145,1145,1053,4,9,NoRidge,RL,8,5,2000,2000,C60,1Fam,2Story,Gable,VinylSd,VinylSd,Gd,TA,BrkFace,350.0,PConc,Gd,TA,Av,GLQ,655,490,Y,SBrkr,Ex,1,0,2,1,1,Gd,1,TA,Attchd,2000.0,RFn,3,836,,192,84,0,0,84.0,14260,IR1,FR2,Y,WD
4,143000,1362,796,796,566,1,5,Mitchel,RL,5,5,1993,1995,C50,1Fam,1.5Fin,Gable,VinylSd,VinylSd,TA,TA,,0.0,Wood,Gd,TA,No,GLQ,732,64,Y,SBrkr,Ex,1,0,1,1,1,TA,0,,Attchd,1993.0,Unf,2,480,MnPrv,40,30,0,0,85.0,14115,IR1,Inside,Y,WD
5,307000,1694,1686,1694,0,3,7,Somerst,RL,8,5,2004,2005,C20,1Fam,1Story,Gable,VinylSd,VinylSd,Gd,TA,Stone,186.0,PConc,Ex,TA,Av,GLQ,1369,317,Y,SBrkr,Ex,1,0,2,0,1,Gd,1,Gd,Attchd,2004.0,RFn,2,636,,255,57,0,0,75.0,10084,Reg,Inside,Y,WD
6,200000,2090,1107,1107,983,3,7,NWAmes,RL,7,6,1973,1973,C60,1Fam,2Story,Gable,HdBoard,HdBoard,TA,TA,Stone,240.0,CBlock,Gd,TA,Mn,ALQ,859,216,Y,SBrkr,Ex,1,0,2,1,1,TA,2,TA,Attchd,1973.0,RFn,2,484,,235,204,228,0,,10382,IR1,Corner,Y,WD
7,118000,1077,991,1077,0,2,5,BrkSide,RL,5,6,1939,1950,C190,2fmCon,1.5Unf,Gable,MetalSd,MetalSd,TA,TA,,0.0,BrkTil,TA,TA,No,GLQ,851,140,Y,SBrkr,Ex,1,0,1,0,2,TA,2,TA,Attchd,1939.0,RFn,1,205,,0,4,0,0,50.0,7420,Reg,Corner,Y,WD
8,129500,1040,1040,1040,0,3,5,Sawyer,RL,5,5,1965,1965,C20,1Fam,1Story,Hip,HdBoard,HdBoard,TA,TA,,0.0,CBlock,TA,TA,No,Rec,906,134,Y,SBrkr,Ex,1,0,1,0,1,TA,0,,Detchd,1965.0,Unf,1,384,,0,0,0,0,70.0,11200,Reg,Inside,Y,WD
9,144000,912,912,912,0,2,4,Sawyer,RL,5,6,1962,1962,C20,1Fam,1Story,Hip,HdBoard,Plywood,TA,TA,,0.0,CBlock,TA,TA,No,ALQ,737,175,Y,SBrkr,TA,1,0,1,0,1,TA,0,,Detchd,1962.0,Unf,1,352,,140,0,0,176,,12968,IR2,Inside,Y,WD


How many data points (rows) does our data have and how many variables (columns)?

In [None]:
print('# Data points:',len(data),'\n', '# variables:', data.shape[1])

To start our analysis we investigate our target, the Sale price of houses (in US dollars). Run the next piece of code to see its distribution.

In [None]:
sns.displot(data['SalePrice'], kde=True);

Next, the data consists of variables explain something about the house. Using these explanatory variables (features) we want to predict the house price. In light of the time, some variables were already removed as they were not informative for the model. One such example is *SaleType* Which is visualized when you run the cell below. First, a table will be generated counting the occurrence of every entity in the variable. Second, two figures will be generated describing the data.

The description of the variables can be found in the shared Google Drive.

#### Question 1, Feature Selection
What is the consequence of not taking such variables into account in your machine learning model? Why would it be a good choice and why not?

#### Question 2, Boxplots
One of the plots shows boxplots, can you explain what a boxplot is?

In [None]:
name = "SaleType"
print(data[name].value_counts())
fig,ax = plt.subplots(1,2,figsize=(13,5))
sns.countplot(x=data[name], ax=ax[0]);
sns.boxplot(x=name, y='SalePrice',data=data, ax=ax[1]);

Now we will delete (drop) the variable *SaleType* with the next line of code. Running the code above will now result in an Error, as *SaleType* is removed.

In [None]:
data.drop('SaleType', axis=1, inplace=True)

One can also group labels. For instance, *Electrical* has the following groups (run code below). 

#### Question 3, Data Manipulation
Why would one want to group labels of, for instance, Electrical? Would you advise doing this to improve the machine learning model? Why not or if so which and why?

In [None]:
name = "Electrical"
print(data[name].value_counts())
fig,ax = plt.subplots(1,2,figsize=(13,5))
sns.countplot(x=data[name], ax=ax[0]);
sns.boxplot(x=name, y='SalePrice',data=data, ax=ax[1]);

The following code groups the non-standard circuits. After running the line below, one can run the code above and see if it worked. If one would not want to group them, the small groups should be removed, as they are just too infrequent for the machine learning model to see a pattern.

Some categorical features have some order in them, as they represent qualities, for instance, where one class is better than the other. We will represent this in the model by translating these variables into numeric values. These are called ordinal variables.

In [None]:
# Encode ordinal features as numeric
data = data.replace({'BsmtCond':{'None':0, 'Po':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5},
                     'BsmtExposure':{'None':0, 'Mn':1, 'Av': 2, 'Gd':3},
                     'BsmtFinType1':{'None':0, 'Unf':1, 'LwQ': 2, 'Rec':3, 'BLQ':4, 'ALQ':5, 'GLQ':6},
                     'BsmtQual':{'None':0, 'Po':1, 'Fa':2, 'TA': 3, 'Gd':4, 'Ex':5},
                     'ExterCond':{'Po':1, 'Fa':2, 'TA': 3, 'Gd': 4, 'Ex':5},
                     'ExterQual':{'Po':1, 'Fa':2, 'TA': 3, 'Gd': 4, 'Ex':5},
                     'FireplaceQu':{'None':0, 'Po':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5},
                     'HeatingQC':{'Po':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5},
                     'KitchenQual':{'Po':1, 'Fa':2, 'TA':3, 'Gd':4, 'Ex':5},
                     'LotShape':{'IR3':1, 'IR2':2, 'IR1':3, 'Reg':4},
                     'PavedDrive':{'N':0, 'P':1, 'Y':2}})

In [None]:
data = data.replace({"Electrical" : {"SBrkr" : "SBrkr", "FuseA" : "Fuse", "FuseF" : "Fuse", "FuseP" : "Fuse"}})

Next, we are looking into correlations between variables. For instance, comparing Yearbuilt and GarageYrBlt. One can also here try different variables by replacing the names between the quotes.

#### Question 4, Correlations in Time
Can you explain the strange pattern in the next figure? Do you think it is necessary to keep both variables or would only keeping one be enough?

In [None]:
sns.scatterplot(x=data['YearBuilt'], y=data['GarageYrBlt']);

#### Question 5, Correlations in Space
Taking it even further, there is also a clear relationship between the number of garage car spots and the garage area, would you keep them both?

In [None]:
sns.regplot(x=data['GarageCars'], y=data['GarageArea']);

## Machine Learning
First, we need to determine which variables are taken into account by removing or adding variable names to the list below. It is initialized to all variables. Next, the code splits the training set into a training and test set. 

In [None]:
features = ['GrLivArea','TotalBsmtSF','1stFlrSF','2ndFlrSF',
'Bedroom','TotRmsAbvGrd',
'Neighborhood','MSZoning',
'OverallQual','OverallCond',
'YearBuilt','YearRemodAdd',
'MSSubClass','BldgType','HouseStyle',
'RoofStyle',
'Exterior1st','Exterior2nd','ExterQual','ExterCond',
'MasVnrType','MasVnrArea',
'Foundation',
'BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1',
'BsmtFinSF1','BsmtUnfSF',
'CentralAir','Electrical','HeatingQC',
'BsmtFullBath','BsmtHalfBath',
'FullBath','HalfBath',
'Kitchen','KitchenQual',
'Fireplaces','FireplaceQu',
'GarageType','GarageYrBlt','GarageFinish','GarageCars','GarageArea',
'Fence','WoodDeckSF','OpenPorchSF','EnclosedPorch','ScreenPorch',
'LotFrontage','LotArea','LotShape','LotConfig',
'PavedDrive']

X = data.dropna(axis=0)
y = X['SalePrice']
X = X[features]
X = pd.get_dummies(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

### Decision Tree Regression 
Now, let's run the first model, one decision tree. The decision tree is a simple machine learning model for getting started with regression tasks. Each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label that is predicted.

The model outputs the Mean Absolute Error, the Mean Squared Error, and the Root Mean Squared Error, these need to be as low as possible. Next, a figure shows the predicted values versus the actual values. And lastly, it displays the 15 most important features for the model.

In [None]:
dtreg = DecisionTreeRegressor(random_state = 42)
dtreg.fit(X_train, y_train)

dtr_pred = dtreg.predict(X_test)
dtr_pred= dtr_pred.reshape(-1,1)
print('MAE:', metrics.mean_absolute_error(y_test, dtr_pred))
print('MSE:', metrics.mean_squared_error(y_test, dtr_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, dtr_pred)))

plt.figure(figsize=(15,8))
plt.scatter(y_test,dtr_pred,c='green')
plt.xlabel('Actual price')
plt.ylabel('Predicted price')
plt.show();

# Feature Importance
feats = {}
for feature, importance in zip(X.columns, dtreg.feature_importances_):
    feats[feature] = importance
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
importances.sort_values(by='Gini-importance',ascending=False).head(15)

### Random Forest
A Random Forest is an ensemble technique using  multiple decision trees and a technique called Bootstrap Aggregation, commonly known as bagging. This involves training each decision tree on a different data sample where sampling is done with replacement. The final prediction is the average of the prediction of every decision tree.

### Question 6, Results
Which model is better, the Random Forest or the Decision tree, and why?

### Question 7, Parameter Tuning
One can try to improve the Random Forest by changing the number at *n_estimators* on the first line (this represents the number of decision trees in the forest). Does adding more trees improve the prediction?

In [None]:
rfr = RandomForestRegressor(n_estimators = 30, random_state = 42)
rfr.fit(X_train, y_train)

rfr_pred = rfr.predict(X_test)
rfr_pred= rfr_pred.reshape(-1,1)
print('MAE:', metrics.mean_absolute_error(y_test, rfr_pred))
print('MSE:', metrics.mean_squared_error(y_test, rfr_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, rfr_pred)))

plt.figure(figsize=(15,8))
plt.scatter(y_test,rfr_pred,c='green')
plt.xlabel('Actual price')
plt.ylabel('Predicted price')
plt.show();

# Feature Importance
feats = {}
for feature, importance in zip(X.columns, rfr.feature_importances_):
    feats[feature] = importance
importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
importances.sort_values(by='Gini-importance',ascending=False).head(15)

### Quention 8, Feature Selection
Feel free to try to improve performance by removing or adding features/variables to the model. One can for instance start with an empty model and build it up with reasonable features till performance does not increase anymore. In the cell just below the header **Machine Learning** one can determine which features are taken into account. Does leaving out features that are correlated with each other improve performance? 

If time allows, you can look at the following parts.

## Extra, Your Analyses
Next, you can analyze any other variable by yourself with the following lines of code. The first line just lists all variable names. The second performs an analysis of numerical variables. The third part of the code performs an analysis for nominal/categorical variables. Just put the name of the variable you want to analyze at the line *name = "Name_of_Variable"* between the quotes.

In [None]:
list(data.columns)

In [None]:
# Nominal Analysis
name = "Neighborhood"
print(data[name].value_counts())
fig,ax = plt.subplots(1,2,figsize=(13,5))
chart1 = sns.countplot(x=data[name], ax=ax[0]);
chart1.set_xticklabels(chart1.get_xticklabels(), rotation=45);
chart2 = sns.boxplot(x=name, y='SalePrice',data=data, ax=ax[1]);
chart2.set_xticklabels(chart1.get_xticklabels(), rotation=45);
chart1;
chart2;

In [None]:
# Numerical Analysis
name = "GrLivArea"
print(data[name].describe())
sns.displot(data[name], kde=True);
sns.lmplot(x=name, y='SalePrice', data=data);

### Question 9, Feature Engineering
One may think about new features. Can we create new features by adding or multiplying other features together? Or can we create new features by combining features in other ways? The best results are most often not obtained by the best machine learning model, but by the researcher that did the best feature engineering. 

## Extra, Missing Values
Almost every dataset contains missing values. A difficult question is what to do with them. Leave them as a separate group, remove the data points, or replacing the value with a logical other value?

The next line shows how many missing values are still in the data set.

In [None]:
total = data.isnull().sum().sort_values(ascending=False)
percent = (data.isnull().sum() /
           data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data[missing_data.Total > 0]

We have numeric variables (LotFrontage, GarageYrBlt, MasVnrArea) and categorical (MasVnrType, Electrical) with missing values. 

#### Question 10, Data Cleaning
How would you replace missing values of these variables or would you remove the corresponding data point?