Data Science in Practice with Python


Sample 1 - REGRESSION

  • WHAT IS A REGRESSION? This is the better definition what I found [Source: Wikipedia] - Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning.
  • HOW DOES IT WORK? Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable; for example, correlation does not imply causation.

The main goal of this notebook is to show you in the practice what is data science and how can we use machine learning techniques to make predictions. And in this first exercise I'll talk about linear regression. So, let's start...

Step 1 - Acquiring the data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

plt.style.use(style='ggplot')
plt.rcParams['figure.figsize'] = (6, 3)

We create two data frames to load the data. The data frame "train" we'll load the data to train the model and the data frame "test" we'll load the data to run the model and to make a prediction.

In [2]:
path = '/home/lserra/Documents/kaggle/sberbank/'

train = pd.read_csv(path + 'train.csv') 
test = pd.read_csv(path + 'test.csv')

How many features and variables there are in each one data frame?

In [3]:
print('Train data shape: ', train.shape)
print('-' * 50)
print('Test data shape: ', test.shape)
Train data shape:  (30471, 292)
--------------------------------------------------
Test data shape:  (7662, 291)

List the five first rows from each one data frame for a first analysis.

In [4]:
train.head()
Out[4]:
id timestamp full_sq life_sq floor max_floor material build_year num_room kitch_sq ... cafe_count_5000_price_2500 cafe_count_5000_price_4000 cafe_count_5000_price_high big_church_count_5000 church_count_5000 mosque_count_5000 leisure_count_5000 sport_count_5000 market_count_5000 price_doc
0 1 2011-08-20 43 27.0 4.0 NaN NaN NaN NaN NaN ... 9 4 0 13 22 1 0 52 4 5850000
1 2 2011-08-23 34 19.0 3.0 NaN NaN NaN NaN NaN ... 15 3 0 15 29 1 10 66 14 6000000
2 3 2011-08-27 43 29.0 2.0 NaN NaN NaN NaN NaN ... 10 3 0 11 27 0 4 67 10 5700000
3 4 2011-09-01 89 50.0 9.0 NaN NaN NaN NaN NaN ... 11 2 1 4 4 0 0 26 3 13100000
4 5 2011-09-05 77 77.0 4.0 NaN NaN NaN NaN NaN ... 319 108 17 135 236 2 91 195 14 16331452

5 rows × 292 columns

In [5]:
test.head()
Out[5]:
id timestamp full_sq life_sq floor max_floor material build_year num_room kitch_sq ... cafe_count_5000_price_1500 cafe_count_5000_price_2500 cafe_count_5000_price_4000 cafe_count_5000_price_high big_church_count_5000 church_count_5000 mosque_count_5000 leisure_count_5000 sport_count_5000 market_count_5000
0 30474 2015-07-01 39.0 20.7 2 9 1 1998.0 1 8.9 ... 8 0 0 0 1 10 1 0 14 1
1 30475 2015-07-01 79.2 NaN 8 17 1 0.0 3 1.0 ... 4 1 1 0 2 11 0 1 12 1
2 30476 2015-07-01 40.5 25.1 3 5 2 1960.0 2 4.8 ... 42 11 4 0 10 21 0 10 71 11
3 30477 2015-07-01 62.8 36.0 17 17 1 2016.0 2 62.8 ... 1 1 2 0 0 10 0 0 2 0
4 30478 2015-07-01 40.0 40.0 17 17 1 0.0 1 1.0 ... 5 1 1 0 2 12 0 1 11 1

5 rows × 291 columns

In [6]:
# Which are the data types existents in the train data frame?
# To know this, you can use the following command
train.dtypes.head()
Out[6]:
id             int64
timestamp     object
full_sq        int64
life_sq      float64
floor        float64
dtype: object
In [7]:
# To list only the float data type, you can use the following command
train.ftypes.head()
Out[7]:
id             int64:dense
timestamp     object:dense
full_sq        int64:dense
life_sq      float64:dense
floor        float64:dense
dtype: object
In [8]:
# To list all colunms (variables) that can be with NaN values, you can use the following command 
train.isnull().any().head()
Out[8]:
id           False
timestamp    False
full_sq      False
life_sq       True
floor         True
dtype: bool

Step 2 - Exploring the data

In [9]:
# Generates descriptive statistics that summarize the central tendency, 
# dispersion and shape of a dataset’s distribution, excluding NaN values.
train['price_doc'].describe()
Out[9]:
count    3.047100e+04
mean     7.123035e+06
std      4.780111e+06
min      1.000000e+05
25%      4.740002e+06
50%      6.274411e+06
75%      8.300000e+06
max      1.111111e+08
Name: price_doc, dtype: float64
In [10]:
# Plotting a histogram and analizing the skewness. 
# To know more about that, please visit https://en.wikipedia.org/wiki/Skewness
print (">> Skew is: ", train['price_doc'].skew())
plt.hist(train['price_doc'], color='blue')
plt.show()
>> Skew is:  4.47474487357
In [11]:
# Normalizing the distribution
# np.log() will transform the variable, and np.exp() will reverse the transformation
target = np.log(train['price_doc'])
print(">> Skew is: ", target.skew())
plt.hist(target, color='blue')
plt.show()
>> Skew is:  -0.686715679719
In [12]:
# this method will return a subset of columns matching the specified data types.
# only numbers
nf = train.select_dtypes(include=[np.number])
In [13]:
# Compute pairwise correlation of columns, excluding NA/null values
corr = nf.corr()

print(corr['price_doc'].sort_values(ascending=False)[:10], '\n')
print(corr['price_doc'].sort_values(ascending=False)[-10:])
price_doc           1.000000
num_room            0.476337
full_sq             0.341840
sport_count_5000    0.294864
sport_count_3000    0.290651
trc_count_5000      0.289371
sport_count_2000    0.278056
office_sqm_5000     0.269977
trc_sqm_5000        0.268072
sport_count_1500    0.258376
Name: price_doc, dtype: float64 

detention_facility_km   -0.223061
office_km               -0.223429
basketball_km           -0.223462
stadium_km              -0.236924
nuclear_reactor_km      -0.257946
ttk_km                  -0.272620
bulvar_ring_km          -0.279158
kremlin_km              -0.279249
sadovoe_km              -0.283622
zd_vokzaly_avto_km      -0.284069
Name: price_doc, dtype: float64

Is very useful to know the data values and to verify if exists some NaN value.

In [14]:
train['num_room'].unique()
Out[14]:
array([ nan,   2.,   1.,   3.,   4.,   5.,   6.,   0.,  19.,  10.,   8.,
         7.,  17.,   9.])
In [15]:
# full_sq means 'Total area in square meters'
train['full_sq'].unique()
Out[15]:
array([  43,   34,   89,   77,   67,   25,   44,   42,   36,   38,   31,
         51,   47,   59,   74,   39,   48,   32,   45,   35,   73,   40,
         81,   37,   27,   33,   54,   46,   56,   75,   64,   96,   52,
         63,   50,   76,   66,   30,   53,  133,   72,   41,   26,   58,
         78,   57,   61,  325,   55,   22,   62,  102,  117,   60,   86,
         85,  115,   98,   80,   68,   49,  104,   84,   17,   93,   99,
        144,   83,   20,  154,   79,   71,   28,   70,   23,   90,  100,
        129,  130,   95,  111,  131,  108,   82,  126,  112,  110,   92,
         29,  125,  106,   87,   12,   97,  167,  114,  118,  136,  166,
        120,  123,   65,  183,  155,  101,  204,   88,   94,  107,   13,
        147,   69,  169,  127,    9,   19,  103,  291,   15,  353,  187,
        142,  137,  394,  113,  148,   18,    5,  134,  173,  412,  105,
        178,  119,  138,  729,  156,  139,  164,   21,  172, 5326,  208,
        150,  157,  388,  159,  158,   10,  170,  165,  109,  193,   24,
        181,  140,  179,   91,  210,  143,  634,  116,  121,    6,  211,
        122,  206,  149,  124,  461,  185,  146,   14,   11,  141,  226,
        160,  132,    1,  202,   16,  135,  215,  174,  153,  303,  189,
        186,  220,  161,  219,    0,  168,  275,  209,  177,  145,  128,
        184,  182,  197,  176,  151,  637,  603,  218,  216,  207,  199,
        407,  195])
In [16]:
# Create a spreadsheet-style pivot table as a DataFrame. 
# The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) 
# on the index and columns of the result DataFrame.
pivot = train.pivot_table(index='num_room', values='price_doc', aggfunc=np.median)
# list the five first rows
pivot.head()
Out[16]:
num_room
0.0     7590001
1.0     5250000
2.0     6824493
3.0     9205505
4.0    14400000
Name: price_doc, dtype: int64
In [17]:
# plotting the data in a bar chart, using the library matplotlib
pivot.plot(kind='bar', color='blue')
plt.xlabel('Num Room')
plt.ylabel('Median Price')
plt.xticks(rotation=0)
plt.show()
In [18]:
# plotting the data in a scatter chart, using the library matplotlib with the data normalized
plt.scatter(x=train['num_room'], y=target)
plt.xlabel('Num Room')
plt.ylabel('Median Price')
plt.show()

Here we are creating a new data frame from the train data frame (original), excluding all the colunms (variables) with data type number. Then we generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

In [19]:
categoricals = train.select_dtypes(exclude=[np.number])
categoricals.describe()
Out[19]:
timestamp product_type sub_area culture_objects_top_25 thermal_power_plant_raion incineration_raion oil_chemistry_raion radiation_raion railroad_terminal_raion big_market_raion nuclear_reactor_raion detention_facility_raion water_1line big_road1_1line railroad_1line ecology
count 30471 30471 30471 30471 30471 30471 30471 30471 30471 30471 30471 30471 30471 30471 30471 30471
unique 1161 2 146 2 2 2 2 2 2 2 2 2 2 2 2 5
top 2014-12-16 Investment Poselenie Sosenskoe no no no no no no no no no no no no poor
freq 160 19448 1776 28543 28817 28155 30175 19600 29335 27649 29608 27427 28134 29690 29578 8018

Step 3 - Cleaning up the data

In [20]:
# selecting from data frame train, only the colunms (variables) with data type number
# fill NaN values with the interpolate() function and some cases elimiting NaN values with the dropna() function
# create a new data frame (data)
data = train.select_dtypes(include=[np.number]).interpolate().dropna() 
data.shape
Out[20]:
(22415, 276)
In [21]:
# selecting from data frame data, created previously, the colunms (variables) full_sq and num_room
# create a new data frame (new_train)
new_train = data[['full_sq', 'num_room']]
new_train.shape
Out[21]:
(22415, 2)
In [22]:
# selecting from data frame train, the colunm (variable) product_type
# replace the values using the function map -> if product_type == 'Investment: 1 elif product_type == 'OwnerOccupier': 2
# add a new colunm (product_type) in the data frame (new_train)
prodtype = {
'Investment': 1,
'OwnerOccupier': 2,
'NaN':1
}
new_train['product_type'] = train['product_type'].map(prodtype)
new_train.shape
/home/lserra/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Out[22]:
(22415, 3)
In [23]:
# this function retrieves a series and process the values
def calcbuildyear(series):
    if series['build_year'] <= 1:
        return 1
    elif series['build_year'] > 1:
        return 2017/series['build_year']
    else:
        return 1
In [24]:
# call the funtion created above to fill and calculate the colunm build_year
# add a new colunm (build_year) in the data frame (new_train)
new_train['build_year'] = data.apply(calcbuildyear, axis=1)
new_train.shape
/home/lserra/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
Out[24]:
(22415, 4)
In [25]:
# add a new colunm (price_doc) in the data frame (new_train)
# but in this case the values are normelized by the function np.log() applied previously
new_train['price_doc'] = target
new_train.shape
/home/lserra/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
Out[25]:
(22415, 5)
In [26]:
# We will create a new dataframe without outliers
X = new_train[new_train['full_sq'] > 0]
X.shape              
Out[26]:
(22413, 5)
In [27]:
# this expression verify if there is any NaN value before to keep to next step (pre-processing the data)
sum(X.isnull().sum() != 0)
Out[27]:
0

Step 4 - Preprocessing the data

In [28]:
# Use the pandas.scatter_matrix() function to create pair-wise scatter plots of all attributes.
from pandas.tools.plotting import scatter_matrix
In [29]:
scatter_matrix(X,figsize=[10,10])
plt.show()

Anomaly Detection Using One-Class SVM

Anomaly detection is the problem of identifying data points that don't conform to expected (normal) behaviour. Unexpected data points are also known as outliers and exceptions etc. Anomaly detection has crucial significance in the wide variety of domains as it provides critical and actionable information. For example, an anomaly in MRI image scan could be an indication of the malignant tumor or anomalous reading from production plant sensor may indicate faulty component.

For more information about it, please visit:

In [30]:
from sklearn import svm
In [31]:
Y = np.array(X[['num_room', 'price_doc']])
print(Y.shape)

clf = svm.OneClassSVM(nu=0.05, kernel="rbf", gamma=0.1)
clf.fit(Y)
(22413, 2)
Out[31]:
OneClassSVM(cache_size=200, coef0=0.0, degree=3, gamma=0.1, kernel='rbf',
      max_iter=-1, nu=0.05, random_state=None, shrinking=True, tol=0.001,
      verbose=False)
In [32]:
pred = clf.predict(Y)
# inliers are labeled 1, outliers are labeled -1
normal = Y[pred == 1]
abnormal = Y[pred == -1]
In [33]:
plt.plot(normal[:, 0], normal[:, 1],'bx')
plt.plot(abnormal[:, 0], abnormal[:, 1],'ro')
plt.xlabel('Price')
plt.ylabel('Num Room')
plt.show()

Step 5 - Building the model (Linear Regression)

In [34]:
import statsmodels.formula.api as smf
In [35]:
m = smf.ols(formula='price_doc ~ num_room', data=X).fit()
print (m.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              price_doc   R-squared:                       0.165
Model:                            OLS   Adj. R-squared:                  0.165
Method:                 Least Squares   F-statistic:                     4414.
Date:                Thu, 06 Jul 2017   Prob (F-statistic):               0.00
Time:                        23:15:22   Log-Likelihood:                -17952.
No. Observations:               22413   AIC:                         3.591e+04
Df Residuals:                   22411   BIC:                         3.592e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     15.1075      0.009   1680.699      0.000        15.090    15.125
num_room       0.2858      0.004     66.441      0.000         0.277     0.294
==============================================================================
Omnibus:                     5889.417   Durbin-Watson:                   1.965
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            22491.105
Skew:                          -1.273   Prob(JB):                         0.00
Kurtosis:                       7.196   Cond. No.                         6.25
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [36]:
m = smf.ols(formula='price_doc ~ full_sq', data=X).fit()
print (m.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              price_doc   R-squared:                       0.231
Model:                            OLS   Adj. R-squared:                  0.231
Method:                 Least Squares   F-statistic:                     6744.
Date:                Thu, 06 Jul 2017   Prob (F-statistic):               0.00
Time:                        23:15:22   Log-Likelihood:                -17019.
No. Observations:               22413   AIC:                         3.404e+04
Df Residuals:                   22411   BIC:                         3.406e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     14.9549      0.009   1626.405      0.000        14.937    14.973
full_sq        0.0129      0.000     82.123      0.000         0.013     0.013
==============================================================================
Omnibus:                     9744.064   Durbin-Watson:                   1.986
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            92445.147
Skew:                          -1.844   Prob(JB):                         0.00
Kurtosis:                      12.240   Cond. No.                         156.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [37]:
m = smf.ols(formula='price_doc ~ full_sq + num_room + build_year + product_type', data=X).fit()
print (m.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              price_doc   R-squared:                       0.259
Model:                            OLS   Adj. R-squared:                  0.259
Method:                 Least Squares   F-statistic:                     1963.
Date:                Thu, 06 Jul 2017   Prob (F-statistic):               0.00
Time:                        23:15:23   Log-Likelihood:                -16601.
No. Observations:               22413   AIC:                         3.321e+04
Df Residuals:                   22408   BIC:                         3.325e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
Intercept       15.0911      0.013   1122.944      0.000        15.065    15.117
full_sq          0.0113      0.000     53.166      0.000         0.011     0.012
num_room         0.0822      0.006     14.802      0.000         0.071     0.093
build_year      -0.0002      0.001     -0.436      0.663        -0.001     0.001
product_type    -0.1525      0.007    -20.763      0.000        -0.167    -0.138
==============================================================================
Omnibus:                    10089.173   Durbin-Watson:                   1.988
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            76991.262
Skew:                          -2.007   Prob(JB):                         0.00
Kurtosis:                      11.144   Cond. No.                         255.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [38]:
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

Let’s perform the final steps to prepare our data for modeling. We’ll separate the features and the target variable for modeling. We will assign the features to X and the target variable to y. We use np.log() as explained above to transform the y variable for the model. The function data.drop([features], axis=1) tells pandas which columns we want to exclude. We won’t include 'price_doc' for obvious reasons, and 'Id' is just an index with no relationship to 'price_doc'.

Let’s partition the data and start modeling. We will use the train_test_split() function from scikit-learn to create a training set and a hold-out set. Partitioning the data in this way allows us to evaluate how our model might perform on data that it has never seen before. If we train the model on all of the test data, it will be difficult to tell if overfitting has taken place.

train_test_split() returns four objects:

  • Z_train is the subset of our features used for training.
  • Z_test is the subset which will be our ‘hold-out’ set - what we’ll use to test the model.
  • y_train is the target variable 'price_doc' which corresponds to Z_train.
  • y_test is the target variable 'price_doc' which corresponds to Z_test.
In [39]:
y = X['price_doc']
Z = X.drop(['num_room', 'product_type', 'build_year','price_doc'], axis=1)

Z_train, Z_test, y_train, y_test = train_test_split(Z, y, random_state=45, test_size=.50)
In [40]:
lr = linear_model.LinearRegression()
model = lr.fit(Z_train, y_train)

Now, we want to evaluate the performance of the model. Each competition might evaluate the submissions differently. In this competition, Kaggle will evaluate our submission using root-mean-squared-error (RMSE). We’ll also look at The r-squared value. The r-squared value is a measure of how close the data are to the fitted regression line. It takes a value between 0 and 1, 1 meaning that all of the variance in the target is explained by the data. In general, a higher r-squared value means a better fit.

The model.score() method returns the r-squared value by default.

In [41]:
print("R² Score => ", model.score(Z_test, y_test))
R² Score =>  0.217740153499
In [42]:
predictions = model.predict(Z_test)
In [43]:
print("RMSE => ", mean_squared_error(y_test, predictions))
RMSE =>  0.272118677563
In [44]:
actual_values = y_test
plt.scatter(predictions, actual_values, alpha=.75, color='b')  # alpha helps to show overlapping data
plt.xlabel('Predicted price')
plt.ylabel('Actual price')
plt.title('Linear Regression Model')
plt.show
Out[44]:
<function matplotlib.pyplot.show>

Ridge Regularization (Linear least squares with L2 regularization)

This model solves a regression model where the loss function is the linear least squares function and regularization is given by the L2-norm. Also known as Ridge Regression or Tikhonov regularization. This estimator has built-in support for multi-variate regression (i.e., when y is a 2d-array of shape [n_samples, n_targets]).

Source: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

In [45]:
for i in range (-2, 3):
    alpha = 10**i
    rm = linear_model.Ridge(alpha=alpha)
    ridge_model = rm.fit(Z_train, y_train)
    preds_ridge = ridge_model.predict(Z_test)
    
    plt.scatter(preds_ridge, actual_values, alpha=.75, color='b')
    plt.xlabel('Predicted Price')
    plt.ylabel('Actual Price')
    plt.title('Ridge Regularization with alpha = {}'.format(alpha))
    plt.show()
    print('-' * 50)
    print('R² => ', ridge_model.score(Z_test, y_test))
    print('RMSE => ', mean_squared_error(y_test, preds_ridge))
    print('-' * 50)
--------------------------------------------------
R² =>  0.217740153615
RMSE =>  0.272118677523
--------------------------------------------------
--------------------------------------------------
R² =>  0.217740154658
RMSE =>  0.27211867716
--------------------------------------------------
--------------------------------------------------
R² =>  0.21774016509
RMSE =>  0.272118673531
--------------------------------------------------
--------------------------------------------------
R² =>  0.217740269409
RMSE =>  0.272118637242
--------------------------------------------------
--------------------------------------------------
R² =>  0.217741312481
RMSE =>  0.272118274397
--------------------------------------------------

Step 6 - Making a submission

In [46]:
data = test.select_dtypes(include=[np.number]).interpolate().dropna()
data.shape
Out[46]:
(7660, 275)
In [47]:
# new_test = data[['id', 'full_sq', 'num_room']].fillna(value=1)
new_test = data[['id', 'full_sq', 'num_room']]
new_test.shape
Out[47]:
(7660, 3)
In [48]:
prodtype = {
'Investment': 1,
'OwnerOccupier': 2,
'NaN':1
}
new_test['product_type'] = test['product_type'].map(prodtype).fillna(value=1)
new_test.shape
/home/lserra/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Out[48]:
(7660, 4)
In [49]:
def calcbuildyear(series):
    if series['build_year'] <= 1:
        return 1
    elif series['build_year'] > 1:
        return 2017/series['build_year']
    else:
        return 1
In [50]:
# new_test['build_year'] = data.apply(calcbuildyear, axis=1).fillna(value=1)
new_test['build_year'] = data.apply(calcbuildyear, axis=1)
new_test.shape
/home/lserra/anaconda3/lib/python3.6/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
Out[50]:
(7660, 5)
In [51]:
# We will create a new dataframe with some outliers removed.
df = new_test[new_test['full_sq'] > 0]
df.shape              
Out[51]:
(7659, 5)
In [52]:
sum(df.isnull().sum() != 0)
Out[52]:
0
In [53]:
submission = pd.DataFrame()
submission['id'] = df['id']
submission.shape
Out[53]:
(7659, 1)
In [54]:
Z = df.drop(['id','num_room', 'product_type', 'build_year'], axis=1)
In [55]:
predictions = model.predict(Z)
In [56]:
final_predictions = np.exp(predictions)
In [57]:
print(">> Original predictions are:\n", predictions[:5], "\n")
print(">> Final predictions are:\n", final_predictions[:5])
>> Original predictions are:
 [ 15.46314328  15.76748615  15.45631945  15.57136924  15.43994225] 

>> Final predictions are:
 [ 5194668.02181842  7042587.14052995  5159341.1530575   5788416.02423529
  5075533.75477477]
In [58]:
submission['price_doc'] = final_predictions
submission.head()
Out[58]:
id price_doc
2 30476 5.194668e+06
3 30477 7.042587e+06
4 30478 5.159341e+06
5 30479 5.788416e+06
6 30480 5.075534e+06
In [59]:
submission.to_csv(path+'submission1.csv', index=False)

Next steps


You can extend this tutorial and improve your results by:

  • Working with and transforming other features in the training set
  • Experimenting with different modeling techniques, such as Random Forest Regressors or Gradient Boosting
  • Using ensembling models

We created a set of categorical features called categoricals that were not all included in the final model. Go back and try to include these features. There are other methods that might help with categorical data, notably the pd.get_dummies() method. After working on these features, repeat the transformations for the test data and make another submission.

Working on models and participating in Kaggle competitions can be an iterative process – it’s important to experiment with new ideas, learn about the data, and test newer models and techniques.

With these tools, you can build upon your work and improve your results.

Good luck!

If you want contact me to explain some doubts, make suggestions or criticals...or to know my services - please send me an e-mail: laercio.serra@gmail.com, or visit my website for more details, in: http://lserra.datafresh.com.br

In [ ]: