Skip to content

Latest commit

 

History

History
3538 lines (3084 loc) · 83.5 KB

File metadata and controls

3538 lines (3084 loc) · 83.5 KB

Identifying safe loans with decision trees

The LendingClub is a peer-to-peer leading company that directly connects borrowers and potential lenders/investors. In this notebook, you will build a classification model to predict whether or not a loan provided by LendingClub is likely to default.

In this notebook you will use data from the LendingClub to predict whether a loan will be paid off in full or the loan will be charged off and possibly go into default. In this assignment you will:

  • Use SFrames to do some feature engineering.
  • Train a decision-tree on the LendingClub dataset.
  • Visualize the tree.
  • Predict whether a loan will default along with prediction probabilities (on a validation set).
  • Train a complex tree model and compare it to simple tree model.

Let's get started!

Fire up GraphLab Create

Make sure you have the latest version of GraphLab Create. If you don't find the decision tree module, then you would need to upgrade GraphLab Create using

   pip install graphlab-create --upgrade
# import graphlab
# graphlab.canvas.set_target('ipynb')
import pandas
from pandas import DataFrame
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from graphviz import Source
from sklearn import tree

Load LendingClub dataset

We will be using a dataset from the LendingClub. A parsed and cleaned form of the dataset is availiable here. Make sure you download the dataset before running the following command.

loans = pandas.read_csv('D:/ml_data/lending-club-data.csv')
# loans = pandas.read_csv('/home/jo/我的坚果云/lending-club-data.csv')
C:\anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2698: DtypeWarning: Columns (19,47) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
# loans.iloc[47]

Exploring some features

Let's quickly explore what the dataset looks like. First, let's print out the column names to see what features we have in this dataset.

loans.dtypes.index
Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title',
       'emp_length', 'home_ownership', 'annual_inc', 'is_inc_v', 'issue_d',
       'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose', 'title',
       'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line',
       'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record',
       'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc',
       'initial_list_status', 'out_prncp', 'out_prncp_inv', 'total_pymnt',
       'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
       'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
       'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d',
       'collections_12_mths_ex_med', 'mths_since_last_major_derog',
       'policy_code', 'not_compliant', 'status', 'inactive_loans', 'bad_loans',
       'emp_length_num', 'grade_num', 'sub_grade_num', 'delinq_2yrs_zero',
       'pub_rec_zero', 'collections_12_mths_zero', 'short_emp',
       'payment_inc_ratio', 'final_d', 'last_delinq_none', 'last_record_none',
       'last_major_derog_none'],
      dtype='object')

Here, we see that we have some feature columns that have to do with grade of the loan, annual income, home ownership status, etc. Let's take a look at the distribution of loan grades in the dataset.

loans['grade'].head()
0    B
1    C
2    C
3    C
4    A
Name: grade, dtype: object
loans['bad_loans'].head()
0    0
1    1
2    0
3    0
4    0
Name: bad_loans, dtype: int64

We can see that over half of the loan grades are assigned values B or C. Each loan is assigned one of these grades, along with a more finely discretized feature called sub_grade (feel free to explore that feature column as well!). These values depend on the loan application and credit report, and determine the interest rate of the loan. More information can be found here.

Now, let's look at a different feature.

loans['home_ownership'].head()
0    RENT
1    RENT
2    RENT
3    RENT
4    RENT
Name: home_ownership, dtype: object

This feature describes whether the loanee is mortaging, renting, or owns a home. We can see that a small percentage of the loanees own a home.

Exploring the target column

The target column (label column) of the dataset that we are interested in is called bad_loans. In this column 1 means a risky (bad) loan 0 means a safe loan.

In order to make this more intuitive and consistent with the lectures, we reassign the target to be:

  • +1 as a safe loan,
  • -1 as a risky (bad) loan.

We put this in a new column called safe_loans.

# safe_loans =  1 => safe
# safe_loans = -1 => risky
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
del loans['bad_loans']
loans.dtypes.index
Index(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title',
       'emp_length', 'home_ownership', 'annual_inc', 'is_inc_v', 'issue_d',
       'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose', 'title',
       'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line',
       'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record',
       'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc',
       'initial_list_status', 'out_prncp', 'out_prncp_inv', 'total_pymnt',
       'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
       'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
       'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d',
       'collections_12_mths_ex_med', 'mths_since_last_major_derog',
       'policy_code', 'not_compliant', 'status', 'inactive_loans',
       'emp_length_num', 'grade_num', 'sub_grade_num', 'delinq_2yrs_zero',
       'pub_rec_zero', 'collections_12_mths_zero', 'short_emp',
       'payment_inc_ratio', 'final_d', 'last_delinq_none', 'last_record_none',
       'last_major_derog_none', 'safe_loans'],
      dtype='object')

Now, let us explore the distribution of the column safe_loans. This gives us a sense of how many safe and risky loans are present in the dataset.

loans['safe_loans'].value_counts()
 1    99457
-1    23150
Name: safe_loans, dtype: int64
loans['safe_loans'].value_counts().plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x1fc66bd0>

png

You should have:

  • Around 81% safe loans
  • Around 19% risky loans

It looks like most of these loans are safe loans (thankfully). But this does make our problem of identifying risky loans challenging.

Features for the classification algorithm

In this assignment, we will be using a subset of features (categorical and numeric). The features we will be using are described in the code comments below. If you are a finance geek, the LendingClub website has a lot more details about these features.

features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]

target = 'safe_loans'                   # prediction target (y) (+1 means safe, -1 is risky)

# Extract the feature columns and target column
loans = loans[features + [target]]

What remains now is a subset of features and the target that we will use for the rest of this notebook.

loans.head()
<style> .dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
</style>
grade sub_grade short_emp emp_length_num home_ownership dti purpose term last_delinq_none last_major_derog_none revol_util total_rec_late_fee safe_loans
0 B B2 0 11 RENT 27.65 credit_card 36 months 1 1 83.7 0.00 1
1 C C4 1 1 RENT 1.00 car 60 months 1 1 9.4 0.00 -1
2 C C5 0 11 RENT 8.72 small_business 36 months 1 1 98.5 0.00 1
3 C C1 0 11 RENT 20.00 other 36 months 0 1 21.0 16.97 1
4 A A4 0 4 RENT 11.20 wedding 36 months 1 1 28.3 0.00 1
len(loans)
122607

Sample data to balance classes

As we explored above, our data is disproportionally full of safe loans. Let's create two datasets: one with just the safe loans (safe_loans_raw) and one with just the risky loans (risky_loans_raw).

safe_loans_raw = loans[loans[target] == +1]
risky_loans_raw = loans[loans[target] == -1]
print ("Number of safe loans  : {}" .format(len(safe_loans_raw)))
print ("Number of risky loans : {}" .format(len(risky_loans_raw)))
Number of safe loans  : 99457
Number of risky loans : 23150

Now, write some code to compute below the percentage of safe and risky loans in the dataset and validate these numbers against what was given using .show earlier in the assignment:

numofsafe = int(len(safe_loans_raw))
numofrisky = int(len(risky_loans_raw))
total = numofsafe + numofrisky
print ("Percentage of safe loans  : {:.2%}".format(numofsafe/total))
print ("Percentage of risky loans : {:.2%}".format(numofrisky/total))
Percentage of safe loans  : 81.12%
Percentage of risky loans : 18.88%

One way to combat class imbalance is to undersample the larger class until the class distribution is approximately half and half. Here, we will undersample the larger class (safe loans) in order to balance out our dataset. This means we are throwing away many data points. We used seed=1 so everyone gets the same results.

risky_loans_raw
<style> .dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
</style>
grade sub_grade short_emp emp_length_num home_ownership dti purpose term last_delinq_none last_major_derog_none revol_util total_rec_late_fee safe_loans
1 C C4 1 1 RENT 1.00 car 60 months 1 1 9.4 0.000 -1
6 F F2 0 5 OWN 5.55 small_business 60 months 1 1 32.6 0.000 -1
7 B B5 1 1 RENT 18.08 other 60 months 1 1 36.5 0.000 -1
10 C C1 1 1 RENT 10.08 debt_consolidation 36 months 1 1 91.7 0.000 -1
12 B B2 0 4 RENT 7.06 other 36 months 1 1 55.5 0.000 -1
18 B B4 0 11 RENT 13.22 debt_consolidation 36 months 1 1 90.3 0.000 -1
21 B B3 0 2 RENT 2.40 major_purchase 36 months 1 1 29.7 0.000 -1
23 C C2 0 10 RENT 15.22 debt_consolidation 36 months 1 1 57.6 0.000 -1
24 D D2 0 3 RENT 13.97 other 60 months 0 1 59.5 0.000 -1
41 A A5 0 11 MORTGAGE 16.33 debt_consolidation 36 months 1 1 62.1 0.000 -1
45 B B1 0 9 MORTGAGE 9.12 debt_consolidation 36 months 1 1 63.7 24.170 -1
48 C C5 0 5 RENT 20.88 car 36 months 1 1 90.8 0.000 -1
50 E E4 0 8 RENT 21.58 debt_consolidation 60 months 1 1 97.6 0.000 -1
58 D D3 0 6 RENT 13.16 debt_consolidation 60 months 1 1 70.8 0.000 -1
60 F F2 0 5 RENT 12.48 small_business 60 months 1 1 73.9 0.000 -1
63 D D2 0 6 RENT 20.22 debt_consolidation 36 months 0 1 67.5 0.000 -1
87 D D3 0 8 MORTGAGE 21.31 credit_card 60 months 1 1 86.1 0.000 -1
89 B B1 0 3 RENT 20.64 debt_consolidation 36 months 1 1 47.7 0.000 -1
93 D D2 0 11 RENT 23.18 debt_consolidation 60 months 1 1 79.7 0.000 -1
102 D D5 0 2 RENT 24.14 credit_card 36 months 0 1 96.3 36.247 -1
108 B B2 0 5 RENT 22.80 debt_consolidation 36 months 0 1 54.2 0.000 -1
111 E E4 0 11 RENT 20.70 debt_consolidation 60 months 1 1 87.6 0.000 -1
118 C C5 0 9 MORTGAGE 9.17 home_improvement 36 months 1 1 71.2 0.000 -1
124 A A2 0 11 RENT 29.85 credit_card 36 months 1 1 62.3 0.000 -1
132 B B1 0 3 RENT 7.83 debt_consolidation 36 months 1 1 65.4 0.000 -1
136 B B3 0 9 RENT 22.08 debt_consolidation 36 months 1 1 29.3 0.000 -1
138 C C1 0 11 RENT 21.89 credit_card 60 months 1 1 65.8 0.000 -1
140 C C3 0 2 MORTGAGE 12.24 debt_consolidation 36 months 0 1 90.8 0.000 -1
151 A A3 1 0 OWN 16.30 debt_consolidation 36 months 1 1 42.2 0.000 -1
158 C C1 0 6 RENT 6.92 small_business 36 months 0 1 69.5 0.000 -1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
122470 D D1 0 10 RENT 10.84 credit_card 60 months 1 1 54.2 0.000 -1
122471 C C3 0 11 RENT 21.86 debt_consolidation 36 months 0 1 86.2 0.000 -1
122475 B B2 0 11 MORTGAGE 6.16 debt_consolidation 36 months 0 0 82.7 0.000 -1
122483 C C3 0 11 MORTGAGE 13.22 debt_consolidation 60 months 0 1 40.6 0.000 -1
122488 D D4 0 8 MORTGAGE 13.73 debt_consolidation 60 months 1 1 44.0 0.000 -1
122492 F F3 0 11 MORTGAGE 11.97 debt_consolidation 60 months 1 1 30.7 0.000 -1
122495 D D1 0 10 RENT 14.90 debt_consolidation 36 months 1 1 81.4 0.000 -1
122509 B B1 0 11 MORTGAGE 26.02 debt_consolidation 36 months 0 1 40.5 0.000 -1
122511 C C1 1 0 MORTGAGE 28.06 credit_card 36 months 0 0 69.9 0.000 -1
122527 B B4 0 11 OWN 5.88 home_improvement 36 months 1 1 6.0 0.000 -1
122534 E E5 0 7 OWN 10.90 small_business 36 months 0 1 21.6 0.000 -1
122542 D D1 0 11 RENT 6.83 debt_consolidation 60 months 1 1 64.6 0.000 -1
122543 C C2 0 2 MORTGAGE 6.60 debt_consolidation 60 months 1 1 50.0 0.000 -1
122556 D D4 0 6 MORTGAGE 22.73 debt_consolidation 60 months 1 1 64.5 0.000 -1
122557 D D1 0 3 MORTGAGE 16.16 credit_card 60 months 1 1 63.1 0.000 -1
122563 C C2 0 11 MORTGAGE 7.46 debt_consolidation 36 months 0 0 12.4 0.000 -1
122565 C C1 0 2 RENT 23.76 credit_card 36 months 1 1 39.2 0.000 -1
122566 B B3 1 0 RENT 11.37 debt_consolidation 36 months 1 1 40.0 0.000 -1
122570 A A4 0 7 MORTGAGE 27.68 major_purchase 36 months 0 1 70.6 0.000 -1
122573 B B2 0 11 MORTGAGE 18.37 debt_consolidation 36 months 1 1 55.3 0.000 -1
122580 B B5 0 2 RENT 27.75 debt_consolidation 36 months 1 1 83.6 0.000 -1
122583 C C4 1 1 RENT 18.09 debt_consolidation 36 months 0 1 0.0 0.000 -1
122587 C C3 0 9 MORTGAGE 11.48 debt_consolidation 60 months 1 1 64.7 0.000 -1
122590 B B5 1 1 OWN 19.62 small_business 36 months 1 1 68.5 0.000 -1
122594 C C3 1 1 OWN 26.21 credit_card 36 months 1 1 28.4 0.000 -1
122596 D D1 0 11 MORTGAGE 19.56 debt_consolidation 36 months 0 1 52.5 0.000 -1
122601 B B5 0 5 MORTGAGE 18.69 debt_consolidation 36 months 0 1 29.5 0.000 -1
122602 E E5 1 0 MORTGAGE 1.50 medical 60 months 0 0 14.6 0.000 -1
122604 D D3 0 6 MORTGAGE 12.28 medical 60 months 0 0 10.7 0.000 -1
122605 D D5 0 11 MORTGAGE 18.45 debt_consolidation 60 months 1 1 46.3 0.000 -1

23150 rows × 13 columns

# Since there are fewer risky loans than safe loans, find the ratio of the sizes
# and use that percentage to undersample the safe loans.
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))
percentage
0.2327639080205516
risky_loans = risky_loans_raw
safe_loans = safe_loans_raw.sample(frac=percentage, random_state=1)
print(safe_loans)
# Append the risky_loans with the downsampled version of safe_loans
loans_data = risky_loans.append(safe_loans)
       grade sub_grade  short_emp  emp_length_num home_ownership    dti  \
14937      B        B5          0               8           RENT  21.97   
104761     A        A1          0              11           RENT   0.92   
77248      B        B4          1               1           RENT   7.73   
120821     A        A5          1               1           RENT   4.12   
104521     D        D4          0              11           RENT   9.34   
89340      B        B4          0              11       MORTGAGE  16.08   
97993      A        A4          0              11       MORTGAGE  18.83   
51510      A        A5          0               5       MORTGAGE   8.19   
90723      B        B3          0              11       MORTGAGE  18.10   
19435      C        C3          0               6           RENT  20.13   
32947      C        C3          0              11           RENT  14.00   
119000     C        C3          1               1           RENT  20.28   
97425      A        A3          0              11       MORTGAGE  23.78   
10733      A        A2          0              11       MORTGAGE  22.55   
68245      D        D4          1               0       MORTGAGE  34.20   
83085      C        C1          0              10       MORTGAGE   9.28   
48666      B        B3          0               8           RENT  13.97   
21646      D        D2          0               2           RENT  11.08   
104040     A        A2          0               2           RENT  16.88   
19715      A        A3          0               4       MORTGAGE   8.14   
106034     C        C4          0              11           RENT   9.29   
54040      C        C2          1               1           RENT  25.62   
32830      B        B4          0               5           RENT  14.68   
120528     F        F3          0              11            OWN  26.82   
45981      F        F5          0               9           RENT  26.69   
112102     C        C4          0              11       MORTGAGE  14.83   
31689      A        A4          0               6           RENT   5.07   
79139      A        A2          0               8           RENT   1.92   
20506      A        A2          0               4           RENT  23.94   
69886      D        D1          1               0       MORTGAGE  22.94   
...      ...       ...        ...             ...            ...    ...   
55027      B        B5          0               7       MORTGAGE   7.28   
100438     C        C1          0               6       MORTGAGE  15.37   
122519     D        D4          1               0            OWN  29.98   
4161       B        B1          0               5       MORTGAGE  17.33   
13520      C        C5          0               4           RENT  15.77   
51558      C        C4          0              11       MORTGAGE  20.88   
110837     A        A2          0               3           RENT   3.00   
83788      C        C2          1               0       MORTGAGE  22.27   
6556       D        D1          0               5       MORTGAGE   5.84   
106731     A        A5          0               3       MORTGAGE  17.59   
111533     B        B2          0               8           RENT  16.09   
115626     C        C5          1               1       MORTGAGE  19.01   
121848     C        C2          0              11       MORTGAGE  18.37   
108846     C        C3          0              10       MORTGAGE  13.12   
111652     C        C5          0              11       MORTGAGE  20.90   
74640      A        A1          0              11       MORTGAGE  10.15   
90254      B        B2          0               4           RENT  14.21   
73781      C        C3          0               7       MORTGAGE  22.36   
46499      B        B4          0              11       MORTGAGE  12.02   
6011       B        B2          0              11       MORTGAGE  10.22   
92462      C        C1          0               4           RENT  26.35   
122138     C        C5          0              11       MORTGAGE   4.41   
98381      B        B5          1               1           RENT   8.37   
104878     C        C1          0               8            OWN   4.22   
38265      C        C3          0               7       MORTGAGE  24.32   
10016      A        A4          0               4       MORTGAGE  21.36   
58367      B        B1          0               8       MORTGAGE  16.29   
90431      B        B1          0               9           RENT  14.40   
115727     F        F5          0               4           RENT  29.45   
105752     D        D5          0               2       MORTGAGE  11.65   

                   purpose        term  last_delinq_none  \
14937   debt_consolidation   60 months                 1   
104761                 car   36 months                 1   
77248   debt_consolidation   36 months                 1   
120821      major_purchase   36 months                 1   
104521         credit_card   36 months                 1   
89340          credit_card   36 months                 0   
97993   debt_consolidation   36 months                 1   
51510   debt_consolidation   60 months                 1   
90723   debt_consolidation   36 months                 1   
19435   debt_consolidation   36 months                 1   
32947   debt_consolidation   36 months                 1   
119000         credit_card   36 months                 0   
97425   debt_consolidation   36 months                 1   
10733                  car   36 months                 1   
68245   debt_consolidation   36 months                 0   
83085          credit_card   36 months                 1   
48666          credit_card   36 months                 1   
21646                  car   36 months                 1   
104040  debt_consolidation   36 months                 1   
19715          credit_card   36 months                 1   
106034  debt_consolidation   36 months                 1   
54040          credit_card   36 months                 1   
32830          credit_card   36 months                 1   
120528  debt_consolidation   36 months                 1   
45981   debt_consolidation   60 months                 0   
112102    home_improvement   36 months                 0   
31689          credit_card   36 months                 0   
79139       small_business   36 months                 1   
20506             vacation   36 months                 1   
69886     home_improvement   36 months                 1   
...                    ...         ...               ...   
55027          credit_card   36 months                 0   
100438  debt_consolidation   36 months                 1   
122519  debt_consolidation   36 months                 1   
4161           credit_card   36 months                 1   
13520   debt_consolidation   60 months                 1   
51558          credit_card   36 months                 1   
110837         credit_card   36 months                 1   
83788   debt_consolidation   36 months                 1   
6556    debt_consolidation   36 months                 1   
106731  debt_consolidation   60 months                 1   
111533  debt_consolidation   36 months                 0   
115626         credit_card   60 months                 1   
121848  debt_consolidation   60 months                 0   
108846    home_improvement   60 months                 1   
111652  debt_consolidation   36 months                 1   
74640   debt_consolidation   36 months                 1   
90254   debt_consolidation   36 months                 0   
73781          credit_card   36 months                 0   
46499   debt_consolidation   36 months                 0   
6011           credit_card   36 months                 0   
92462          credit_card   36 months                 1   
122138  debt_consolidation   60 months                 1   
98381   debt_consolidation   36 months                 1   
104878         credit_card   36 months                 1   
38265          credit_card   60 months                 1   
10016   debt_consolidation   36 months                 1   
58367   debt_consolidation   36 months                 1   
90431   debt_consolidation   36 months                 1   
115727  debt_consolidation   60 months                 0   
105752  debt_consolidation   60 months                 0   

        last_major_derog_none  revol_util  total_rec_late_fee  safe_loans  
14937                       1        77.7                 0.0           1  
104761                      1         0.6                 0.0           1  
77248                       1        40.9                 0.0           1  
120821                      1        15.4                 0.0           1  
104521                      1        66.3                 0.0           1  
89340                       1        46.7                 0.0           1  
97993                       1        27.2                 0.0           1  
51510                       1        10.6                 0.0           1  
90723                       1        61.7                 0.0           1  
19435                       1        75.0                 0.0           1  
32947                       1        33.1                 0.0           1  
119000                      0        64.3                 0.0           1  
97425                       1        66.9                 0.0           1  
10733                       1         3.1                 0.0           1  
68245                       0        59.2                 0.0           1  
83085                       1        82.1                 0.0           1  
48666                       1        72.1                 0.0           1  
21646                       1        85.2                 0.0           1  
104040                      1        63.5                 0.0           1  
19715                       1        52.5                 0.0           1  
106034                      1        75.5                 0.0           1  
54040                       1        87.2                 0.0           1  
32830                       1        34.4                 0.0           1  
120528                      1        86.6                 0.0           1  
45981                       0        97.5                 0.0           1  
112102                      0        87.0                 0.0           1  
31689                       1        17.8                 0.0           1  
79139                       1        22.3                 0.0           1  
20506                       1         0.0                 0.0           1  
69886                       1        77.6                 0.0           1  
...                       ...         ...                 ...         ...  
55027                       1        53.5                 0.0           1  
100438                      1        63.8                 0.0           1  
122519                      1        92.7                 0.0           1  
4161                        1        77.1                 0.0           1  
13520                       1        27.5                 0.0           1  
51558                       1        78.8                 0.0           1  
110837                      1         1.2                 0.0           1  
83788                       1        71.4                 0.0           1  
6556                        1        71.2                 0.0           1  
106731                      1         9.0                 0.0           1  
111533                      0        60.0                 0.0           1  
115626                      1        70.7                 0.0           1  
121848                      1        28.2                 0.0           1  
108846                      1        40.0                 0.0           1  
111652                      1        94.0                 0.0           1  
74640                       1        67.2                 0.0           1  
90254                       1         6.3                 0.0           1  
73781                       1        72.4                 0.0           1  
46499                       0        36.0                 0.0           1  
6011                        1        49.1                 0.0           1  
92462                       1        76.1                 0.0           1  
122138                      0        36.0                 0.0           1  
98381                       1        71.3                 0.0           1  
104878                      1        81.6                 0.0           1  
38265                       1        50.1                 0.0           1  
10016                       1        16.5                 0.0           1  
58367                       1        41.0                 0.0           1  
90431                       1        36.3                 0.0           1  
115727                      0        36.2                 0.0           1  
105752                      1        83.0                 0.0           1  

[23150 rows x 13 columns]
loans_data
<style> .dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
</style>
grade sub_grade short_emp emp_length_num home_ownership dti purpose term last_delinq_none last_major_derog_none revol_util total_rec_late_fee safe_loans
1 C C4 1 1 RENT 1.00 car 60 months 1 1 9.4 0.000 -1
6 F F2 0 5 OWN 5.55 small_business 60 months 1 1 32.6 0.000 -1
7 B B5 1 1 RENT 18.08 other 60 months 1 1 36.5 0.000 -1
10 C C1 1 1 RENT 10.08 debt_consolidation 36 months 1 1 91.7 0.000 -1
12 B B2 0 4 RENT 7.06 other 36 months 1 1 55.5 0.000 -1
18 B B4 0 11 RENT 13.22 debt_consolidation 36 months 1 1 90.3 0.000 -1
21 B B3 0 2 RENT 2.40 major_purchase 36 months 1 1 29.7 0.000 -1
23 C C2 0 10 RENT 15.22 debt_consolidation 36 months 1 1 57.6 0.000 -1
24 D D2 0 3 RENT 13.97 other 60 months 0 1 59.5 0.000 -1
41 A A5 0 11 MORTGAGE 16.33 debt_consolidation 36 months 1 1 62.1 0.000 -1
45 B B1 0 9 MORTGAGE 9.12 debt_consolidation 36 months 1 1 63.7 24.170 -1
48 C C5 0 5 RENT 20.88 car 36 months 1 1 90.8 0.000 -1
50 E E4 0 8 RENT 21.58 debt_consolidation 60 months 1 1 97.6 0.000 -1
58 D D3 0 6 RENT 13.16 debt_consolidation 60 months 1 1 70.8 0.000 -1
60 F F2 0 5 RENT 12.48 small_business 60 months 1 1 73.9 0.000 -1
63 D D2 0 6 RENT 20.22 debt_consolidation 36 months 0 1 67.5 0.000 -1
87 D D3 0 8 MORTGAGE 21.31 credit_card 60 months 1 1 86.1 0.000 -1
89 B B1 0 3 RENT 20.64 debt_consolidation 36 months 1 1 47.7 0.000 -1
93 D D2 0 11 RENT 23.18 debt_consolidation 60 months 1 1 79.7 0.000 -1
102 D D5 0 2 RENT 24.14 credit_card 36 months 0 1 96.3 36.247 -1
108 B B2 0 5 RENT 22.80 debt_consolidation 36 months 0 1 54.2 0.000 -1
111 E E4 0 11 RENT 20.70 debt_consolidation 60 months 1 1 87.6 0.000 -1
118 C C5 0 9 MORTGAGE 9.17 home_improvement 36 months 1 1 71.2 0.000 -1
124 A A2 0 11 RENT 29.85 credit_card 36 months 1 1 62.3 0.000 -1
132 B B1 0 3 RENT 7.83 debt_consolidation 36 months 1 1 65.4 0.000 -1
136 B B3 0 9 RENT 22.08 debt_consolidation 36 months 1 1 29.3 0.000 -1
138 C C1 0 11 RENT 21.89 credit_card 60 months 1 1 65.8 0.000 -1
140 C C3 0 2 MORTGAGE 12.24 debt_consolidation 36 months 0 1 90.8 0.000 -1
151 A A3 1 0 OWN 16.30 debt_consolidation 36 months 1 1 42.2 0.000 -1
158 C C1 0 6 RENT 6.92 small_business 36 months 0 1 69.5 0.000 -1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
55027 B B5 0 7 MORTGAGE 7.28 credit_card 36 months 0 1 53.5 0.000 1
100438 C C1 0 6 MORTGAGE 15.37 debt_consolidation 36 months 1 1 63.8 0.000 1
122519 D D4 1 0 OWN 29.98 debt_consolidation 36 months 1 1 92.7 0.000 1
4161 B B1 0 5 MORTGAGE 17.33 credit_card 36 months 1 1 77.1 0.000 1
13520 C C5 0 4 RENT 15.77 debt_consolidation 60 months 1 1 27.5 0.000 1
51558 C C4 0 11 MORTGAGE 20.88 credit_card 36 months 1 1 78.8 0.000 1
110837 A A2 0 3 RENT 3.00 credit_card 36 months 1 1 1.2 0.000 1
83788 C C2 1 0 MORTGAGE 22.27 debt_consolidation 36 months 1 1 71.4 0.000 1
6556 D D1 0 5 MORTGAGE 5.84 debt_consolidation 36 months 1 1 71.2 0.000 1
106731 A A5 0 3 MORTGAGE 17.59 debt_consolidation 60 months 1 1 9.0 0.000 1
111533 B B2 0 8 RENT 16.09 debt_consolidation 36 months 0 0 60.0 0.000 1
115626 C C5 1 1 MORTGAGE 19.01 credit_card 60 months 1 1 70.7 0.000 1
121848 C C2 0 11 MORTGAGE 18.37 debt_consolidation 60 months 0 1 28.2 0.000 1
108846 C C3 0 10 MORTGAGE 13.12 home_improvement 60 months 1 1 40.0 0.000 1
111652 C C5 0 11 MORTGAGE 20.90 debt_consolidation 36 months 1 1 94.0 0.000 1
74640 A A1 0 11 MORTGAGE 10.15 debt_consolidation 36 months 1 1 67.2 0.000 1
90254 B B2 0 4 RENT 14.21 debt_consolidation 36 months 0 1 6.3 0.000 1
73781 C C3 0 7 MORTGAGE 22.36 credit_card 36 months 0 1 72.4 0.000 1
46499 B B4 0 11 MORTGAGE 12.02 debt_consolidation 36 months 0 0 36.0 0.000 1
6011 B B2 0 11 MORTGAGE 10.22 credit_card 36 months 0 1 49.1 0.000 1
92462 C C1 0 4 RENT 26.35 credit_card 36 months 1 1 76.1 0.000 1
122138 C C5 0 11 MORTGAGE 4.41 debt_consolidation 60 months 1 0 36.0 0.000 1
98381 B B5 1 1 RENT 8.37 debt_consolidation 36 months 1 1 71.3 0.000 1
104878 C C1 0 8 OWN 4.22 credit_card 36 months 1 1 81.6 0.000 1
38265 C C3 0 7 MORTGAGE 24.32 credit_card 60 months 1 1 50.1 0.000 1
10016 A A4 0 4 MORTGAGE 21.36 debt_consolidation 36 months 1 1 16.5 0.000 1
58367 B B1 0 8 MORTGAGE 16.29 debt_consolidation 36 months 1 1 41.0 0.000 1
90431 B B1 0 9 RENT 14.40 debt_consolidation 36 months 1 1 36.3 0.000 1
115727 F F5 0 4 RENT 29.45 debt_consolidation 60 months 0 0 36.2 0.000 1
105752 D D5 0 2 MORTGAGE 11.65 debt_consolidation 60 months 0 1 83.0 0.000 1

46300 rows × 13 columns

Now, let's verify that the resulting percentage of safe and risky loans are each nearly 50%.

print ("Percentage of safe loans                 :", len(safe_loans) / float(len(loans_data)))
print ("Percentage of risky loans                :", len(risky_loans) / float(len(loans_data)))
print ("Total number of loans in our new dataset :", len(loans_data))
Percentage of safe loans                 : 0.5
Percentage of risky loans                : 0.5
Total number of loans in our new dataset : 46300

Note: There are many approaches for dealing with imbalanced data, including some where we modify the learning algorithm. These approaches are beyond the scope of this course, but some of them are reviewed in this paper. For this assignment, we use the simplest possible approach, where we subsample the overly represented class to get a more balanced dataset. In general, and especially when the data is highly imbalanced, we recommend using more advanced methods.

Split data into training and validation sets

We split the data into training and validation sets using an 80/20 split and specifying seed=1 so everyone gets the same results.

Note: In previous assignments, we have called this a train-test split. However, the portion of data that we don't train on will be used to help select model parameters (this is known as model selection). Thus, this portion of data should be called a validation set. Recall that examining performance of various potential models (i.e. models with different parameters) should be on validation set, while evaluation of the final selected model should always be on test data. Typically, we would also save a portion of the data (a real test set) to test our final model on or use cross-validation on the training set to select our final model. But for the learning purposes of this assignment, we won't do that.

(train_data,validation_data) = train_test_split(loans_data, train_size=0.8, random_state=0)
train_data.shape, validation_data.shape
((37040, 13), (9260, 13))

Use decision tree to build a classifier

Now, let's use the built-in GraphLab Create decision tree learner to create a loan prediction model on the training data. (In the next assignment, you will implement your own decision tree learning algorithm.) Our feature columns and target column have already been decided above. Use validation_set=None to get the same results as everyone else.

# from sklearn import preprocessing
# le = preprocessing.LabelEncoder()
# for i in range(13):
#     print(i)
#     print(train_data.values[:,i])
#     train_data.values[:,i] = le.fit_transform(train_data.values[:,i])
#     print(train_data.values[:,i])
train_data.columns
Index(['grade', 'sub_grade', 'short_emp', 'emp_length_num', 'home_ownership',
       'dti', 'purpose', 'term', 'last_delinq_none', 'last_major_derog_none',
       'revol_util', 'total_rec_late_fee', 'safe_loans'],
      dtype='object')
type(train_data)
pandas.core.frame.DataFrame
train_data['grade'].head(3), train_data['safe_loans'].head(3)
(9894     B
 32654    A
 65019    D
 Name: grade, dtype: object, 9894    -1
 32654    1
 65019   -1
 Name: safe_loans, dtype: int64)
# for column_name in train_data.columns:
#     print(train_data[column_name].dtype)
#     if train_data[column_name].dtype == object:
#         print(1)
#     else:
#         print(-1)

Use LabelEncoder to transfer all the object items into matrix

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for column_name in train_data.columns:
    if train_data[column_name].dtype == object:
        train_data[column_name] = le.fit_transform(train_data[column_name])
        validation_data[column_name] = le.fit_transform(validation_data[column_name])
    else:
        pass
C:\anaconda3\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
C:\anaconda3\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
train_data.head(3)
<style> .dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
</style>
grade sub_grade short_emp emp_length_num home_ownership dti purpose term last_delinq_none last_major_derog_none revol_util total_rec_late_fee safe_loans
9894 1 7 1 1 3 23.25 7 0 1 1 71.3 14.9917 -1
32654 0 1 0 4 3 15.00 11 0 1 1 8.2 0.0000 1
65019 3 19 0 9 3 28.22 1 1 1 1 83.5 0.0000 -1
# le.fit(train_data['purpose'])
# list(le.classes_)
# le.transform(train_data['purpose'])
# len(le.transform(train_data.values[:,6])), len(train_data.values[:,6]), type(train_data['grade'])
# train_data['purpose'] = le.transform(train_data['purpose'])
# train_data.loc[:,('purpose')] = le.transform(train_data.loc[:,('purpose')])
# train_data['purpose']
# train_data['term'].head(3)
# X = train_data.values[:,0:12]
# Y = train_data.values[:,12].reshape(-1, 1)
# # X = train_data.iloc[:,12]
# # Y = train_data.iloc[:,0:12]
# print(X,Y)
train_Y = train_data['safe_loans'].as_matrix()
train_X = train_data.drop('safe_loans', axis=1).as_matrix()
print(train_X)
print(train_Y)
[[  1.       7.       1.     ...,   1.      71.3     14.9917]
 [  0.       1.       0.     ...,   1.       8.2      0.    ]
 [  3.      19.       0.     ...,   1.      83.5      0.    ]
 ..., 
 [  0.       0.       0.     ...,   1.      10.6      0.    ]
 [  1.       6.       0.     ...,   1.      39.3      0.    ]
 [  6.      32.       0.     ...,   1.      70.8      0.    ]]
[-1  1 -1 ...,  1  1 -1]
decision_tree_model = DecisionTreeClassifier(max_depth=6)
# You can't pass str to your model fit() method.
decision_tree_model.fit(train_X, train_Y)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Visualizing a learned model

As noted in the documentation, typically the max depth of the tree is capped at 6. However, such a tree can be hard to visualize graphically. Here, we instead learn a smaller model with max depth of 2 to gain some intuition by visualizing the learned tree.

# small_model = graphlab.decision_tree_classifier.create(train_data, validation_set=None,
#                    target = target, features = features, max_depth = 2)
small_model = DecisionTreeClassifier(criterion='gini', max_depth=2)
small_model.fit(train_X, train_Y)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In the view that is provided by GraphLab Create, you can see each node, and each split at each node. This visualization is great for considering what happens when this model predicts the target of a new data point.

Note: To better understand this visual:

  • The root node is represented using pink.
  • Intermediate nodes are in green.
  • Leaf nodes in blue and orange.
dot_data = tree.export_graphviz(decision_tree_model, out_file=None, 
                         feature_names=features,  
                         class_names=("+1","-1"),  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = Source(dot_data)  
graph

svg

# graph = Source(tree.export_graphviz(decision_tree_model, out_file=None, feature_names=features))
# graph
# graph.format = 'png'
# graph.render('dtree_render',view=True)

Making predictions

Let's consider two positive and two negative examples from the validation set and see what the model predicts. We will do the following:

  • Predict whether or not a loan is safe.
  • Predict the probability that a loan is safe.
validation_safe_loans = validation_data[validation_data[target] == 1]
validation_risky_loans = validation_data[validation_data[target] == -1]

sample_validation_data_risky = validation_risky_loans[0:2]
sample_validation_data_safe = validation_safe_loans[0:2]
sample_validation_data = sample_validation_data_safe.append(sample_validation_data_risky)
sample_validation_data
<style> .dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
</style>
grade sub_grade short_emp emp_length_num home_ownership dti purpose term last_delinq_none last_major_derog_none revol_util total_rec_late_fee safe_loans
113161 1 9 1 1 3 29.64 1 0 1 1 39.8 0.0 1
106326 0 4 0 4 0 18.60 2 0 1 1 8.5 0.0 1
16626 4 20 0 4 2 6.89 2 1 1 1 39.1 0.0 -1
62925 3 17 0 11 0 21.14 3 0 0 0 68.8 0.0 -1
sample_exclude_target = sample_validation_data.iloc[:,0:12]

Explore label predictions

Now, we will use our model to predict whether or not a loan is likely to default. For each row in the sample_validation_data, use the decision_tree_model to predict whether or not the loan is classified as a safe loan.

Hint: Be sure to use the .predict() method.

decision_tree_model.predict(sample_exclude_target)
array([-1,  1, -1, -1], dtype=int64)

Quiz Question: What percentage of the predictions on sample_validation_data did decision_tree_model get correct?

Explore probability predictions

For each row in the sample_validation_data, what is the probability (according decision_tree_model) of a loan being classified as safe?

Hint: Set output_type='probability' to make probability predictions using decision_tree_model on sample_validation_data:

decision_tree_model.predict_proba(sample_exclude_target)
array([[ 0.5       ,  0.5       ],
       [ 0.27347196,  0.72652804],
       [ 0.60796325,  0.39203675],
       [ 0.60477941,  0.39522059]])

Quiz Question: Which loan has the highest probability of being classified as a safe loan?

Checkpoint: Can you verify that for all the predictions with probability >= 0.5, the model predicted the label +1?

Tricky predictions!

Now, we will explore something pretty interesting. For each row in the sample_validation_data, what is the probability (according to small_model) of a loan being classified as safe?

Hint: Set output_type='probability' to make probability predictions using small_model on sample_validation_data:

small_model.predict_proba(sample_exclude_target)
array([[ 0.42313756,  0.57686244],
       [ 0.24494122,  0.75505878],
       [ 0.66362791,  0.33637209],
       [ 0.66362791,  0.33637209]])

Quiz Question: Notice that the probability preditions are the exact same for the 2nd and 3rd loans. Why would this happen?

Visualize the prediction on a tree

Note that you should be able to look at the small tree, traverse it yourself, and visualize the prediction being made. Consider the following point in the sample_validation_data

sample_validation_data[0:1]
<style> .dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
</style>
grade sub_grade short_emp emp_length_num home_ownership dti purpose term last_delinq_none last_major_derog_none revol_util total_rec_late_fee safe_loans
113161 1 9 1 1 3 29.64 1 0 1 1 39.8 0.0 1

Let's visualize the small tree here to do the traversing for this data point.

graph = Source(tree.export_graphviz(small_model, out_file=None, feature_names=features))
# graph.format = 'png'
# graph.render('dtree_render',view=True)
graph

svg

Note: In the tree visualization above, the values at the leaf nodes are not class predictions but scores (a slightly advanced concept that is out of the scope of this course). You can read more about this here. If the score is $\geq$ 0, the class +1 is predicted. Otherwise, if the score < 0, we predict class -1.

Quiz Question: Based on the visualized tree, what prediction would you make for this data point?

Now, let's verify your prediction by examining the prediction made using GraphLab Create. Use the .predict function on small_model.

small_model.predict(sample_exclude_target)
array([ 1,  1, -1, -1], dtype=int64)

Evaluating accuracy of the decision tree model

Recall that the accuracy is defined as follows: $$ \mbox{accuracy} = \frac{\mbox{# correctly classified examples}}{\mbox{# total examples}} $$

Let us start by evaluating the accuracy of the small_model and decision_tree_model on the training data

# print small_model.evaluate(train_data)['accuracy']
# print decision_tree_model.evaluate(train_data)['accuracy']

Checkpoint: You should see that the small_model performs worse than the decision_tree_model on the training data.

Now, let us evaluate the accuracy of the small_model and decision_tree_model on the entire validation_data, not just the subsample considered above.

def accuracy(model, validation_data):
    predict = model.predict(validation_data.iloc[:,0:12])
    # print(predict)
    actual = validation_data.iloc[:,12]
    # print(actual)
    result = (predict==actual).value_counts()
    # print(result.values[0])
    return (result.values[0])/len(actual)
accuracy(small_model, sample_validation_data)
1.0
accuracy(decision_tree_model, sample_validation_data)
0.75
accuracy(small_model, validation_data)
0.60421166306695462
accuracy(decision_tree_model, validation_data)
0.62926565874730023

Quiz Question: What is the accuracy of decision_tree_model on the validation set, rounded to the nearest .01?

Evaluating accuracy of a complex decision tree model

Here, we will train a large decision tree with max_depth=10. This will allow the learned tree to become very deep, and result in a very complex model. Recall that in lecture, we prefer simpler models with similar predictive power. This will be an example of a more complicated model which has similar predictive power, i.e. something we don't want.

big_model = DecisionTreeClassifier(max_depth=10)
big_model.fit(train_X, train_Y)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

Now, let us evaluate big_model on the training set and validation set.

# print big_model.evaluate(train_data)['accuracy']
# print big_model.evaluate(validation_data)['accuracy']
accuracy(big_model, sample_validation_data)
1.0
accuracy(big_model, validation_data)
0.61295896328293742

Checkpoint: We should see that big_model has even better performance on the training set than decision_tree_model did on the training set.

Quiz Question: How does the performance of big_model on the validation set compare to decision_tree_model on the validation set? Is this a sign of overfitting?