Title

About-Kiva

Kiva Life of a Loan:

Kiva Loan Lifecycle

The life of a loan generally follows the following lifecycle:

  1. Borrowers apply to a local Field Partner, which manages the loan on the ground.

  2. Partner loans are facilitated by local nonprofits or lending institutions, which approve the borrower’s loan request. Kiva does due diligence and ongoing monitoring for each of these Field Partners.

  3. Disbursal refers to when the borrower can access the money— the timing of this can vary. For most Field Partner loans, the money is pre-disbursed, so the borrower can access the funds right away.

  4. Depending on the type of loan, a Field Partner or borrower uploads the loan details into the system. Our worldwide network of volunteers then helps to edit and translate loans before they go live on the website for lenders to crowdfund.

  5. Lenders receive repayments over time, based on the given repayment schedule and the borrower’s ability to repay. The repayments go into the lenders’ Kiva accounts.

  6. Lenders use repayments to fund new loans, donate or withdraw the money.

Business Problem:

In 2012 Kiva instituted a policy where a loan has 30 days to be crowd-funded on the site before it expires. If after 30 days the loan is not fully funded the loan is flagged as expired and any money raised is returned to the lenders. If this happens the loan and the risk is beared by the loan partner. Per Kiva's blog post this was done for primarily two reasons:

Kiva's mission is to connect people through lending to alleviate poverty. To maximize their impact they want to ensure that they are helping to efficiently deploy users capital to as many aspiring entrepreneurs across the globe as possible. If Kiva can decrease the number of loans that go unfunded it will help them to reach more entrepreneurs and better execute on their mission to alleviate poverty.

The goal of this project is to build a model that will help predict if a loan is at risk of being funded. Armed with this knowledge Kiva can either work to feature these loans more prominently on the sight or work with field partners and borrowers to better present their loans to maximize demand and increase funding.

Exploratory Data Analysis (EDA)

Exploring Funded Kiva Loans

PointMap Click Here For Live Interactive Country Map

First off before jumping into exploring who has trouble getting loans funded on Kiva I want to take a moment to explore the breakdown of the loans they do make. All of the below figures are for loans made from November 29th, 2011 through August 22nd, 2016. During this period Kiva made a total of 694,202 loans for $588,538,975. As you can see from the below plots a few things stand out. (One note is that only displaying top 25 countries by $ loaned)

gender_amount gender_count

sector_amount region_loaned country_loaned income_loaned

I want to further dig into the breakdown of the average loan size by the same demographics. A few things that stand out here:

sector_gender_avg region_gender_avg country_gender_avg income_gender_avg

Exploring Kiva Expired Loans

Now that we have a feel for the profile of the loans made we want to start exploring which types of loans are the most likely to expire before being funded. In total there were 39,975 loans that expired for a total of $57,930,375. This make up ~5.4% of all loans. I want to start subsetting the data to see if we can tease out anything across demographics. (For countries only showing top 25 countries by total $ loaned)

If you look at the below histograms you can see that the expired loans distribution (red) is shifted to the right with a long right tail. This tells me that it is more difficult to get a larger $ loan funded. I will explore this further across demographics.

income-dist

When I look at the overall expiration rate across demographics a few things stand out.

gender_expired sector_expired region_expired country_expired income_expired

Given that gender has such a large impact I want to subset each of the demographics by gender to see if there is anything that stands out.

sectors region country income-level

The expiration rate by gender across each of the demographics varies quite a bit. The fact that they're not more consistent tells me that there is an interaction impact here that I should look to include in the model.

Given that the loan size appears to make a difference in the % of loans that go un-funded I next want to take a look at our cross sections of gender, sector, region, country, and income level to see if the average loan size by funded vs unfunded similar across these demographics.

gender_avg sector_avg region_avg country_avg income_avg

I also want to take a look at how the supply of loans impacts the likelihood of getting a loan funded. From the below histogram you can see that as the number of loans posted increases the likelihood of a loan going un-funded goes up (red)

loan-supply

Lastly before moving on to building the model I want to take a look at the correlation matrix for the numerical data. If there are any features strongly correlated with the funding status they will likely be good candidates for the model. I also want to look at features that have a strong correlation with each other. I will need to eliminate any of these to make the model more stable. A couple of observations here:

corr_matrix

Modeling

Feature Engineering

Now that I've explored the data visually and have a good sense of the breakdown across all of the different subsets/demographics I want to move on to building a model to help predict if a loan will be funded or not. There are going to be four main features that go into the model. They include:

From the EDA that I did above one thing that stood out is that the default rates across the demographics (sector, country, and income level) varied considerably across gender. To capture this interaction I will include features such as gender * sector to capture if there is any variance that is not explained by gender and sector in isolation.

Due to the extremely high correlation between GDP Growth and GDP per capita growth I will strip out the GDP growth.

Model Selection

My primary goal is to use this model to be able to not only predict if a loan is at risk of expiring but also to understand what puts it so that Kiva can take steps to improve the likelihood of the loan being funded by either helping the borrower restructure their application (online posting) so that it drives more demand or featuring the loan more prominently on the website to help increase demand.

Given that the why is equally if not more important than the prediction itself I am going to explore using a logistic regression classification model. This will tell me what feature have the greatest impact as well as if they are positively or negatively related to the outcome.

Model Tuning

For the tf-idf and LDA component of the model I want to chop off the most common and most rarely used words to help the model perform better.

I want to get rid of the common words because if every single document has a given word multiple times it will not be helpful to identify what is unique about a given document and will provide very little variance or signal to be picked up in the model. Granted tf-idf already penalizes the most commonly used words in a corpus so they have a very small weight, but to help keep the feature set smaller and the potential overfitting that can result from this I will look to drop these out of the model all together.

On the other extreme I want to remove words that only appear in a few of the documents. This is to try to avoid overfitting where the model would be able to predict specific loans by effectively memorizing unique loans and not be able to generalize to new loans it has not seen.

After some doing some histogram plotting for both the headline topic sentence and the description I settled on setting the minimum and maximum words counts for 150,1000 and 100,1600 respectively.

The next step was to do some hyperparameter optimization. I utilized scikit-learn's gridsearch to identify if using an L1(lasso) or L2(ridge) regularization yielded the best model as well the number of topics to use for the LDA (5,10 or 20). L1 and 20 topics yielded the most accurate model.

Model Evaluation

Now that I have the model fitted and have optimized the hyperparameters used I want to start to look at evaluating how well the model performed.

The first classification metric I want to look at is the ROC curve. A score of 1 means the model is perfect as has 0 false positives. A score of 0.5 is what a naive model would do, effectively a coin flip. With a score of 0.93 I'm off to a good start.

ROC_curve

Given that this data set is heavily inbalanced with ~95% of loans being funded a better metric to focus on is the precision/recall curve which focuses on the expired loan class. This measures the tradeoff of being correct when you guess a loan will expire (precision) versus correctly predicting every loan that expires(recall). Similar to the ROC curve the closer the area under the curve is to 1 the better.

precision_recall

Overall the model is performing relatively well.

The accuracy on the test data is 96.2%. This is slightly better than the the baseline accuracy of 95%. The AUC of the ROC curve is 93%. The AUC of the precision/recall curve of 54%. The point precision and recall of the expired class are 72% and 32% respectively. This means of the times the model predicts that a loan expires that it is correct 72% of these times and of all of the expired loans the model identified 32% of these.

I may decide that it is more important to identify a greater percentage of the expired loans with the tradeoff being that I may incorrectly identify more loans as expiring that will actually be funded. In order to do this I can decrease the probability threshold for flagging a loan as expiring. By doing this I could move along the precision recall curve to get a precision of ~63% and a recall of ~50%. This may be preferred if it's determined that the cost to the business of not identifying an expired loan is far higher than incorrectly flagging a good loan as expired.

confusion_matrix confusion_normalized feature_importance

Conclusions and Follow Ups

In order to help improve and tune the model a few things that I would consider exploring are: