Kiva Life of a Loan:
The life of a loan generally follows the following lifecycle:
Borrowers apply to a local Field Partner, which manages the loan on the ground.
Partner loans are facilitated by local nonprofits or lending institutions, which approve the borrower’s loan request. Kiva does due diligence and ongoing monitoring for each of these Field Partners.
Disbursal refers to when the borrower can access the money— the timing of this can vary. For most Field Partner loans, the money is pre-disbursed, so the borrower can access the funds right away.
Depending on the type of loan, a Field Partner or borrower uploads the loan details into the system. Our worldwide network of volunteers then helps to edit and translate loans before they go live on the website for lenders to crowdfund.
Lenders receive repayments over time, based on the given repayment schedule and the borrower’s ability to repay. The repayments go into the lenders’ Kiva accounts.
Lenders use repayments to fund new loans, donate or withdraw the money.
Business Problem:
In 2012 Kiva instituted a policy where a loan has 30 days to be crowd-funded on the site before it expires. If after 30 days the loan is not fully funded the loan is flagged as expired and any money raised is returned to the lenders. If this happens the loan and the risk is beared by the loan partner. Per Kiva's blog post this was done for primarily two reasons:
- So that posted loans reflect the reality on the ground. If a loan is on the site for too long a borrower could already be paying back the loan.
- To help provide a feedback loop for field partners. If a field partner's loans are expiring that is a signal to them that they are not posting loans that are in demand by lenders on the site.
Kiva's mission is to connect people through lending to alleviate poverty. To maximize their impact they want to ensure that they are helping to efficiently deploy users capital to as many aspiring entrepreneurs across the globe as possible. If Kiva can decrease the number of loans that go unfunded it will help them to reach more entrepreneurs and better execute on their mission to alleviate poverty.
The goal of this project is to build a model that will help predict if a loan is at risk of being funded. Armed with this knowledge Kiva can either work to feature these loans more prominently on the sight or work with field partners and borrowers to better present their loans to maximize demand and increase funding.
Exploratory Data Analysis (EDA)
Exploring Funded Kiva Loans
Click Here For Live Interactive Country Map
First off before jumping into exploring who has trouble getting loans funded on Kiva I want to take a moment to explore the breakdown of the loans they do make. All of the below figures are for loans made from November 29th, 2011 through August 22nd, 2016. During this period Kiva made a total of 694,202 loans for $588,538,975. As you can see from the below plots a few things stand out. (One note is that only displaying top 25 countries by $ loaned)
- Considerably more loans are made to women than men. Almost 3:1
- Agriculture, Retail, and Food are the most popular sectors making up greater than half of all dollars loaned
- While Latin America and Sub Saharan Africa are the two regions receiving the most loans the top country is the Philippines which is in South East Asia.
- Countries identified by the world bank as lower middle income receive the most loans with upper middle income countries coming in second. These are both ahead of countries flagged as low income. My initial guess here is that this is because the rule of law and strength of institutions make it easier to provide loans and start up a business in these countries. This is something that would definitely warrant further research to dig into.
I want to further dig into the breakdown of the average loan size by the same demographics. A few things that stand out here:
- The disparity in average loan size by gender in most cases is relatively small. When you look by country where there is a larger discrepancy, with a few exceptions, it is women that are receiving the larger loans.
- Congo stands out as the 2nd highest average loan size despite being one of the countries identified as low income by the World Bank. Digging in further retail, food, and clothing were the top sectors for the Congo though these sectors are towards the middle of the pack for Kiva overall.
- Not surprisingly average loan size is positively correlated with income level.
Exploring Kiva Expired Loans
Now that we have a feel for the profile of the loans made we want to start exploring which types of loans are the most likely to expire before being funded. In total there were 39,975 loans that expired for a total of $57,930,375. This make up ~5.4% of all loans. I want to start subsetting the data to see if we can tease out anything across demographics. (For countries only showing top 25 countries by total $ loaned)
If you look at the below histograms you can see that the expired loans distribution (red) is shifted to the right with a long right tail. This tells me that it is more difficult to get a larger $ loan funded. I will explore this further across demographics.
When I look at the overall expiration rate across demographics a few things stand out.
- Men have a much more difficult time getting loans funded. Overall ~12% of mens loans go unfunded while women only have ~3.2% of funds expire.
- There is a clear trend that the higher income regions have a harder time getting funded. I suspect this is because the loans are altruistic in nature people are generally biased towards helping out those they feel need it most.
- There are very large disparities across both sectors and countries. There is a > 10% disparity between the low end and the high for both of these demographics.
Given that gender has such a large impact I want to subset each of the demographics by gender to see if there is anything that stands out.
The expiration rate by gender across each of the demographics varies quite a bit. The fact that they're not more consistent tells me that there is an interaction impact here that I should look to include in the model.
Given that the loan size appears to make a difference in the % of loans that go un-funded I next want to take a look at our cross sections of gender, sector, region, country, and income level to see if the average loan size by funded vs unfunded similar across these demographics.
I also want to take a look at how the supply of loans impacts the likelihood of getting a loan funded. From the below histogram you can see that as the number of loans posted increases the likelihood of a loan going un-funded goes up (red)
Lastly before moving on to building the model I want to take a look at the correlation matrix for the numerical data. If there are any features strongly correlated with the funding status they will likely be good candidates for the model. I also want to look at features that have a strong correlation with each other. I will need to eliminate any of these to make the model more stable. A couple of observations here:
- Repayment term, loan amount, and number of competing loans have the strongest positive correlation with a loan going unfunded.
- Gender and currency risk are the most strongly negatively correlated with a loan going unfunded. Given how these variables were encoded this implies being male (consistent with analysis above) and not having currency exposure (I would have guessed the opposite) make it harder to get a loan funded.
- GDP and GDP per capita growth are very strongly negatively correlated. I'm going to need to remove one of these before I run my model.
Modeling
Feature Engineering
Now that I've explored the data visually and have a good sense of the breakdown across all of the different subsets/demographics I want to move on to building a model to help predict if a loan will be funded or not. There are going to be four main features that go into the model. They include:
- Continuous data (Loan Amount, Number of competing loans, country tourism dollars, GNI per capita, GDP growth per capita, and number of lenders)
- Categorical data (region, country, gender, sector, income level, and activity) which will be encoded into dummy variables of 1s and 0s so that they can be included in model.
- The one sentence headline description which will be represented using a modeling statistic called TF-IDF. TF-IDF or term frequency - inverse document frequency is a method that counts the number of times a word appears in a document and then adjusted it down based on how many times that word appears across all documents in your corpus. This give more weights to words that are common in a document but not common across documents. This helps to highlight what is both important and unique about a document.
- I will use Latent Dirichlet allocation (LDA) which is a statistical model for representing a document by a set of topic weights. So instead of representing a large corpus by 50k + columns, one for each word, each document is represented by a set of topic weights (I am using 20 topics) This is an unsupervised dimensionality reduction technique for representing large bodies of text.
From the EDA that I did above one thing that stood out is that the default rates across the demographics (sector, country, and income level) varied considerably across gender. To capture this interaction I will include features such as gender * sector to capture if there is any variance that is not explained by gender and sector in isolation.
Due to the extremely high correlation between GDP Growth and GDP per capita growth I will strip out the GDP growth.
Model Selection
My primary goal is to use this model to be able to not only predict if a loan is at risk of expiring but also to understand what puts it so that Kiva can take steps to improve the likelihood of the loan being funded by either helping the borrower restructure their application (online posting) so that it drives more demand or featuring the loan more prominently on the website to help increase demand.
Given that the why is equally if not more important than the prediction itself I am going to explore using a logistic regression classification model. This will tell me what feature have the greatest impact as well as if they are positively or negatively related to the outcome.
Model Tuning
For the tf-idf and LDA component of the model I want to chop off the most common and most rarely used words to help the model perform better.
I want to get rid of the common words because if every single document has a given word multiple times it will not be helpful to identify what is unique about a given document and will provide very little variance or signal to be picked up in the model. Granted tf-idf already penalizes the most commonly used words in a corpus so they have a very small weight, but to help keep the feature set smaller and the potential overfitting that can result from this I will look to drop these out of the model all together.
On the other extreme I want to remove words that only appear in a few of the documents. This is to try to avoid overfitting where the model would be able to predict specific loans by effectively memorizing unique loans and not be able to generalize to new loans it has not seen.
After some doing some histogram plotting for both the headline topic sentence and the description I settled on setting the minimum and maximum words counts for 150,1000 and 100,1600 respectively.
The next step was to do some hyperparameter optimization. I utilized scikit-learn's gridsearch to identify if using an L1(lasso) or L2(ridge) regularization yielded the best model as well the number of topics to use for the LDA (5,10 or 20). L1 and 20 topics yielded the most accurate model.
Model Evaluation
Now that I have the model fitted and have optimized the hyperparameters used I want to start to look at evaluating how well the model performed.
The first classification metric I want to look at is the ROC curve. A score of 1 means the model is perfect as has 0 false positives. A score of 0.5 is what a naive model would do, effectively a coin flip. With a score of 0.93 I'm off to a good start.
Given that this data set is heavily inbalanced with ~95% of loans being funded a better metric to focus on is the precision/recall curve which focuses on the expired loan class. This measures the tradeoff of being correct when you guess a loan will expire (precision) versus correctly predicting every loan that expires(recall). Similar to the ROC curve the closer the area under the curve is to 1 the better.
Overall the model is performing relatively well.
The accuracy on the test data is 96.2%. This is slightly better than the the baseline accuracy of 95%. The AUC of the ROC curve is 93%. The AUC of the precision/recall curve of 54%. The point precision and recall of the expired class are 72% and 32% respectively. This means of the times the model predicts that a loan expires that it is correct 72% of these times and of all of the expired loans the model identified 32% of these.
I may decide that it is more important to identify a greater percentage of the expired loans with the tradeoff being that I may incorrectly identify more loans as expiring that will actually be funded. In order to do this I can decrease the probability threshold for flagging a loan as expiring. By doing this I could move along the precision recall curve to get a precision of ~63% and a recall of ~50%. This may be preferred if it's determined that the cost to the business of not identifying an expired loan is far higher than incorrectly flagging a good loan as expired.
Conclusions and Follow Ups
- It is clear that having a high dollar loan, posting your loan during times of high supply, and having a longer loan repayment term all make it more difficult to get a loan funded on Kiva
- On the other side lenders have a preference for loans that are: from women, with multiple borrowers, from countries with higher gross national income per capita, and that already have more lenders assigned.
- The model picked up the fact that there are large disparities between genders within countries as the interaction between country/gender for both Paraguay and Peru were some of the stronger coefficients.
- The headline text analysis picked up many of the words associated with agriculture. In particular loans that are associated with hybrid seed fertilizer are especially prone to going unfunded.
- The topic modeling did not add much to the model. Only 2 of the topics did not get totally whacked by the regularization. These two topics appear to be associated with service related positions.
In order to help improve and tune the model a few things that I would consider exploring are:
- Explore using more black box models. Ensemble methods like random forest or gradient boost may yield a more accurate model.
- Including field partner data. It's possible that certain field partners have a better sense of what loans are in high demand and this could help to improve the model.
- Fine tune the LDA model more. As I was sampling and testing data I noticed the topic importance change quite a bit over each iteration. I would look to train the LDA model over a larger sample size and include more topics. This could help to differentiate between loans better.
- As the old saying goes a picture is worth a thousand words. It would be interesting to explore doing image sentiment analysis to see if that could improve the model.
- Kiva also has groups that lenders can belong to. It would be interesting to see some social network analysis to see if certain groups are more successful at driving up lending rates and if they tend to lend to the same loans.
- The data set was large enough that it got pretty unwieldy to run on my laptop. Could have been worthwhile to spin up on AWS and explore using Spark to see if could get any better performance, or at the very least let it run on it's own without tying up my laptop.