Probability – Sampling with Replacement vs Without Replacement

probabilitystatistical-inferencestatistics

I'm writing a program in R that simulates bank losses on car loans. Here is the questions I'm trying to solve:

You run a bank that has a history of identifying potential homeowners that can be trusted to make payments. In fact, historically, in a given year, only 2% of your customers default. You want to use stochastic models to get an idea of what interest rates you should charge to guarantee a profit this upcoming year.

A. Your bank gives out 1,000 loans this year. Create a sampling model and use the function sample() to simulate the number of foreclosure in a year with the information that 2% of customers default. Also suppose your bank loses $120,000 on each foreclosure. Run the simulation for one year and report your loss.

B. Note that the loss you will incur is a random variable. Use Monte Carlo simulation to estimate the distribution of this random variable. Use summaries and visualization to describe your potential losses to your board of trustees.

C. The 1,000 loans you gave out were for 180,000. The way your bank can give out loans and not lose money is by charging an interest rate. If you charge an interest rate of, say, 2% you would earn 3,600 for each loan that doesn't foreclose. At what percentage should you set the interest rate so that your expected profit totals 100,000. Hint: Create a sampling model with expected value 100 so that when multiplied by the 1,000 loans you get an expectation of 100,000. Corroborate your answer with a Monte Carlo simulation.

I'm confused about how to set up this simulation up from a high level point of view and have the following questions:
1. For part A, Should I create a pool of 1000 customers or should I create a larger pool of customers?
2. For part A, when sampling, do I sample with or without replacement?
3. For part B, I'm confused about how to set up the monte carlo simulation. Am I varying the size of the customer pool?
4. For part C, I'm not sure how to set up a sampling model that involves the interest rate. Any advice or guidance would be appreciated. I'm also thinking that if I fully understood the high level concepts for parts A and B, part C might not be such a mystery.

Best Answer

(a) Let $n = 1000$ be the number of loans in a year, and $X$ be the number of foreclosures in a year, assuming a foreclosure rate of $r = 2\% = ,02$. The total loss is $L = \$120,000\,X.$

In terms of probabilities, $X \sim Binom(n, .02)$. The expected value $E(L)$ is the average loss per year over many years. We have $$E(L) = 120,000E(X) = 120,000(n)(r) = 120,000(1000)(.02) = 2,400,000$$ dollars.

The way I read your question, you are asked to simulate the loss for $one$ year. If you repeat this several times, you will get a great variety of answers, because the standard deviation of $L$ is quite large. (You give no clue how much you know about probability. Can you find $SD(L) = 1533.623\,?$)

Simulating a binomial random variable. I think the most convenient way to simulate one realization of $X$ in R is to use the statement rbinom(1, n, r). So here are simulated (and quite different) losses for three successive years. Try several runs for yourself: most of your numbers of foreclosures will likely be between 10 and 30.

 n = 1000;  r = .02
 x = rbinom(1, n, r); L = 120000*x
 x; L
 ## 11            # foreclosures
 ## 1320000       # total loss
 x = rbinom(1, n, r); L = 120000*x
 x; L
 ## 36
 ## 4320000
 x = rbinom(1, n, r); L = 120000*x
 x; L
 ## 20
 ## 2400000

I used the code below to simulate the number $X$ of foreclosures per year over an imaginary 10000-year period, and make a histogram of the results.

 x = rbinom(10000, n, r)
 hist(x, br=(-1:max(x))+.5, prob=T, col="wheat")

enter image description here

Simulating with random sampling. However, you have been asked to use the sample function. One way to do this is the use sample with parameter prob which gives the probabilities of foreclosure and no foreclosure. We use 0 to stand for no foreclosure and 1 to stand for a foreclosure. We use the parameter rep=T because 0's and 1's can be used repeatedly (sampling with replacement). Then sample gives a vector of length 1000, typically with lots of 0's and a few 1's, and sum counts how many foreclosures. Again here, you will get a great variety of different answers if you run the code below several times.

 x = sum(sample(1:0, 1000, rep=T, prob=c(.02, .98)))
 L = 120000*x
 x; L
 ## 21
 ## 2520000

This code using sample and sum is not as convenient as the code with rbinom. Maybe you were asked to do the exercise with sample as a preliminary to learning about binomial random variables.

I hope this is enough to get you started on the entire problem. If not, please edit your Question or leave a Comment showing what you have tried. This is not a 'homework answer' site. Once you show some additional participation, maybe I or someone else can help you with the next step.

Note. There are several reasons for learning to use simulation in such probability problems. Here are two: (a) Not all actuarial problems are as simple as the one posed here, and it may be difficult to get an analytical solution. (b) After you have an analytical solution and you're 'sure' it is right, it may be a good idea to go through the thought process one more time to program a simulation, and thus to check the validity of the analytic solution. This is an especially good idea where losses have as many 0's in them as in the situation described here.

Related Question