Tuesday, December 10, 2013

Binomial distribution in R.

I haven't put much emphasis on using R on my blog yet, but since we have a pretty good grasp of some discrete distributions it's probably time we start using it. I'll start with the binomial distribution. Now all distributions are fairly similar in the sense that in R we can use four different commands for each distribution. The "d" command gives you the density, the "p" command tells you the cumulative probability, "q" is the "inverse cumulative probability" (I'll talk about this later), and "r" uses randomly generated numbers. I'll use d to start off with, since it's the easiest to grasp.

In R placing a "d" before the binomial function binom() gives the density of your chosen point. (for continuous distributions, which we'll cover later, gives the height at the point). For instance, let's say we want to know the probability of getting exactly 4 successes in 10 trials with a probability of .4 of success. Plugging this in would be:

>dbinom(4,10,.4)

Which is approximately equal to .25. What about the probability of getting 4 or fewer? Well we could add them:

>dbinom(0,10,.4)+dbinom(1,10,.4)+dbinom(2,10,.4)+dbinom(3,10,.4)+dbinom(4,10,.4)

Or we can take the much simpler route which is using the cumulative distribution of the binomial command, which is:

>pbinom(4,10,.4)

For this particular case. Either way gives you approximately .633 probability. Now we also have the inverse cumulative function, also known as the quantile function. I might talk about this in a different post, but as a quick primer I'll give the definition for a continuous, monotonically increasing distribution. If $F(x)=p$ is the cumulative probability at $x$, then we know we can take the inverse and get $F^{-1}(p)=x$, which in words means we take the reverse order in that we take the probability and try to find an $x$ that satisfies this. For discrete distributions it's slightly more difficult, since they're discontinuous, but for our purposes understanding the process intuitively even in the continuous case will be enough to try it out in R. Since the probability of getting 4 or fewer is approximately .633, the putting in the inverse cumulative function:

>qbinom(.633,10,.4)

 Should give us 4. Check this for yourself.

Now for our last command, the randomly generated binomial command. Let's say one experiment has 100 trials, the probability of success is .4, and we run 5 of these experiments. Well, in R we'd put:


>x=rbinom(5,100,.4)

Where x is an arbitrary variable we're assigning this all to, rbinom() is the randomly generated binomial distribution command, 5 is the total number of experiments we run, 100 the number of trials in each experiment, and .4 is the probability of success (notice the first number plays a different role than earlier). If we check the value of x, it would give us the results of these randomly run experiments. It would look something like this:



This certainly makes sense. If we have 100 trials, and we have .4 probability of success, we should get round 40 successes. Notice the values actually oscillate around 40. Graphically this would look like:



This is simply the histogram of the randomly generated values we had. To make this, you simply put:

>hist(x)

Where hist() is the histogram command. What we can see from the graph is that it doesn't look like much. Obviously they cluster together around 38-40. There's a strange gap between 32-36, with another value appearing in the 30-32 bin. Why does this look so strange? Well, we only have 5 experiments total. What we need to do is run more before we get a distribution that looks more reasonable. Let's do the exact same process, but now instead of 5 experiments of 100 trials, we'll run 100. That would give:



With a histogram of:



Now that looks much more reasonable. We could also test how close it is to the mean by using the mean command, mean(). The mean of the 5 experiments we ran earlier was 38.2. The mean of the 100 experiments equals 39.68. Closer to the theoretical average of 40.

You should definitely play around with this and get a better feel for it, you'll be using this for your other distributions.

Addendum:

Keep in mind the Bernoulli distribution that only takes one trial. It's easy to use our tools here to make this. if you want randomly created data for a Bernoulli distribution just set the number of trials equal to one. It would look like this:

>x=rbinom(100,1,.5)

So a hundred experiments of a single trial. That would look like:

Now here's another good command we can use to play around with this. Let's say we want to sum up all those outcomes of 1. Put:

>sum(x==1)

This sums up all the individual data that equals one, and it gets 51 in this case which is exactly what you would expect from a distribution like this with a probability of .5.  A collection of all the outcomes can be found by putting >table(x) which finds you the total numbers of 0s and 1s.



No comments:

Post a Comment