Alright now we have to take the derivative we got the last time and differentiate again. Last time we had:
$$\frac{\partial M_{x}(t)}{\partial t}=r\left( \frac{Pe^{t}}{1-(1-P)e^{t}}\right)^{r-1}\left(\frac{Pe^{t}}{1-(1-P)e^{t}}+\frac{P(1-P)e^{2t}}{(1-(1-P)e^{t})^{2}}\right)$$
The second derivative of this would become:
(yeah I'm not typing that up at all, of course I used Wolfram)
Setting $t=0$ reduces this to:
$$\frac{-r(P-1)+r^{2}}{P^2}$$
Now to use the fact that $\sigma^2=E[x^2]-\mu^2$, we have:
$$\sigma^2=\frac{-r(P-1)+r^{2}}{P^2}-\frac{r^2}{P^2}=\frac{-r(1-P)}{P^2}=\frac{r(P-1)}{P^2}$$
Now we have the variance of the negative binomial, we can use this to find the variance of the geometric, which is just the special case of $r=1$. That is simply:
$$\frac{(P-1)}{P^2}$$
Work is finallyyyyy done. Now moving onto the next discrete distribution. The hypergeometric. That leaves only two of the most useful discrete distributions. The Poisson, and the uniform (trivial but I'll make a post about it anyways).
Sunday, December 15, 2013
Tuesday, December 10, 2013
Binomial distribution in R.
I haven't put much emphasis on using R on my blog yet, but since we have a pretty good grasp of some discrete distributions it's probably time we start using it. I'll start with the binomial distribution. Now all distributions are fairly similar in the sense that in R we can use four different commands for each distribution. The "d" command gives you the density, the "p" command tells you the cumulative probability, "q" is the "inverse cumulative probability" (I'll talk about this later), and "r" uses randomly generated numbers. I'll use d to start off with, since it's the easiest to grasp.
In R placing a "d" before the binomial function binom() gives the density of your chosen point. (for continuous distributions, which we'll cover later, gives the height at the point). For instance, let's say we want to know the probability of getting exactly 4 successes in 10 trials with a probability of .4 of success. Plugging this in would be:
>dbinom(4,10,.4)
Which is approximately equal to .25. What about the probability of getting 4 or fewer? Well we could add them:
>dbinom(0,10,.4)+dbinom(1,10,.4)+dbinom(2,10,.4)+dbinom(3,10,.4)+dbinom(4,10,.4)
Or we can take the much simpler route which is using the cumulative distribution of the binomial command, which is:
>pbinom(4,10,.4)
For this particular case. Either way gives you approximately .633 probability. Now we also have the inverse cumulative function, also known as the quantile function. I might talk about this in a different post, but as a quick primer I'll give the definition for a continuous, monotonically increasing distribution. If $F(x)=p$ is the cumulative probability at $x$, then we know we can take the inverse and get $F^{-1}(p)=x$, which in words means we take the reverse order in that we take the probability and try to find an $x$ that satisfies this. For discrete distributions it's slightly more difficult, since they're discontinuous, but for our purposes understanding the process intuitively even in the continuous case will be enough to try it out in R. Since the probability of getting 4 or fewer is approximately .633, the putting in the inverse cumulative function:
>qbinom(.633,10,.4)
Should give us 4. Check this for yourself.
Now for our last command, the randomly generated binomial command. Let's say one experiment has 100 trials, the probability of success is .4, and we run 5 of these experiments. Well, in R we'd put:
>x=rbinom(5,100,.4)
Where x is an arbitrary variable we're assigning this all to, rbinom() is the randomly generated binomial distribution command, 5 is the total number of experiments we run, 100 the number of trials in each experiment, and .4 is the probability of success (notice the first number plays a different role than earlier). If we check the value of x, it would give us the results of these randomly run experiments. It would look something like this:
This certainly makes sense. If we have 100 trials, and we have .4 probability of success, we should get round 40 successes. Notice the values actually oscillate around 40. Graphically this would look like:
This is simply the histogram of the randomly generated values we had. To make this, you simply put:
>hist(x)
Where hist() is the histogram command. What we can see from the graph is that it doesn't look like much. Obviously they cluster together around 38-40. There's a strange gap between 32-36, with another value appearing in the 30-32 bin. Why does this look so strange? Well, we only have 5 experiments total. What we need to do is run more before we get a distribution that looks more reasonable. Let's do the exact same process, but now instead of 5 experiments of 100 trials, we'll run 100. That would give:
With a histogram of:
Now that looks much more reasonable. We could also test how close it is to the mean by using the mean command, mean(). The mean of the 5 experiments we ran earlier was 38.2. The mean of the 100 experiments equals 39.68. Closer to the theoretical average of 40.
You should definitely play around with this and get a better feel for it, you'll be using this for your other distributions.
Addendum:
Keep in mind the Bernoulli distribution that only takes one trial. It's easy to use our tools here to make this. if you want randomly created data for a Bernoulli distribution just set the number of trials equal to one. It would look like this:
>x=rbinom(100,1,.5)
So a hundred experiments of a single trial. That would look like:
Now here's another good command we can use to play around with this. Let's say we want to sum up all those outcomes of 1. Put:
>sum(x==1)
This sums up all the individual data that equals one, and it gets 51 in this case which is exactly what you would expect from a distribution like this with a probability of .5. A collection of all the outcomes can be found by putting >table(x) which finds you the total numbers of 0s and 1s.
In R placing a "d" before the binomial function binom() gives the density of your chosen point. (for continuous distributions, which we'll cover later, gives the height at the point). For instance, let's say we want to know the probability of getting exactly 4 successes in 10 trials with a probability of .4 of success. Plugging this in would be:
>dbinom(4,10,.4)
Which is approximately equal to .25. What about the probability of getting 4 or fewer? Well we could add them:
>dbinom(0,10,.4)+dbinom(1,10,.4)+dbinom(2,10,.4)+dbinom(3,10,.4)+dbinom(4,10,.4)
Or we can take the much simpler route which is using the cumulative distribution of the binomial command, which is:
>pbinom(4,10,.4)
For this particular case. Either way gives you approximately .633 probability. Now we also have the inverse cumulative function, also known as the quantile function. I might talk about this in a different post, but as a quick primer I'll give the definition for a continuous, monotonically increasing distribution. If $F(x)=p$ is the cumulative probability at $x$, then we know we can take the inverse and get $F^{-1}(p)=x$, which in words means we take the reverse order in that we take the probability and try to find an $x$ that satisfies this. For discrete distributions it's slightly more difficult, since they're discontinuous, but for our purposes understanding the process intuitively even in the continuous case will be enough to try it out in R. Since the probability of getting 4 or fewer is approximately .633, the putting in the inverse cumulative function:
>qbinom(.633,10,.4)
Should give us 4. Check this for yourself.
Now for our last command, the randomly generated binomial command. Let's say one experiment has 100 trials, the probability of success is .4, and we run 5 of these experiments. Well, in R we'd put:
>x=rbinom(5,100,.4)
Where x is an arbitrary variable we're assigning this all to, rbinom() is the randomly generated binomial distribution command, 5 is the total number of experiments we run, 100 the number of trials in each experiment, and .4 is the probability of success (notice the first number plays a different role than earlier). If we check the value of x, it would give us the results of these randomly run experiments. It would look something like this:
This certainly makes sense. If we have 100 trials, and we have .4 probability of success, we should get round 40 successes. Notice the values actually oscillate around 40. Graphically this would look like:
This is simply the histogram of the randomly generated values we had. To make this, you simply put:
>hist(x)
Where hist() is the histogram command. What we can see from the graph is that it doesn't look like much. Obviously they cluster together around 38-40. There's a strange gap between 32-36, with another value appearing in the 30-32 bin. Why does this look so strange? Well, we only have 5 experiments total. What we need to do is run more before we get a distribution that looks more reasonable. Let's do the exact same process, but now instead of 5 experiments of 100 trials, we'll run 100. That would give:
With a histogram of:
Now that looks much more reasonable. We could also test how close it is to the mean by using the mean command, mean(). The mean of the 5 experiments we ran earlier was 38.2. The mean of the 100 experiments equals 39.68. Closer to the theoretical average of 40.
You should definitely play around with this and get a better feel for it, you'll be using this for your other distributions.
Addendum:
Keep in mind the Bernoulli distribution that only takes one trial. It's easy to use our tools here to make this. if you want randomly created data for a Bernoulli distribution just set the number of trials equal to one. It would look like this:
>x=rbinom(100,1,.5)
So a hundred experiments of a single trial. That would look like:
Now here's another good command we can use to play around with this. Let's say we want to sum up all those outcomes of 1. Put:
>sum(x==1)
This sums up all the individual data that equals one, and it gets 51 in this case which is exactly what you would expect from a distribution like this with a probability of .5. A collection of all the outcomes can be found by putting >table(x) which finds you the total numbers of 0s and 1s.
Sunday, December 1, 2013
Negative binomial and geometric distribution pt. 2 MGF, variance, and mean.
As I said in the last post, the geometric distribution is a special case of the negative binomial when $r=1$. Therefore, we'll begin by finding the MGF, variance, and mean of the negative binomial, and this will automatically give us the cases for the geometric. Since we can use the MGF to find the variance and the mean, we'll start with that. This gives:
$$E[e^{tn}]=\sum_{x=r}^{\infty}e^{tn}\left( {\begin{array}{*{20}c} x-1 \\ r-1 \\
\end{array}} \right) P^{r}(1-P)^{x-r}$$
Now some explaining. Because the number of successes, $r$, is a set number, it's not our random variable. Instead, our random variable would be $x$, the number of trials it would take before we get $r$ successes. That's why $e$ is raised to $tx$. Why does it sum over $x=r$ to $\infty$? Well those are the possible values of $x$. Either there are $0$ failures ($r-r=0$), or the number of failures goes on to $\infty$. (In which case the probabilities get very small). Well what we have so far is a good start, but we'll need to make a slight change. We'll turn $e^tx$ to $e^tx+tr-tr$, which will become useful soon. Split into parts that becomes $e^{t(x-r)}e^{tr}$. Now arranging this into our equation changes it to:
$$\sum_{x=r}^{\infty}\left( {\begin{array}{*{20}c}x-1 \\ r-1 \\
\end{array}} \right) (Pe^{t})^{r}((1-P)e^{t})^{x-r}=(Pe^{t})^{r}\sum_{x=r}^{\infty}\left( {\begin{array}{*{20}c}x-1 \\ r-1 \\
\end{array}} \right)((1-P)e^{t})^{x-r}$$
Where I've just pulled out the term that isn't being summed over. What we want to do next is get rid of that ugly binomial coefficient to simplify this. What can we do? Well we know from the binomial theorem that:
$$\sum_{k=o}^{n}\left( {\begin{array}{*{20}c}n \\ k \\
\end{array}} \right)x^{n-k}y^k=(x+y)^n$$
If we can form part of the equation into that then we can simplify it greatly. Even better, if we can make it so that $x+y=1$, then that whole part of the equation would become $1$. Let's compare what we have to what we want to make it. Now, in the exponents they must add up to the total number of trials. So, $n-k+k=n$, so in ours we must have $x-r+r=x$, or the other exponent to be $r$. Therefore, the numbers on the right side of the binomial coefficient in our original equation must be in the form $a^{x-r}b^{k}$. Furthermore, $a+b$ must equal $1$. We know that our first term, $((1-P)e^{t})$ is raised to $x-r$, which means it fits the position of the $a$ term. Now what we need is a "$b$" term that is raised to $r$. Since we know the $a$ term, we want it so that $a+b=1$, or:
$$(1-P)e^{t}+b=1\Leftrightarrow b=1-(1-P)e^{t}$$
So we need this term on the right side of the binomial coefficient:
$$(1-(1-P)e^{t})^{r}$$
Well, on the right side we can have:
$$\left( {\begin{array}{*{20}c}x-1 \\ r-1 \\ \end{array}} \right)((1-P)e^{t})^{x-r}\frac{(1-(1-P)e^{t})^{r}}{(1-(1-P)e^{t})^{r}}$$
Since that term is just $1$ multiplying it. Well, the full equation would be:
$$(Pe^{t})^{r}\sum_{n=r}^{\infty}\left( {\begin{array}{*{20}c}x-1 \\ r-1 \\
\end{array}} \right)((1-P)e^{t})^{x-r}\frac{(1-(1-P)e^{t})^{r}}{(1-(1-P)e^{t})^{r}}$$
The summation only affects $x$, so we can pull the denominator out:
$$\frac{(Pe^{t})^{r}}{(1-(1-P)e^{t})^{r}}\sum_{n=r}^{\infty}\left( {\begin{array}{*{20}c}x-1 \\ r-1 \\
\end{array}} \right)((1-P)e^{t})^{x-r}(1-(1-P)e^{t})^{r}$$
Well, we know the binomial coefficient and everything to the right of it will become $1$, so all we are left with is:
$$\frac{(Pe^{t})^{r}}{(1-(1-P)e^{t})^{r}}=\left( \frac{Pe^{t}}{1-(1-P)e^{t}}\right) ^r$$
Which is the MGF of the negative binomial distribution. Setting $r=1$ would then give us the MGF of the geometric distribution, which is:
$$\frac{Pe^{t}}{1-(1-P)e^{t}}$$
(Fairly easy to check that yourself). Now that we have the MGF, we can focus on the mean and variance. Starting with the mean, we want the first moment. That's equivalent to the first derivative of the MGF with respect to $t$, and then setting $t$ equal to $0$. The derivative is:
$$\frac{\partial M_{x}(t)}{\partial t}=r\left( \frac{Pe^{t}}{1-(1-P)e^{t}}\right)^{r-1}\left(\frac{Pe^{t}}{1-(1-P)e^{t}}+\frac{P(1-P)e^{2t}}{(1-(1-P)e^{t})^{2}}\right)$$
(You have no idea how long that took)
Now setting $t=0$, we get $\frac{r}{P}$. Setting $r=1$ for the geometric case, we get $\frac{1}{P}$. Well, we have the means of both distributions. Now it's time to differentiate twice....I'll do that in another post. Too tired. As for now, there's something I should mention. If you'll notice the denominator, if we set $t=ln(\frac{1}{1-P})$ then we get a zero. Obviously, it's important we avoid something like that, but I didn't deem it necessary to mention earlier. Then a stroke of conscience reminded me I should.
$$E[e^{tn}]=\sum_{x=r}^{\infty}e^{tn}\left( {\begin{array}{*{20}c} x-1 \\ r-1 \\
\end{array}} \right) P^{r}(1-P)^{x-r}$$
Now some explaining. Because the number of successes, $r$, is a set number, it's not our random variable. Instead, our random variable would be $x$, the number of trials it would take before we get $r$ successes. That's why $e$ is raised to $tx$. Why does it sum over $x=r$ to $\infty$? Well those are the possible values of $x$. Either there are $0$ failures ($r-r=0$), or the number of failures goes on to $\infty$. (In which case the probabilities get very small). Well what we have so far is a good start, but we'll need to make a slight change. We'll turn $e^tx$ to $e^tx+tr-tr$, which will become useful soon. Split into parts that becomes $e^{t(x-r)}e^{tr}$. Now arranging this into our equation changes it to:
$$\sum_{x=r}^{\infty}\left( {\begin{array}{*{20}c}x-1 \\ r-1 \\
\end{array}} \right) (Pe^{t})^{r}((1-P)e^{t})^{x-r}=(Pe^{t})^{r}\sum_{x=r}^{\infty}\left( {\begin{array}{*{20}c}x-1 \\ r-1 \\
\end{array}} \right)((1-P)e^{t})^{x-r}$$
Where I've just pulled out the term that isn't being summed over. What we want to do next is get rid of that ugly binomial coefficient to simplify this. What can we do? Well we know from the binomial theorem that:
$$\sum_{k=o}^{n}\left( {\begin{array}{*{20}c}n \\ k \\
\end{array}} \right)x^{n-k}y^k=(x+y)^n$$
If we can form part of the equation into that then we can simplify it greatly. Even better, if we can make it so that $x+y=1$, then that whole part of the equation would become $1$. Let's compare what we have to what we want to make it. Now, in the exponents they must add up to the total number of trials. So, $n-k+k=n$, so in ours we must have $x-r+r=x$, or the other exponent to be $r$. Therefore, the numbers on the right side of the binomial coefficient in our original equation must be in the form $a^{x-r}b^{k}$. Furthermore, $a+b$ must equal $1$. We know that our first term, $((1-P)e^{t})$ is raised to $x-r$, which means it fits the position of the $a$ term. Now what we need is a "$b$" term that is raised to $r$. Since we know the $a$ term, we want it so that $a+b=1$, or:
$$(1-P)e^{t}+b=1\Leftrightarrow b=1-(1-P)e^{t}$$
So we need this term on the right side of the binomial coefficient:
$$(1-(1-P)e^{t})^{r}$$
Well, on the right side we can have:
$$\left( {\begin{array}{*{20}c}x-1 \\ r-1 \\ \end{array}} \right)((1-P)e^{t})^{x-r}\frac{(1-(1-P)e^{t})^{r}}{(1-(1-P)e^{t})^{r}}$$
Since that term is just $1$ multiplying it. Well, the full equation would be:
$$(Pe^{t})^{r}\sum_{n=r}^{\infty}\left( {\begin{array}{*{20}c}x-1 \\ r-1 \\
\end{array}} \right)((1-P)e^{t})^{x-r}\frac{(1-(1-P)e^{t})^{r}}{(1-(1-P)e^{t})^{r}}$$
The summation only affects $x$, so we can pull the denominator out:
$$\frac{(Pe^{t})^{r}}{(1-(1-P)e^{t})^{r}}\sum_{n=r}^{\infty}\left( {\begin{array}{*{20}c}x-1 \\ r-1 \\
\end{array}} \right)((1-P)e^{t})^{x-r}(1-(1-P)e^{t})^{r}$$
Well, we know the binomial coefficient and everything to the right of it will become $1$, so all we are left with is:
$$\frac{(Pe^{t})^{r}}{(1-(1-P)e^{t})^{r}}=\left( \frac{Pe^{t}}{1-(1-P)e^{t}}\right) ^r$$
Which is the MGF of the negative binomial distribution. Setting $r=1$ would then give us the MGF of the geometric distribution, which is:
$$\frac{Pe^{t}}{1-(1-P)e^{t}}$$
(Fairly easy to check that yourself). Now that we have the MGF, we can focus on the mean and variance. Starting with the mean, we want the first moment. That's equivalent to the first derivative of the MGF with respect to $t$, and then setting $t$ equal to $0$. The derivative is:
$$\frac{\partial M_{x}(t)}{\partial t}=r\left( \frac{Pe^{t}}{1-(1-P)e^{t}}\right)^{r-1}\left(\frac{Pe^{t}}{1-(1-P)e^{t}}+\frac{P(1-P)e^{2t}}{(1-(1-P)e^{t})^{2}}\right)$$
(You have no idea how long that took)
Now setting $t=0$, we get $\frac{r}{P}$. Setting $r=1$ for the geometric case, we get $\frac{1}{P}$. Well, we have the means of both distributions. Now it's time to differentiate twice....I'll do that in another post. Too tired. As for now, there's something I should mention. If you'll notice the denominator, if we set $t=ln(\frac{1}{1-P})$ then we get a zero. Obviously, it's important we avoid something like that, but I didn't deem it necessary to mention earlier. Then a stroke of conscience reminded me I should.
Friday, November 29, 2013
Moment generating function Pt. 3 multiple variables
I believe this will be the last part of my MGF series, but we'll see. This is the MGF of a joint probability function (multiple variables), or a vector valued MGF. Originally I was going to use a proof that I put together myself, but it was long and cumbersome compared to a much shorter, comprehensible one I recently found. Firstly, I'll have to show a very useful result. Which is, if we have an expectation of $E[XY]$ where $X$ and $Y$ are random variables that are independent, then this is equivalent to $E[X]E[Y]$. To show this I'll have to explain a bit about independent events, and then move on to that result.
Now from an earlier post we know about the conditional probability, which is written as:
$$P(A|B)=\frac{P(A\cap B)}{P(B)}$$
Rearranged it becomes:
$$P(A|B)P(B)=P(A\cap B)$$
In other words, the probability that $A$ and $B$ happen is equal to the probability that $A$ given that $B$ has happened, multiplied by the probability of $B$. Now here's where we introduce the idea of independence. Let's say that we know $B$ has happened. Now what if $B$ happening doesn't affect $A$ at all? In other words, if we know $B$ happens, it doesn't change what $A$ could be. Let's say we flip a coin and get heads. Now let's say we flip a coin again. Does the fact we got heads on the first flip affect what we get on the second? Absolutely not. So, in mathematical notation, that's saying that $P(A|B)=P(A)$. Plugging that into our earlier function gives:
$$P(A)P(B)=P(A\cap B)$$
So the probability of $A$ and $B$ is the multiplication of both. This is known as the multiplication rule, and should make a lot of intuitive sense. What are the odds of getting two heads in a row? $(\frac{1}{2})(\frac{1}{2})=\frac{1}{4}$.
We can further the discussion by talking about PDFs of independent variables. Since a PDF is the probability of an event, independence should carry over naturally. So if we have two variables $X_{1}$ and $X_{2}$ that are independent, we know how conditional probability is defined:
$$f_{X_{2}|X_{1}}(X_{2}|X_{1})=\frac{f_{X_{2},X_{1}}(X_{2},X_{1})}{f_{X_{1}}(X_{1})}$$
Which, again, becomes:
$$f_{X_{2}|X_{1}}(X_{2}|X_{1})f_{X_{1}}(X_{1})=f_{X_{2},X_{1}}(X_{2},X_{1})$$
Now if like last time we can say that $X_{2}$ does not depend on $X_{1}$, then:
$$f_{X_{2}|X_{1}}(X_{2}|X_{1})=f_{X_2}(X_{2})$$
Which makes our previous equation:
$$f_{X_{2}}(X_{2})f_{X_{1}}(X_{1})=f_{X_{2},X_{1}}(X_{2},X_{1})$$
Which essentially says that the joint PDF of two independent variables is equal to the PDFs of both variables multiplied together. Now, we have enough information to find the result we were originally looking for. Let's start with $E[XY]$. Using the expectation operator this gives:
$$\int\int XYf_{X,Y}(X,Y)dxdy$$
But because the distributions are independent:
$$\int\int XYf_{X}(X)f_{Y}(Y)dxdy$$
Now we can separate the variables with the integrals. As we can see, we can pull out the PDF of $Y$ as well as $Y$ itself and integrate over $X$ and its distribution. Written out mathematically:
$$\int\int XYf_{X}(X)f_{Y}(Y)dxdy=\int Yf_{Y}(Y)\int Xf_{X}(X)dxdy$$
The inside integral becomes the average of $X$. Continuing:
$$ \int Y\mu_{X}f_{Y}(Y)dy=\mu_{X}\int Yf_{Y}(Y)dy=\mu_{X}\mu_{Y}=E[X]E[Y]$$
Which was the desired result. Now that we have this, we can talk about the MGF. The multiple variable MGF, where we take $n$ random variables, $X_{1},X_{2},...,X_{n}$, is defined as:
$$E[e^{t_{1}X_{1}+t_{2}X_{2}+...+t_{n}X_{n}}]=E[e^{\sum_{i=1}^{n}t_{i}X_{i}}]$$
Now we know that $e^{x+y}=e^{x}e^{y}$. Carrying this result gives us:
$$E[\prod_{i=1}^{n}e^{t_{i}X_{i}}]$$
Now we can finally use the result we proved earlier. Since this is the product of all the individual random variables, we can use independence. This gives us:
$$\prod_{i=1}^{n}E[e^{t_{i}X_{i}}]=\prod_{i=1}^{n}M_{X_{i}}$$
Where $M_{X_{i}}$ is the MGF of the $i$th variable. Hence, the vector valued MGF is simply the product of all the individual MGFs. From here we can use what we know about their individual MGFs to find their respective moments.
Now from an earlier post we know about the conditional probability, which is written as:
$$P(A|B)=\frac{P(A\cap B)}{P(B)}$$
Rearranged it becomes:
$$P(A|B)P(B)=P(A\cap B)$$
In other words, the probability that $A$ and $B$ happen is equal to the probability that $A$ given that $B$ has happened, multiplied by the probability of $B$. Now here's where we introduce the idea of independence. Let's say that we know $B$ has happened. Now what if $B$ happening doesn't affect $A$ at all? In other words, if we know $B$ happens, it doesn't change what $A$ could be. Let's say we flip a coin and get heads. Now let's say we flip a coin again. Does the fact we got heads on the first flip affect what we get on the second? Absolutely not. So, in mathematical notation, that's saying that $P(A|B)=P(A)$. Plugging that into our earlier function gives:
$$P(A)P(B)=P(A\cap B)$$
So the probability of $A$ and $B$ is the multiplication of both. This is known as the multiplication rule, and should make a lot of intuitive sense. What are the odds of getting two heads in a row? $(\frac{1}{2})(\frac{1}{2})=\frac{1}{4}$.
We can further the discussion by talking about PDFs of independent variables. Since a PDF is the probability of an event, independence should carry over naturally. So if we have two variables $X_{1}$ and $X_{2}$ that are independent, we know how conditional probability is defined:
$$f_{X_{2}|X_{1}}(X_{2}|X_{1})=\frac{f_{X_{2},X_{1}}(X_{2},X_{1})}{f_{X_{1}}(X_{1})}$$
Which, again, becomes:
$$f_{X_{2}|X_{1}}(X_{2}|X_{1})f_{X_{1}}(X_{1})=f_{X_{2},X_{1}}(X_{2},X_{1})$$
Now if like last time we can say that $X_{2}$ does not depend on $X_{1}$, then:
$$f_{X_{2}|X_{1}}(X_{2}|X_{1})=f_{X_2}(X_{2})$$
Which makes our previous equation:
$$f_{X_{2}}(X_{2})f_{X_{1}}(X_{1})=f_{X_{2},X_{1}}(X_{2},X_{1})$$
Which essentially says that the joint PDF of two independent variables is equal to the PDFs of both variables multiplied together. Now, we have enough information to find the result we were originally looking for. Let's start with $E[XY]$. Using the expectation operator this gives:
$$\int\int XYf_{X,Y}(X,Y)dxdy$$
But because the distributions are independent:
$$\int\int XYf_{X}(X)f_{Y}(Y)dxdy$$
Now we can separate the variables with the integrals. As we can see, we can pull out the PDF of $Y$ as well as $Y$ itself and integrate over $X$ and its distribution. Written out mathematically:
$$\int\int XYf_{X}(X)f_{Y}(Y)dxdy=\int Yf_{Y}(Y)\int Xf_{X}(X)dxdy$$
The inside integral becomes the average of $X$. Continuing:
$$ \int Y\mu_{X}f_{Y}(Y)dy=\mu_{X}\int Yf_{Y}(Y)dy=\mu_{X}\mu_{Y}=E[X]E[Y]$$
Which was the desired result. Now that we have this, we can talk about the MGF. The multiple variable MGF, where we take $n$ random variables, $X_{1},X_{2},...,X_{n}$, is defined as:
$$E[e^{t_{1}X_{1}+t_{2}X_{2}+...+t_{n}X_{n}}]=E[e^{\sum_{i=1}^{n}t_{i}X_{i}}]$$
Now we know that $e^{x+y}=e^{x}e^{y}$. Carrying this result gives us:
$$E[\prod_{i=1}^{n}e^{t_{i}X_{i}}]$$
Now we can finally use the result we proved earlier. Since this is the product of all the individual random variables, we can use independence. This gives us:
$$\prod_{i=1}^{n}E[e^{t_{i}X_{i}}]=\prod_{i=1}^{n}M_{X_{i}}$$
Where $M_{X_{i}}$ is the MGF of the $i$th variable. Hence, the vector valued MGF is simply the product of all the individual MGFs. From here we can use what we know about their individual MGFs to find their respective moments.
Monday, November 4, 2013
Negative Binomial and Geometric Distributions Part 1.
Hey everyone, I know it's been a while since I posted but I hope to
make it a bit more frequent in the future. I'll be making some changes
on the whole layout, but as for now I'll just continue with my series on
discrete distributions. This time I'll cover two distributions, the
negative binomial and the geometric, that are very related to each
other, as well as the two we've covered so far. Essentially every
distribution thus far is really a slight tweak of the binomial
distribution. For instance, here is the original binomial distribution
equation:
$$
B(n,p)=\left\{
\begin{array}{ll}
\left( {\begin{array}{*{20}c}
n \\ x \\
\end{array}} \right) P^{x}(1-P)^{n-x} \quad x=0,1,2,...,n\\
0,\quad \qquad \qquad \qquad otherwise
\end{array}
\right.
$$
Where $n$ is the total number of trials, $P$ is the probability of success, $(1-P)$ the probability of failure, and $x$ the total number of successes. In an earlier post I explained the reasoning behind the term in front, but I could go over it again here. Let's say that you want to know the probability of getting two heads in a toss of four coins. Well, one way of doing this would be $P(1-P)P(1-P)$. A head on the first toss, a tail, another head, and then another tail. Is this the only way? Absolutely not. In fact, we could have had $PP(1-P)(1-P)$, or $P(1-P)(1-P)P$, and so on. The "Binomial Coefficient" in front counts all these different ways up, so you can properly measure the probability of getting two heads. Now, let's pose this question: What is the probability of getting $r$ successes on the $n+r$ trial? In other words, in $n+r-1$ trials we had exactly $r-1$ successes, and on the very next trial, $n+r$, we get a success. Let's have a variable, $x$, which denotes this total amount $n+r$. Therefore, total number of trials would be $x-1$ and total number of successes $r-1$ right before the final success. Before the final success, it would look like:
$$\left( {\begin{array}{*{20}c}x-1\\r-1\\ \end{array}}\right)P^{r-1}(1-P)^{x-r}$$
Now from our discussion before, we know that the probabilities can come in a different order when multiplying. However, in this scenario we know that the very last trial must be a success. Therefore, we must count up all the possible ways of getting $r-1$ successes, and then multiply that by the probability of getting the final success, $P$. This is essentially like taking our coin tossing experiment and only counting the terms with $P$ as the last term. This makes sense, since probabilities are multiplicative. If you want to know the probability of getting the $r$th success in $n+r$ trials, you multiply the probability of getting $r-1$ successes with the probability of getting another success, so $P$. That would look something like this:
$$\left( {\begin{array}{*{20}c}x-1\\r-1\\ \end{array}}\right)P^{r-1}(1-P)^{n}P$$
Where as you can see it's the probability of getting $r-1$ successes in $x-1$ trials, and the term on the end is the probability of getting the final success. This can be rearranged to get:
$$\left( {\begin{array}{*{20}c}x-1\\r-1\\ \end{array}}\right)P^{r}(1-P)^{n}$$
This is the negative binomial distribution. Unfortunately, there are different forms of the negative binomial distribution that are all essentially equivalent. This is the one I'll be using. Now what is the geometric distribution? The geometric distribution is the special case when $r=1$. This would become:
$$(1-P)^{n}P$$
I'll leave it to you to show that the binomial coefficient reduces to $1$. Now this makes sense. In the case that $r=1$ we have $n+r-1=n+1-1=n$. So we would have $n$ failures until the $n+1$ success. So in the sense that the Bernoulli distribution is a special case of the binomial, the geometric is a special case of the Negative binomial. Well, likewise, it would make sense that if we found the mean and variance of the negative binomial, it would include the mean and variance of the geometric distribution. The easiest way to find it would be using the MGF. To explain it in more depth I'll make
$$
B(n,p)=\left\{
\begin{array}{ll}
\left( {\begin{array}{*{20}c}
n \\ x \\
\end{array}} \right) P^{x}(1-P)^{n-x} \quad x=0,1,2,...,n\\
0,\quad \qquad \qquad \qquad otherwise
\end{array}
\right.
$$
Where $n$ is the total number of trials, $P$ is the probability of success, $(1-P)$ the probability of failure, and $x$ the total number of successes. In an earlier post I explained the reasoning behind the term in front, but I could go over it again here. Let's say that you want to know the probability of getting two heads in a toss of four coins. Well, one way of doing this would be $P(1-P)P(1-P)$. A head on the first toss, a tail, another head, and then another tail. Is this the only way? Absolutely not. In fact, we could have had $PP(1-P)(1-P)$, or $P(1-P)(1-P)P$, and so on. The "Binomial Coefficient" in front counts all these different ways up, so you can properly measure the probability of getting two heads. Now, let's pose this question: What is the probability of getting $r$ successes on the $n+r$ trial? In other words, in $n+r-1$ trials we had exactly $r-1$ successes, and on the very next trial, $n+r$, we get a success. Let's have a variable, $x$, which denotes this total amount $n+r$. Therefore, total number of trials would be $x-1$ and total number of successes $r-1$ right before the final success. Before the final success, it would look like:
$$\left( {\begin{array}{*{20}c}x-1\\r-1\\ \end{array}}\right)P^{r-1}(1-P)^{x-r}$$
Now from our discussion before, we know that the probabilities can come in a different order when multiplying. However, in this scenario we know that the very last trial must be a success. Therefore, we must count up all the possible ways of getting $r-1$ successes, and then multiply that by the probability of getting the final success, $P$. This is essentially like taking our coin tossing experiment and only counting the terms with $P$ as the last term. This makes sense, since probabilities are multiplicative. If you want to know the probability of getting the $r$th success in $n+r$ trials, you multiply the probability of getting $r-1$ successes with the probability of getting another success, so $P$. That would look something like this:
$$\left( {\begin{array}{*{20}c}x-1\\r-1\\ \end{array}}\right)P^{r-1}(1-P)^{n}P$$
Where as you can see it's the probability of getting $r-1$ successes in $x-1$ trials, and the term on the end is the probability of getting the final success. This can be rearranged to get:
$$\left( {\begin{array}{*{20}c}x-1\\r-1\\ \end{array}}\right)P^{r}(1-P)^{n}$$
This is the negative binomial distribution. Unfortunately, there are different forms of the negative binomial distribution that are all essentially equivalent. This is the one I'll be using. Now what is the geometric distribution? The geometric distribution is the special case when $r=1$. This would become:
$$(1-P)^{n}P$$
I'll leave it to you to show that the binomial coefficient reduces to $1$. Now this makes sense. In the case that $r=1$ we have $n+r-1=n+1-1=n$. So we would have $n$ failures until the $n+1$ success. So in the sense that the Bernoulli distribution is a special case of the binomial, the geometric is a special case of the Negative binomial. Well, likewise, it would make sense that if we found the mean and variance of the negative binomial, it would include the mean and variance of the geometric distribution. The easiest way to find it would be using the MGF. To explain it in more depth I'll make
Friday, April 26, 2013
MGF of the Bernoulli and Binomial distributions.
Alright, a quick post about MGFs and the Bernoulli and Binomial distributions. Since the Bernoulli distribution is simply the Binomial distribution for the special case of $n=1$, we can see that the proof for the Binomial distribution will automatically include a proof for the Bernoulli distribution. I already showed it to find $E[x^2]$, but I'll restate it here. So, the MGF for discrete functions is:
$$\sum_{x=0}^{n}e^{tx}P_{x}(x)$$
Which for the Binomial distribution would be:
$$\sum_{x=0}^{n}e^{tx}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}$$
Where we do this:
$$\sum_{x=0}^{n}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) (Pe^{t})^{x}(1-P)^{n-x}$$
Now, because of the Binomial theorem we know that:
$$\sum_{x=0}^{n}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) (Pe^{t})^{x}(1-P)^{n-x}=(Pe^{t}+1-P)^{n}$$
This is a much more manageable form. Now we can take derivatives of this and get whatever moment we want. Let's use the general case of $n$, but then check it for $n=1$ (i.e. the Bernoulli distribution).
Well, (setting $(Pe^{t}+1-P)^{n}=M_{x}(t)$) the first derivative is:
$$\frac{d(M_{x})}{dt}=e^{t}Pn(e^{t}P+1-P)^{n-1}$$
By using the chain rule. So this is the first moment. Setting $t=0$ here will give us the mean. Let's check it:
$$e^{0}Pn(e^{0}P+1-P)^{n-1}=nP$$
Well, for the Binomial distribution we proved that the mean is $P$, and for the Binomial we showed it was $nP$, which is perfect. We have the Binomial mean, and if we set $n=1$ then we have the Bernoulli. Now for the second moment and the variance. To do this we can use the fact that $\sigma^{2}=E[x^2]-\mu^2$. Well, we know $\mu$, but we need to know $E[x^2]$, or the second moment. Taking the second derivative gives:
$$\frac{d^{2}(M_{x})}{dt^{2}}=e^{t}Pn(e^{t}P+1-P)^{n-1}+e^{t}Pn\left(e^{t}P(n-1)(e^{t}P+1-P)^{n-2}\right)$$
By use of the product rule. Setting $t=0$, taking note of the fact that $(P+1-P)=1$, and then simplifying gives us:
$$Pn+(P)^{2}n(n-1)=Pn+(Pn)^{2}-(P)^{2}n$$
Now, plugging this back into the variance formula, as well as the mean, gives us:
$$\sigma^{2}=Pn+(Pn)^{2}-(P)^{2}n-(Pn)^{2}=Pn-(P)^{2}n=Pn(1-P)$$
Which is the variance of the Binomial distribution. Setting $n=1$ gives us $P(1-P)$, which is exactly the variance for the Bernoulli distribution.
$$\sum_{x=0}^{n}e^{tx}P_{x}(x)$$
Which for the Binomial distribution would be:
$$\sum_{x=0}^{n}e^{tx}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}$$
Where we do this:
$$\sum_{x=0}^{n}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) (Pe^{t})^{x}(1-P)^{n-x}$$
Now, because of the Binomial theorem we know that:
$$\sum_{x=0}^{n}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) (Pe^{t})^{x}(1-P)^{n-x}=(Pe^{t}+1-P)^{n}$$
This is a much more manageable form. Now we can take derivatives of this and get whatever moment we want. Let's use the general case of $n$, but then check it for $n=1$ (i.e. the Bernoulli distribution).
Well, (setting $(Pe^{t}+1-P)^{n}=M_{x}(t)$) the first derivative is:
$$\frac{d(M_{x})}{dt}=e^{t}Pn(e^{t}P+1-P)^{n-1}$$
By using the chain rule. So this is the first moment. Setting $t=0$ here will give us the mean. Let's check it:
$$e^{0}Pn(e^{0}P+1-P)^{n-1}=nP$$
Well, for the Binomial distribution we proved that the mean is $P$, and for the Binomial we showed it was $nP$, which is perfect. We have the Binomial mean, and if we set $n=1$ then we have the Bernoulli. Now for the second moment and the variance. To do this we can use the fact that $\sigma^{2}=E[x^2]-\mu^2$. Well, we know $\mu$, but we need to know $E[x^2]$, or the second moment. Taking the second derivative gives:
$$\frac{d^{2}(M_{x})}{dt^{2}}=e^{t}Pn(e^{t}P+1-P)^{n-1}+e^{t}Pn\left(e^{t}P(n-1)(e^{t}P+1-P)^{n-2}\right)$$
By use of the product rule. Setting $t=0$, taking note of the fact that $(P+1-P)=1$, and then simplifying gives us:
$$Pn+(P)^{2}n(n-1)=Pn+(Pn)^{2}-(P)^{2}n$$
Now, plugging this back into the variance formula, as well as the mean, gives us:
$$\sigma^{2}=Pn+(Pn)^{2}-(P)^{2}n-(Pn)^{2}=Pn-(P)^{2}n=Pn(1-P)$$
Which is the variance of the Binomial distribution. Setting $n=1$ gives us $P(1-P)$, which is exactly the variance for the Bernoulli distribution.
Moment generating function Pt 2.
Alright, not really needed, but I'm going to throw in a few things that may come up. Essentially nifty little fun-facts. Let's start with the scenario
that you'd want a moment about the mean. To get that result we just start over with:
$$MGF\equiv\int_{-\infty}^{\infty}e^{tX}f(X)dx,~t\in\mathbb{R}$$
And replace $e^{tX}$ with $e^{t(X-\mu)}$. Now, the rest is pretty straight forward. We could sub in the expression for the expansion of $e$ and get:
$$e^{t(X-\mu)}=1+\frac{t(X-\mu)}{1!}+\frac{t^{2}(X-\mu)^{2}}{2!}+\frac{t^{3}(X-\mu)^{3}}{3!}+...$$
Following the steps of last time would give you essentially the same idea. However, there's another way we could go about it. Since we have $e^{t(X-\mu)}$, we can transform it into $e^{tX}e^{-t\mu}$. You'll notice the former is just what we had earlier, and the latter is a new term. But let's write it out:
$$\int_{-\infty}^{\infty}e^{-t\mu}e^{tX}f(X)dx$$
Well, the integral is just with respect to $X$, so we can treat $e^{-t\mu}$ as a constant and pull it out to get:
$$e^{-t\mu}\int_{-\infty}^{\infty}e^{tX}f(X)dx$$
Where the integral on the right is exactly the same as the MGF we had before. So this now becomes:
$$e^{-t\mu}M_{x}(t)$$
Differentiating and setting $t$ equal to zero like last time will give us the moments we want. To see this, let's try the first ones. We know the variance is defined as $E[(x-\mu)^2]$, and can alternatively be defined as $E[x^2]-\mu^2$. Well, this is the second moment about the mean, so let's try that in our alternate and simpler MGF. We'd have to take the second derivative and set it equal to zero. So let's start by differentiating it. Here:
$$ \frac{d(e^{-t\mu}M_{x}(t)}{dt}=M'_{x}(t)e^{-t\mu}+-\mu M_{x}(t)e^{-t\mu}$$
Now, this is the first moment about the mean. So it's essentially $E[x-\mu]=E[x]-E[\mu]=\mu-\mu=0$. Well, let's set $t=0$ to test that. We have:
$$M'_{x}(0)e^{0}+-\mu M_{0}(t)e^{0}=\mu(1)-\mu(1)(1)=\mu-\mu=0$$
Using the different moments of $x$ and the fact that $e^{0}=1$. The second derivative being:
$$\frac{d^{2}M_{x}(t)}{dt^2}=M''_{x}(t)e^{-t\mu}+-\mu M'_{x}(t)e^{-t\mu}+-\mu M'_{x}(t)e^{-t\mu}+\mu^{2} M_{x}(t)e^{-t\mu}$$
Setting $t=0$, and using the moments from the last post, we get:
$$M''_{x}(0)e^{0}+-2\mu M'_{x}(0)e^{0}+-\mu^{2} M'_{x}(0)e^{0}=E[x^2]+-2\mu^{2}+\mu^{2}=E[x^2]-\mu^{2}$$
Thus what we were trying to prove. Next post in this series will be about multiple variables. I may also mix the MGF series with the discrete distributions to get the MGFs of them.
$$MGF\equiv\int_{-\infty}^{\infty}e^{tX}f(X)dx,~t\in\mathbb{R}$$
And replace $e^{tX}$ with $e^{t(X-\mu)}$. Now, the rest is pretty straight forward. We could sub in the expression for the expansion of $e$ and get:
$$e^{t(X-\mu)}=1+\frac{t(X-\mu)}{1!}+\frac{t^{2}(X-\mu)^{2}}{2!}+\frac{t^{3}(X-\mu)^{3}}{3!}+...$$
Following the steps of last time would give you essentially the same idea. However, there's another way we could go about it. Since we have $e^{t(X-\mu)}$, we can transform it into $e^{tX}e^{-t\mu}$. You'll notice the former is just what we had earlier, and the latter is a new term. But let's write it out:
$$\int_{-\infty}^{\infty}e^{-t\mu}e^{tX}f(X)dx$$
Well, the integral is just with respect to $X$, so we can treat $e^{-t\mu}$ as a constant and pull it out to get:
$$e^{-t\mu}\int_{-\infty}^{\infty}e^{tX}f(X)dx$$
Where the integral on the right is exactly the same as the MGF we had before. So this now becomes:
$$e^{-t\mu}M_{x}(t)$$
Differentiating and setting $t$ equal to zero like last time will give us the moments we want. To see this, let's try the first ones. We know the variance is defined as $E[(x-\mu)^2]$, and can alternatively be defined as $E[x^2]-\mu^2$. Well, this is the second moment about the mean, so let's try that in our alternate and simpler MGF. We'd have to take the second derivative and set it equal to zero. So let's start by differentiating it. Here:
$$ \frac{d(e^{-t\mu}M_{x}(t)}{dt}=M'_{x}(t)e^{-t\mu}+-\mu M_{x}(t)e^{-t\mu}$$
Now, this is the first moment about the mean. So it's essentially $E[x-\mu]=E[x]-E[\mu]=\mu-\mu=0$. Well, let's set $t=0$ to test that. We have:
$$M'_{x}(0)e^{0}+-\mu M_{0}(t)e^{0}=\mu(1)-\mu(1)(1)=\mu-\mu=0$$
Using the different moments of $x$ and the fact that $e^{0}=1$. The second derivative being:
$$\frac{d^{2}M_{x}(t)}{dt^2}=M''_{x}(t)e^{-t\mu}+-\mu M'_{x}(t)e^{-t\mu}+-\mu M'_{x}(t)e^{-t\mu}+\mu^{2} M_{x}(t)e^{-t\mu}$$
Setting $t=0$, and using the moments from the last post, we get:
$$M''_{x}(0)e^{0}+-2\mu M'_{x}(0)e^{0}+-\mu^{2} M'_{x}(0)e^{0}=E[x^2]+-2\mu^{2}+\mu^{2}=E[x^2]-\mu^{2}$$
Thus what we were trying to prove. Next post in this series will be about multiple variables. I may also mix the MGF series with the discrete distributions to get the MGFs of them.
Monday, April 22, 2013
Discrete PMFs part 3. Variance of the Binomial distribution.
Alright, the variance of the binomial distribution. Now, we know that $\sigma^{2}=E[x^{2}]-\mu^{2}$, but we can check it with the binomial distribution if we'd like. So we have:
$$\sum_{x=0}^{n}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}$$
And we also have that the mean, $\mu$, is $nP$. So, let's use our variance formula, which is $E[(x-\mu)^2]$. This becomes:
$$\sum_{x=0}^{n}(x-nP)^{2}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}$$
Which simplifying the $(x-nP)^{2}$ term gives $(x^{2}-2xnP+[nP]^{2})$. Distributing out the probability function, and using the $\sum$ term as a linear operator gives:
$$\sum_{0}^{n}x^{2}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right)P^{x}(1-P)^{n-x}+2nP\sum_{0}^{n}x\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}+(nP)^{2}\sum_{x=0}^{n}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}$$
Where I was able to pull the $2nP$ out of the middle term since the summation doesn't affect it, as well as the $(nP)^2$ in the third term. This leaves it as $E[x]$, which we know is $nP$. Furthermore, the last term is simply the summation of the probability function, which my last post shows is $1$. So simplifying this down gives:
$$\sum_{0}^{n}x^{2}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right)P^{x}(1-P)^{n-x}-2(nP)^{2}+(nP)^{2}=\sum_{0}^{n}x^{2}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right)P^{x}(1-P)^{n-x}-(nP)^{2}$$
Which is easy to see that it's $E[x^{2}]-\mu^{2}$, thus verifying the variance formula for the binomial distribution. Now our job is to figure out what the $E[x^{2}]$ equals. Well, here's where we can use something we learned before. That's the MGF. This would look like:
$$\sum_{x=0}^{n}e^{tx}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}$$
And we know that we'll need $M''(0)$ to get $E[x^2]$. But before we can worry about the second moment specifically, let's worry about getting the MGF in a form that's more manageable. So let's use a bit of footwork:
$$\sum_{x=0}^{n}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) (e^{t}P)^{x}(1-P)^{n-x}=(e^{t}P+1-P)^{n}$$
By using the binomial theorem. Well, this is good, since now we can start taking derivatives. The first derivative is:
$$e^{t}Pn(e^{t}P+1-P)^{n-1}$$
By using the chain rule. The second we must use the product rule, which gives:
$$e^{t}Pn(e^{t}P+1-P)^{n-1}+e^{t}Pn\left(e^{t}P(n-1)(e^{t}P+1-P)^{n-2}\right)$$
Setting $t=0$ simplifies it to:
$$Pn+P^{2}n(n-1)$$
Plugging that back into our formula to find the variance gives:
$$\sigma^{2}=Pn+P^{2}n(n-1)-(nP)^{2}=Pn+P^{2}n^{2}-P^{2}n-P^{2}n^{2}=Pn-P^{2}n=Pn(1-P)$$
And now we have the variance in a manageable form. Next will be the Hypergeometric distribution.
$$\sum_{x=0}^{n}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}$$
And we also have that the mean, $\mu$, is $nP$. So, let's use our variance formula, which is $E[(x-\mu)^2]$. This becomes:
$$\sum_{x=0}^{n}(x-nP)^{2}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}$$
Which simplifying the $(x-nP)^{2}$ term gives $(x^{2}-2xnP+[nP]^{2})$. Distributing out the probability function, and using the $\sum$ term as a linear operator gives:
$$\sum_{0}^{n}x^{2}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right)P^{x}(1-P)^{n-x}+2nP\sum_{0}^{n}x\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}+(nP)^{2}\sum_{x=0}^{n}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}$$
Where I was able to pull the $2nP$ out of the middle term since the summation doesn't affect it, as well as the $(nP)^2$ in the third term. This leaves it as $E[x]$, which we know is $nP$. Furthermore, the last term is simply the summation of the probability function, which my last post shows is $1$. So simplifying this down gives:
$$\sum_{0}^{n}x^{2}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right)P^{x}(1-P)^{n-x}-2(nP)^{2}+(nP)^{2}=\sum_{0}^{n}x^{2}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right)P^{x}(1-P)^{n-x}-(nP)^{2}$$
Which is easy to see that it's $E[x^{2}]-\mu^{2}$, thus verifying the variance formula for the binomial distribution. Now our job is to figure out what the $E[x^{2}]$ equals. Well, here's where we can use something we learned before. That's the MGF. This would look like:
$$\sum_{x=0}^{n}e^{tx}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}$$
And we know that we'll need $M''(0)$ to get $E[x^2]$. But before we can worry about the second moment specifically, let's worry about getting the MGF in a form that's more manageable. So let's use a bit of footwork:
$$\sum_{x=0}^{n}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) (e^{t}P)^{x}(1-P)^{n-x}=(e^{t}P+1-P)^{n}$$
By using the binomial theorem. Well, this is good, since now we can start taking derivatives. The first derivative is:
$$e^{t}Pn(e^{t}P+1-P)^{n-1}$$
By using the chain rule. The second we must use the product rule, which gives:
$$e^{t}Pn(e^{t}P+1-P)^{n-1}+e^{t}Pn\left(e^{t}P(n-1)(e^{t}P+1-P)^{n-2}\right)$$
Setting $t=0$ simplifies it to:
$$Pn+P^{2}n(n-1)$$
Plugging that back into our formula to find the variance gives:
$$\sigma^{2}=Pn+P^{2}n(n-1)-(nP)^{2}=Pn+P^{2}n^{2}-P^{2}n-P^{2}n^{2}=Pn-P^{2}n=Pn(1-P)$$
And now we have the variance in a manageable form. Next will be the Hypergeometric distribution.
Wednesday, April 17, 2013
Discrete PMFs part 2. Binomial distribution. PMF proof and mean.
Now the binomial distribution. You'll see it's related to the Bernoulli for obvious reasons. It's defined as:
$$
B(n,p)=\left\{
\begin{array}{ll}
\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x} \quad x=0,1,2,...,n\\
0,\quad \qquad \qquad \qquad otherwise
\end{array}
\right.
$$
Where:
$$\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right)=\frac{{n!}}{{x!\left( {n - x} \right)!}}$$
Is the binomial coefficient. Before I explain the importance of it, let's worry about the other part. Now, the Bernoulli distribution was really just concerned with one success and one failure. However, what if we're worried about many successes and failures? Let's say we want to run four different trials? Well, that looks like:
$$P^{x}(1-P)^{4-x}$$
Where $x$ is the number of success, and hence $(4-x)$ is the number of failures. But as you'll notice, this isn't the actual probability. Let's say that there were two successes. This would look like:
$$(P)^{2}(1-P)^2=(P)(P)(1-P)(1-P)$$
But, notice that this is the probability of getting two successes in a row, and then two failures. Another option would be:
$$(P)(1-P)(P)(1-P)$$
This is the the same amount of successes and failures, but a completely different order. Since order does matter for us, this is where the binomial coefficient comes in. It counts up all the different orders in which we can get a certain number of successes and failures. So the full equation becomes:
$$\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}$$
Which is the probability of getting a certain number of successes in any order, multiplied the number of orders you can possibly get. For instance, flipping a coin twice and getting one head and one tail would be:
$$\left( {\begin{array}{*{20}c} 2 \\ 1 \\ \end{array}} \right) (\frac{1}{2})^{1}(1-\frac{1}{2})^{1}$$
Now, the binomial coefficient for these values equals:
$$\frac{{2!}}{{1!\left( {1} \right)!}}=\frac{2}{1}=2$$
And, multiplying this by the probability side gives:
$$(2)(\frac{1}{2})(\frac{1}{2})=(\frac{1}{2})$$
Which makes sense. The probability of getting a head then a tail in one order is $\frac{1}{4}$, yet there are two different ways to get it. Heads first, and heads second. So it becomes one half.
Alright, good stuff. But what about showing it's a probability function? It's obvious that it has $0\leq P$, since it's the same as the Bernoulli distribution (but this time with more successes and failures). But how do we know it adds up to one? Well, let's sum up all of the terms. That gives:
$$\sum_{0}^{n}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}$$
Remember we're summing up the $x$ terms, since the number of trials is set. What varies is the number or successes and failures. So, we have:
$$(1-P)^{n}+nP(1-P)^{n-1}+...+nP^{n-1}(1-p)+P^{n}$$
The first term is the event of failing every trial, the next one is all the events of succeeding once and failing $n-1$ times, all the way up to succeeding $n$ times. Well, now we can use the binomial theorem, which states that:
$$(x+y)^{n}=\left( {\begin{array}{*{20}c} n \\ 0 \\ \end{array}} \right)x^{n}y^{0}+\left( {\begin{array}{*{20}c} n \\ 1 \\ \end{array}} \right)x^{n-1}y^{1}+...+\left( {\begin{array}{*{20}c} n \\ n-1 \\ \end{array}} \right)x^{1}y^{n-1}+\left( {\begin{array}{*{20}c} n \\ n \\ \end{array}} \right)x^{0}y^{n}$$
As you'll notice, if we use $y=1-P$ and $x=P$, this is exactly the same as our old equation. Well, reducing it to $(x+y)^n$ gives:
$$(x+y)^n=(P+1-P)^{n}=1^n=1$$
So this shows it's total probability is $1$, which is what we were trying to prove. Now what about the expectation and variance? Let's start with the expectation. That is:
$$\sum_{0}^{n}x\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right)P^{x}(1-P)^{n-x}$$
Writing it out gives:
$$(0)\left( {\begin{array}{*{20}c} n \\ 0 \\ \end{array}} \right)P^{0}(1-P)^{n}+(1)\left( {\begin{array}{*{20}c} n \\ 1 \\ \end{array}} \right)P^{1}(1-P)^{n-1}+...+\left( {\begin{array}{*{20}c} n \\ n-1 \\ \end{array}} \right)P^{n-1}(1-P)^{1}+\left( {\begin{array}{*{20}c} n \\ n \\ \end{array}} \right)P^{n}(1-P)^{0}$$
Now, we can use this fact:
$$\frac{x(n)!}{(x)!(n-x)!}=\frac{(n)!}{(x-1)!(n-x)!}=n\left[\frac{(n-1)!}{(x-1)!(n-x)!}\right]$$
Now, the denominator of the fraction can be changed to $(x-1)!([n-1]-[x-1])!$, since simplifying the latter factorial still gives $(n-x)!$. Alright, well using this with our previous equation (since this is the coefficient of all of these), and simplifying a bit, we get:
$$nP\left[(1-P)^{n-1}+(n-1)P(1-P)^{n-2}+...+(n-1)P^{n-2}(1-P)+P^{n-1}\right]$$
Now, we can see the inside is the binomial coefficent:
$$\sum_{y=0}^{n-1}\left( {\begin{array}{*{20}c} n-1 \\ y \\ \end{array}} \right) P^{y}(1-P)^{n-1-y}=(P+(1-P))^{n-1}=(1)^{n-1}=1$$
Where $y=x-1$. So the equation reduces to $nP$, which is the mean.
For variance we have I'll save it for another blog post.
$$
B(n,p)=\left\{
\begin{array}{ll}
\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x} \quad x=0,1,2,...,n\\
0,\quad \qquad \qquad \qquad otherwise
\end{array}
\right.
$$
Where:
$$\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right)=\frac{{n!}}{{x!\left( {n - x} \right)!}}$$
Is the binomial coefficient. Before I explain the importance of it, let's worry about the other part. Now, the Bernoulli distribution was really just concerned with one success and one failure. However, what if we're worried about many successes and failures? Let's say we want to run four different trials? Well, that looks like:
$$P^{x}(1-P)^{4-x}$$
Where $x$ is the number of success, and hence $(4-x)$ is the number of failures. But as you'll notice, this isn't the actual probability. Let's say that there were two successes. This would look like:
$$(P)^{2}(1-P)^2=(P)(P)(1-P)(1-P)$$
But, notice that this is the probability of getting two successes in a row, and then two failures. Another option would be:
$$(P)(1-P)(P)(1-P)$$
This is the the same amount of successes and failures, but a completely different order. Since order does matter for us, this is where the binomial coefficient comes in. It counts up all the different orders in which we can get a certain number of successes and failures. So the full equation becomes:
$$\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}$$
Which is the probability of getting a certain number of successes in any order, multiplied the number of orders you can possibly get. For instance, flipping a coin twice and getting one head and one tail would be:
$$\left( {\begin{array}{*{20}c} 2 \\ 1 \\ \end{array}} \right) (\frac{1}{2})^{1}(1-\frac{1}{2})^{1}$$
Now, the binomial coefficient for these values equals:
$$\frac{{2!}}{{1!\left( {1} \right)!}}=\frac{2}{1}=2$$
And, multiplying this by the probability side gives:
$$(2)(\frac{1}{2})(\frac{1}{2})=(\frac{1}{2})$$
Which makes sense. The probability of getting a head then a tail in one order is $\frac{1}{4}$, yet there are two different ways to get it. Heads first, and heads second. So it becomes one half.
Alright, good stuff. But what about showing it's a probability function? It's obvious that it has $0\leq P$, since it's the same as the Bernoulli distribution (but this time with more successes and failures). But how do we know it adds up to one? Well, let's sum up all of the terms. That gives:
$$\sum_{0}^{n}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}$$
Remember we're summing up the $x$ terms, since the number of trials is set. What varies is the number or successes and failures. So, we have:
$$(1-P)^{n}+nP(1-P)^{n-1}+...+nP^{n-1}(1-p)+P^{n}$$
The first term is the event of failing every trial, the next one is all the events of succeeding once and failing $n-1$ times, all the way up to succeeding $n$ times. Well, now we can use the binomial theorem, which states that:
$$(x+y)^{n}=\left( {\begin{array}{*{20}c} n \\ 0 \\ \end{array}} \right)x^{n}y^{0}+\left( {\begin{array}{*{20}c} n \\ 1 \\ \end{array}} \right)x^{n-1}y^{1}+...+\left( {\begin{array}{*{20}c} n \\ n-1 \\ \end{array}} \right)x^{1}y^{n-1}+\left( {\begin{array}{*{20}c} n \\ n \\ \end{array}} \right)x^{0}y^{n}$$
As you'll notice, if we use $y=1-P$ and $x=P$, this is exactly the same as our old equation. Well, reducing it to $(x+y)^n$ gives:
$$(x+y)^n=(P+1-P)^{n}=1^n=1$$
So this shows it's total probability is $1$, which is what we were trying to prove. Now what about the expectation and variance? Let's start with the expectation. That is:
$$\sum_{0}^{n}x\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right)P^{x}(1-P)^{n-x}$$
Writing it out gives:
$$(0)\left( {\begin{array}{*{20}c} n \\ 0 \\ \end{array}} \right)P^{0}(1-P)^{n}+(1)\left( {\begin{array}{*{20}c} n \\ 1 \\ \end{array}} \right)P^{1}(1-P)^{n-1}+...+\left( {\begin{array}{*{20}c} n \\ n-1 \\ \end{array}} \right)P^{n-1}(1-P)^{1}+\left( {\begin{array}{*{20}c} n \\ n \\ \end{array}} \right)P^{n}(1-P)^{0}$$
Now, we can use this fact:
$$\frac{x(n)!}{(x)!(n-x)!}=\frac{(n)!}{(x-1)!(n-x)!}=n\left[\frac{(n-1)!}{(x-1)!(n-x)!}\right]$$
Now, the denominator of the fraction can be changed to $(x-1)!([n-1]-[x-1])!$, since simplifying the latter factorial still gives $(n-x)!$. Alright, well using this with our previous equation (since this is the coefficient of all of these), and simplifying a bit, we get:
$$nP\left[(1-P)^{n-1}+(n-1)P(1-P)^{n-2}+...+(n-1)P^{n-2}(1-P)+P^{n-1}\right]$$
Now, we can see the inside is the binomial coefficent:
$$\sum_{y=0}^{n-1}\left( {\begin{array}{*{20}c} n-1 \\ y \\ \end{array}} \right) P^{y}(1-P)^{n-1-y}=(P+(1-P))^{n-1}=(1)^{n-1}=1$$
Where $y=x-1$. So the equation reduces to $nP$, which is the mean.
For variance we have I'll save it for another blog post.
Tuesday, April 16, 2013
Discrete PMFs part 1. Bernoulli distribution.
Alright, I realize I still have to do that MGF post, but I also realized most of my stuff has dealt with continuous distributions. A lot of the proofs are pretty easy (just switch to summation notation), but I figured I'd do a post or two covering the main discrete distributions.
$$
P_{x}(x)=\left\{
\begin{array}{ll}
P^{x}(1-P)^{1-x} \quad x=0,1\\
0 \qquad \qquad \qquad otherwise
\end{array}
\right.
$$
This can be considered a success and failure distribution. For instance, if there are two mutually exclusive events, success=$1$ and failure=$0$, with a probability $P$ such that $0\leq P\leq 1$, then the probability of success is $P$, and failure is $(1-P)$. To see that, let's say it succeeds:
$$P^{1}(1-P)^{1-1}=P^{1}(1-P)^{0}=P(1)=P$$
Since we use the fact that $(1-P)$ raised to $0$ is $1$. For failure:
$$P^{0}(1-P)^{1-0}=(1)(1-P)=(1-P)$$
Using the fact that $P^0$ equals $1$. So, we've verified that the probability of success is $P$, and failure $(1-P)$. But, this is really just the beginning. How do we know that it's a probability function? Well, there are two conditions it must satisfy. Since $0\leq P$, we know that the probability of success is greater than or equal to zero. Furthermore, since $P\leq1$ we know that $(1-P)$ must be greater than or equal to $0$, so we know that both events have a probability greater than or equal to $0$. Now, since they're independent and discrete their total probability is equal to:
$$\sum_{x}P_{x}(x)$$
Which, in this case is:
$$\sum_{0}^{1}P_{x}(x)=P_{x}(0)+P_{x}(1)=P+(1-P)=1$$
So this shows it's a PMF. What about it's CMF? That's set as:
$$F(x)=P[X\leq x]=\sum_{x_{i}\leq x}P_{x}(x_{i})$$
Where:
$$\lim_{x\rightarrow-\infty}F(x)=0$$
And,
$$\lim_{x\rightarrow\infty}F(x)=1$$
Something worth covering is the expectation and the variance. For discrete functions the expected value is defined as:
$$\sum_{i=1}^{\infty}x_{i}P_{i}$$
For this case, it's:
$$\sum_{i=0}^{1}x P^{x}(1-P)^{1-x}=(0) P^{0}(1-P)^{1}+(1)P^{1}(1-P)^{0}=P$$
And the variance, defined as:
$$\sum_{i=1}^{\infty}(x_{i}-\mu)^{2}P_{i}$$
Which for the Bernoulli distribution is:
$$\sum_{0}^{1}(x-P)^{2}P^{x}(1-P)^{1-x}=(P)^{2}(1-P)+(1-P)^{2}P$$
Where we can pull out a $P$ and $(1-P)$ to get:
$$P[P+(1-P)](1-P)=P(1)(1-P)=P(1-P)$$
That was a lot of the basic stuff for the Bernoulli distribution. I'll cover the binomial one next since it's so closely related.
$$
P_{x}(x)=\left\{
\begin{array}{ll}
P^{x}(1-P)^{1-x} \quad x=0,1\\
0 \qquad \qquad \qquad otherwise
\end{array}
\right.
$$
This can be considered a success and failure distribution. For instance, if there are two mutually exclusive events, success=$1$ and failure=$0$, with a probability $P$ such that $0\leq P\leq 1$, then the probability of success is $P$, and failure is $(1-P)$. To see that, let's say it succeeds:
$$P^{1}(1-P)^{1-1}=P^{1}(1-P)^{0}=P(1)=P$$
Since we use the fact that $(1-P)$ raised to $0$ is $1$. For failure:
$$P^{0}(1-P)^{1-0}=(1)(1-P)=(1-P)$$
Using the fact that $P^0$ equals $1$. So, we've verified that the probability of success is $P$, and failure $(1-P)$. But, this is really just the beginning. How do we know that it's a probability function? Well, there are two conditions it must satisfy. Since $0\leq P$, we know that the probability of success is greater than or equal to zero. Furthermore, since $P\leq1$ we know that $(1-P)$ must be greater than or equal to $0$, so we know that both events have a probability greater than or equal to $0$. Now, since they're independent and discrete their total probability is equal to:
$$\sum_{x}P_{x}(x)$$
Which, in this case is:
$$\sum_{0}^{1}P_{x}(x)=P_{x}(0)+P_{x}(1)=P+(1-P)=1$$
So this shows it's a PMF. What about it's CMF? That's set as:
$$F(x)=P[X\leq x]=\sum_{x_{i}\leq x}P_{x}(x_{i})$$
Where:
$$\lim_{x\rightarrow-\infty}F(x)=0$$
And,
$$\lim_{x\rightarrow\infty}F(x)=1$$
Something worth covering is the expectation and the variance. For discrete functions the expected value is defined as:
$$\sum_{i=1}^{\infty}x_{i}P_{i}$$
For this case, it's:
$$\sum_{i=0}^{1}x P^{x}(1-P)^{1-x}=(0) P^{0}(1-P)^{1}+(1)P^{1}(1-P)^{0}=P$$
And the variance, defined as:
$$\sum_{i=1}^{\infty}(x_{i}-\mu)^{2}P_{i}$$
Which for the Bernoulli distribution is:
$$\sum_{0}^{1}(x-P)^{2}P^{x}(1-P)^{1-x}=(P)^{2}(1-P)+(1-P)^{2}P$$
Where we can pull out a $P$ and $(1-P)$ to get:
$$P[P+(1-P)](1-P)=P(1)(1-P)=P(1-P)$$
That was a lot of the basic stuff for the Bernoulli distribution. I'll cover the binomial one next since it's so closely related.
Wednesday, April 10, 2013
Conditional Probability, Conditional Expectations, and Conditional Variance.
Alright, so I know I still owe you guys a finishing post about the MGF, but I wanted to give a quick one about conditional probability, conditional expectations, etc.
So from beginner probability theory we know that:
$$P(A|B)=\frac{P(A\cap B)}{P(B)}$$
Which is the probability that A happens, given that we already know B has happened. A way to understand this is to look at a simple set graph (specifically a Venn diagram). Take this:
$$\frac{1}{f_{X_{1}}(X_{1})}\int_{A}f_{X_{2},X_{1}}(X_{2},X_{1})d{X_{2}}=\frac{f_{X_{1}}(X_{1})}{f_{X_{1}}(X_{1})}=1$$
So the total probability sums to one. We can also do some cool stuff, such as find the conditional probability over an interval, such as:
$$\int_{a}^{b}\frac{f_{X_{2},X_{1}}(X_{2},X_{1})}{f_{X_{1}}(X_{1})}d{X_{2}}=f_{X_{2}|X_{1}}(a\leq X_{2}\leq{b}|X_{1})$$
In other words, the probability that $X_{2}$ is in a certain interval, given that $X_{1}$ happens. We can also get a conditional expectation, by plugging in any function $g(X_{2})$, and doing the expectation we're used to:
$$E[X_{2}|X_{1}] =\int_{A}g(X_{2})\frac{f_{X_{2},X_{1}}(X_{2},X_{1})}{f_{X_{1}}(X_{1})}d{X_{2}}=\int_{A}g(X_{2})f_{X_{2}|X_{1}}( X_{2}|X_{1})d{X_{2}}$$
And, likewise, we can do a conditional variance:
$$Var[X_{2}|X_{1}]=E\left[([X_{2}|X_{1}]-E[X_{2}|X_{1}])^2\right]$$
Which, simplifying gives us the general equation for variance, but this time given that we know $X_{1}$, or:
$$Var[X_{2}|X_{1}]=E[[X_{2}|X_{1}]^{2} -E[([X_{2}|X_{1})]^{2}$$
Which looks nearly the same as our usual definition for variance, which would be:
$$Var[X]=E[X^2]-\mu_{X}^{2}$$
But with conditional probabilities and expectations.
Now, one last thing. It's time to make sense of this all, and I'll use a convenient graph taken from here. (Great source for beginner econometrics, and it's free! I highly recommend you use this. It's by Bruce Hansen.):
Now I removed the original $X$ and $Y$ values so it wouldn't be confusing. I would highly recommend checking out his original example with wages conditioned on things like race, gender, and so on. Very easy to understand real world application. However, let's just use the $X$ axis as $X_{1}$, and the $Y$ value as $X_{2}$. As you can see there's a contour map on the left graph. That's the amount of realizations of these events. As you can see, at different values of $X_{2}$, there are different values that $X_{2}$ can take. The line going through it are the expectations, which means the expected value of $X_{2}$ given that we choose a given value of $X_{1}$. For instance, if $X_{1}\leq{10}$, then a quick look at the graph tells us that the expected value of $X_{2}$ should be lower than if ${10}\leq{X_{1}}$. Well, let's check that. By taking specific values of $X_{1}$ and plotting distributions of $X_{2}$ given these values gives the right hand graph. Turns out our first impression was right. At higher values of $X_{1}$, $X_{2}$ has a higher expected value. So depending on what value of $X_{1}$, we can expect different values of $X_{2}$. Likewise, we could go through with variance. As you can see, some values of $X_{1}$ have different variances of $X_{2}$ given that $X_{1}$ value.
So that's an intuitive way of looking at it. Check out the book.
So from beginner probability theory we know that:
$$P(A|B)=\frac{P(A\cap B)}{P(B)}$$
Which is the probability that A happens, given that we already know B has happened. A way to understand this is to look at a simple set graph (specifically a Venn diagram). Take this:
Now, let's say that we already know the event happened in the space of $B$. What's the probability that $A$ happens, or happened? Well, since the event is in the space of $B$, the only way it could be in $A$ as well is if it's in the intersection. These are the only possibilities, since we know for certain that $B$ did happen. Well, that's how the $P(A\cap B)$ term gets in there. Why do we divide it by $P(B)$? Well, the entire event space is now limited to the space of $B$. To see why, since $B$ happened, it must be that we're in the $B$ circle. Now, we normalize it to equal one, since by definition probabilities must sum to one. So $B$ happens, and we have to sum up all the different events within $B$, and they should sum to one. Likewise with probabilities, events should be $0\leq X\leq1$. Which makes sense. If $A$ completely intersected with $B$, the the probability that A also happens is $1$, since no matter what, $B$ happening guarantees $A$ happened, or will happen. Similarly, if $A$ is disjoint with $B$, or doesn't intersect at all, then obviously $B$ happening means that $A$ is guaranteed to not happen. To use our notation so far, if $A$ completely covers $B$, then $P(A\cap B)=P(B)$, since they intersect on every point in $B$. Therefore:
$$P(A|B)=\frac{P(A\cap B)}{P(B)}=\frac{P(B)}{P(B)}=1$$
Likewise, if they don't intersect, then $P(A\cap B)=0$. Plugging that in, gives:
$$P(A|B)=\frac{P(A\cap B)}{P(B)}=\frac{0}{P(B)}=0$$
So it makes sense that all the other ways that $A$ and $B$ can intersect are inbetween $0$ and $1$.
Alright, so we know how we can condition on probabilities, or, knowing that one event happened, we can know the probability that another happens. Well, since we know marginal PDFs, and joint PDFs, we do know some probability with continuous probabilities. So we can do exactly the same. Take:
$$f_{X_{2}|X_{1}}( X_{2}|X_{1})=\frac{f_{X_{2},X_{1}}(X_{2},X_{1})}{f_{X_{1}}(X_{1})}$$
Notice this is the exact same as before, but now we have different ways of showing the probability. The left hand side denotes the conditional probability that $X_{2}$ happens, given that $X_{1}$ happens. The right hand side is exactly what we had before. The numerator is the joint distribution function, and as can easily be seen, it's equivalent to the probability that $X_{1}$ and $X_{2}$ both take specific values, while the denominator is the marginal PDF of $X_{2}$. So, it's the probability that both happen, divided by the probability that one of them takes a specific value. Exactly the same as our conditional probability from earlier. Some cool stuff we can do is show that it's a good old fashioned PDF. If we sum over all the values of $X_{2}$, we get:
$$\int_{A}\frac{f_{X_{2},X_{1}}(X_{2},X_{1})}{f_{X_{1}}(X_{1})}d{X_{2}}=\frac{1}{f_{X_{1}}(X_{1})}\int_{A}f_{X_{2},X_{1}}(X_{2},X_{1})d{X_{2}}$$
Where $A$ is the entire space that $X_{2}$ is defined on, and looking at the last part we can see that it becomes the marginal PDF of $X_{1}$ so it becomes:
$$\frac{1}{f_{X_{1}}(X_{1})}\int_{A}f_{X_{2},X_{1}}(X_{2},X_{1})d{X_{2}}=\frac{f_{X_{1}}(X_{1})}{f_{X_{1}}(X_{1})}=1$$
So the total probability sums to one. We can also do some cool stuff, such as find the conditional probability over an interval, such as:
$$\int_{a}^{b}\frac{f_{X_{2},X_{1}}(X_{2},X_{1})}{f_{X_{1}}(X_{1})}d{X_{2}}=f_{X_{2}|X_{1}}(a\leq X_{2}\leq{b}|X_{1})$$
In other words, the probability that $X_{2}$ is in a certain interval, given that $X_{1}$ happens. We can also get a conditional expectation, by plugging in any function $g(X_{2})$, and doing the expectation we're used to:
$$E[X_{2}|X_{1}] =\int_{A}g(X_{2})\frac{f_{X_{2},X_{1}}(X_{2},X_{1})}{f_{X_{1}}(X_{1})}d{X_{2}}=\int_{A}g(X_{2})f_{X_{2}|X_{1}}( X_{2}|X_{1})d{X_{2}}$$
And, likewise, we can do a conditional variance:
$$Var[X_{2}|X_{1}]=E\left[([X_{2}|X_{1}]-E[X_{2}|X_{1}])^2\right]$$
Which, simplifying gives us the general equation for variance, but this time given that we know $X_{1}$, or:
$$Var[X_{2}|X_{1}]=E[[X_{2}|X_{1}]^{2} -E[([X_{2}|X_{1})]^{2}$$
Which looks nearly the same as our usual definition for variance, which would be:
$$Var[X]=E[X^2]-\mu_{X}^{2}$$
But with conditional probabilities and expectations.
Now, one last thing. It's time to make sense of this all, and I'll use a convenient graph taken from here. (Great source for beginner econometrics, and it's free! I highly recommend you use this. It's by Bruce Hansen.):
Now I removed the original $X$ and $Y$ values so it wouldn't be confusing. I would highly recommend checking out his original example with wages conditioned on things like race, gender, and so on. Very easy to understand real world application. However, let's just use the $X$ axis as $X_{1}$, and the $Y$ value as $X_{2}$. As you can see there's a contour map on the left graph. That's the amount of realizations of these events. As you can see, at different values of $X_{2}$, there are different values that $X_{2}$ can take. The line going through it are the expectations, which means the expected value of $X_{2}$ given that we choose a given value of $X_{1}$. For instance, if $X_{1}\leq{10}$, then a quick look at the graph tells us that the expected value of $X_{2}$ should be lower than if ${10}\leq{X_{1}}$. Well, let's check that. By taking specific values of $X_{1}$ and plotting distributions of $X_{2}$ given these values gives the right hand graph. Turns out our first impression was right. At higher values of $X_{1}$, $X_{2}$ has a higher expected value. So depending on what value of $X_{1}$, we can expect different values of $X_{2}$. Likewise, we could go through with variance. As you can see, some values of $X_{1}$ have different variances of $X_{2}$ given that $X_{1}$ value.
So that's an intuitive way of looking at it. Check out the book.
Friday, March 29, 2013
Marginal pdf.
Alright, I mentioned continuing on with the moment generating function, but I realized that I need to cover something before I can get really in depth with that. What we have to cover now are marginal PDFs. (I rarely ever cover discrete examples, since most problems are meant for continuous cases. It shouldn't matter, since most of the proofs transfer easily to summation with no problem).
Alright, you know the drill with one variable functions. For instance, what's the mean of the PDF $g(X)$? Well, it's:
$$\mu=\int_{S} xg(x)dx$$
Where $S$ is the support of $X$. Well, That's no problem. But what about two variables? Now we have a PDF that looks like $g(X_{1},X_{2})$. Here's where "marginal" PDFs come in. We have two variables to integrate over. So, the new expectation function (for whatever function of $X_{1}$ and $X_{2}$ you'd like, I'll use $f(X_{1},X_{2})$) looks like:
$$\int_{S_{2}}\int_{S_{1}} f(X_{1},X_{2})g(X_{1},X_{2})dx_{1}dx_{2}$$
Where $S_{1}$ and $S_{2}$ are the supports of $X_{1}$ and $X_{2}$, respectively. Now, we can do something cool here. Since this is the joint PDF of both variables, what happens when we sum over the support of one? Take, for example:
$$\int_{S_{1}} f(X_{1},X_{2})dx_{1}$$
Well, what does that give us? Keep in mind this effectively eliminates the variable $X_{1}$ out of the equation, so all that's left is variable $X_{2}$ in the PDF. Therefore, the probability only depends on the value of $X_{2}$. This makes it the marginal PDF. So, if we have a function like this:
$$\int_{B}\int_{S_{1}}X_{2}g(X_{1},X_{2})dx_{1}dx_{2}$$
Where $S_{1}$ is the entire support of $X_{1}$. Well, we can pull the $X_{2}$ out of the first integral since it only integrates over $X_{1}$, and that gives us:
$$\int_{B}X_{2}\left[\int_{S_{1}}g(X_{1},X_{2})dx_{1}\right]dx_{2}$$
The integral in the inner brackets is exactly the marginal PDF of $X_{2}$, so we'll denote that by $f_{2}(X_{2})$. Plugging that in gives us:
$$\int_{B}X_{2}f_{2}(X_{2})dx_{2}$$
Which is just the mean of $X_{2}$. You can also play around with more variables, and perhaps prove to yourself that the expectation of more than one variables is linear. As for now this should be good enough to finish my MGF post.
Alright, you know the drill with one variable functions. For instance, what's the mean of the PDF $g(X)$? Well, it's:
$$\mu=\int_{S} xg(x)dx$$
Where $S$ is the support of $X$. Well, That's no problem. But what about two variables? Now we have a PDF that looks like $g(X_{1},X_{2})$. Here's where "marginal" PDFs come in. We have two variables to integrate over. So, the new expectation function (for whatever function of $X_{1}$ and $X_{2}$ you'd like, I'll use $f(X_{1},X_{2})$) looks like:
$$\int_{S_{2}}\int_{S_{1}} f(X_{1},X_{2})g(X_{1},X_{2})dx_{1}dx_{2}$$
Where $S_{1}$ and $S_{2}$ are the supports of $X_{1}$ and $X_{2}$, respectively. Now, we can do something cool here. Since this is the joint PDF of both variables, what happens when we sum over the support of one? Take, for example:
$$\int_{S_{1}} f(X_{1},X_{2})dx_{1}$$
Well, what does that give us? Keep in mind this effectively eliminates the variable $X_{1}$ out of the equation, so all that's left is variable $X_{2}$ in the PDF. Therefore, the probability only depends on the value of $X_{2}$. This makes it the marginal PDF. So, if we have a function like this:
$$\int_{B}\int_{S_{1}}X_{2}g(X_{1},X_{2})dx_{1}dx_{2}$$
Where $S_{1}$ is the entire support of $X_{1}$. Well, we can pull the $X_{2}$ out of the first integral since it only integrates over $X_{1}$, and that gives us:
$$\int_{B}X_{2}\left[\int_{S_{1}}g(X_{1},X_{2})dx_{1}\right]dx_{2}$$
The integral in the inner brackets is exactly the marginal PDF of $X_{2}$, so we'll denote that by $f_{2}(X_{2})$. Plugging that in gives us:
$$\int_{B}X_{2}f_{2}(X_{2})dx_{2}$$
Which is just the mean of $X_{2}$. You can also play around with more variables, and perhaps prove to yourself that the expectation of more than one variables is linear. As for now this should be good enough to finish my MGF post.
Wednesday, March 27, 2013
Moment Generating Function.
As promised, the moment generating function. The moment
generating function is generally applied in statistics to save us from very
complex and tedious calculations. For instance, functions like these:
$$E[X],~E[X^2],~E[X^3],...$$
And these:
$$E[(X-\mu)],~E[(X-\mu)^2],~E[(X-\mu)^2],...$$
The first sequence of functions are known as moments. $E[X]$ is called the "first" moment, $E[X^2]$ the "second" moment, and so on. As for the second sequence, these are known as the moments about the mean, and like last time are known as the first moment about the mean, the second, and so forth. Now, how can we solve for these? Well, the first is pretty easy. If $X$ is just raised to the first power then it becomes $\mu$, or the mean of X. For $E[(X-\mu)]$ we know it's zero, since the expectation operator is a linear operator, or $E[(X-\mu)]=E[X]-E[\mu]=\mu-\mu=0$. But what about higher moments? Well, for the first we could use the fact that:
$$\sigma^2=E[X^2]+\mu^2$$
And solve for $E[X^2]$, but that assumes we already know $\sigma^2$ as well as the mean. So how can we do better? This is where the moment generating function comes in. I'll give the definition first, and then try to explain the reasoning. Without much further ado:
$$MGF\equiv\int_{-\infty}^{\infty}e^{tX}f(X)dx,~t\in\mathbb{R}$$
Now, the cool thing about the function $e^{tX}$ is that it has some useful properties that just so happen to be exactly what we need. For instance:
$$e^{X}=1+\frac{tX}{1!}+\frac{t^{2}X^{2}}{2!}+\frac{t^{3}X^{3}}{3!}+...$$
Notice all those beautiful moments? That's exactly what we're looking for. If we can take advantage of that to find the value of whatever moments we need, then we're set. Let's start by plugging it in:
$$\int_{-\infty}^{\infty}\left(1+\frac{tX}{1!}+\frac{t^{2}X^{2}}{2!}+\frac{t^{3}X^{3}}{3!}+...\right)f(X)dx$$
Now, this looks sort of like a mess as is, but let's start by distributing out the function $f(X)$:
$$\int_{-\infty}^{\infty}\left(f(x)+\frac{tXf(X)}{1!}+\frac{t^{2}X^{2}f(X)}{2!}+...\right)dx$$
Now, remember the integral is a linear operator in the sense that we can do this:
$$\int_{-\infty}^{\infty}(f(x))dx+\int_{-\infty}^{\infty}\frac{tXf(X)}{1!}dx+\int_{-\infty}^{\infty}\frac{t^{2}X^{2}f(X)}{2!}dx+...$$
The integrals themselves are all for X, so that means we can pull out any constants or variables that aren't $X$. So this becomes:
$$1+\frac{t}{1!}\int_{-\infty}^{\infty}Xf(X)dx+\frac{t^2}{2!}\int_{-\infty}^{\infty}X^{2}f(X)dx+\frac{t^3}{3!}\int_{-\infty}^{\infty}X^{3}f(X)dx...$$
The first term became one because it was simply the area under the probability function, which, by definition, equals one. Now, here's the interesting thing. All of the integrals become the different moments. See how this works. $E[X]$ is the first moment, which also happens to be the second term. $E[X^2]$ is the third term, and is the second moment. And so on. What about the first term? Well, it's the same as $E[X^0]$, which is the expected value of a constant $1$, which is obviously one. This is technically the zeroth (zeroth?) moment. Alright, now we have all the moments in one long formula. What now? Now we can differentiate with respect to $t$ to get what we want. Setting the full function up there equal to $\phi$ and taking the first derivative gives us:
$$ \frac{d\phi}{dt}=\int_{-\infty}^{\infty}Xf(X)dx+\frac{t}{1!}\int_{-\infty}^{\infty}X^{2}f(X)dx+\frac{t^2}{2!}\int_{-\infty}^{\infty}X^{3}f(X)dx+...$$
Well, that just got rid of the constant, and shifted the moments down. However, notice the first term. It's the first moment, but there's no $t$ attached to it. What can we do? Since this is a function of $t$, we can set it to zero. All the terms after the first moment are eliminated, and all that's left is the first moment. Here's the incredibly useful part. Taking the derivative again gets rid of the first moment. Now, we have the second moment unattached to $t$, with all other moments still multiplied by $t$, and as we keep differentiating the factorials on the bottom of the fractions disappear. What can we conclude? Whatever moment you're looking for is equal to:
$$E[X^n]=\left.\frac{d^{n}\phi}{dt^{n}}\right|_{t=0}$$
So, let's say you want the nth moment. Well, take the nth derivative with respect to $t$ of the original function, and then set it equal to zero. This gives us all moments out to infinity.
Next post I'll go into moments about a constant, most notably the mean.
Sunday, March 24, 2013
Regression formula.
Alright, a simple linear regression. I was actually thinking of skipping this (since there is an over-abundance of material on this anyways), but I think it'll be helpful to show the intuitive way I learned this.
Alright, so let's say, hypothetically, that there was a linear equation that could match the data we have above. It would look something like:
$$y=\beta_{0} +\beta_{1}x$$
Well, we know that there's no way the data will actually fit this line. Just look at it and you can tell there's no line that can fit what we have. But that's fine, since we just want the line that fits the best. How do we do that? Well, we can use our actual outcomes, $y_{k}$, and compare them to the predicted values $\beta_{0}+\beta_{1}x_{k}$ for the kth observation, whatever it may be. Well, let's take all of the actual outcomes and predicted outcomes and find the difference. That gives:
$$(y_{1}-\beta_{0} -\beta_{1}x_{1})^2+(y_{2}-\beta_{0} -\beta_{1}x_{2})^2+...+(y_{n}-\beta_{0} -\beta_{1}x_{n})^2$$
Which can simplify to:
$$\sum_{1}^{n} (y_{k}-\beta_{0} -\beta_{1}x_{k})^2$$
Which is pretty easy to understand. That's the error, or difference between the regression line and the data points. I'll explain later the reason for having it squared (there are other ways of getting the same result where it's needed, but for now let's see the optimization reason). Now, here's a tricky part. We want to minimize these differences. Well, that means we'll have to take the derivative and solve for 0. But how can we do that? Keep in mind that the variables we're minimizing aren't x and y. In fact, those are set. They're data points. What we can control, however, are $\beta_{0}$ and $\beta_{1}$. Which makes sense, because we're trying to fit a line to the data, not the data to a line. So we're essentially moving the line and slope around to find the best one given the data that's already determined. This goes right into partial differentiation, which gives us:
$$\frac{\partial\phi}{\partial\beta_{0}}=-2\sum_{1}^{n}(y_{k}-\beta_{0} -\beta_{1}x_{k})=0$$
And:
$$\frac{\partial\phi}{\partial\beta_{1}}=-2\sum_{1}^{n}x_{k}(y_{k}-\beta_{0} -\beta_{1}x_{k})=0$$
Now, we set both to zero to get a minimum (we could get a maximum, or perhaps some saddle point, but luckily due to the squared terms we know it's a parabolic function that's always non-negative). Furthermore, note that the squared term allowed us to keep the differences positive, but also is convenient for differentiating. Absolute value signs, although convenient in some cases, would have been a mess here.
Anyways, now we just have two equations, with two unknown variables $\beta_{0}$ and $\beta_{1}$. So let's start with the first. Using a little algebraic footwork, we get:
$$\sum(y)-\sum(\beta_{0})-\sum(x\beta_{1})=0$$
(I dropped some notation to save clutter from this point on). Now, both $\beta$s are constants. Therefore, the $\beta_{0}$ is just the same constant summed up n times, which turns it into $n\beta_{0}$. The second beta can be pulled out, which gives $\beta_{1}\sum(x)$. Solving for $\beta_{0}$ gives:
$$\beta_{0}=\frac{\sum(y)-\beta_{1}\sum(x)}{n}$$
Good. Now we have one unknown in terms of the other. And it only took me the greater part of 20 minutes. Now to solve for $\beta_{1}$. Taking the equation we had earlier, and using some of the same tools we did last time, we can get:
$$\beta_{0}(\sum x)+\beta_{1}(\sum x^2)=\sum xy$$
Well, we know the first $\beta$, so let's plug that in. That gives us:
$$\sum x\left(\frac{\sum(y)-\beta_{1}\sum(x)}{n}\right)+\beta_{1}(\sum x^2)=\sum xy$$
Well, let's separate the $\beta_{1}$ from the rest, and move stuff around. That gives:
$$\beta_{1}\left((\sum x^2)-\frac{(\sum x)^2}{n}\right)=\sum xy -(\frac{\sum x\sum y}{n})$$
Taking the last step gives us:
$$\beta_{1}=\frac{\sum xy -(\frac{\sum x\sum y}{n})}{(\sum x^2)-\frac{(\sum x)^2}{n}}$$
And now we're done. We have the slope in terms of the data, so it's set. And once we plug in the data points, and solve for $\beta_{1}$, we can then plug that into our equation for $\beta_{0}$ and get that. Keep in mind that other explanations might slightly differ due to averages. When the sum of x is divided by n, they sometimes just turn it into the average, or do other things like that. All the same, but I didn't want to make too many changes since it might confuse someone who's just learning it.
Anyways, plugging that into our linear formula and then graphing will give us something like this:
Not bad. Took me forever. But not bad. Moment generating function next? I hope not.
Thursday, March 21, 2013
Testing.
$$\prod_{\alpha}^{\beta-1}f(\Psi)^{1-\Omega}, \forall{\Psi:0\leq\Psi\leq\infty}$$
Well that worked. Now I just have to mess with the font and awful layout.
Well that worked. Now I just have to mess with the font and awful layout.
Wednesday, March 20, 2013
Beginner regression in R.
I guess the best place to start would be a simple linear regression. I won't explain a regression, since I think I'll do that in a later proof. This will be more related to R, which is a statistical program. I highly recommend picking up R, since it's a free and easy way to learn beginner statistical coding. As you can see above I plotted randomly generated data in it. To create the same yourself, plug in:
>x=rnorm(1000,4,5)
Which gives you a random, normally distributed variable "x", which you have 1000 observations for. It has a mean of 4, and variance of 5. To create a regression out of this, let's make a linear form with a dependent variable "y":
>y=2*x+rnorm(1000,0,10)
So, as you can tell, the first term is just 2 multiplied by the observations we just created, which gives it a slope of 2 when plotted against y. The added term on the end is the error term, which has a mean of 0 and variance of 10. You can plot this by putting "plot(x,y)" into the command. To put a regression line through it, you'd start with this:
>lm(y~x)
Where "lm" stands for "linear model". Entering that in will give you the slope and intercept:
Now, to create the line through it, we put:
>abline(lm(y~x))
Which gives:
I highly recommend playing around with this stuff. For instance, what about no error? Well, then you can perfectly predict what value of x will give you in terms of y, and vice versa. Besides the discontinuous jumps between data, it's essentially the same as a linear relation y=2x. Take a look:
Alright, so this was random data, not the real deal. But it does give a relatively easy way of messing around with data, and just get a feel for what playing with a statistics program should be like. Next time I'll try and do a proof.
Tuesday, March 19, 2013
The science of society.
I'm creating this blog to divert the math away from my other blog, found here. Although I do love writing non-rigorous and non-technical posts, I also enjoy the occasional proof and econometric work. So that's what this blog is for. I'll give off simple and intuitive proofs, explanations, etc., of anything mathematically related, with an emphasis on econometrics. Since I'm just starting out with statistical software as well, I could give out beginner lessons so that I too can get practice. Best way to start is a few quotes, so here goes:
"Science cannot solve the ultimate mystery of nature. And that is
because, in the last analysis, we ourselves are a part of the mystery
that we are trying to solve."
-Max Planck
"Imagine how much harder physics would be if electrons had feelings!"
-Richard Feynman
Subscribe to:
Posts (Atom)