Friday, April 26, 2013

MGF of the Bernoulli and Binomial distributions.

Alright, a quick post about MGFs and the Bernoulli and Binomial distributions. Since the Bernoulli distribution is simply the Binomial distribution for the special case of $n=1$, we can see that the proof for the Binomial distribution will automatically include a proof for the Bernoulli distribution. I already showed it to find $E[x^2]$, but I'll restate it here. So, the MGF for discrete functions is:
$$\sum_{x=0}^{n}e^{tx}P_{x}(x)$$
Which for the Binomial distribution would be:
$$\sum_{x=0}^{n}e^{tx}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}$$
Where we do this:
$$\sum_{x=0}^{n}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) (Pe^{t})^{x}(1-P)^{n-x}$$
Now, because of the Binomial theorem we know that:
$$\sum_{x=0}^{n}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) (Pe^{t})^{x}(1-P)^{n-x}=(Pe^{t}+1-P)^{n}$$
This is a much more manageable form. Now we can take derivatives of this and get whatever moment we want. Let's use the general case of $n$, but then check it for $n=1$ (i.e. the Bernoulli distribution).
Well, (setting $(Pe^{t}+1-P)^{n}=M_{x}(t)$) the first derivative is:
 $$\frac{d(M_{x})}{dt}=e^{t}Pn(e^{t}P+1-P)^{n-1}$$
By using the chain rule. So this is the first moment. Setting $t=0$ here will give us the mean. Let's check it:
$$e^{0}Pn(e^{0}P+1-P)^{n-1}=nP$$
Well, for the Binomial distribution we proved that the mean is $P$, and for the Binomial we showed it was $nP$, which is perfect. We have the Binomial mean, and if we set $n=1$ then we have the Bernoulli. Now for the second moment and the variance. To do this we can use the fact that $\sigma^{2}=E[x^2]-\mu^2$. Well, we know $\mu$, but we need to know $E[x^2]$, or the second moment. Taking the second derivative gives:
$$\frac{d^{2}(M_{x})}{dt^{2}}=e^{t}Pn(e^{t}P+1-P)^{n-1}+e^{t}Pn\left(e^{t}P(n-1)(e^{t}P+1-P)^{n-2}\right)$$
By use of the product rule. Setting $t=0$, taking note of the fact that $(P+1-P)=1$, and then simplifying gives us:
 $$Pn+(P)^{2}n(n-1)=Pn+(Pn)^{2}-(P)^{2}n$$
Now, plugging this back into the variance formula, as well as the mean, gives us:
$$\sigma^{2}=Pn+(Pn)^{2}-(P)^{2}n-(Pn)^{2}=Pn-(P)^{2}n=Pn(1-P)$$
Which is the variance of the Binomial distribution. Setting $n=1$ gives us $P(1-P)$, which is exactly the variance for the Bernoulli distribution.

Moment generating function Pt 2.

Alright, not really needed, but I'm going to throw in a few things that may come up. Essentially nifty little fun-facts. Let's start with the scenario that you'd want a moment about the mean. To get that result we just start over with:

$$MGF\equiv\int_{-\infty}^{\infty}e^{tX}f(X)dx,~t\in\mathbb{R}$$

And replace $e^{tX}$ with $e^{t(X-\mu)}$. Now, the rest is pretty straight forward. We could sub in the expression for the expansion of $e$ and get:

 $$e^{t(X-\mu)}=1+\frac{t(X-\mu)}{1!}+\frac{t^{2}(X-\mu)^{2}}{2!}+\frac{t^{3}(X-\mu)^{3}}{3!}+...$$

Following the steps of last time would give you essentially the same idea. However, there's another way we could go about it. Since we have $e^{t(X-\mu)}$, we can transform it into $e^{tX}e^{-t\mu}$. You'll notice the former is just what we had earlier, and the latter is a new term. But let's write it out:

$$\int_{-\infty}^{\infty}e^{-t\mu}e^{tX}f(X)dx$$

Well, the integral is just with respect to $X$, so we can treat  $e^{-t\mu}$ as a constant and pull it out to get:

$$e^{-t\mu}\int_{-\infty}^{\infty}e^{tX}f(X)dx$$

Where the integral on the right is exactly the same as the MGF we had before. So this now becomes:

 $$e^{-t\mu}M_{x}(t)$$

Differentiating and setting $t$ equal to zero like last time will give us the moments we want. To see this, let's try the first ones. We know the variance is defined as $E[(x-\mu)^2]$, and can alternatively be defined as $E[x^2]-\mu^2$. Well, this is the second moment about the mean, so let's try that in our alternate and simpler MGF. We'd have to take the second derivative and set it equal to zero. So let's start by differentiating it.  Here:
$$ \frac{d(e^{-t\mu}M_{x}(t)}{dt}=M'_{x}(t)e^{-t\mu}+-\mu M_{x}(t)e^{-t\mu}$$
Now, this is the first moment about the mean. So it's essentially $E[x-\mu]=E[x]-E[\mu]=\mu-\mu=0$. Well, let's set $t=0$ to test that. We have:
$$M'_{x}(0)e^{0}+-\mu M_{0}(t)e^{0}=\mu(1)-\mu(1)(1)=\mu-\mu=0$$
Using the different moments of $x$ and the fact that $e^{0}=1$. The second derivative being:
$$\frac{d^{2}M_{x}(t)}{dt^2}=M''_{x}(t)e^{-t\mu}+-\mu M'_{x}(t)e^{-t\mu}+-\mu M'_{x}(t)e^{-t\mu}+\mu^{2} M_{x}(t)e^{-t\mu}$$
Setting $t=0$, and using the moments from the last post, we get:
$$M''_{x}(0)e^{0}+-2\mu M'_{x}(0)e^{0}+-\mu^{2} M'_{x}(0)e^{0}=E[x^2]+-2\mu^{2}+\mu^{2}=E[x^2]-\mu^{2}$$

Thus what we were trying to prove. Next post in this series will be about multiple variables. I may also mix the MGF series with the discrete distributions to get the MGFs of them.

Monday, April 22, 2013

Discrete PMFs part 3. Variance of the Binomial distribution.

Alright, the variance of the binomial distribution. Now, we know that $\sigma^{2}=E[x^{2}]-\mu^{2}$, but we can check it with the binomial distribution if we'd like. So we have:
$$\sum_{x=0}^{n}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}$$
And we also have that the mean, $\mu$, is $nP$. So, let's use our variance formula, which is $E[(x-\mu)^2]$. This becomes:
$$\sum_{x=0}^{n}(x-nP)^{2}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}$$
Which simplifying the $(x-nP)^{2}$ term gives $(x^{2}-2xnP+[nP]^{2})$. Distributing out the probability function, and using the $\sum$ term as a linear operator gives:
$$\sum_{0}^{n}x^{2}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right)P^{x}(1-P)^{n-x}+2nP\sum_{0}^{n}x\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}+(nP)^{2}\sum_{x=0}^{n}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}$$
Where I was able to pull the $2nP$ out of the middle term since the summation doesn't affect it, as well as the $(nP)^2$ in the third term. This leaves it as $E[x]$, which we know is $nP$. Furthermore, the last term is simply the summation of the probability function, which my last post shows is $1$. So simplifying this down gives:
$$\sum_{0}^{n}x^{2}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right)P^{x}(1-P)^{n-x}-2(nP)^{2}+(nP)^{2}=\sum_{0}^{n}x^{2}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right)P^{x}(1-P)^{n-x}-(nP)^{2}$$
Which is easy to see that it's $E[x^{2}]-\mu^{2}$, thus verifying the variance formula for the binomial distribution. Now our job is to figure out what the $E[x^{2}]$ equals. Well, here's where we can use something we learned before. That's the MGF. This would look like:
$$\sum_{x=0}^{n}e^{tx}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}$$
And we know that we'll need $M''(0)$ to get $E[x^2]$. But before we can worry about the second moment specifically, let's worry about getting the MGF in a form that's more manageable. So let's use a bit of footwork:
$$\sum_{x=0}^{n}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) (e^{t}P)^{x}(1-P)^{n-x}=(e^{t}P+1-P)^{n}$$
By using the binomial theorem. Well, this is good, since now we can start taking derivatives. The first derivative is:
$$e^{t}Pn(e^{t}P+1-P)^{n-1}$$
By using the chain rule. The second we must use the product rule, which gives:
$$e^{t}Pn(e^{t}P+1-P)^{n-1}+e^{t}Pn\left(e^{t}P(n-1)(e^{t}P+1-P)^{n-2}\right)$$
Setting $t=0$ simplifies it to:
$$Pn+P^{2}n(n-1)$$
Plugging that back into our formula to find the variance gives:
$$\sigma^{2}=Pn+P^{2}n(n-1)-(nP)^{2}=Pn+P^{2}n^{2}-P^{2}n-P^{2}n^{2}=Pn-P^{2}n=Pn(1-P)$$
And now we have the variance in a manageable form. Next will be the Hypergeometric distribution.

Wednesday, April 17, 2013

Discrete PMFs part 2. Binomial distribution. PMF proof and mean.

Now the binomial distribution. You'll see it's related to the Bernoulli for obvious reasons. It's defined as:
$$
 B(n,p)=\left\{
\begin{array}{ll}
\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x} \quad x=0,1,2,...,n\\
0,\quad \qquad \qquad \qquad otherwise
\end{array}
\right.
$$
Where:
$$\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right)=\frac{{n!}}{{x!\left( {n - x} \right)!}}$$
Is the binomial coefficient. Before I explain the importance of it, let's worry about the other part. Now, the Bernoulli distribution was really just concerned with one success and one failure. However, what if we're worried about many successes and failures? Let's say we want to run four different trials? Well, that looks like:
$$P^{x}(1-P)^{4-x}$$
Where $x$ is the number of success, and hence $(4-x)$ is the number of failures. But as you'll notice, this isn't the actual probability. Let's say that there were two successes. This would look like:
$$(P)^{2}(1-P)^2=(P)(P)(1-P)(1-P)$$
But, notice that this is the probability of getting two successes in a row, and then two failures. Another option would be:
$$(P)(1-P)(P)(1-P)$$
This is the the same amount of successes and failures, but a completely different order. Since order does matter for us, this is where the binomial coefficient comes in. It counts up all the different orders in which we can get a certain number of successes and failures. So the full equation becomes:
$$\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}$$
Which is the probability of getting a certain number of successes in any order, multiplied the number of orders you can possibly get. For instance, flipping a coin twice and getting one head and one tail would be:
$$\left( {\begin{array}{*{20}c} 2 \\ 1 \\ \end{array}} \right) (\frac{1}{2})^{1}(1-\frac{1}{2})^{1}$$
Now, the binomial coefficient for these values equals:
$$\frac{{2!}}{{1!\left( {1} \right)!}}=\frac{2}{1}=2$$
And, multiplying this by the probability side gives:
$$(2)(\frac{1}{2})(\frac{1}{2})=(\frac{1}{2})$$
Which makes sense. The probability of getting a head then a tail in one order is $\frac{1}{4}$, yet there are two different ways to get it. Heads first, and heads second. So it becomes one half.

Alright, good stuff. But what about showing it's a probability function? It's obvious that it has $0\leq P$, since it's the same as the Bernoulli distribution (but this time with more successes and failures). But how do we know it adds up to one? Well, let's sum up all of the terms. That gives:
$$\sum_{0}^{n}\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right) P^{x}(1-P)^{n-x}$$
Remember we're summing up the $x$ terms, since the number of trials is set. What varies is the number or successes and failures. So, we have:
$$(1-P)^{n}+nP(1-P)^{n-1}+...+nP^{n-1}(1-p)+P^{n}$$
The first term is the event of failing every trial, the next one is all the events of succeeding once and failing $n-1$ times, all the way up to succeeding  $n$ times. Well, now we can use the binomial theorem, which states that:
$$(x+y)^{n}=\left( {\begin{array}{*{20}c} n \\ 0 \\ \end{array}} \right)x^{n}y^{0}+\left( {\begin{array}{*{20}c} n \\ 1 \\ \end{array}} \right)x^{n-1}y^{1}+...+\left( {\begin{array}{*{20}c} n \\ n-1 \\ \end{array}} \right)x^{1}y^{n-1}+\left( {\begin{array}{*{20}c} n \\ n \\ \end{array}} \right)x^{0}y^{n}$$
As you'll notice, if we use $y=1-P$ and $x=P$, this is exactly the same as our old equation. Well, reducing it to $(x+y)^n$ gives:
$$(x+y)^n=(P+1-P)^{n}=1^n=1$$
So this shows it's total probability is $1$, which is what we were trying to prove. Now what about the expectation and variance? Let's start with the expectation. That is:
$$\sum_{0}^{n}x\left( {\begin{array}{*{20}c} n \\ x \\ \end{array}} \right)P^{x}(1-P)^{n-x}$$
Writing it out gives:
 $$(0)\left( {\begin{array}{*{20}c} n \\ 0 \\ \end{array}} \right)P^{0}(1-P)^{n}+(1)\left( {\begin{array}{*{20}c} n \\ 1 \\ \end{array}} \right)P^{1}(1-P)^{n-1}+...+\left( {\begin{array}{*{20}c} n \\ n-1 \\ \end{array}} \right)P^{n-1}(1-P)^{1}+\left( {\begin{array}{*{20}c} n \\ n \\ \end{array}} \right)P^{n}(1-P)^{0}$$
Now, we can use this fact:
$$\frac{x(n)!}{(x)!(n-x)!}=\frac{(n)!}{(x-1)!(n-x)!}=n\left[\frac{(n-1)!}{(x-1)!(n-x)!}\right]$$
Now, the denominator of the fraction can be changed to $(x-1)!([n-1]-[x-1])!$, since simplifying the latter factorial still gives $(n-x)!$. Alright, well using this with our previous equation (since this is the coefficient of all of these), and simplifying a bit, we get:
$$nP\left[(1-P)^{n-1}+(n-1)P(1-P)^{n-2}+...+(n-1)P^{n-2}(1-P)+P^{n-1}\right]$$
Now, we can see the inside is the binomial coefficent:
$$\sum_{y=0}^{n-1}\left( {\begin{array}{*{20}c} n-1 \\ y \\ \end{array}} \right) P^{y}(1-P)^{n-1-y}=(P+(1-P))^{n-1}=(1)^{n-1}=1$$
Where $y=x-1$. So the equation reduces to $nP$, which is the mean.

For variance we have I'll save it for another blog post.

Tuesday, April 16, 2013

Discrete PMFs part 1. Bernoulli distribution.

Alright, I realize I still have to do that MGF post, but I also realized most of my stuff has dealt with continuous distributions. A lot of the proofs are pretty easy (just switch to summation notation), but I figured I'd do a post or two covering the main discrete distributions.
$$
 P_{x}(x)=\left\{
\begin{array}{ll}
 P^{x}(1-P)^{1-x} \quad x=0,1\\
0 \qquad \qquad \qquad otherwise
\end{array}
\right.
$$
This can be considered a success and failure distribution. For instance, if there are two mutually exclusive events, success=$1$ and failure=$0$, with a probability $P$ such that $0\leq P\leq 1$, then the probability of success is $P$, and failure is $(1-P)$. To see that, let's say it succeeds:
$$P^{1}(1-P)^{1-1}=P^{1}(1-P)^{0}=P(1)=P$$
Since we use the fact that $(1-P)$ raised to $0$ is $1$. For failure:
$$P^{0}(1-P)^{1-0}=(1)(1-P)=(1-P)$$
Using the fact that $P^0$ equals $1$. So, we've verified that the probability of success is $P$, and failure $(1-P)$. But, this is really just the beginning. How do we know that it's a probability function? Well, there are two conditions it must satisfy. Since $0\leq P$, we know that the probability of success is greater than or equal to zero. Furthermore, since $P\leq1$ we know that $(1-P)$ must be greater than or equal to $0$, so we know that both events have a probability greater than or equal to $0$. Now, since they're independent and discrete their total probability is equal to:
$$\sum_{x}P_{x}(x)$$
Which, in this case is:
$$\sum_{0}^{1}P_{x}(x)=P_{x}(0)+P_{x}(1)=P+(1-P)=1$$
So this shows it's a PMF. What about it's CMF? That's set as:
$$F(x)=P[X\leq x]=\sum_{x_{i}\leq x}P_{x}(x_{i})$$
Where:
$$\lim_{x\rightarrow-\infty}F(x)=0$$
And,
 $$\lim_{x\rightarrow\infty}F(x)=1$$
Something worth covering is the expectation and the variance. For discrete functions the expected value is defined as:
$$\sum_{i=1}^{\infty}x_{i}P_{i}$$
For this case, it's:
$$\sum_{i=0}^{1}x P^{x}(1-P)^{1-x}=(0) P^{0}(1-P)^{1}+(1)P^{1}(1-P)^{0}=P$$
And the variance, defined as:
$$\sum_{i=1}^{\infty}(x_{i}-\mu)^{2}P_{i}$$
Which for the Bernoulli distribution is:
$$\sum_{0}^{1}(x-P)^{2}P^{x}(1-P)^{1-x}=(P)^{2}(1-P)+(1-P)^{2}P$$
Where we can pull out a $P$ and $(1-P)$ to get:
$$P[P+(1-P)](1-P)=P(1)(1-P)=P(1-P)$$

That was a lot of the basic stuff for the Bernoulli distribution. I'll cover the binomial one next since it's so closely related.






Wednesday, April 10, 2013

Conditional Probability, Conditional Expectations, and Conditional Variance.

Alright, so I know I still owe you guys a finishing post about the MGF, but I wanted to give a quick one about conditional probability, conditional expectations, etc.

So from beginner probability theory we know that:

$$P(A|B)=\frac{P(A\cap B)}{P(B)}$$

Which is the probability that A happens, given that we already know B has happened. A way to understand this is to look at a simple set graph (specifically a Venn diagram). Take this:

Now, let's say that we already know the event happened in the space of $B$. What's the probability that $A$ happens, or happened? Well, since the event is in the space of $B$, the only way it could be in $A$ as well is if it's in the intersection. These are the only possibilities, since we know for certain that $B$ did happen. Well, that's how the $P(A\cap B)$ term gets in there. Why do we divide it by $P(B)$? Well, the entire event space is now limited to the space of $B$. To see why, since $B$ happened, it must be that we're in the $B$ circle. Now, we normalize it to equal one, since by definition probabilities must sum to one. So $B$ happens, and we have to sum up all the different events within $B$, and they should sum to one. Likewise with probabilities, events should be $0\leq X\leq1$. Which makes sense. If $A$ completely intersected with $B$, the the probability that A also happens is $1$, since no matter what, $B$ happening guarantees $A$ happened, or will happen. Similarly, if $A$ is disjoint with $B$, or doesn't intersect at all, then obviously $B$ happening means that $A$ is guaranteed to not happen. To use our notation so far, if $A$ completely covers $B$, then $P(A\cap B)=P(B)$, since they intersect on every point in $B$. Therefore:

 $$P(A|B)=\frac{P(A\cap B)}{P(B)}=\frac{P(B)}{P(B)}=1$$

Likewise, if they don't intersect, then $P(A\cap B)=0$. Plugging that in, gives:

 $$P(A|B)=\frac{P(A\cap B)}{P(B)}=\frac{0}{P(B)}=0$$

So it makes sense that all the other ways that $A$ and $B$ can intersect are inbetween $0$ and $1$.

Alright, so we know how we can condition on probabilities, or, knowing that one event happened, we can know the probability that another happens. Well, since we know marginal PDFs, and joint PDFs, we do know some probability with continuous probabilities. So we can do exactly the same. Take:

$$f_{X_{2}|X_{1}}( X_{2}|X_{1})=\frac{f_{X_{2},X_{1}}(X_{2},X_{1})}{f_{X_{1}}(X_{1})}$$

Notice this is the exact same as before, but now we have different ways of showing the probability. The left hand side denotes the conditional probability that $X_{2}$ happens, given that $X_{1}$ happens. The right hand side is exactly what we had before. The numerator is the joint distribution function, and as can easily be seen, it's equivalent to the probability that $X_{1}$ and $X_{2}$ both take specific values, while the denominator is the marginal PDF of $X_{2}$. So, it's the probability that both happen, divided by the probability that one of them takes a specific value. Exactly the same as our conditional probability from earlier. Some cool stuff we can do is show that it's a good old fashioned PDF. If we sum over all the values of $X_{2}$, we get:

$$\int_{A}\frac{f_{X_{2},X_{1}}(X_{2},X_{1})}{f_{X_{1}}(X_{1})}d{X_{2}}=\frac{1}{f_{X_{1}}(X_{1})}\int_{A}f_{X_{2},X_{1}}(X_{2},X_{1})d{X_{2}}$$

Where $A$ is the entire space that $X_{2}$ is defined on, and looking at the last part we can see that it becomes the marginal PDF of  $X_{1}$ so it becomes:


$$\frac{1}{f_{X_{1}}(X_{1})}\int_{A}f_{X_{2},X_{1}}(X_{2},X_{1})d{X_{2}}=\frac{f_{X_{1}}(X_{1})}{f_{X_{1}}(X_{1})}=1$$

So the total probability sums to one. We can also do some cool stuff, such as find the conditional probability over an interval, such as:

 $$\int_{a}^{b}\frac{f_{X_{2},X_{1}}(X_{2},X_{1})}{f_{X_{1}}(X_{1})}d{X_{2}}=f_{X_{2}|X_{1}}(a\leq X_{2}\leq{b}|X_{1})$$

In other words, the probability that $X_{2}$ is in a certain interval, given that $X_{1}$ happens. We can also get a conditional expectation, by plugging in any function $g(X_{2})$, and doing the expectation we're used to:

$$E[X_{2}|X_{1}] =\int_{A}g(X_{2})\frac{f_{X_{2},X_{1}}(X_{2},X_{1})}{f_{X_{1}}(X_{1})}d{X_{2}}=\int_{A}g(X_{2})f_{X_{2}|X_{1}}( X_{2}|X_{1})d{X_{2}}$$

And, likewise, we can do a conditional variance:

$$Var[X_{2}|X_{1}]=E\left[([X_{2}|X_{1}]-E[X_{2}|X_{1}])^2\right]$$

Which, simplifying gives us the general equation for variance, but this time given that we know $X_{1}$, or:


$$Var[X_{2}|X_{1}]=E[[X_{2}|X_{1}]^{2} -E[([X_{2}|X_{1})]^{2}$$

Which looks nearly the same as our usual definition for variance, which would be:

 $$Var[X]=E[X^2]-\mu_{X}^{2}$$

But with conditional probabilities and expectations.

Now, one last thing. It's time to make sense of this all, and I'll use a convenient graph taken from here. (Great source for beginner econometrics, and it's free! I highly recommend you use this. It's by Bruce Hansen.):




Now I removed the original $X$ and $Y$ values so it wouldn't be confusing. I would highly recommend checking out his original example with wages conditioned on things like race, gender, and so on. Very easy to understand real world application. However, let's just use the $X$ axis as $X_{1}$, and the $Y$ value as $X_{2}$. As you can see there's a contour map on the left graph. That's the amount of realizations of these events. As you can see, at different values of $X_{2}$, there are different values that $X_{2}$ can take. The line going through it are the expectations, which means the expected value of $X_{2}$ given that we choose a given value of $X_{1}$. For instance, if $X_{1}\leq{10}$, then a quick look at the graph tells us that the expected value of $X_{2}$ should be lower than if ${10}\leq{X_{1}}$. Well, let's check that. By taking specific values of $X_{1}$ and plotting distributions of $X_{2}$ given these values gives the right hand graph. Turns out our first impression was right. At higher values of $X_{1}$, $X_{2}$ has a higher expected value. So depending on what value of $X_{1}$, we can expect different values of $X_{2}$. Likewise, we could go through with variance. As you can see, some values of $X_{1}$ have different variances of $X_{2}$ given that $X_{1}$ value.

So that's an intuitive way of looking at it. Check out the book.