Wednesday, March 20, 2013

Beginner regression in R.



I guess the best place to start would be a simple linear regression. I won't explain a regression, since I think I'll do that in a later proof. This will be more related to R, which is a statistical program. I highly recommend picking up R, since it's a free and easy way to learn beginner statistical coding. As you can see above I plotted randomly generated data in it. To create the same yourself, plug in:

>x=rnorm(1000,4,5)

Which gives you a random, normally distributed variable "x", which you have 1000 observations for. It has a mean of 4, and variance of 5. To create a regression out of this, let's make a linear form with a dependent variable "y":

>y=2*x+rnorm(1000,0,10)

So, as you can tell, the first term is just 2 multiplied by the observations we just created, which gives it a slope of 2 when plotted against y. The added term on the end is the error term, which has a mean of 0 and variance of 10. You can plot this by putting "plot(x,y)" into the command. To put a regression line through it, you'd start with this:

>lm(y~x)

Where "lm" stands for "linear model".  Entering that in will give you the slope and intercept:



Now, to create the line through it, we put:

>abline(lm(y~x))

Which gives:


 

I highly recommend playing around with this stuff. For instance, what about no error? Well, then you can perfectly predict what value of x will give you in terms of y, and vice versa. Besides the discontinuous jumps between data, it's essentially the same as a linear relation y=2x. Take a look:


Alright, so this was random data, not the real deal. But it does give a relatively easy way of messing around with data, and just get a feel for what playing with a statistics program should be like. Next time I'll try and do a proof.



No comments:

Post a Comment