Stat381 Computer lab (I): Statistical data analysis with R and Excel
Instructor: Xijin Ge, Ph.D. (SDSU Dept. Math/Stat)
This file is
available online at http://learn.sdstate.edu/gex/381/Computerlab1.htm
Descriptive statistics using R
1. Download data file chapter06.txt from our course homepage. http://learn.sdstate.edu/gex/381/chapter06.zip
Save it and unzip it to a folder, such as c:\temp\chapter06
1.
Start R by clicking the
icon on desktop.
2. Change your current working folder into the folder where you saved your downloaded data. Using the menu by selecting “Change dir…” under “File”.
3. Read in battery life span data to a vector called “x”.
x =
scan(“exa6.2.1.txt”)
4. Have a look at the data:
x
5. Produce a summary of the data
summary(x)
6. Produce a stem-and-leaf diagram
stem(x)
7. Produce a histogram
hist(x)
8. You can customerize it:
hist(x,xlim=range(0,4500),br=20,main="Battery
life span", col="red")
9. For fancier graphs, seek help on this function:
? hist
10. Create a box plot
boxplot(x)
11. Read a data matrix with two columns:
y=
read.table("exa6.4.1c.txt")
12. Create a box plot for the two columns at the same figure.
boxplot(y)
What can we say about the number of defective products at the
two sites?
13. Are the differences significant? Student’s T test on the two columns of data:
t.test(y[,1],y[,2])
Using R to solve
statistical problems
Problem #1. As a good basketball player, you can shoot 3 pointers with 77% accuracy from your favorite position. Your friend suggests if you can successfully shot at least 9 times out of 10 trials, he will give you $500; otherwise you owe him $100. Will you take on this challenge?
1. The probability of winning $500: Pwin = f(9)+f(10). Since it is a binomial distribution, we will use the function dbinom() which will give us density function f(x) for binomial distribution.
Pwin = dbinom(9, 10, 0.77)+dbinom(10,10, 0.77)
Pwin
2. Our expected value for benefit=likely reward – likely cost
500*Pwin-100*(1-Pwin)
Will you do it?
3. Generate the sequence of all possible values of the random variable x=0,1,2,…,10.
x=0:10
4. Calculate the density function f(x) for each of the possible values of x. The result is stored in a vector fx:
fx = dbinom(x, 10, 0.77)
5. We can plot the distribution:
plot(x,fx,type=”b”)
6. Cumulative density function F(x)
Fx= pbinom(x,10,0.77)
7. To
get all the numbers, we create our own statistics table.
cbind(x, fx, Fx)
Exercise: what is the probability that you’ll score less than 3 times (thus be punished by a couch to do push-ups)? What is the probability that you’ll score between 3 to 6 times inclusive?
Problem #2. Among diabetics, fasting blood glucose level follow normal distribution with mean 106 in mg/100 ml. Standard deviation is 8. Find the probability the blood glucose level between 90 and 122 mg/100ml.
8. This is solved by the following command, using the cumulative density function of normal distribution pnorm():
pnorm(122,106,8) -
pnorm(90,106,8)
9. How rare is it to observe a level of 130?
1 - pnorm(130,106,8)
10. Find the point where 25% of all diabetics have a fasting glucose level of this value or lower?
qnorm(0.25, 106,8)
11. How does the distribution look like?
x = 70:140
fx = dnorm(x,106,8)
plot(x, fx, type="l")
In addition to dbinom and dnorm we used above, R has the following density functions: dgeom, dhyper, dpois, dnbinom, dbeta, dexp, and dchisq, covering all distributions we discussed in class. Can you guess what distributions there represent?
Analyzing genomics data using Excel
1. Download data from my website for the bioinformatics course by clicking “Microarray” on this page:
http://learn.sdstate.edu/gex/BioInfo-Fall07.htm
2. Save the data and open it using Excel. It contains the activity level of 3295 genes in 9 different biological specimens. The data matrix in the right contains the confidence level in each of the numbers of the left matrix. A smaller P value indicates confident measurement.
3. Copy-and-paste data into R from Excel. Using R to investigate whether the distribution of gene activity level is normally distributed. Select all 3295 data points for sample #1 in Excel, and copy it to clipboard. In R, type the following and hit RETURN:
x = scan()
Paste this data to R by selecting “Paste” under “Edit” in the top menu. We are abusing R by pasting 3295 numbers from the clipboard. After all the data has been copied, hit RETURN.
4. How are the gene activity levels distributed?
summary(x)
hist(x)
boxplot(x)
We can get some rough idea about the numbers that we are working with.
5. Now go back to Excel. Are the numbers for the 9 specimens comparable? Calculate the median of 3295 number for each of the 9 specimens using MEDIAN( ) and plot it.
6. First step we will remove all genes whose confidence level is not signficiant (P>.05) for all 9 specimens. Using function MIN( ) or COUNTIF( ). Delete data using sorting and then delete.
7. We want to remove genes that are constantly expressed across all specimens. Calculate the standard deviation and average for each gene using STDEV( ) and AVERAGE( ).
8. We’ll define a new measure of variability called coefficient of variance (CV) = STDEV/ AVERAGE.
9. Sort according to CV and delete the top ones with small number.
10. Do a T test in EXCEL using function TTEST( ).