R for Statistical Computing: Quick Start & Useful Coding Snippets
Assad Ebrahim, http://www.mathscitech.org/articles/computing-toolkits/r-for-stats
2003-2020
This page provides notes/references to using R for statistical computing.
1. Overview of R
What is R? R is an professional grade open-source statistical computing platform widely used in academia, bio-tech/pharma, and computational finance/investment banking, that has its origins in the IBM proprietary statistical computing platform called S/S+ (John Chambers is the author of both). R is useful for statistics, graphing, data visualization, and various types of analysis, including signal processing, forecasting, cluster analysis, PCA, fuzzy logic, and many of the topic areas discussed in Knowledge Engineering. R has many packages (extensions) for specialized applications. By 2010, R was an academic standard in Statistics, Plotting, Signal Processing, Data Analysis and was rapidly gaining mindshare in industry.
– runs on Windows and Linux
– has a command line interface but also has a GUI version
– is fully scriptable, and provides a programming language
– integrates with databases seamlessly and handles large data sets
– has bindings that integrate with Python, Ruby (on Unix so far), and Perl
– has commercially available plug-ins for Excel that allow R code to be run on the back-end.
If there’s one thing to remember as you learn R is think in VECTORS, LISTS and TABLES. The point is that you should be able to do everything you used to do in a programming language in one line with numbers, now with VECTORS (lists). In this sense, R is about list processing.
Excellent Sites for R and its use in Statistical Modeling, Statistical Learning, Data Science, and AI:
- R Programming for those coming from other languages (by John Cook)
- Drunks and Lampposts (by Simon Raper)
- Coppelia (by Simon Raper)
- Hyndsight (by Rob J Hyndman)
- Vladimir Anisimov
- One Page R
- R in Action, especially for the graphing
- Cross Validation
- Functional Programming in R (Assad Ebrahim, StackOverflow)
- Spreadsheet to R Translator
- Using R for Time Series Analysis
- Forecasting
- R Language Reference Card
- Check out The R Journal, a referreed journal relating to R and its applications. First issue: June 2009, roughly 3 issues per year. journal.r-project.org/archive Best is the single file archive of R News — before it rolled up into R Journal (2001 – 2009) and the single file bib file of refereed articles (2009 – present)
- Data manipulation (PDF)
- R Course materials: PDF (Oxford), R basics (PDF), UCLA: basics with annotated outputs, tutorial with info on plots, lecture notes with R code
- Statistical techniques: Regression, Time Series beginner’s tutorial, advanced time series
- Open Source framework for Financial Analysis.
- A brief guide to R and Economics
- quantitative trading: analysis and visualization framework
- Time series packages
- Data Mining: Data Mining in R (Togaware), online e-book
- Advanced Statistics using R
- Credit Scoring using R
- Graph gallery of R plots and charts with supporting code
- Graphics package Lattice – a tutorial
- Ggplot R graphics
- Ggplot Vs Lattice
- Multiple tutorials for using ggplot2 and Lattice
- Text Mining package in R
- Social Network Analysis
- Web Scraping in R
- Embedding R data frames in Excel
- Using R from Excel
- Databases: Connecting to MySQL from R
- Pulling data from SAS, STATA, SPSS, etc.
- Maps with R, Geographic maps
- Google Charts with R
- Using R and GoogleMaps
- Sweave
- R2HTML
- Poor Man GUI (PMG) for R
- R Commander is a robust GUI for R
- Java GUI for R (JGR)
- Tinn-R good R editor
- Eclipse plugin for R
- Instructions to install StatET in Eclipse
- Komodo R editor
- Omega Hat has a very interesting list of packages that is worth a look
- Commercial versions of R
- A very informative R blog
- Red R for R tasks
- Screenshots in R: KNIME is worth a serious look.
- R Graph Gallery, with source code
Why R? When not R?
R is a powerful, though complicated language. Unfortunately it has a somewhat difficult learning curve.
I have been using R since 2003 (v1.7.1, rw1071.exe), initially as a graphical tool for exploratory data analysis (EDA) of large data sets with many factors (comprehensive health data), later as a tool for simulation and modelling. Critically, the release of Excel 2007 with its new larger data set handling capabilities and easy to use pivot tools meant that my use of R shifted more toward to model building, automation of analytics involving the production of many tables and charts, and computationally intensive tasks on large data sets. Should a beginner start with R? It is worth forcing yourself through the learning curve while you have motivation and time, so that you already can use R easily when you need capabilities that are lacking anywhere else.
Why use R?
– you’d like a freely available software package for your home / personal computer that packs all the power of proprietary, industrial strength platform, without the cost
– you need to productionise / automate analysis that will be repeated many numbers of times, and that possibly involves producing many tables and charts.
– you need analyses that is more sophisticated or involved than would be possible to do in Excel
– specific analysis that is almost impossible in Excel — segmentation, clustering, etc.
– using / exploring a large number of pre-packaged statistical techniques, e.g. browsing / potentially using a vast statistical library of techniques (CRAN) [Footnote: CRAN is a user submitted archive following in the model of CTAN for the Comprehensive TeX Archive Network, and then picked up by CPAN for Perl.]
When not to use R?
– you’re already a whiz with some other statistical package (S, S+, Stata, SAS, etc.) In this case you probably have all of the same power, much of the same expressiveness, and (presumably) you already own or can use the proprietary software on your computer
– you’re a whiz with Matlab, and are willing to tolerate some slight awkwardness and inefficiency to avoid having to learn another platform
– you’re doing ad-hoc, one-time data analysis — in this case Excel (specifically version 2007 and newer) — are all you really need. The 2007 version makes Data filtering, Pivot tables, and a host of other analysis and graphical tools readily available and easier to use.
2. Getting & Setting up R
Getting Installer
Homepage. Download page. Select your operating system (Windows, Linux, Mac) and platform (32- or 64-bit). Choose the R version — recommended to install a version that has been around a while to avoid having package incompatibility problems (as of May 2020, recommend to install 3.6.3 instead of 4.0.0). Save and run the installer to begin.
Installation
Choose Language. Change installation directory to something without spaces so that when you want to run scripts or incorporate R into a data pipeline, you don’t have to worry about spaces in the pathname. I use: c:\totalprogs\R\. Customize startup:
– choose MDI (one big window) — With MDI there is one big container window, and all windows are within it. This means you can minimise, maximise the app in one click — easier if you’re working with lots of other applications also
– chose Plain Text help — this will bring up the help in a separate window in the GUI, rather than shunting you to a browser.
– UNCHECK: Save version number in registry
On Windows 10, the font in R can look blurry. If this happens to you the fix is easy: Find RGui.exe, right click > Properties > Compatibility tab > Change High DPI setting button > check Override high DPI scaling behaviour. The font shown below is the crisp font after the High DPI scaling fix.
Using an Integrated Development Environment (IDE)
There are a few choices: RStudio,
Rattle,
R-Commander,
Sweave for using R & LaTeX.
Sweave(), Stangle() and friends in package tools. Sweave allows mixing LATEX documentation and R code in a single source file: the R code can be replaced by its output (text, figures) to allow automatic report generation. Sweave files found in package subdir `inst/doc’ are automatically tested by R CMD check and converted to PDF by R CMD build, see the section on package vignettes in Writing R Extensions.
3. Quickstart
To run the GUI version, link to: c:\totalprogs\R\bin\i386\Rgui.exe
Rgui.exe – this brings up the R console (shell). q() to exit.
To run a command line version, link to: c:\totalprogs\R\bin\R.exe
Rcmd.exe – this is a command line front end with a lot of options
rcmd.exe batch
Rterm.exe – this brings up the console in a DOS terminal. q() to exit.
Files:
.RData
.Rhistory – command history
savehistory(“name.Rhistory”)
history()
savehistory() — saves to .Rhistory
Navigating between R windows: Use Ctrl+Tab. Close window with Ctrl+F4.
Help:
help(dir) # load help page dir() # Should show the files in the CURRENT directory [this command added in 2.xx series]
Arithmetic:
2+3 # This is a comment 2*3 2**3 # Also 2^3 2^(1/3) # cube root 2/3 # Float 3%%2 # Remainder (modulus)
Logic:
&, |, ! are AND, OR, NOT operators that work on lists && and || are SINGLETON (first element of list) AND and OR Example: c(TRUE, FALSE, TRUE) & c(FALSE, FALSE, TRUE)
String Literals:
print("Hello World!") # prints with newline cat("Hello World!") # no newline cat("hello","world") # concatenate text with space in between cat("hello", 2, " world!") # number can be concatenated sprintf("Hello World! %d:%d:%f", 08,53,29.2) # C-style formatted print sprintf("0x%X = %dd",13,0xD) # C-style number conversion and print
Categorical Variables:
Designating a factor variable and specifying its order with levels means that in all subsequent usage these levels will be used in displaying the data.
cc5 <- c("VERY LOW", "LOW", "MEDIUM", "HIGH", "VERY HIGH") table(cc5) cc5 <- factor(cc5,levels=cc5) table(cc5)
Assignment of Variables and Functions:
r <- 5 # Assignment square <- function(val){ val**2 } square(3) classify <- function(val) { if (val<10) print("LOW") else if (val>90) print("HIGH") else print("NORMAL") }
Working with lists and vectors:
R is powerful because it is easy to work with lists (lists of objects or vectors of data) as well as lists of lists (tables of objects or matrices of data).
Lists can be (1) created by hand using c, seq, or rep, (2) generated by sampling from known distributions, (3) read in from pre-existing datasets or (4) loaded in from a file.
Vectors can be named by assigning them at creation, e.g.
v <- c(a=1, b=2, c=3)
Many ways to unname
Working with tables and dataframes:
TABLES / dataframes
thisframe$HasIns # one column
thisframe$HasBPmeds # another column
t = table(thisframe$HasIns, thisframe$HasBPmeds) # 2-way table
addmargins(t) # adds the row and column sums, and total sum
ftable(addmargins(t)) # formatted table
t[1,] # row 1
t[2,] # row 2
t[,1] # column 1
t[,2] # column 2
CREATING A DATAFRAME
build it up using rbind and cbind
e.g.
vars2007 = rbind(vCreatin) # This starts off the table...
vars2007 = rbind(vars2007, vTSH) # This adds a row
CREATING other multi-dimensional data structures
### Form data frame:
ddd <- cbind(d, dd, bb, q, dd.vhigh, dd.high, dd.medium, dd.low, dd.vlow)
ddd <- data.frame(d, dd, bb, q, dd.vhigh, dd.high, dd.medium, dd.low, dd.vlow)
ddd <- table()
The most flexible is the list:
ddd<-list(d=d,dd=dd,bb=bb,q=q,vhigh=dd.vhigh,high=dd.high,med=dd.medium,low=dd.low,vlow=dd.vlow) # list with named sub-lists, can all be different lengths.
Basics:
length(d) min(d) # min of a vector treated as a set which.min(d) pmin(vec1, vec2) # parallel min of two vectors, element-wise apply(data.frame(vec1,vec2),1,min) # min across the rows of the dataframe , 1=rows table(d) # summarizing in a counts table table(d)/length(d) # frequency table suppressing verbose list output: sink("NUL") ... sink() # no output for ... commands unlist(d) # useful if getting a list of lists
Vectors and Lists:
Vectors are homogeneous, i.e. all elements have the same data type (e.g. integer or character).
Lists are general and can have elements of mixed types, including other lists.
So there are two operators for accessing elements: [ ] (indexing) and [[ ]] (subsetting).
Indexing [ ] returns the element, which in the case of a list of lists might be a list.
Subsetting [[ ]] forces the return of one element, strips out any named extras, and flattens a list of lists.
1:4 # vector of length 4. sequence, iterator, can be e.g. 4:1 length(1:4) d <- c(1,5,7,9) # vector of values d[2:3] # pick out subset of row d[2] quantile(d)[[1]] # just the number without the named %
Work with built-in dataset:
library(datasets) # load built-in datasets (see 00Index.html file in .\R\library\datasets\html\) library(help="datasets") plot(airmiles) plot(beaver1$time, beaver1$temp)
Read data from file:
fn <- "c:/totalcmd/scripts/rdata.dat" # note the / is opposite Windows default d <- read.csv(fn, header=FALSE) names(d) <- c('t','v') d <- read.table(fn,skip=15,sep="\t",fill=TRUE,header=FALSE) names(d) <- c("col1", "col2")
Working with lists:
detecting NA's: is.na(val) # tests element
suppressing NA's: most functions have parameter na.rm=TRUE
removing NA's: d-clean <- d[!is.na(d)] # remove NA's
min(d, na.rm=TRUE) # min value suppressing NA's
Reading/Loading source files/running an R script/Including source files
source("script.r")
lib = "c:\\totalcmd\\R\\lib\\AKE\\akelib.r"
source(lib)
Generic functions
clean <- function(v) { v[!is.na(v)] } # remove's NAs from a list (ex: clean(d)) drawOn <- function(w,h) { windows(bg="white", width=w, height=h) par(cex.axis=0.9, mar=c(4.5,3.2,2.5,0.5)+0.1, mgp=c(2.2,0.8,0)) # tight margins for insertion into a Word doc # LATER: for LAS=2, set margin bottom from 4.5 to 7.0 # margin: bottom, left, top, right } # < PLOT BODY GOES HERE > drawOff <- function(w,h,finame) { par(cex.axis=1, mar=c(5.1,4.1,4.1,2.1), mgp=c(3,1,0)) # set back to defaults fname <- sprintf("fig_%s.png",finame) dev.print(png, file=fname, width=w, height=h) dev.off() }
APPENDICES
Appendix 0: My Snippets
clean <- function(v) { v[!is.na(v)] } # remove's NAs from a list (ex: clean(d)) square <- function(val){ val**2 } # example for demonstrating sapply(clean(d), square)
Appendix 1: Using Packages
R packages are stored locally in the .\R\library path.
Packages must be loaded from the library (the location must be in your current search path):
library(ISwR)
library(datasets)
Packages can be imported from CRAN (R-repository)
To download a package (connect to internet) and run:
download.packages("ISwR", ".")
install.packages("ISwR", .libPaths()[1])
Package Installation
Packages can be installed from within R GUI (easy) OR from the command line:
c:\totalcmd\R\rw261\bin\Rmcd.exe (shows the Rcmd help)
c:\totalcmd\R\rw261\bin\Rmcd.exe INSTALL --help (shows the Rcmd INSTALL help)
c:\totalcmd\R\rw261\bin\Rmcd.exe INSTALL (echoes where it will install to)
Useful Packages (WIP)
teradataR_1.0.1.zip --- not on CRAN, get from Teradata directly --- REQUIRES R 2.11 through 2.XX (cannot use 3.000)
library(teradataR)
The Lowdown on Packages
-----------------------
R has some great capabilities out of the box.
But the best thing about R is that it is extensible and has a very active user community comprising many academics and professionals, from statisticians, machine learning, forecasting, data mining, big data, quantitative finance, and other analytic areas.
There are about 2 dozen packages installed with the base R system.
These aren't loaded in -- you have to load them in yourself if you want to use the functionality they provide.
Then there are thousands of user contributed packages, archived in CRAN (Comprehensive R Archive Network) modelled after CTAN (for TeX users) and CPAN (for Perl users).
Commands:
Within RGui > Packages Menu > Load Packages .... shows you the dozen or so packages included in the base installation.
> Packages Menu > Set CRAN mirror --- set this once so you're not prompted again. Choose a location near you. E.g UK (London)
> Packages Menu > Install Packages ... choose from CRAN list
> Packages Menu > Install Packages from ZIP ... unfortunately these are not available (as of this writing) on CRAN, so you have to get them externally as ZIP and then load in as ZIP.
Essential Packages --- whatever you can get from CRAN is best kept that way so you get the latest...
# Avail from CRAN (R Menu > Packages > Install Packages > Choose Mirror > Select packages... )
library(RODBC) # for connecting to databases (data base import to data frame)
odbcDataSources() # shows available connections installed on your machine use: Teradata or DW Master or ebraha01
library(rgl) # 3D rotatable plot
library(scatterplot3d) # 3D scatterplot / point cloud
library(ggplot2) # beautiful filled plots, http://stackoverflow.com/a/7027666/181638 (installs 13 others)
Appendix 2: Working with Data: load built-in (see Packages above), generate Randomly, import from File, pull from Database
WORKING WITH DATA
Loading data from a built-in library (see Packages above)
library(datasets) plot(beaver1$time, beaver1$temp) title("Body Temperature of Beavers, *C\nSource: built-in dataset, beaver1") # note, can wrap text with \n newline
Sampling from a random distribution:
This is useful for testing, generating data for inputting into a function or plot, or running simulations
General form: rdist(N, distparams...) runif(100,0,1) draw 100 samples from the uniform 0-1 distribution rnorm rpois rexp rgamma rbeta
Read / Load data from a file
getwd() # You'll need to know what directory you're working in. fn <- "filename.dat" READ IN A **TABLE** OF NUMBERS INTO A DATAFRAME data <- read.table(fn) # reads multi-column, char separated table into data frame d <- read.table(fn,skip=15,sep="-",fill=TRUE,header=FALSE) # skip n lines, specify separating character, allow ragged data (i.e. different number of elements per row), no header names(d) head(d) # top 5 head(d,10) # top 10 tail(d) # bottom 5 d$V1 # show column by name, default names are V1, V2, ... d[1] # show column by index d<-readLines(fn) # reads TEXT file line by line as a list of strings writeLines() #writes to a file line by line READ IN A **COLUMN** OF NUMBERS INTO A VECTOR list = scan("jj.dat") READ IN A SINGLE **ROW** OF COMMA-SEPARATED NUMBERS list = scan("list.txt",sep=",") READ IN A CSV **TABLE** INTO A DATAFRAME frm = read.csv("file.csv", header=TRUE) frm = read.csv(filename, header=FALSE) colnames(frm) = c('t','v') # to add your own column headers READ IN DELIMITED DATA read.delim() #reads TAB or other delimited file, other than CSV READ IN A BUNCH OF FILES IN A LOOP: dir /B *.csv > files (in cmd shell) files = readLines("files") # read in all filenames for (filename in files) { do(filename) } # Example reading in columns of different types coltype = c('integer','integer','character','integer') f2026 = read.csv("F20-26.csv", header=TRUE,colClasses = coltype) QUALITY CHECK --- read in and write out the data data = read.table("foo.csv", header = TRUE, sep = ",") write.csv(data, "out.csv", row.names=FALSE) # writes out WITH column names... data = read.table("out.csv", header = TRUE, sep = ",") write.csv(data, "out2.csv", row.names=FALSE) # writes out WITH column names...
Writing out data to files
write.table(x, file = "foo.csv", sep = ",", col.names = NA, qmethod = "double") ## and to read this file back into R one needs read.table("foo.csv", header = TRUE, sep = ",", row.names = 1) ## NB: you do need to specify a separator if qmethod = "double". ### Alternatively write.csv(x, file = "foo.csv") read.csv("foo.csv", row.names = 1) ## or without row names write.csv(x, file = "foo.csv", row.names = FALSE) read.csv("foo.csv") WRITE OUT TO PNG, PDF, etc. # a convenient one shot deal -- THIS WORKS! fname = sprintf("plot_%d.png",i) # For use within a loop and automatic naming dev.print(png, file=fname, width=300, height=300) # prints with transparent background, view in mspaint drawOn(500,350) # from one of my library files def=par(las=2, cex=.6, cex.axis=.8) barplot(t1rptot[1:15], ylim=c(0,30), main="Cholesterol Health Assessment", ylab = "Percent of Total Study Population", names = labt1r) par(def) drawOff(500,350,"37a") dev.print(png, file=fname, width=300, height=300) pdf("beamplot.pdf", width=800, height=600, paper="USr") # setup the PDF file to print on landscape Letter (US) sized paper ... plot stuff dev.off() #drivers pdf("file.pdf") # graphics driver - like sink, but to pdf() and only for graphics ... plot stuff dev.off() # close file png() bmp() jpeg() dev2bitmap() bitmap() # closes the stream to the driver dev.off() deviceswin.print() WRITING OUT TO VECTOR GRAPHICS FILE (SO CAN SCALE, ETC.) http://www.stat.wisc.edu/~deepayan/SIBS2005/demos/embed-output.html Bitmap formats are not suitable when there is the possibility of graphs being scaled (either when being viewed on-screen or after printing). This is usually the case for Word documents or Powerpoint presentations. For such use, the appropriate type of output devices to use are those that produce so called vector graphics, where the output file contains the plot information as a collection of geometric objects and text, and not pixels as in bitmap formats. The most popular of these formats (on Windows) is the Windows Metafile format. The following example produces a scatter plot using the iris data in a file called iris.wmf data(iris) win.metafile(file = "iris.wmf") plot(Sepal.Length ~ Sepal.Width, data = iris) dev.off() Once this is done, the file iris.wmf can be included in a Word or Powerpoint file. Another popular vector format is PDF, which can be included in other documents, as well as viewed or printed directly. WRITE OUT A CSV write.csv(frm1, "out.csv", row.names=TRUE) # writes out the rownames and column names... frm2 = read.csv("out.csv", header=TRUE) # reads in the output ? HOW TO APPEND? WRITE OUT A DATA TABLE write.table(data, "akhb2006m9.csv", sep=",", row.names=FALSE, col.names=TRUE) write.table(data, sprintf("%s%s",filename,"_test.csv"), sep=",", row.names=FALSE, col.names=FALSE)
Loading data directly from a database
use package RODBC db <- odbcDriverConnect("dsn=DW Master;uid=xxxxxxx;pwd=xxxxxxx;") make connection to database qstr <- "select top 10 * from cctr;" store sql query frm <- sqlQuery(db, qstr, believeNrows=FALSE) submit query and store result in data frame @@PITFALL! R is CASE SENSITIVE. So myVar is not the same as myvar. Be especially careful when importing from a database. From Database: use RODBC sqlString <- " put your sql statement here -- end with ;" # query string res <- sqlQuery(channel, sqlString, believeNRows = FALSE); Data is pulled back into a data.frame Tip: Import from database, export to file so that later you can easily re-import the dataset offline. Tip: Take care to formulate / transform your dataset on the database side -- this keeps the data side separate from the statistical transformations you'll do in R solely for the purpose of analysis, visualisation, summarisation
@@If you do nothing else, at this point you should be able to see the power of R and you should be able to use it enough to want to (and need to!) learn more.
The next chapter will give you what you need to take control of the power you see around you.
Appendix 3: Mixing Languages with R
Excel
Python
SQL / Teradata databases
Ruby (used to be patchy)
Perl (used to be patchy)
R Cookbook and References - Assad Ebrahim
A Reference Collection for R Programmers
Appendix 4. R as a Numerical Platform
library(datasets) # load built-in datasets (see 00Index.html file in .\R\library\datasets\html\) d<-beaver1$temp summary(d) quantile(d) quantile(d,0.9)Appendix 4. R as a Graphing & Visualization Platform
Plot Types: Plots & Curves R has many plot types. 3 most basic are: scatterplot plot(), histogram hist(), and boxplot(). Others are: timeseries plot.ts(), Q-Q plots, contour plots (2D), 3D plots, wireframes, mesh plots, animation plots, heat maps, map overlays... ADVANTAGE: R can easily graph 1M data points. In Excel, you struggle after more than 10K points. So easily a factor 100 better. Syntax:plot(x,y) lines(x,y) curve(f(x),add=TRUE) hist(x) boxplot(x) barplot(x)Example: Barplot with values above bar
months = c("Jan", "Feb", "Mar", "Apr", "May", "June", "July") org2020 <- c(3.7, 3.8, 3.7, 3.7, 3.9, 4.0, 4.0) x<-barplot(org2020, names.arg=months, main="Scores from 0 (low) to 5 (high)", ylim=c(0.0,5.0)) y<-org2020 text(x,y*1.1,labels=as.character(y))Examples:
#plot curve with endpoints x<-c(0,10) y<-x^2 plot(x,y) curve(x^2, add=TRUE) #plot from dataset and add overlay library(datasets) # load built-in datasets (see 00Index.html file in .\R\library\datasets\html\) plot(beaver1$time, beaver1$temp) curve(0*x + 37.0, add=TRUE) # overlay function must have an x #other plots hist(beaver1$temp) boxplot(beaver1$temp)Inspect data behind the plot by setting PLOT=FALSE and assigning
Example:h<-hist(x, plot=FALSE)Titles, Labels, Ranges: note title is at top, subtitle is at bottom of chart
plot( ..., main="title", sub="subtitle", xlab="x axis label", ylab="y axis label" xlim=c(x0,x1), ylim=c(y0,y1) )Setting axes:
Or can do it separately:title(main="", sub="", cex.main = 1, cex.sub = 0.75) # title and subtitle, with sizes mtext( , line=1) # subtext below title, give one line gap between mtext and graphPlotting an abline()
Advanced: Histograms
hist(x) histogram, either of counts (frequencies) or relative frequencies (densities) hist(runif(100,0,1)) Useful for testing, filling histogram with randomly generated data hist(runif(100,0,1), breaks=c(LEFT,breaks...,RIGHT)) Breaks: LEFT and RIGHT must contain data range. The rest are internal separators and determine number of bars (bins, buckets). Example: test <- rnorm(100) hist(test, breaks=c(-3,-2,-1,0,0.5,1,1.5,2,2.5,3) hist(..., xlim=c(x0,x1), ylim=c(y0,y1), density=50,col="blue") Pretty shaded histograms >> coloring different bars differently: use library(ggplot2) http://stackoverflow.com/a/7027666/181638 >> relative frequency (probability): Pitfall: R does not produce *relative frequency* (probability) histograms out of the box. This requires a bit of extra code as [given by @BondedDust] (http://stackoverflow.com/a/17433446/181638). Furthermore, if you specify breaks yourself, then R defaults to showing the DENSITY, not the COUNTS (frequencies) What's the difference? A density is related to an integral, it is x*f(x), so the width of your breaks makes a difference. Solution: To get relative frequency, you need is counts/total. [ Answer to question: why are densities NOT adding to 1 ? ] h$counts <- h$counts/sum(h$counts) plot(h, freq=TRUE) boxplot(x) boxplots are important to identify outliers the neat thing is side by side boxplots for easy comparison of two populations a,b. All populations to be plotted separately are presented in one list boxplot(c(a,b),names=c("a","b"))Advanced: Magic with figures
Recommended References:
Tips: http://blog.revolutionanalytics.com/2009/01/10-tips-for-making-your-r-graphics-look-their-best.html
Gallery: http://www.r-project.org/screenshots/screenshots.html
Graphics Export: https://www.stat.berkeley.edu/classes/s133/saving.htmlPrettification (Pretty, Beautiful) Annotation Labelling Titling Symbols Line styles Colour Multi-plots Producing plots (exporting to PDF or EPS) Plot / Chart Types: windows() disable current active plotting window and open a new one, making it active plot(x,y) 2-D scatterplot par( ) for setting parameters scatterplot3D(x,y,z, angle=a, highlight.3d=TRUE) external package. Angle=80 (view from top) angle=10 (view from side). Highlighting helps viewing. contour() 2-D contour plot image() 2-D image / gradient plot / heat plot persp() 3-D perspective plot trans3d() plot3d(x,y,z) [in package rgl] rotatable 3-D plot External package --- R OpenGL rotatable (left mouse button and drag) and zoomable (scroll wheel) par3d( ) for setting parameters play3d(spin3d(rpm=6), duration=5) to make the image spin configuring 3d plots is tricky: http://stackoverflow.com/a/15587188/181638 http://stackoverflow.com/a/9658576/181638 doc: http://www.inside-r.org/packages/cran/rgl/docs/decorate3d blog: http://www.stattler.com/article/three-dimensional-plots-using-rgl-package *** EXCELLENT -- including auto rotation animations *** tutorial: http://scs.math.yorku.ca/index.php/MATH_6627_2012-13_Practicum_in_Statistical_Consulting/R_tutorials/rgl_tutorial Regression model For our example we shall be using 3 variables which are "mpg" (Miles/(US) gallon), "cyl" (Number of cylinders) and "disp" (Displacement in cu.in.). mpg and cyl will be predictors and disp will be our response. The model we are interested in is: y_disp = b_0 + b_{mpg} x_{mpg} + b_{cyl} x_{cyl} > fit <- lm(cars.red$disp ~ cars.red$mpg + cars.red$cyl, data = cars.red) > coefs <- coef(fit) > planes3d(a=coefs["cars.red$mpg"], b=coefs["cars.red$cyl"], -1, coefs["(Intercept)"], alpha=0.50, col="plum2") Plot items: plot(..., xlab="", ylab="") x,y,z labels plot(..., main="", cex.main=1, sub="", cex.sub=0.75) title (top of plot), subtitle (bottom of plot) with sizes mtext="" text beneath the title lty="2" dashed line See R RefCard plot(..., pch='o', 'x', 'v', ') choose plot symbols pch=20 filled circles cex=2 twice the size 0.5 is half the size plot(x,y, col=(km$cluster+1)) colour code plot symbols with a class assignment vector (classes 1,2,3,...) palette() shows the list/sequence of colours 1=black, 2=red, 3=green, 4=blue, 5=cyan (light blue), 6=magenta (pink), 7=yellow, 8=gray plot(..., type="l|p|") plot type: lines (l) or points (p) or broken/both (b) Sizes: cex=1 point size cex.main title size cex.sub subtitle size cex.axis axis tick labels size cex.lab axis name labels size @@PLOT BEAUTIFICATION & PRETTIFICATION Gallery: http://www.r-project.org/screenshots/screenshots.html plot( x, y, xlab="", ylab="", type="l") # the plot, with axis labels, and either a line graph (type L) or points (type p) title(main="", sub="", cex.main = 1, cex.sub = 0.75) # title and subtitle, with sizes mtext( , line=1) # subtext below title, give one line gap between mtext and graph margins: mai - margin vector in inches mar - margin vector for plot area. default: c(5,4,4,2) + 0.1 # bottom, left, top, right. (good for subtitle, one line mtext, roomy around graph) c(5,4,5,2) + 0.1 # with sub, two-line mtext, roomy c(5,4,2,0) + 0.1 # with sub, no mtext, roomy c(4,4,2,0) + 0.1 # no sub, no mtext, tight around title, bottom, and right hand side mgp - margin vector for axis area. default: c(3,1,0) # axis title label, axis tick label, axis line c(2.2,0.8,0) # less space between axis title labels and plot c(1.8,0.5,0) # fits 4 graphs per view Excellent reference: http://research.stowers-institute.org/efg/R/Graphics/Basics/mar-oma/index.htm plot(..., width=, height=) plot sizes typical nice sizes for plots is 400,300 or 350,350, or 350,300 @@Multiple Plots in a Window (View) windows() # create new active window view <- matrix(1:2, nrow=2, byrow=T) # define view: matrix(1:totcells_nxm, nrows, orientation) layout(view) # assign view to layout pdef <- par(cex.axis=0.75, mar=c(4.3,3.2,2,0)+0.1, mgp=c(2.2,0.8,0)) # define tight margins to fit all elements par(pdef) @@Outputting Graphs to Screen and File plot(...) dev.copy(png,'myplot.png') dev.off() @@Easier graphics with ggplot2 library Fancy logically coloured histogram http://stackoverflow.com/a/7027666/181638 test <- rnorm(100) dat <- data.frame( x=test, above=test>1 ) library(ggplot2) qplot(x,data=dat,geom="histogram",fill=above) Slick way to plot & save the plot to PNG http://stackoverflow.com/a/8400798/181638 library(ggplot2) ggplot_alternative <- function() { the_data <- data.frame( x <- seq(0, 1, length.out = 100), y = pbeta(x, 1, 10) ) ggplot(the_data, aes(x, y)) + geom_line() + xlab("False Positive Rate") + ylab("Average true positive rate") + coord_cartesian(0:1, 0:1) } ggsave( "ggtest.png", ggplot_alternative(), width = 3.25, height = 3.25, dpi = 108 )Interactive Graphics and Animation
http://www.r-project.org/conferences/DSC-2003/Proceedings/UrbanekTheus.pdfiPlots library(rgl) # 3D rotatable plot library(scatterplot3d) # 3D scatterplot / point cloud library(ggplot2) # beautiful filled plots, http://stackoverflow.com/a/7027666/181638 (installs 13 others) ANIMATIONS can animate by adding a slider bar (Excel) can animate by looping with pause in between can animate by printing images and then using an image viewer to scroll through can animate by printing images and then putting them into a movie using ffmpeg can animate by printing images and then putting them into ImageJ and saved it as an animated gif. http://imagej.nih.gov/ij/Plotting to Image File
# draws a graph to png drawOn <- function(x,y) { windows(bg="white", width=x, height=y) # LATER: for LAS=2, set margin bottom from 4.5 to 7.0 # margin: bottom, left, top, right par(cex.axis=0.9, mar=c(4.5,3.2,2.5,0.5)+0.1, mgp=c(2.2,0.8,0)) # set to tight for Word insertion } # PUT THE BODY BETWEEN HERE drawOff <- function(x,y,str_id) { par(cex.axis=1, mar=c(5.1,4.1,4.1,2.1), mgp=c(3,1,0)) # set back to defaults fname <- sprintf("fig_%s.png",str_id) dev.print(png, file=fname, width=x, height=y) dev.off() } # Example drawOn(500,500) plot(rnorm(1000,0,1),rnorm(1000,0,1), xlab = "X position", ylab="Y position", main="Random Buoy movement above an anchored location") drawOff(500,500,"01") # Function for doing preliminary analysis --- automatically reading in a file, plotting it, and graphics out to a file do = function(filename) { # get input d1 = read.csv(filename, header=FALSE) names(d1) = c('t','v') # plot to screen and png drawOn(500,500) plot(d1$t, d1$v, xlab = "Time (s)\n(Sampling Resolution: 1 us)", ylab = "Volts (V), (Voltage Resolution: 0.01 V)") drawOff(500,500,filename) }Ref 1
@@1D1 R ON NUMBERS (SCALARS) Arithmetic: + - * / ^ ** Notice: exponentiation is either ^ or ** (Matlab and Python/Ruby syntax) Mod: %% Comparison: < > <= != == Strings: ' " is.na !is.na Higher Functions: exp, log, log10, pi, sin, cos, [note: sin and cos are in radians] Probabilities: dnorm, pnorm, qnorm pois exp gamma beta ... Assignment to Variables: Operator: <- (works at every level) = (only works at top level, or setting parameters inside function assignments) Though actually there are arguments for using = to avoid bugs: http://www.win-vector.com/blog/2013/04/prefer-for-assignment-in-r/ @@PITFALL: Assignment: <- not = first thing you'll note is that R idiom uses <- Most newbies would prefer = But the problem is that = only works at the TOP LEVEL, i.e. it won't work within functions or loops etc. So while 99% of basic cases, there is no difference, doing anything advanced will require you to use <- so better get used to it now. http://stackoverflow.com/a/1742550/181638 http://ig2600.blogspot.co.uk/2012/04/5-assignment-operators-of-r.html @@PITFALL: Trapdoors from Numerical Mathematics Machine Numbers: Don't forget that you're working on a machine! 0 is NOT exact. Subtraction of equal numbers is NOT exact. As an example: sin(pi) is NOT zero, but is 1.22e-16 So when dealing with 0, you have to TRAP the case, e.g. IF X > -1e-8 or X < 1e-8 THEN 0 ELSE X ENDIF @@1D2 Data Structures in R @@1D2a LIST / ARRAY / VECTOR / TUPLE MODE A list is the generic object. Because R can add and multiply two lists element-wise, we automatically get tuple-mode and array-mode. Vectors are the mathematical designation -- from a computational point, they behave under product like matrices, i.e. * is inner product. c(1,2,3,4) + c(5,6,7,8) # any operation will work ELEMENT-WISE: +,-,*,/,** or ^ x <- c(1,2,3,4) x <- 1:4 # better sum(x) - sum prod(x) - product How to do a running sum? You need sapply, or sequential apply sapply(sequence, function to do on each sequence, other arguments of function) rsum <- function(i,x) { sum(x[1:i] } # x is a vector, i is the index sapply(1:length(x), rsum, x) # runs over each index and calculates the running sum of x, returning the result in a sequence Note, to pass multiple arguments, list the 2nd and successive after fun. How to do a running product? Same way @@1D2b MATRIX MODE a<-matrix(c(1,2,3,4),nrow=2,ncol=2,byrow=TRUE) b<-matrix(c(5,6,7,8),nrow=2,ncol=2,byrow=TRUE) a+b # NOTE: all operations are still element-wise To do MATRIX MATRIX multiplication: a %*% b t(a) # transpose solve(a) # calculate A^-1 NOTE: lists are treated as COLUMNS when doing matrix multiplication, so the usual conventions apply: v'v v<-c(1,2,3,4) t(v) %*% v # dot product solve(A,b) # if A is square, invertible, and b is in range of A. . <- function(v1,v2) { t(v1) %*% v2 } # dot product ? how to define a new operator ? e.g. v.v @@1E USE R IN GRAPHING MODE. @@1F USE R IN STATISTICAL SUMMARY MODE. summary mean (average) median quartiles IQR Numeric Summarisation: categorical (qualitative) variables: levels(vec) list the distinct categories found in a vector of categorical variables factor(vec) levels(factor(vec)) same as levels, but for an integer vector, returns as strings as.integer(levels(factor(frm$report_week))) returns as integers table(vec) list the levels and counts quantitative variables: summary(vec) produce 5 number summary data frames having both quantitative and qualitative variables tapply(frm$measure, frm$category, summary) returns an array of lists, each list holding the 5-number number **for that category**! very useful excellent reference: http://stackoverflow.com/a/7141669/181638 but usually you want : by(frm, frm$category, summary) # show how all of the attributes in frm are summarised by class, !! but returns vector or: aggregate(frm$measure, list(frm$category), summary) # ** returns data frame ** -- notice you have to coerce the category into a list s <- aggregate(frm[,c(1:3), list(frm$category), summary) # ** returns data frame ** -- notice you can even have multiple measurements! :) aggregate(frm[,cols], list(rep(1,nrow(frm))), summary) # return a summary of everything -- ie all have the same class note: you can use median, or mean instead of summary or: ave provides the element-wise average, so useful with a list of lists, aka a matrix, or dataframe, or with sapply or: ddly @@2 A DEEPER UNDERSTANDING OF R @@2A R's Data Structures @@2A1 arrays, vectors, lists creating c( ) create, concatenate seq( ) rep( ) 1:5 indexing into an array is done with square brackets e.g. x[2] end of an array: length(x) end of a row/column of a matrix: ncol(row_x) nrow(col_x) any list of integers can be used as indices into a further list v[1:5] you can also specify what you *don't* want, i.e. set complements. This is very powerful. e.g. x <- c(1,2,3,4) x[-1] >> 2 3 4 x[-2] >> 1 3 4 x[-(1:2)] >> 3 4 which( condition ) returns the INDEX/INDICES of a list corresponding to the condition(s) subset(x, expr) @@2A2 matrices, tables, frames frm[1,] row 1 frm[,1] column 1 (matrix notation) frm$col select column by name frm[[1]] select column by number subset(x, expr, cols ) select subset subset(frm, new_dg=="D1", sig_nois_rat) example @@2A3 data types numeric float int ... @@2A4 INTROSPECTION When things go wrong, as they always will, it is handy to find out what the data type is. @@2B Working with Lists In R you should be working with lists (vectors). Sclar operations act on a list but may need na.rm=TRUE Example: min, max library(datasets) d<-airquality$Ozone max(d, na.rm=TRUE) FUNCTIONAL PROGRAMMING paradigm creates vector operations that act element-wise on the list while avoiding explicit looping in R. In R, these are the *apply functions summarized here: **Mnemonics** * `lapply` is a *list* apply (acts on a list and returns a list or list of lists) * `sapply` is a *shaped* sequential apply (lapply but shaping output naturally ** you will usually use this one) * `vapply` is a *vectored* apply (vectored sapply for speed, returns atomic vectors) * `rapply` is a *recursive* apply (for recursive lists, i.e. lists within lists) * `tapply` is a *tagged* apply (the tags identify the subsets) * `apply` is *generic*: (applies a function to a matrix's rows or columns) Using the `apply` family requires a change in perspective (users of Lisp will recognise the paradigm immediately): - Advanced R: Functional Programming, by Hadley Wickham - Simple Functional Programming in R, by Michael Barton Example: square <- function(val){ val**2 } sapply(d[!is.na(d)], square) @@2C Working with Frames @@2C1 Subsetting @@3 PROGRAMMING R Programming in R is essential for efficiency and reproducibility. @@PITFALL! R is CASE SENSITIVE. So myVar is not the same as myvar. Be especially careful when importing from a database. @@3B R programming style A bit of Forth (refactoring, small definitions) A bit of functional A bit of objects @@PITFALL! DRY: Don't repeat yourself! In R it is very easy to do copy-paste development. This is a terrible habit to get into. It means you will have very long files, iteration will be slow. Get into the habit of honouring DRY. If you need to repeat yourself, REFACTOR into a function. @@PITFALL! Clean up as you go As you refactor, don't forget to clean up old definitions so that you don't stumble across the wrong one. NOTE: You don't have to redefine definitions that include other functions that you have changed. R references at RUN-TIME not COMPILE-TIME (like Forth) Finally, always re-test a long script after clearing out all of R's memory before finishing, to see that everything works "from scratch". ls() shows all the user objects remove(x) remove(ls()) removes all your objects, variables, functions, definitions rm(list=ls(all=TRUE)) rm(list= ls()[!(ls() %in% c('keepThis','andThis'))]) removes all objects except keepThis andThis @@3C Pragmatic Programming Comments @@3A Defining a function @@3C Strings & Formatting in R Working with Console Output: cat is a console output that adds a space between parameters cat("\nLoading",length(rownames(data)),"records...\n") print() sprintf("%s %s",str1,str2) NICELY FORMATTED OUTPUT @@3D Console INPUT scan() @@3D Looping in R & Conditionals if for: for (x in seq(0,1,by=0.3)) { cat(x) } filling list: # this forms a ragged list of lists L <- list() for (x in seq(1,100)) { N <- 1+rpois(1,5) # ensures no 0 lists L[[x]] <- rnorm(N,0,1) # lists are 1-indexed in R } while repeat break next switch Q: How can I use a loop to [...insert task here...] ? A: Don’t. Use one of the apply functions. blog: http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/ see below: numeric summarisation, and excellent reference: http://stackoverflow.com/a/7141669/181638 @@4 PRODUCING WITH R In this chapter we look at R AS A PLATFORM FOR PRODUCTIVITY. @@4A Documentation, Revision Control, SWEAVE @@5 STATISTICS, DATA MINING, ALGORITHMS, and SIMULATION WITH R Finally, the place where R really shines --- as a platform for statisticians, applied mathematicians, scientists & data scientists, algorithmicians, quants, and anyone else who likes to play with mathematical ideas professionally or for pleasure. @@5A References Good ones are brilliant, and can make the entry smooth even for beginners. Book: Introduction to Statistical Learning with R, 2014 Site: http://a-little-book-of-r-for-time-series.readthedocs.org/en/latest/src/timeseries.html#forecasts-using-exponential-smoothing @@5C local Web App with R Shiny http://shiny.rstudio.com/ A web application framework for R Turn your analyses into interactive web applications No HTML, CSS, or JavaScript knowledge required @@5D Plaforms, Plugins RExcel - Excel plugin RCommander - environment RGUI - environment RStudio - IDE @@6 A COLLECTION OF USEFUL SNIPPETS, RECIPES, TRICKS USEFUL EXAMPLES TO KEEP A LOG OF RESULTS -- NOTE --- SINK does NOT sink the commands, just the output sink("diary.txt", split=TRUE) # directs output to sink file AND stdout sink("diary.txt", append=TRUE, split=TRUE) # appends to sink file sink("log.txt") # output ONLY to the file sink() --closes the logging stdout! print() does work with sink() WARNING: sprintf() doesn't seem to work with sink?? gettextf() does NOT work with sink (v2.6.1) capture.output() and sink() both choke on sprintf() and gettextf() -- WEIRD! TO STORE THE COMMAND HISTORY -- history(numlines) savehistory("history.txt") loadhistory() TO SEE A DIRECTORY LISTING dir() TO SEE A LIST OF ALL OBJECTS IN MEMORY objects() or ls() TO REMOVE OBJECTS (KILL, DELETE, DROP) remove(..., ...) or rm() TO REMOVE ALL OBJECTS remove(list=objects()) or rm(list=ls()) CONFIGURATION options() #shows all options options(digits=4) TO RUN A SYSTEM COMMAND shell() # less primitive system() # primitiveRef 2
Environment: CASE SENSITIVE R is case sensitive so nAME is not the same as Name or NAMe # comment ?ref quick way to say help same as help(ref) history(N) show last N commands dir() show current directory listing ls() see defined objects rm(obj) remove object, rm(list=ls()) to remove all! sink(filename, append=TRUE, split=TRUE) appends to sink file while also directing output to stdout --- useful for logging results sink() close file and stops logging Finding out what Data Type an object is (introspection) / dumping / printing its contents class(obj) e.g. data.frame typeof(obj) e.g. list str(obj) *** displays the contents of the objects in a useful way print(obj) dput(obj) dump(obj) HELP: help(keyword) help.start() #starts online help help(base) # the base package of R help(stats) # the stats package of R library(help="stats") #shows all the stat functions in the stats package google... stackoverflow... http://www.rseek.org/ [http://cran.r-project.org/web/packages/sos/index.html] Operators sort(vec) sort(vec,TRUE) sort(vec,decreasing=TRUE) sort a column, either asc (default) or desc (TRUE) frm[ with( frm, order(-ColOne,ColTwo)), cols ] sort a data table with ColOne desc, then ColTwo asc Defining Subroutines, Procedures, Functions funcname = function(param1) { ... body ... 0 # last value is returned } so: y <- funcname(param1) # returns 0 Global & Local Variables http://stackoverflow.com/questions/10904124/global-and-local-variables-in-r globvar <- ... globfunc <- function( ) { ...} f <- function( x ) { localvar <- x (localvar - median(localvar)) / max(-min(localvar),max(localvar)) } g <- function ( x ) { globvar <<- x ... } Randomness set.seed(2) rnorm(10) # return 10 random values from normal distribution N(0,1) rnorm(10,u,s) # normal N(mu,sigma) runif(10) # uniform random U[0,1] runif(10,a,b) # uniform random U[a,b] rpois(10,1) # poisson random Data Structures: Know your structures! sequence, list, array, vector data frame, table matrix really an array, but treated like a 2D array Working with Sequences (a specific type of array): 1:5 seq(1,5) seq(1,6,2) create sequence Working with vectors/arrays: vec vec <- c(1,2,3) creation using concatenation operator. Any sequence, singleton, or array can be concatenated vec[a] index element(s) a, a is any sequence, singleton or array size: length(vec) Working with matrices: mat mat <- matrix(rnorm(10*2),ncol=2) # 10x2 matrix Working with Data Frames: frm What is a data frame: Size: nrow(frm) ncol(frm) returns row or col dimensions View: frm[x,y] frm[x,] frm[,y] frm[1,vec], frm[a:b,1] show element, row, column, indexed rows or columns (vec), sequence of rows or columns frm["ColOne"] frm$ColOne used named columns to show column as column or row ($) names(frm) colnames(frm) rownames(frm) show column or rownames Filter: rows <- which(frm$ColOne > 20) returns rows of dataset satisfying filter condition Subset: (create a subset table based on a filter) frm2 <- subset(frm, frm$ColOne > 20)[a:b] pull off specified columns for all rows satisfying the filter Subset -- alternative w <- which(df$x > 0.9) data.frame(which=w, df=df[w, ]) creates and specifies column names re-ordering a data frame: frm[with(frm,order(colname1,-colname2)),] reset the row numbers (rowID) of a data frame: row.names(frm) <- NULLAppendix 4: Using Scripts
You can either start rterm.exe or rgui.exe and use the command # source("helloworld.r") # OR # you can use rcmd.exe batch helloworld.r on the command line # and the output will be written to helloworld.Rout### BATCH PROCESSING INFO is at help(BATCH) in R.
options(echo=FALSE)
#suppresses the text from output when running in Rcmd.exe
#when this is used in RGUI, you don't get the PROMPT
# to see the prompt again, run
options(echo=TRUE)
#RGUI is also a LOT slower with echo=FALSE, and you still see the instructions anyway...Ref 3
CRAN packages: http://cran.r-project.org/web/packages/ Recommended Links: http://www.statmethods.net/graphs/scatterplot.html http://www.cyclismo.org/tutorial/R/ options(echo=F) # setting echo to off means that the OUTPUT to the console is NOT interspersed with > signs. You can then copy and paste items into your script much more easily. RECOMMENDED DEFAULT OPTIONS options("digits"=5) # default is 7, prints 5 SIGNIFICANT FIGURES only, e.g. 3.1416 NOT 3.14159. Rounds everything else. options("scipen"=1) # default is 0, biases whether to use fixed width or scientific notation in numbers options("width"=80) # set this so that you have a fixed width output USEFUL COMMANDS cat() # no quote, no line numbers, only explicit newlines, so can concatenate values, etc in the same line , format(), write(), print.default() print(x, quote="FALSE") # leaves off the quotes ?? why does R print with [1] ? aaah! these are LINE NUMBERS! noquote() # print without quotes methods(print) formatC() # formatting numbers using C style paste() gettext() cat(s, sep="\n") cat(a,fill=TRUE) HELP help(command) help(library=base) help(any index label) help(Logic) COMMENT # this is comment if(FALSE){ this can now be a multiline comment. R will never execute it. } HELP help(command) ??keyword NOTE! ARRAYS BEGIN INDEXED AT *ONE* not ZERO. (For Installation, see Toolchain\R) R repositories CRAN Rforge http://www.rforge.net/ R and Reporting brew http://www.rforge.net/brew/svn.html sweave # for documentation http://www.stat.uni-muenchen.de/~leisch/Sweave/ RSRuby (only for UNIX so far) Knowledge Bases: How-to's for statistics: http://www.statmethods.net/advgraphs/axes.html Numerical Methods in Statistics: http://www4.stat.ncsu.edu/~monahan/jun08/toc.html R Cookbook: http://www.r-cookbook.com/ ---------- Value of R ---------- 1. Exploratory Data Analysis using R John W. Tukey began his classic book, Exploratory Data Analysis (EDA), stating that "EDA is detective work--numerical detective work--or counting detective work--or graphical detective work." He continued: "A detective investigating a crime needs both tools and understanding. If he has no fingerprint powder, he will fail to find fingerprints on most surfaces. If he does not understand where the criminal is likely to have put his fingers, he will not look in the right places. Equally, the analyst of data needs both tools and understanding." (Tukey, 1977, pg 1). The goal of this guide is to introduce readers to using R as a tool to take part in this detective work. http://www.r-cookbook.com/node/32 2. R as a graphing calculator # setup R = 10 # uniform [-N,N] range x = seq(-R,R,.5) plot(0,0, xlim=c(-R,R), ylim=c(-R,R)) lines(0*x,x) # draws y axis lines(x,0*x) # draws x axis # functions y = 2*x+5 # for example lines(x,y) # points points(9,9) #black, circle points(9,7, pch=23, col="red") #red, diamond # points from file points = read.csv("points.csv", header=FALSE) points(points, pch=23, col="red") 3. R as a matrix calculator: ------------- Concepts of R ------------- 1. To quit: `q()' 1. To escape out of a command halfway in progress: ESC 2. # is the comment indicator 3. R can be used as an overgrown calculator 4. Assignment operator: <- (e.g. x <- 2) also = (e.g. x = 2) ?? Any difference? No. 5. Symbolic variable: - names *are* case-sensitive (so WT is not the same as wt) - reserved names: c, q, t, C, D, F, I, T, diff, df, pt 6. Data vectors can be created with c(), - c(...,...) creates data vectors by listing their elements (e.g. weight <- c(50, 72, 57, 90, 95, 72) 7. Vector arithmetic is elementwise or cycled through a shorter vector if the length of the longer is a multiple of the length of the shorter -- this is the convention for vector arithmetic. This makes it easy to do stats calculations such as mean and std. deviation, e.g. mean(x), sd(x), t.test(x). Notice: use the p-value to assess your hypothesis relative to the theoretical mean mu. Use the 95% confidence interval to see what the sample tells us about the 95% confidence range of the true mean if we were estimating it. 8. Plots: - plot(x,y) - one vector against the other, or, by default, against the index of the vector - lines(x,y) - plots and connects the plotted points by a line and adds the line to the existing plot. NOTE: x should be sorted! --------------------------------- HEIRARCHY OF MATHEMATICAL OBJECTS --------------------------------- scalars pi exp(1) functions sin(pi) cos(pi) = -1 vectors / arrays o = c(1,2,3) # creates vector/array o[1] # access first element -- notice, arrays are indexed from 1 rep(vector,2) # repeats vector twice, one after the other rep(vector,rep(2,length(vector))) # repeats each element of vector twice a:b seq(a,b,step) matrix() %*% is matrix multiplication v=1:14 matrix(v,nrow=2, byrow=FALSE) # assigning down columns -- this is R's standard for tables matrix(v,nrow=2, byrow=TRUE) # assigning across rows Example: extracting independent tables from a three-way table: bp5i = table(ofacBPmeds, ofacBP, HasIns) # 2 levels, 4 levels, 2 levels t1 = matrix(bp5i[1:8], nrow=2, byrow=FALSE) # first table of 8 t2 = matrix(bp5i[9:16], nrow=2, byrow=FALSE) # second table of 8 matrix by row binding or column binding rbind(r1,r2,r3,...) cbind(c1,c2,c3,...) v=1:14 # FIRST assign values to a vector, THEN dimension into a matrix dim(v) = c(2,7) # row by column, COLUMN MAJOR! v[1,2] # shows element i,j v[1, ] # shows row 1 v[ ,1] # shows column 1 data frame col1 = col2 = data = data.frame(col1, col2, ...) is.data.frame(d) COLUMN SUM sum(matrix[,1]) ROW SUM sum(matrix[1,]) CATEGORICAL VARIABLES categories = c("level1", "level2", ...) catcounts = c(x1, x2, ...) rep(categories, catcounts) table(factor(natpop, levels=agelab)) #allows ordered labels (not autosorted) ===================== R Language 2/23/2006 11:17AM AKE QUALITY DATA ANALYSIS WORK -------------------------- Always three things: numbers (absolute counts as well as relative percentages) words, pictures (charts, graphs, visuals), (summary) (then later statistics) (subgroup exploratory data analysis) (pivot by factors exploratory data analysis) For categorical data: table counts (or ftable -- formatted table -- better) ftable(addmargins(table(HasIns, ofacBPmeds))) table percentages plot (percentages) For numerical data, show: quantile calculations (5 number summary, min, max, median, mean, quartiles, IQR) histograms (if have notion of categorical grouping into bins, e.g. age bins) plots box plot For comparison of grouped categorical data, show: plot of two way tables For comparisons of grouped numerical data, show: side-by-side boxplots For continuous variables, test for non-normality: # - remove NaNs data.Age <- data$Age[!is.na(data$Age)] # - form histogram data.Age.hist <- hist(data.Age) # - view statistics quantile(data.Age) mean(data.Age) sd(data.Age) # - view **linear** correlation between two continuous variables cor(Age, Creatinine, use="pairwise") IS correlation robust? I.e. is it pulled by outliers? NO! It IS pulled by outliers... # **linear** modeling: linfit = lm(response ~ predictor terms) linfit = lm(y ~ 0 + x) # to force not using a y-intercept abline(linfit, lty = 2) # plot as dotted line # **quadratic** modeling: quadfit = lm(response ~ term1 + I(term2^2)) # the I is essential so that ^ means power # how to compare the quality of the fit? You want the magnitude of the sum of squared errors..., so you're looking for the R-squared value. linfit = lm(y ~ x) summary(linfit) quadfit = lm(y ~ x + I(x^2)) summary(quadfit) plot(x, y) curve(predict(quadfit, newdata = data.frame(x = x)), add = TRUE) abline(linfit, lty = 2) # general modeling (non-linear least squares) nls(thisframe$CholLDL ~ (thisframe$Age^2) + thisframe$Age, trace=TRUE, start=list(thisframe$Age=20)) # - create normal distribution model med.Age = median(data$Age, na.rm=TRUE) IQR.Age = summary(data$Age)[5] - summary(data$Age)[2] # IQR norm.Age = rnorm(1000000,mean=med.Age,sd=IQR.Age) plot(density(sort(data$Age))) plot(density(sort(norm.Age))) #, xlab = "Age", ylab="Density", sub="Figure 7", main = "Age Distribution of the National Sample") # - set breaks Agebreaks<-c(0,data.Age.hist$breaks) # - plot normal distribution norm.Age.hist <- hist(norm.Age, breaks=Agebreaks) # - plot comparisons windows() view <- matrix(1:2, nrow=1, byrow=T) layout(view) hist(data.Age, freq=FALSE) hist(norm.Age, breaks=Agebreaks, freq=FALSE) # - count comparisons ============================================ Statistical Methods in R Assad E.K. Ebrahim, M.Sc. 11/30/2007 8:40AM 3/6/2006 12:59PM ------------------------- --- ROWS, COLUMNS, AND CONDITIONAL SELECTION HOW MANY DATA ROWS IN A TABLE? (i.e. not including the header row) length(rownames(data)) SHOW A ROW - create a sequence of column indices and then show indices <- seq(2,10,1) data[3,indices] This will show row three, columns 2 through 10, stepping through cols singly OR choose your columns indices <- c(1,2,3,5,6,9) data[2,indices] This shows the selected columns for row 2 data[1:3] shows all rows with columns 1 through 3 inclusive data["BPsys"] shows as a row data$BPsys shows as a column HOW MANY COLUMNS? length(names(date)) SHOW COLUMN HEADERS names(data) colnames(data) SHOW A SUMMARIZED COLUMN WITH LEVELS AT THE END, referenced by header name data$column referenced by column number data[,1] SHOW A COLUMN data[1] SHOW MULTIPLE COLUMNS data[2:4] - shows columns 2 through 4 TO SEE ALL THE LEVELS ASSOCIATED WITH A COLUMN: levels(data$colname) QUERY which(data$freq==) # DOES NOT include NA's which(data$HeightIn >= 11) rows = which(data$HeightIn >= 12) cols = c(1,2,28,29,30) data[rows,cols] QUERY HOW MANY NA's, Non-NA's length(which(is.na(data$HasHDinMaleFam))) length(which(!is.na(data$HasHDinMaleFam))) SORT A COLUMN sort(vector) s = sort(vector, index.return = TRUE) # returns index vector for the sort s$ix # index vector: element i in index vector says WHICH element in vector got put into ith place of the sorted vector s$x # sorted result SORT TABLE BY A COLUMN order() SUBTABLES To create a subtable subtable <- data.frame(table$colname, table$colname2) subset(dataframe, condition to select) e.g. subset(data, data$Age < 18)[1:9] Make a two-way table from a dataframe table(data) CONDITIONAL DEFINITIONS Defining a population from data mwht.male <- mw$HeightTotalIn[mw$Gender=="M"] COUNTS COUNT HOW MANY ARE OF A CERTAIN KIND - either use a table or a conditional statement (Iverson notation!) e.g. mw$HeightTotal[mw$HeightTotal>5] # INCLUDES NA's length(which(mw$HeightTotal>5)) length(mw$HeightTotal[mw$HeightTotal>5]) table(data$Gender) # for categorical variables with fixed levels of a factor OR use which length(which(data$Age<7)) # does NOT include NA's length(which(data$Age>9 & data$Age<10)) # should be zero length(which(data$Age>=9 & data$Age<10)) length(which(data$Age==9)) KEY: in a two-way table, the first argument goes down the rows, i.e. fills a column BARPLOT(table()[1:n]) fills the array DOWN the rows, i.e. column at a time. So! To quickly get a nice visual of a two way table with adjacent classes, put what you want to be adjacent in the FIRST table argument, then barplot it... barplot(table(Age,ofacBP)[1:16]) AND to make a summary percentage table, divide by your second argument, repeated as many times as you have rows, and then columns, e.g. barplot(table(Age,ofacBP)[1:16]/rep(table(ofacBP),rep(5,4))) Use my special function to make life easy: t1 = table(Age,ofacBP) t1p = pctable(t1) # make percents according to column sums To get rowpercents: t2 = table(ofacBP, Age) tp2 = pctable(t2) To plot with the BP grouping barplot(tp1) To plot side by side grouped by BP adjacent barplot(t1p[1:20]) To plot side by side grouped by AGE, take a transpose first: barplot(t(t1p)[1:20]) barplot(t) # barplot with divided bars plot(t) # with sectioned bars To have the bars' thicknesses reflect the numbers, do barplot(t,t), where t is the table 3-way tables Example: extracting independent tables from a three-way table: bp5i = table(ofacBPmeds, ofacBP, HasIns) # 2 levels, 4 levels, 2 levels t1 = matrix(bp5i[1:8], nrow=2, byrow=FALSE) # first table of 8 t2 = matrix(bp5i[9:16], nrow=2, byrow=FALSE) # second table of 8 labels(bp5i) # labels for all three factors labels(bp5i)[1] # labels for the first factor # making a labeled table from a matrix rownames(t1) = c( ... ) colnames(t1) = c( ... ) # Form Percentages t1 = matrix(bp5i[1:8], nrow=2, byrow=FALSE) t2 = matrix(bp5i[9:16], nrow=2, byrow=FALSE) rownames(t1) = c("Y","N") rownames(t2) = c("Y","N") colnames(t1) = labBP colnames(t2) = labBP When barplotting in three-way tables, leave as the DARK STUFF the variable that is most "active", i.e. which you would visualize dramatically as "filling up" the beaker. Let the adjacent beakers (bars) be the experimental sequence, and then finally duplicate the experiment with variable that can be visualized as the category and control... OPTIMIZATION IN R min() max() which() If min is: min(data$freq) then argmin is: which(data$freq==min(data$freq)) and then the VALUE at which something is minimized, say in a Z vs. freq graph is found with: data$freq[which(data$Z == min(data$Z))] max(data$Age, na.rm=TRUE) GROUPED DATA Grouped data means having data in one vector and having a parallel vector telling what factor (group) the individual is a member of because of the particular data value and a membership criterion/selction function. STATISTICAL SUMMARIES summary(dataframe) summary(matrix) summary(vector) WORKING WITH SETS ----------------- union() intersect() setdiff() setequal() %in% help("%in%") : a boolean indicating set membership is.element(element,set) : same !is.element(element,set) rows=which(!(1:(nrow(thisframe)) %in% msTot)) # those NOT in the set e.g. 1929 %in% rows_great data[rows,] # show the table with only those rows screenplot() text() mtext() Rterm.exe accepts -f filename and -e expression ! (v2.5+) plotmath() (v2.6+) Rwui() provides R web interface! So that the R scripts can run in a website environment. Cairo package for better graphics rendering and bmp, png, tiff, etc. outputs gcl package for fuzzy rules classification gWidgets package is probably the way to go for simple effective user interfaces in R scripts. hints package helps remember functions in a package update.packages() FKBL package supports fuzzy inference sets package for fuzzy sets and generalized sets ====================================================== PLOT BEAUTIFICATION PRETTY PRETTIFICATION font sizes: cex cex.axis (tick marks and their annotations) cex.lab (x axis and y axis labels) cex.main (title) cex.sub (subtitle) PLOTTING SYMBOLS pchShow <- function(extras = c("*",".", "o","O","0","+","-","|","%","#"), cex = 3, ## good for both .Device=="postscript" and "x11" col = "red3", bg = "gold", coltext = "brown", cextext = 1.2, main = paste("plot symbols : points (... pch = *, cex =", cex,")")) { nex <- length(extras) np <- 26 + nex ipch <- 0:(np-1) k <- floor(sqrt(np)) dd <- c(-1,1)/2 rx <- dd + range(ix <- ipch %/% k) ry <- dd + range(iy <- 3 + (k-1)- ipch %% k) pch <- as.list(ipch) # list with integers & strings if(nex > 0) pch[26+ 1:nex] <- as.list(extras) plot(rx, ry, type="n", axes = FALSE, xlab = "", ylab = "", main = main) abline(v = ix, h = iy, col = "lightgray", lty = "dotted") for(i in 1:np) { pc <- pch[[i]] ## 'col' symbols with a 'bg'-colored interior (where available) : points(ix[i], iy[i], pch = pc, col = col, bg = bg, cex = cex) if(cextext > 0) text(ix[i] - 0.3, iy[i], pc, col = coltext, cex = cextext) } } pchShow() pchShow(c("o","O","0"), cex = 2.5) pchShow(NULL, cex = 4, cextext = 0, main = NULL) colors: col col.axis col.lab col.main col.sub label rotation: crt las - 0 parallel, 2 perp MARGINS NICE AND TIGHT: def=par(cex.axis=0.75, mar=c(4.3,3.2,2,0)+0.1, mgp=c(2.2,0.8,0)) ... plot stuff par(def) DRAWING IT / EXPORTING IT: It is convenient to use my plotting library, which sets the sizes and outputs to a PNG source("c:\\totalcmd\\R\\lib\\AKE\\akelib.r") # Obtain the drawing functions drawOn(), drawOff() drawOn(500,500) ... plot stuff drawOff(500,500,"fft_joint_relmax") # produces a PNG SETTING UP A CUSTOM GRID plot(..., axes=FALSE, frame.plot=TRUE) # call plot with no axes, but with a frame -- IF YOU WANT THE DEFAULT AXIS LABELS, make axes=TRUE # FOR INFO PURPOSES axTicks(1) # shows you what the x axis is axTicks(2) # shows you what the y axis is # SETTING UP GRID par(xaxp=c(0,250000,5)) # sets up x axis with 5 ticks over this range 0 to 250000 par(yaxp=c(0,200,8)) # sets up y axis with 8 ticks over this range 0 to 200 axis(1) # plots first axis # -- IF YOU WANT THE DEFAULT AXES COMMENT OUT THESE TWO LINES axis(2) # plots second axis grid() # then puts the grid, default lty="dotted", solid is lty=1 SELECTING COLORS colors() # shows available list TYPES OF PLOTS Make a stem plot from numeric data stem(data$numericcol) stem(as.numeric(data$shouldbebutisntnumeric)) TYPE CONVERSION DATA TYPE DATATYPE as.numeric Make a plot of data plot(data$col) plot(table(data$col)) makes a plot that shows outliers more clearly Adding plots plot(x) curve(sin(x), add=TRUE) Make a boxplot of numerical data boxplot(jk06$Age) Make side-by-side boxplots of grouped numerical data boxplot(age.nt, age.se, age.sw, age.ww, names=c("NT", "SE", "SW", "West"), ylab="Age", main="Age by Region") Make a histogram of numerical data hist(jk06$Age) You can specify the breaks with ,breaks=" " You can specify relative frequency (percentages) with freq=FALSE table(data$Region, data$Gender)) # a two way table plot PLOTS FOR TABLES barplot(table()) # a plot of a table, even a 2-way table is nice!! HOW TO GET TEXT LABELS TO BE VERTICAL? par(las=2) # vertical par(las=1) # horizontal HOW TO GET SUBTEXT mtext("Subtext") HOW TO ADD AN AXIS states=c("DYING","SEVERE","MILD","HEALTHY") axis(2,at=0:3,labels=states,cex.axis=.6,las=1) HOW TO SET UP A CUSTOM PLOT plot(..., axes=FALSE, frame.plot=TRUE) # call plot with no axes, but with a frame axTicks(1) # shows you what the x axis is axTicks(2) # shows you what the y axis is par(xaxp=c(0,250000,5)) # sets up x axis with 5 ticks over this range 0 to 250000 par(yaxp=c(0,200,8)) # sets up y axis with 8 ticks over this range 0 to 200 axis(1) # plots first axis axis(2) # plots second axis grid() # then puts the grid # Plot Parameters fmax = 250000 fstep = 50000 ymax = 200 ystep = 25 # Unsmoothed Plot drawOn(500,500) # call plot with no axes, but with a frame plot(rec$freq,rec$fft_db, col="red",type="l", ylim=c(0,200),xlab = "Frequency (Hz)", ylab = "Spectral Coefficients (dB)", main = "Spectral Decomposition\n(BioSonics 204kHz Transducer)",axes=FALSE,frame.plot=TRUE) lines(tap$freq,tap$fft_db, col="blue",type="l", ylim=c(0,200)) par(xaxp=c(0,fmax,fmax/fstep)) # sets up x axis ticks par(yaxp=c(0,ymax,ymax/ystep)) # sets up y axis ticks axis(1) # plots first axis axis(2) # plots second axis grid(col="blue") # lays on grid drawOff(500,500,"fft_joint_relmax") HOW TO ADD A GRID grid(col="blue") drawOn(500,500) plot(rec$freq,filter(rec$fft_db,ma5), col="red",type="l", ylim=c(0,200),xlab = "Frequency (Hz)", ylab = "Spectral Coefficients (dB)", main = "Spectral Decomposition\n(BioSonics 204kHz Transducer)",axes=TRUE) lines(tap$freq,filter(tap$fft_db,ma5), col="blue",type="l", ylim=c(0,200)) # horizontal lines abline(h=25) abline(h=75) abline(h=125) abline(h=175) grid(col="blue") # handmade darker grid for(ll in seq(-45,45,5)) { abline(v=ll,col="gray25") } PLOTTING COUNT DATA You can use the following: (a) table: table(mw$JK) This will show the levels and the counts for each level (the levels are discrete categories) (b) frequency histogram: plot(mw$JK) This shows a frequency histogram for each level (the levels are discrete categories) BOXPLOTS Boxplots provide a visual presentation of the 5-number summary of data: low, high, average, and inter-quartile (25th and 75th percentile) values. So the middle 50% of the values lie within the hinges while the extreme 50% lie outside the hinges. boxplot(mw$Age) boxplot(mw$Age, sw$Age) #side-by-side boxplots stripchart(mw$Age) #good for small data sets SCATTER PLOTS AND OVERPLOTS plot(mw$age, mw$BMI, pch='+') #scatterplot points(sw$age, sw$BMI, pch=2) #adding the points to the same plot, with triangle as the item LINE PLOT lines(...) POINTS & LINES type="p", "l", "b", "o" COLORS colors() # returns all color types blue red green orange brown black gray blue1 blue2 blue3 ... MULTIPLE PLOTS ON A PAGE windows() view <- matrix(1:2, nrow=2, byrow=T) layout(view) OR par(mfrow=c(3,2)) # multifigure setup: 3 rows, 2 cols ADDING A LEGEND windows() plot(data$freq, data$Z, ylim=c(0,1500), main="Finding the Optimal Frequency", xlab = "Frequency (kHZ)", ylab="Z (ohms), Relative Phase Angle, Im(Z) (ohms)") points(data$freq, -1000*data$phrad, pch=2) points(data$freq, -1*data$Z*sin(data$phrad), pch=3) legend(locator(1), locator(1), c("Z (ohms)", "Relative Phase Angle (unitless)", "Im(Z) (ohms)"), col = c(1,1,1), text.col = "black", lty = c(2, -1, 1), pch = c(1, 2, 3), merge = TRUE, bg = 'white') MATH TEXT ytext = expression(V^2) ytext = expression(paste(plain(sin) * phi, " and ", plain(cos) * phi)) ytext = expression(paste("Power ( ", V^2, ") (dB)")) plot(..., ylib=ytext) 3-D surface plot persp(seq(0,1,238),seq(0,1,3),d_scale) ## Default S3 method: persp(x = seq(0, 1, length.out = nrow(z)), y = seq(0, 1, length.out = ncol(z)), z, xlim = range(x), ylim = range(y), zlim = range(z, na.rm = TRUE), xlab = NULL, ylab = NULL, zlab = NULL, main = NULL, sub = NULL, theta = 0, phi = 15, r = sqrt(3), d = 1, scale = TRUE, expand = 1, col = "white", border = NULL, ltheta = -135, lphi = 0, shade = NA, box = TRUE, axes = TRUE, nticks = 5, ticktype = "simple", ...) persp(seq(0,5,238),seq(0,650,3),outer(d_scale[,1], d_scale[,2], d_scale[,3])) 3-D scatter plot Need to Packages > Install > search for scatterplot3d library(scatterplot3d) scatterplot3d(x,y,z) # 3D Scatterplot with Coloring and Vertical Lines # and Regression Plane library(scatterplot3d) attach(mtcars) s3d <-scatterplot3d(wt,disp,mpg, pch=16, highlight.3d=TRUE, type="h", main="3D Scatterplot") fit <- lm(mpg ~ wt+disp) s3d$plane3d(fit) Interactive 3-D scatter plot library(rgl) plot3d(wt, disp, mpg, col="red", size=3) =================================== 12/27/2007 5:00AM AKE Numerical Statistics Arithmetic: scalar and vector / is elementwise for arrays round(scalar, digits=2) #digits is decimal places table(data$Region) gives you a counts array you can create an array of totals tot = c( . , . , .) samp = table(data$Region) percentiles are now round(samp/tot, digits = 2) sum(vector) - returns the sum of all the elements Trig functions are in RADIANS! so use phi_rad <- phi_deg * pi / 180 sin( ) pi For continuous variables, test for non-normality: - remove NaNs, e.g. jk06.Age <- jk06$Age[!is.na(jk06$Age)] - form histogram, e.g. hist(jk06.Age) - view statistics, e.g. quantile(jk06.Age) - mean(jk06.Age) - sd(jk06.Age) - mad(jk06.Age) -- most robust alternative to SD Distributions require(stats) - gaussian - binomial - Poisson distribution stats package dpois - poisson density ppois - poisson cumulative distribution function qpois - poisson quantile function rpois - poisson random draw function - glm (generalized linear model) Draws: - d,p,q,r : density, cumulative distribution (probability), quantile, random deviates - binom(count, num_trials, success_prob) dbinom - mass density function pbinom dxxxx( - uniform dunif punif qunif runif - norm - exp - gamma distribution stats package help(GammaDist) dgamma() pgamma() qgamma() rgamma() - pois - weibull - cauchy - beta - t - f - chisq - binom - geom - hyper - logis - lnorm - nbinom - unif - wilcox To draw 1 number randomly from the following distributions: rnorm(1) runif(1) rbinom(1,1,.5) # a coin flip rbinom(1,10,.5) # number of heads after 10 flips rpois(1,.5) # lambda = .5 Note - for help on these, use the Capitalized Distribution Name, e.g. help(Normal) help(Uniform) help(Binomial) help(Poisson) normAge <- rnorm(1000,mean=mean(jk06.Age), sd=sd(jk06.Age)) You get a smooth normal curve when n = 1,000,000 and the classes are of width 1, stretching from -50 to 150 hist(sort(rnorm(1000000,mean=43,sd=12)), breaks=seq(-50,150,1)) You can compute the probability below a given P-value as follows: pchisq(2.7, df=1) this returns the probability distribution below that amount, i.e. the cumulative distribution Test for normality, non-normality, example: For continuous variables, test for non-normality: sample <- rnorm(sz,mean=43,sd=12) sample.hist <- hist(sample) br <- sample.hist$breaks sC <- sample.hist$counts br.z <- (br - 43)/12 pnorm(br.z) # - this gives the cumulative distribution below that amount est.cum <- round(1000*pnorm(br.z)) br.n <- length(br) estC <- est.cum[2:br.n] - est.cum[1:br.n-1] csq <- sum((sC - estC)^2/estC) ddf <- br.n - 3 pchisq(csq,df=ddf) csq br.n ddf =================================================== sz<-10000 k<-1 # no. of extreme values to skip brk <- seq(-10,110,5) sample <- rnorm(sz,mean=43,sd=12) sample.hist <- hist(sample, breaks=brk) br <- sample.hist$breaks sC <- sample.hist$counts br.z <- (br - 43)/12 pnorm(br.z) # - this gives the cumulative distribution below that amount est.cum <- round(1000*pnorm(br.z)) br.n <- length(br) n <- br.n-(2*k)+1 estC <- est.cum[(2+k):(br.n-k)] - est.cum[(1+k):(br.n-k-1)] csq <- sum((sC[(1+k):(length(sC)-k)] - estC)^2/estC) ddf <- n - 3 pchisq(csq,df=ddf) estC csq br.n sz k n ddf ========================================================= Hypothesis Testing Single Proportions Testing -------------------------- prop.test(numpos, numtot, popparam) where numpos is the number of positive outcomes numtot is the total number of binary trials (positive, negative) popparam is the population proportion that you want to hypothesis test for Example: gender balance in study population F = 1442, out of 2814 individuals But the known population proportion of females is .4836 > prop.test(1442,2814,.4836) Gives output: 1-sample proportions test with continuity correction data: 1442 out of 2814, null probability 0.4836 X-squared = 9.2557, df = 1, p-value = 0.002348 alternative hypothesis: true p is not equal to 0.4836 95 percent confidence interval: 0.4937878 0.5310536 sample estimates: p 0.5124378 Interpretation: p = 0.5124378 this is the sample estimate (proportion) from the study data that you have, e.g. p = numpos/numtest 95 percent confidence interval: 0.4937878 0.5310536 This is the 95% confidence interval around your study data's sample estimate (propertion value), i.e. the value around which 95% of repeated random samplings of that size would have their proportions come out within. X-squared = 9.2557, df = 1, p-value = 0.002348 This is the chi-squared value and likelihood of a random sampling under your sample estimate parameters yielding the value that your population parameter is. The p-value is the likelihood of obtaining a MORE EXTREME value away from the sample estimate's proportion. The Null Hypothesis is: the population parameter is popparam The Alternate Hypothesis: the true population parameter for the population from which your sample was draw is NOT popparam Exact Binomial Test -------------------------- binom.test(numpos, numtot, popparam) Fitting Lines ------------- If A,B are correlated data sets, plot(A,B) shows the cloud z = line(A,B) # fits a line robustly abline(coeff(z)) or abline(coeff(line(A,B))) -------------------------------------- Data Editing in R ##********************************************** ############################### print("Filter #11: HasHistHD"); print("1) NO -> N"); print("2) YES -> Y"); print("3) D -> NA"); ### corrections ct <- 0 colToMod = data$HasHistHD # PROGRAMMING IN R # i = row index; j = col index for (i in 1:length(data$ID)) { if (!is.na(colToMod[i]) & colToMod[i] == "NO") { colToMod[i] <- "N" ct <- ct + 1 #increment counter } else if (!is.na(colToMod[i]) & colToMod[i] == "YES") { colToMod[i] <- "Y" ct <- ct + 1 #increment counter } else if (!is.na(colToMod[i]) & colToMod[i] == "D") { colToMod[i] <- NA ct <- ct + 1 #increment counter } } colToMod <- factor(colToMod) #re-assess levels after this modification if (ct > 0) { print(" Number of records modified: "); ct } else { print("No records modified.") } data$HasHistHD = colToMod ------------------------------------------ Working with Categorical Data in R factor(data$JK) levels(data$JK) You can CHANGE the levels of a data set easily by simply re-assigning the level element: e.g. levels(data$JK)[2] = "BGHM" Tricks and Things ----------------- prompt(data frame) or prompt(any object) produces a document describing that object. # Graphing plot(anything) dev.copy2eps() # packages the graphic into an eps file bitmap(file, type = "png256", height = 6, width = 6, res = 72, pointsize, ...) ### Later define a table reversing function... t1 = table(ofacCH, ofacCL) t1r = rev(t1) dim(t1r) = c(3,5) rownames(t1r) = rev(labCH) colnames(t1r) = rev(labCL) t1rp = pctable(t1r) t1rptot = 100*round(t1r/sum(t1r),4) #table with percentage of total population # making the names in prep for a barplot (using my array.r library) nc = concat(rep("LDL",5),":",colnames(t1r)) nr = concat(rep("HDL",3),":",rownames(t1r)) labc = rep(nc,rep(3,5)) # cycling through 3 repetitions each labr = rep(nr,5) # cycling instances through 1 repetition each labt1r = concat(labc,"\n", labr) # printing an exploded barplot print("Figure 37a shows the Cholesterol Health assessment of the total population") drawOn(500,350) def=par(las=2, cex=.6, cex.axis=.8) barplot(t1rptot[1:15], ylim=c(0,30), main="Cholesterol Health Assessment", ylab = "Percent of Total Study Population", names = labt1r) par(def) drawOff(500,350,"37a") strsplit() # allows using regular expressions in R system() #makes a system call capabilities() tells you what your system is capable of CRAN packages can be downloaded from within R update.packages() download.file() = is now allowed as an assignment operator attach() # for user defined tables of variables match() unique() duplicated() The grid graphics package as a drawing tool. split() splits a list by factors -- could be very useful .C() allows for making C function calls! .Call() .External() dotchart() stripchart() barplot() hist() legend() pie() polygon() rect() bxp() boxplot() contour() sprintf() So we can now print variables: filename = "akhb2006.csv" sprintf("The filename is %s", filename) princomp() rpvm package: parallel virtual machine ( a cluster of computers working as one) widgets are available through library(tktcl) # Having mathematical notation in plots! :) plot(0, main = expression(y == alpha*x[1] + beta*x[2]^2)) # VARIABLE SUBSTITUTIONS a <- 3.5 x <- 1:2 substitute(y == a + alpha*x[1] + beta*x[2]^2, list(a = a)) plot(0, main = expression(substitute(y == a + alpha*x[1] + beta*x[2]^2, list(a = a)))) rowsum() colsum() capabilities("PCRE") grep(), sub(), gsub() and regexpr() have a new argument `perl' which if TRUE uses Perlstyle regexps from PCRE (if installed). New capabilities option "PCRE" to say if PCRE is available. stdin() installed.packages() capture.output() # sends results from an output to a string Wow! example(topic) # gives examples of usage!!! dir() #shows the current working directory contents list.file() GRASS - open source GIS which.min(x) which.max(x) R2HTML package REGULAR EXPRESSIONS help("regex") -- in v1.8 R For example, ^[[:space:]]* matches any amount of empty space at the start of a string, so sub("^[[:space:]]*", "", string) will remove any leading whitespace characters from the strings (that is, replace it with nothing). perl=TRUE allows the use of Perl style regexes valid_name is a valid name in v1.9 and above must faster startup in v1.9 and above! dir.create() # v1.9 and above split.screen() addmargins() # adds row and column marginal summaries to tabular information Graphviz package quantile() rank() grid graphics package all() any() # combinations choose(n,k) menu() message() plot.data.frame() recordGraphics() sunflowerplot() write.csv() write.csv2() R.matlab package gsl wrapper options() TeachingDemos package glpk The GNU Linear Programming Kit (GLPK) version 4.8. This interface mirrors the GLPK C API. Almost all GLPK lpx routines are supported. This neural network package relax package -- functions for report writing, presentation and programming package outliers -- for outlier detection for() can use raw vectors as indices (like foreach in Perl) in v2.3 + Sys.time() date() timestamp() options() # there are many! args(function) #shows its arguments, e.g. args(options) R commander package... PearsonICA package xlsReadWrite() -- natively, read, write XLS data - in v2.4 + dev.print() -- in v2.5 + file("stdin") Programming With R ------------------ R.h exists for C++ It is possible to use sockets! make.socket write.socket nsl() #tests for connection to an IP address Rcgi can provide R's output directly to the web! Interfaces between R and another programming language: The old ones are C and Fortran. The new ones are: XML Java RSPerl Python CORBA With v2.5, Objective-C is also supported! RSRuby (only for Unix) R & Ruby R Home c:\totalcmd\R\rw261\.. libR shared library... c:\totalcmd\R\rw261\bin R.dll Spectral Analysis ----------------- ------------------------------------------------- In R, trig is in radians, so x = seq(0,2*pi,.01) # slice the circle into .01 steps plot(x,sin(x)) # plot one oscillation, x units are in radians plot(x,sin(2*pi*x)) # plot 6 oscillations, x units are in s, f is 1 Hz SO! Frequency (Hz) of a sine in radians sin(k*t) is f = k/(2*p), you are dividing the frequency coefficient by the natural period of sine (2*pi). ------------------------------------------------- spectrum - ... of time series x = seq(1:1000) x = x/100 y = sin(x) plot(x,y) A = 1 k = 2 c = 0 h = 0 y = A*sin(k*x + c) + h plot(x,y) To see in radians, x = x / (2*pi) plot(x,y) To see 15 radians worth: x = seq(1:10000) x = x/100 x = x/(2*pi) y = sin(x) plot(x,y,xlab="Radians") To see the units in units of pi/2 plot(x/(pi/2), y) Time Series ----------- ts() - create a time series from a vector readts() - read a time series from a two column ascii file plot.ts() - plots the points and connects them with a line, can plot multiple time series on the same plot, like boxplot... stl() - seasonal decomposition of time series by loess k = c(.5,1,1,1,.5) # k is the vector of weights (k = k/sum(k)) [1] 0.125 0.250 0.250 0.250 0.125 fjj = filter(jj, sides=2, k) # ?filter for help [but you knew that already] plot(jj) lines(fjj, col="red") # adds a line to the existing plot lines(lowess(jj), col="blue", lty="dashed") Now a histogram and a Q-Q plot, one on top of the other: par(mfrow=c(2,1)) # set up the graphics hist(dljj, prob=T, 12) # histogram lines(density(dljj)) # smooth it - ?density for details qqnorm(dljj) # normal Q-Q plot qqline(dljj) # add a line Examples: vector jj = ts(jj, start=1960, frequency=4) # creates QUARTERLY earnings time series jj = ts(jj, start=1960, frequency=12) # creates MONTHLY earnings time series Spectral Analysis ------------------ x = arima.sim(list(order=c(2,0,0), ar=c(1,-.9)), n=2^8) # some data (u = polyroot(c(1,-1,.9))) # x is AR(2) w/complex roots Arg(u[1])/(2*pi) # dominant frequency around .16: par(mfcol=c(3,1)) plot.ts(x) spec.pgram(x, spans=c(3,3), log="no") # nonparametric spectral estimate; also see spectrum() Finally, note that R tapers and logs by default, so if you simply want the periodogram of a series, the command is spec.pgram(x, taper=0, fast=FALSE, detrend=FALSE, log="no"). If you just asked for spec.pgram(x), you wouldn't get the RAW periodogram because the data are detrended, possibly padded, and tapered in spec.pgram, even though the title of the resulting graphic would say Raw Periodogram. An easier way to get a raw periodogram is this: per=abs(fft(x))^2 .... duh. The moral of this story ... and the bottom line: pay special attention to the defaults of the functions you're using. So, e.g. par(mfcol=c(4,1)) # 4 rows, 1 column x = seq(0,2*pi*10,.01) y1 = sin(x) plot(y1) spec.pgram(y1, taper=0, fast=FALSE, detrend=FALSE, log="no") y2 = sin(10*x) plot(y2) spec.pgram(y2, taper=0, fast=FALSE, detrend=FALSE, log="no") Now the question is, how to get that frequency! sp = spec.pgram(y2, taper=0, fast=FALSE, detrend=FALSE, log="no") The spectrogram is plot(sp$freq,sp$spec) sp$freq[which(sp$spec == max(sp$spec))] * 2 * pi Test: par(mfcol=c(2,1)) x = seq(0,2*pi*10,.01) y3 = sin(5*x) # frequency is 5 plot(y3) spec.pgram(y3, taper=0, fast=FALSE, detrend=FALSE, log="no") sp = spec.pgram(y3, taper=0, fast=FALSE, detrend=FALSE, log="no") sp$freq[which(sp$spec == max(sp$spec))] * 2 * pi * 100 THIS IS IT -- SPECTRAL ANALYSIS IN R ------------------------------------ par(mfcol=c(3,1)) # show 3 plots on the same chart x = seq(0,2*pi*10,.01) # domain y1 = 3*sin(2*x) # frequency 2, amplitude 3 y2 = -1*sin(50*x) # frequency 50, amplitude 1 y3 = 2*sin(20*x) # frequency 20, amplitude 2 y = y1 + y2 + y3 # superposition of the three functions plot(x,y1,xlim=c(0,2),ylim=c(-6,6),main="Component functions",sub="y1 = 3*sin(2*x); y2 = -1*sin(50*x); y3=2*sin(20*x); y=y1+y2+y3 (red)") # start the plots lines(x,y2) lines(x,y3) lines(x,y,col="red") plot(x,y,xlim=c(0,2),ylim=c(-6,6),col="red",main="Sum function",sub="y=y1+y2+y3") # sum function by itself sp = spec.pgram(y, taper=0, fast=FALSE, detrend=FALSE, log="no") # spec decomp sp$freq[which(sp$spec > 500)] * 2 * pi * 100 # tell me the frequencies, in order of strength (amplitude) Min Finding ----------- w = seq(-50,50,1) fw = .... (some function of w) i = which (fw == min(fw)) # this gives the index of the minimum, i.e. the index min w[i] # this gives the argument of the minimum, argmin REVERSE a list, FLIP a list rev(c(1,2,3)) SMOOTHING ALGORITHMS AKE 2:00 PM 10/5/2009 In R: smooth() -- default is 3RS3R help(smooth) Tukey's smoothing: 3RS3R 3 - moving medians of length 3 R - repeat till convergence is reached S - splitting of horizontal stretches of length 2 or 3 ?? filter(signal, filtercoeffs) moving average 5-point moving average: coefficients uniform at 1. normalization constant is the total sum of coeffs ma5 = c(1,1,1,1,1)/5 ma5 = rep(1,5)/5 Savitsky-Golay smoothing function: 5 point quadratic or cubic smoothfunction: coeffs sg5 = c(-3,12,17,12,-3)/35 dLateSmooth19 = filter(d$LateDays, rep(1,19)/19) Exponential Smoothing http://en.wikipedia.org/wiki/Exponential_smoothing Exponentially Weighted Moving Average Filter (EWMA) http://tolstoy.newcastle.edu.au/R/e2/help/06/12/6919.html Packages expsmooth Data sets from "Forecasting with exponential smoothing" MEWMA Multivariate Exponentially Weighted Moving Average (MEWMA) Control Chart a_j = alpha(1-alpha)^j Time series classes ts -- built in time series class zoo tseries -- package library(forecast) library(Mcomp) http://robjhyndman.com/researchtips/estimation2/ Non-Stationary Time Series Simple Test for Stationarity l1 = list[1:floor(length(list)/2)] l2 = list[ceiling(length(list)/2):length(list)] summary(l1) summary(l2) Exploring Differences diff(list,lag,order) diffinv(dlist, x_0) # x_0 is initial value so: dl = diff(list,1,1) idl = diffinv(dl,xi=list[0]) summary(l-idl) # should be all zeros KEY!! NOTE!!! R STATS COMPUTATIONS DEFAULT TO TREATING LISTS AS SAMPLES, and so using N-1 for sd, var, and cov... If you're doing population corrections, you'll want to correct with * (N-1)/N cov(x,y) cor(x,y) ccf - cross correlation To compute autocorrelations cor(list[1:367],list[2:368]) cor(list[2:368],list[3:369]) But what IS the correlation? To see this, one must plot a scattergram of values separated by lag $k$. plot(list[1:367],list[2:368]) plot(list[2:368],list[3:369]) The AUTOCORRELATION FUNCTION (beginning at lag 0): acf(list) To see more of the ACF tail acf(l,lag.max=90) % lag.max approx N/4 acf(il, lag.max=90) To get the values OUT: acf_l =acf(l,lag.max=90, plot=FALSE) % lag.max approx N/4 acf_il = acf(il, lag.max=90, plot=FALSE) But the values are in a funky table format, where we want just a list. acf_l$acf give the actual values This gives a MATRIX acf2AR(acf_l$acf) # SHIFT OPERATORS: given list backward shift (adding zero to the end), i.e. this is the (1 + wB) z_t operation. Bdld = c(0,dld[1:length(dld)-1]) # BACKWARD shift on shocks, i.e. Ba_t 1 2 3 4 5 0 1 2 3 4 forward shift Flist = c(list[2:length(list)],0) 1 2 3 4 5 2 3 4 5 0 pacf - partial autocorrelation SPECTRAL ANALYSIS cpgram spec.pgram fft fft: this returns the complex fourier coefficients, so in R use: abs(fft(list)) and plot(abs(fft(list)),type="l") System stuff ------------ system() memory.limit(2048) -- is max memory.size() -- how much is currently in us LaTeX symbols in R plots: help(symbol) Symbols / Names in R: as.name # makes into a name is.name # checks if name call("round", 10.5) # call with argument eval # evaluates a call do.call() e = expression( single expression or list of expressions) eval function PROGRAMMING IN R http://manuals.bioinformatics.ucr.edu/home/programming-in-r GUI for programming in R: RStudio http://www.johndcook.com/R_language_for_programmers.html http://en.wikibooks.org/wiki/R_Programming http://zoonek2.free.fr/UNIX/48_R/02.html Ref 4
# OUTPUT of a vector as a csv vector write.csv(h1,"h1.csv",eol=",",row.names=FALSE) # GRAPHICAL VIEW: 2 graphs side by side wx = 800 hy = 600 windows(bg="white", width=wx, height=hy) view <- matrix(1:2, nrow=1, byrow=T) layout(view) ... plot statements ... # output to file dev.print(png, file="figCholTransError.png", width=wx, height=hy) # Here we look at MAX, MIN, PLOT (to see patterns of possible typos and spread) and BOXPLOT (to see spread) max(data$Age, na.rm=TRUE) min(data$Age, na.rm=TRUE) plot(data$Age) boxplot(data$Age) # variable correlated by Gender boxplot(data$BMI[data$Gender=="F"],data$BMI[data$Gender=="M"]) # Here we look at a variable, pivoting on Age and Gender separately, and then together These are weights, so we know roughly what they could be. max(data$WeightLbs[data$Age<18], na.rm=TRUE) max(data$WeightLbs[data$Age>=18], na.rm=TRUE) min(data$WeightLbs[data$Age<18], na.rm=TRUE) min(data$WeightLbs[data$Age>=18], na.rm=TRUE) max(data$WeightLbs[data$Gender=="F"], na.rm=TRUE) max(data$WeightLbs[data$Gender=="M"], na.rm=TRUE) min(data$WeightLbs[data$Gender=="F"], na.rm=TRUE) min(data$WeightLbs[data$Gender=="M"], na.rm=TRUE) plot(data$WeightLbs) boxplot(data$WeightLbs) # Now, weight is correlated by AGE and by GENDER, and a better measure is BMI boxplot(data$WeightLbs[data$Age<18],data$WeightLbs[data$Age>=18]) boxplot(data$WeightLbs[data$Gender=="F"],data$WeightLbs[data$Gender=="M"]) # Annotating and Exporting a Plot plot(data$HeightTotalIn[data$Gender=="F" & data$Age>18], main="Adult Female Height", ylab="Height (Inches)") boxplot(data$HeightTotalIn[data$Gender=="F" & data$Age>18], main="Adult Female Height", ylab="Height (Inches)") dev.print(png, file="figHeightF.png", width=wx, height=hy) # Nice annotation for export plot(data$CholLDL, main="LDL Cholesterol Level", xlab="Record ID #", ylab="LDL Cholesterol Level", ylim=c(0,300)) plot(data$CholHDL, main="HDL Cholesterol Level", xlab="Record ID #", ylab="HDL Cholesterol Level") dev.print(png, file="figCholTransError.png", width=wx, height=hy) # 6 correlation plots ## TWO-WAY PLOTS for correlation analysis... # General population wx = 800 hy = 600 windows(bg="white", width=wx, height=hy) view <- matrix(1:6, nrow=2, byrow=T) layout(view) plot(data$CholLDL, data$CholTot) plot(data$CholHDL, data$CholTot) plot(data$Triglyc, data$CholTot) plot(data$CholHDL, data$CholLDL) plot(data$Triglyc, data$CholLDL) plot(data$Triglyc, data$CholHDL) dev.print(png, file="CholCorrelation1.png", width=wx, height=hy) # Age specific correlation plots # Age specific: 16 - 20 plot(data$Age,data$CholTot,xlim=c(0,30),ylim=c(80,250)) plot(data$Age,data$CholLDL,xlim=c(0,30),ylim=c(40,180)) plot(data$Age,data$CholHDL,xlim=c(0,30),ylim=c(0,150)) plot(data$CholHDL[data$Age<=20 & data$Age>=16],data$CholTot[data$Age<=20 & data$Age>=16],xlim=c(0,150),ylim=c(80,250)) plot(data$CholLDL[data$Age<=20 & data$Age>=16],data$CholTot[data$Age<=20 & data$Age>=16],xlim=c(40,180),ylim=c(80,250)) plot(data$CholHDL[data$Age<=20 & data$Age>=16],data$CholLDL[data$Age<=20 & data$Age>=16],xlim=c(0,150),ylim=c(40,180)) dev.print(png, file="CholCorrelation2.png", width=wx, height=hy) # Age specific: 60 - 70 plot(data$Age,data$CholTot,xlim=c(50,80),ylim=c(80,300)) plot(data$Age,data$CholLDL,xlim=c(50,80),ylim=c(30,220)) plot(data$Age,data$CholHDL,xlim=c(50,80),ylim=c(20,130)) plot(data$CholHDL[data$Age<=70 & data$Age>=60],data$CholTot[data$Age<=70 & data$Age>=60],xlim=c(20,130),ylim=c(80,300)) plot(data$CholLDL[data$Age<=70 & data$Age>=60],data$CholTot[data$Age<=70 & data$Age>=60],xlim=c(30,220),ylim=c(80,300)) plot(data$CholHDL[data$Age<=70 & data$Age>=60],data$CholLDL[data$Age<=70 & data$Age>=60],xlim=c(20,130),ylim=c(30,220)) dev.print(png, file="CholCorrelation3.png", width=wx, height=hy) # Looking for a correlation -- a faint correlation -- how faint? plot(data$BMI, data$Triglyc) # Comparing two populations # did NOT fast and yet had testing done rows = which(data$HasFasted=="N" & !is.na(data$Triglyc)) foodies = data[rows,cols] length(rows) # DID fast and had testing done s = which(data$HasFasted=="Y" & !is.na(data$Triglyc)) fasties = data[s,cols] length(s) wx=350 hy=600 windows(bg="white", width=wx, height=hy) boxplot(foodies$Triglyc, fasties$Triglyc, names=c("Foodies", "Fasties"), ylab="Triglycerides Level", main="Fasting & Triglyceride Levels") dev.print(png, file="figTriglycFasting.png", width=wx, height=hy) !! Hunh! It could also be that some people who SAID they fasted actually LIED and didn't fast... and hence that's why they have such high triglycerides? !! OR could it be that the fasting triggered the body to release fat reserves? And that those who ate, the body just used up the fats?? wx=600 hy=600 windows(bg="white", width=wx, height=hy) view <- matrix(1:2, nrow=1, byrow=T) layout(view) plot(foodies$Triglyc,foodies$CholTot, xlim=c(0,275), ylim=c(0,300)) plot(fasties$Triglyc,fasties$CholTot, xlim=c(0,275), ylim=c(0,300)) dev.print(png, file="figTriglyFastingCholTot.png", width=wx, height=hy) wx=600 hy=600 windows(bg="white", width=wx, height=hy) view <- matrix(1:2, nrow=1, byrow=T) layout(view) plot(foodies$Triglyc,foodies$CholLDL, xlim=c(0,275), ylim=c(0,200)) plot(fasties$Triglyc,fasties$CholLDL, xlim=c(0,275), ylim=c(0,200)) dev.print(png, file="figTriglyFastingCholLDL.png", width=wx, height=hy) wx=600 hy=600 windows(bg="white", width=wx, height=hy) view <- matrix(1:2, nrow=1, byrow=T) layout(view) plot(foodies$Triglyc,foodies$CholHDL, xlim=c(0,275), ylim=c(0,100)) plot(fasties$Triglyc,fasties$CholHDL, xlim=c(0,275), ylim=c(0,100)) dev.print(png, file="figTriglyFastingCholHDL.png", width=wx, height=hy) # Doing a comparative study of populations and drawing conclusions -- Triglycerides, dieting and females in their 30s... (after one or two kids) rows = which(data$Triglyc<50) dieters = data[rows,] plot(dieters$Gender) # mostly Female! boxplot(dieters$Age) # mostly young! (median 31yrs) dieters.f = dieters[which(dieters$Gender=="F"),] # females dieters.m = dieters[which(dieters$Gender=="M"),] # males summary(dieters.f$Age) #26 - 36 years are 75% of these women boxplot(dieters.f$Age) dieters.f.young = dieters.f[which(dieters.f$Age<=36 & dieters.f$Age>=26),] length(rownames(dieters.f.young)) # 29 ---- s = which(data$Triglyc>=50) # get the problem rows nodieters = data[s,] plot(nodieters$Gender) # balanced boxplot(nodieters$Age) # mostly young! (median 41yrs) nodieters.f = nodieters[which(nodieters$Gender=="F"),] # females nodieters.m = nodieters[which(nodieters$Gender=="M"),] # males length(rownames(nodieters.f)) # 1393 women length(rownames(nodieters.m)) # 1357 men summary(nodieters.f$Age) #34 - 51 years are 75% of these women boxplot(nodieters.f$Age) nodieters.f.young = nodieters.f[which(nodieters.f$Age<=36 & nodieters.f$Age>=26),] length(rownames(nodieters.f.young)) # 332 ---- #COMPARISONS wx=1050 hy=600 windows(bg="white", width=wx, height=hy) view <- matrix(1:3, nrow=1, byrow=T) layout(view) boxplot(dieters.f.young$Triglyc, nodieters.f.young$Triglyc, ylim=c(30,230), main="Separating Young Female Pop'n by Triglyc Level", names=c("Low (<50)", "Regular (>=50)"), ylab="Triglyceride Levels") boxplot(dieters.f.young$BMI, nodieters.f.young$BMI, ylab="Body-Mass-Index (BMI)", names=c("Dieters (Low Triglyc)", "Non-Dieters (Ave Triglyc)"), main="BMI of Young Females with Low Triglyc") boxplot(dieters.f.young$Age, nodieters.f.young$Age, main="Females Between 26 and 36 years", names=c("Dieters(Low Triglyc)", "Non-Dieters (Ave Triglyc)"), ylab = "Age") dev.print(png, file="figTriglycAbnormLowYoungWomen.png", width=wx, height=hy) #!!!! WOW -- look -- these abnormally low numbers are from young females!!! These are mostly young females!!! -------------- IDENTIFYING OUTLIERS BACKGROUND: boxplot(numeric vector) This shows you median, 1st and third quartiles, and the largest and smallest instances that fall WITHIN 1.5 of the box WIDTH (IQR - interquartile range) beyond the 1st and third quartiles summary(numeric vector) gives you the numerical backing E.g. Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 4.0 110.0 120.0 123.3 134.0 227.0 244.0 So: IQR = 134.0 - 110.0 = 24.0 1.5 * IQR = 36 Therefore: TOP WHISKER <= 36 + 134 = 170 BOTTOM WHISKER >= 110 - 36 = 74 QUESTION: how do we handle outliers? We take the sample median and IQR as estimators of the population from which the sample was drawn. Then we use the non-parametric measures of extremeness: factor 1.5 corresponds to 2 standard deviations, or 95% of the population factor 2.5 corresponds to 3 standard deviations, or 99.7% of the pop factor 3.5 corresponds to 4 standard deviations, or 99.99% of the pop SEE MY STATS ARTICLE: e:\Repository\Mathematics\Statistics\_statisticalmethods.txt OF COURSE, YOU ALWAYS HAVE THE PROBLEM OF WHETHER THE GENDER CONFOUNDING OR THE AGE CONFOUNDING ETC. MEAN THAT THE ASSUMPTION MAY BE VIOLATED THAT mu-hat and s-hat (sample parameters) are good estimators of the population parameters. In which case their use to determine outliers is also flawed. EXPLORATORY TECHNIQUE -- PIVOTING AND TREND ANALYSIS This is how to slice by age and gender and see the trends: rowtest = which(data$Age > 60 & data$Gender=="M") boxplot(data[rowtest,cols]$CholTot, data[rowtest,cols]$CholLDL, data[rowtest,cols]$CholHDL, data[rowtest,cols]$Triglyc) summary(data[rowtest,cols]$CholHDL) ADDING LEVELS THAT AREN'T THERE reglev=c("FL","MW","NE","NT","SE","SW","W") fr = factor(data$Region,levels=reglev) table(fr) # USING SUBTABLES # FL, MW, NE, NT, SE, SW, W (alphabetical order like in R) pop = c(3370,4740,4140,7840,11240,16890,5690) samp = table(data$Region) # subtables by region ne = subset(data, data$Region=="NE") se = subset(data, data$Region=="SE") fl = subset(data, data$Region=="FL") mw = subset(data, data$Region=="MW") nt = subset(data, data$Region=="NT") sw = subset(data, data$Region=="SW") w = subset(data, data$Region=="W") # summary(data) # factor(data$JK) # plot(data$Region, main="Counts by Region") # subtable by risk factors risk.hd = subset(data, data$HasHistHD=="Y") risk.hd.compl = subset(data,data$HasHistHD=="N") # summary(risk.hd) # summary(risk.hd.compl) # disentangling by gender and age risk.hd.m = subset(risk.hd, risk.hd$Gender=="M") risk.hd.f = subset(risk.hd, risk.hd$Gender=="F") risk.hd.compl.m = subset(risk.hd.compl, risk.hd.compl$Gender=="M") risk.hd.compl.m = subset(risk.hd.compl, risk.hd.compl$Gender=="F") # summary( ) #disentangling by gender, generally m = subset(data, data$Gender=="M") f = subset(data, data$Gender=="F")Ref 5
Good Statistical Philosophies:
+ Always explore your data at least two ways: numerically, graphically (Tukey)Why do it in R?
+ Because it has the BEST of Forth's factor approach and functional approach (implemented as the LISPy Functional Programming paradigm of lapply and sapply, etc. (though they take a little while getting your head around)) AND because they have all the built-in probability and visualisation capabilities.
So there are 4 killer reasons for using R.
+ Because probability engines are the heart of simulation -- and R has all 4 of the key functions: p.m.f/pdf, cdfs qfs and rngs (prob mass func/prob func, cumul dist func, quantile func, random number generators) for a huge number of distributions
- but the problem is that R is a COMPLICATED object oriented language and so you'll spend lots of time reading documentation and googling in order to do almost anything --- until you get to a reasonable level of proficiency, which can take a while
This says it best:
"R can be a very unforgiving language. Obtuse error messages and a variety of differing yet similar data structures leads to a frustrating learning curve. Furthermore R is a functional programming language which, if you are used to imperative programming (e.g. Java/Perl/Python/Ruby), makes R seem like a crufty Perl for statistics." - Michael Barton
This led to frustration at the poor object support and the seeming requirement to create large numbers of loops and temporary variables.
[You'll] appreciate the functional heart of R and how much it can simplify your code. R is almost a lisp dialect since it draws much inspiration from Scheme.
functional programming can be better than imperative approaches in terms of concise, clear code and for simplifying parallelisation.
http://www.bioinformaticszen.com/post/simple-functional-programming-in-r/
This is eye-opening. My initial use of R for statistical analysis was brute force --- lots of copy paste, it was not software engineered by any stretch. It was a tool to get a job done. I should have used a lot more functions, and functional programming would have reduced duplication.+ R has first-class functions, i.e. functions that are fundamental objects in themselves. You have pass them as arguments, put them into lists, etc.
+ R supports a Forth-like working style: "[to get complex tasks done, you] build a family of composite tools starting from very simple primitives... this is a recurring theme: start with small, easy-to-understand building blocks, combine them into more complex structures, and apply them with confidence.
http://adv-r.had.co.nz/Functional-programming.htmlBest conceptual references that you'll need to do advanced R programming & the sequence in which you should read them:
R Objects: http://ww2.coastal.edu/kingw/statistics/R-tutorials/objects.html
R programming for those coming from other languages: http://www.johndcook.com/R_language_for_programmers.html
One Page R: A Survival Guide to Data Science with R: http://www.datasciencecentral.com/profiles/blogs/one-page-r-a-survival-guide-to-data-science-with-r
Functional Programming: Hadley Wickham: http://adv-r.had.co.nz/Functional-programming.html
Functional Programming: Michael Barton http://www.bioinformaticszen.com/post/simple-functional-programming-in-r/
Data Structures: Hadley Wickham http://adv-r.had.co.nz/Data-structures.html
Subsetting & Assignment: Hadley Wickham http://adv-r.had.co.nz/Subsetting.html
Grouping functions (lapply, sapply, apply, tapply, aggregate, by): Joran http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega
Data Frames --- and differences to databases: Chris http://www.r-bloggers.com/select-operations-on-r-data-frames/
Accessing Lists -- there are three operators: [], [[]], $
http://stackoverflow.com/questions/1169456/in-r-what-is-the-difference-between-the-and-notations-for-accessing-theTips for coding in R
+ introspection: str(obj)
+ Everything is a LIST. Not a matrix, but a list.
+ To make a table from a list, calculate it as a list and then reshape it as you wish
+ A sequential or iterative programmer will think about LOOPING through something. A functional programmer will think MAP or REDUCE
+ lapply() allows creating a composite function by combining the function you want to do some work with lapply() which can then, in a single pass, do something to each column in a data frame.
+ if something doesn't work, think do I need to coerce it into a type, or perhaps even **uncoerce** it, e.g. unlist(list of lists)
+ once you've got a good script, rm.all() and then re-run it, and then save your workspace. Just running your workspace next time will restore everything and you can easily demo, etc.Why consider Lisp or Scheme? For experience in FP, functional programming.
What other languages could give you this?
FP languages: Haskell, OCaml, F#,
Multi-paradigm, but FP friendly languages like Lisp, Scheme, Clojure, ScalaYou learn about the key concepts of FP or functional programming:
+ First class functions --- functions that can treated like any other data type --- made into lists, passed into arguments, etc. (Forth can do this using executable tokens) + Anonymous functions --- functions without names (Forth has anonymity via the stack --- anonymous variables) + Closures --- functions written by other functions When we have multiple functions that all follow same basic template, we can create closures: functions that return functions. Closures allow us to make functions based on a template. + Lists of functions --- storing functions in a list + functionals --- functions that take functions as arguments and return vectors as output + function operators --- functions that use functions as inputs and return them as outputs Recently FP has experienced a surge in interest because it provides a complementary set of techniques to object oriented programming, the dominant style for the last several decades. + Composite functions: The key idea here is composition. We take two simple functions, one which does something to each column and one which fixes missing values, and combine them to fix missing values in every column. Writing simple functions that can be understood in isolation and then composited together to solve complex problems is an important technique for effective FP. Hadley Wickham http://adv-r.had.co.nz/Functional-programming.html apply(M,FUN,...) --- running a function against rows or columns of a matrix lapply(L,fun,...) --- list apply: runs a function against every element a list and returns a list for (xi in x) do fun(xi,y,z) sapply(L,fun,...) --- shaped apply: runs a function against every element of a list but returns an UNlist, i.e. a vector `lapply` is *list apply* `sapply` is *shaped apply* `rapply` is *recursive apply* `vapply` is *vectored apply* `tapply` is *tagged apply* + Reduce / Filter / Find / Position / Map / Negate --- these are higher-order functions because they operate on a data set with a function... + Subsetting a data frame: subset(dfrm, condition==a & condition==b) which ... # idx <- 1+which(diff(pt3[,csl_str])==1) # identify the breakpoints in target sellable stocks --- one column + Removing row 539: d<-frm[-539,] Data Structures in R case sensitivity --- R names ARE case sensitive --- so Myname, MyName, myname are each different Remember this when importing from a database --- if all names are uppercase, you'll have to type them all upper case introspection: str( ) class( ) R has 8 primary objects: http://ww2.coastal.edu/kingw/statistics/R-tutorials/objects.html cell, element, scalar --- actually this doesn't exist -- everything in R is built on top of vectors string (as.character) grep(str, char.obj) # remember, strings are CASE SENSITIVE tolower(), toupper() vector v : a standard collection or a hash (dictionary) v <- c(...) c for concatenate, combine, create, or call-out individually v <- rep( ) rep for repeat v <- a:b quick sequence: from a to b (inclusive) v <- seq( ) seq for sequence named vector names(v) <- c(...) # the names list l : a collection of OBJECTS --- so a list in R could be used like a record in another language... l <- list(...) named list l[[1]] this REMOVES the named bit l$objname Note: data frames are actually LISTS: named lists in fact To UNLIST something, use unlist(ll) matrix m : a table, or two-dimensional array m <- matrix(1:12, nrows=3) transpose: t(m) array a : a generalised table, or n-dimensional array a <- array(1:16, dim=c(4,2,2)) # rows, columns, layers -- think like a dungeon table : a summary or frequency table of a vector table(v) data frame frm : a table, collection of lists, spreadsheet frm = data.frame ( ... ) but usually you'll just load these from a file removing a row: frm <- frm[-rwnm] R syntax if (bool) branch1 else if branch2 else branch 3 Creating your own records & objects in R: use lists Subsetting
Leave a Reply