R (for Stats)

By Assad Ebrahim, on May 15th, 2020 (4,420 views) | Leave your comment!
Topic:

Spread the love

R for Statistical Computing: Quick Start & Useful Coding Snippets

Assad Ebrahim, http://www.mathscitech.org/articles/computing-toolkits/r-for-stats
2003-2020

This page provides notes/references to using R for statistical computing.

1. Overview of R

What is R? R is an professional grade open-source statistical computing platform widely used in academia, bio-tech/pharma, and computational finance/investment banking, that has its origins in the IBM proprietary statistical computing platform called S/S+ (John Chambers is the author of both). R is useful for statistics, graphing, data visualization, and various types of analysis, including signal processing, forecasting, cluster analysis, PCA, fuzzy logic, and many of the topic areas discussed in Knowledge Engineering. R has many packages (extensions) for specialized applications. By 2010, R was an academic standard in Statistics, Plotting, Signal Processing, Data Analysis and was rapidly gaining mindshare in industry.
– runs on Windows and Linux
– has a command line interface but also has a GUI version
– is fully scriptable, and provides a programming language
– integrates with databases seamlessly and handles large data sets
– has bindings that integrate with Python, Ruby (on Unix so far), and Perl
– has commercially available plug-ins for Excel that allow R code to be run on the back-end.
If there’s one thing to remember as you learn R is think in VECTORS, LISTS and TABLES. The point is that you should be able to do everything you used to do in a programming language in one line with numbers, now with VECTORS (lists). In this sense, R is about list processing.

Excellent Sites for R and its use in Statistical Modeling, Statistical Learning, Data Science, and AI:

R Programming for those coming from other languages (by John Cook)
Drunks and Lampposts (by Simon Raper)
Coppelia (by Simon Raper)
Hyndsight (by Rob J Hyndman)
Vladimir Anisimov
One Page R
R in Action, especially for the graphing
Cross Validation
Functional Programming in R (Assad Ebrahim, StackOverflow)
Spreadsheet to R Translator
Using R for Time Series Analysis
Forecasting
R Language Reference Card
Check out The R Journal, a referreed journal relating to R and its applications. First issue: June 2009, roughly 3 issues per year. journal.r-project.org/archive Best is the single file archive of R News — before it rolled up into R Journal (2001 – 2009) and the single file bib file of refereed articles (2009 – present)
Data manipulation (PDF)
R Course materials: PDF (Oxford), R basics (PDF), UCLA: basics with annotated outputs, tutorial with info on plots, lecture notes with R code
Statistical techniques: Regression, Time Series beginner’s tutorial, advanced time series
Open Source framework for Financial Analysis.
A brief guide to R and Economics
quantitative trading: analysis and visualization framework
Time series packages
Data Mining: Data Mining in R (Togaware), online e-book
Advanced Statistics using R
Credit Scoring using R
Graph gallery of R plots and charts with supporting code
Graphics package Lattice – a tutorial
Ggplot R graphics
Ggplot Vs Lattice
Multiple tutorials for using ggplot2 and Lattice
Text Mining package in R
Social Network Analysis
Web Scraping in R
Embedding R data frames in Excel
Using R from Excel
Databases: Connecting to MySQL from R
Pulling data from SAS, STATA, SPSS, etc.
Maps with R, Geographic maps
Google Charts with R
Using R and GoogleMaps
Sweave
R2HTML
Poor Man GUI (PMG) for R
R Commander is a robust GUI for R
Java GUI for R (JGR)
Tinn-R good R editor
Eclipse plugin for R
Instructions to install StatET in Eclipse
Komodo R editor
Omega Hat has a very interesting list of packages that is worth a look
Commercial versions of R
A very informative R blog
Red R for R tasks
Screenshots in R: KNIME is worth a serious look.
R Graph Gallery, with source code

Why R? When not R?
R is a powerful, though complicated language. Unfortunately it has a somewhat difficult learning curve.

I have been using R since 2003 (v1.7.1, rw1071.exe), initially as a graphical tool for exploratory data analysis (EDA) of large data sets with many factors (comprehensive health data), later as a tool for simulation and modelling. Critically, the release of Excel 2007 with its new larger data set handling capabilities and easy to use pivot tools meant that my use of R shifted more toward to model building, automation of analytics involving the production of many tables and charts, and computationally intensive tasks on large data sets. Should a beginner start with R? It is worth forcing yourself through the learning curve while you have motivation and time, so that you already can use R easily when you need capabilities that are lacking anywhere else.

Why use R?
– you’d like a freely available software package for your home / personal computer that packs all the power of proprietary, industrial strength platform, without the cost
– you need to productionise / automate analysis that will be repeated many numbers of times, and that possibly involves producing many tables and charts.
– you need analyses that is more sophisticated or involved than would be possible to do in Excel
– specific analysis that is almost impossible in Excel — segmentation, clustering, etc.
– using / exploring a large number of pre-packaged statistical techniques, e.g. browsing / potentially using a vast statistical library of techniques (CRAN) [Footnote: CRAN is a user submitted archive following in the model of CTAN for the Comprehensive TeX Archive Network, and then picked up by CPAN for Perl.]

When not to use R?
– you’re already a whiz with some other statistical package (S, S+, Stata, SAS, etc.) In this case you probably have all of the same power, much of the same expressiveness, and (presumably) you already own or can use the proprietary software on your computer
– you’re a whiz with Matlab, and are willing to tolerate some slight awkwardness and inefficiency to avoid having to learn another platform
– you’re doing ad-hoc, one-time data analysis — in this case Excel (specifically version 2007 and newer) — are all you really need. The 2007 version makes Data filtering, Pivot tables, and a host of other analysis and graphical tools readily available and easier to use.

2. Getting & Setting up R

Getting Installer
Homepage. Download page. Select your operating system (Windows, Linux, Mac) and platform (32- or 64-bit). Choose the R version — recommended to install a version that has been around a while to avoid having package incompatibility problems (as of May 2020, recommend to install 3.6.3 instead of 4.0.0). Save and run the installer to begin.

Installation
Choose Language. Change installation directory to something without spaces so that when you want to run scripts or incorporate R into a data pipeline, you don’t have to worry about spaces in the pathname. I use: c:\totalprogs\R\. Customize startup:
– choose MDI (one big window) — With MDI there is one big container window, and all windows are within it. This means you can minimise, maximise the app in one click — easier if you’re working with lots of other applications also
– chose Plain Text help — this will bring up the help in a separate window in the GUI, rather than shunting you to a browser.
– UNCHECK: Save version number in registry

On Windows 10, the font in R can look blurry. If this happens to you the fix is easy: Find RGui.exe, right click > Properties > Compatibility tab > Change High DPI setting button > check Override high DPI scaling behaviour. The font shown below is the crisp font after the High DPI scaling fix.

Fix to “blurry font” problem in R on Windows 10. The font shown is after the fix.

Using an Integrated Development Environment (IDE)
There are a few choices: RStudio,
Rattle,
R-Commander,
Sweave for using R & LaTeX.

Sweave(), Stangle() and friends in package tools. Sweave allows mixing LATEX documentation and R code in a single source file: the R code can be replaced by its output (text, figures) to allow automatic report generation. Sweave files found in package subdir `inst/doc’ are automatically tested by R CMD check and converted to PDF by R CMD build, see the section on package vignettes in Writing R Extensions.

3. Quickstart

To run the GUI version, link to: c:\totalprogs\R\bin\i386\Rgui.exe
Rgui.exe – this brings up the R console (shell). q() to exit.
To run a command line version, link to: c:\totalprogs\R\bin\R.exe
Rcmd.exe – this is a command line front end with a lot of options
rcmd.exe batch – runs rscript in the R interpreter and puts output in
Rterm.exe – this brings up the console in a DOS terminal. q() to exit.
Files:
.RData
.Rhistory – command history
savehistory(“name.Rhistory”)
history()
savehistory() — saves to .Rhistory

Navigating between R windows: Use Ctrl+Tab. Close window with Ctrl+F4.

Help:

help(dir)               # load help page
dir()    	        # Should show the files in the CURRENT directory  [this command added in 2.xx series]

Arithmetic:

2+3                     # This is a comment
2*3
2**3                    # Also 2^3
2^(1/3)                 # cube root
2/3                     # Float
3%%2                    # Remainder (modulus)

Logic:

&, |, !  are  AND, OR, NOT operators that work on lists
&& and ||  are SINGLETON (first element of list) AND and OR
Example: c(TRUE, FALSE, TRUE) & c(FALSE, FALSE, TRUE)

String Literals:

print("Hello World!")    # prints with newline
cat("Hello World!")      # no newline
cat("hello","world")     # concatenate text with space in between
cat("hello", 2, " world!")  # number can be concatenated   
sprintf("Hello World! %d:%d:%f", 08,53,29.2)  # C-style formatted print
sprintf("0x%X = %dd",13,0xD)    # C-style number conversion and print

Categorical Variables:
Designating a factor variable and specifying its order with levels means that in all subsequent usage these levels will be used in displaying the data.

cc5 <- c("VERY LOW", "LOW", "MEDIUM", "HIGH", "VERY HIGH")
table(cc5)
cc5 <- factor(cc5,levels=cc5)
table(cc5)

Assignment of Variables and Functions:

r <- 5             # Assignment
square <- function(val){ val**2 }
square(3)
classify <- function(val) {
   if (val<10) print("LOW")
   else if (val>90) print("HIGH")
   else print("NORMAL")
}

Working with lists and vectors:
R is powerful because it is easy to work with lists (lists of objects or vectors of data) as well as lists of lists (tables of objects or matrices of data).
Lists can be (1) created by hand using c, seq, or rep, (2) generated by sampling from known distributions, (3) read in from pre-existing datasets or (4) loaded in from a file.
Vectors can be named by assigning them at creation, e.g.
v <- c(a=1, b=2, c=3) Many ways to unname

Working with tables and dataframes:
TABLES / dataframes
thisframe$HasIns # one column
thisframe$HasBPmeds # another column
t = table(thisframe$HasIns, thisframe$HasBPmeds) # 2-way table
addmargins(t) # adds the row and column sums, and total sum
ftable(addmargins(t)) # formatted table
t[1,] # row 1
t[2,] # row 2
t[,1] # column 1
t[,2] # column 2

CREATING A DATAFRAME
build it up using rbind and cbind
e.g.
vars2007 = rbind(vCreatin) # This starts off the table...
vars2007 = rbind(vars2007, vTSH) # This adds a row

CREATING other multi-dimensional data structures
### Form data frame:
ddd <- cbind(d, dd, bb, q, dd.vhigh, dd.high, dd.medium, dd.low, dd.vlow) ddd <- data.frame(d, dd, bb, q, dd.vhigh, dd.high, dd.medium, dd.low, dd.vlow) ddd <- table() The most flexible is the list: ddd<-list(d=d,dd=dd,bb=bb,q=q,vhigh=dd.vhigh,high=dd.high,med=dd.medium,low=dd.low,vlow=dd.vlow) # list with named sub-lists, can all be different lengths. Basics:

length(d)
min(d)  # min of a vector treated as a set
which.min(d)
pmin(vec1, vec2)  # parallel min of two vectors, element-wise
apply(data.frame(vec1,vec2),1,min)  # min across the rows of the dataframe , 1=rows
table(d)             # summarizing in a counts table
table(d)/length(d)   # frequency table
suppressing verbose list output:  
   sink("NUL")  ...  sink()      # no output for ... commands
unlist(d)            # useful if getting a list of lists

Vectors and Lists:
Vectors are homogeneous, i.e. all elements have the same data type (e.g. integer or character).
Lists are general and can have elements of mixed types, including other lists.
So there are two operators for accessing elements: [ ] (indexing) and [[ ]] (subsetting).
Indexing [ ] returns the element, which in the case of a list of lists might be a list.
Subsetting [[ ]] forces the return of one element, strips out any named extras, and flattens a list of lists.

1:4                # vector of length 4.   sequence, iterator, can be e.g. 4:1
length(1:4)
d <- c(1,5,7,9)    # vector of values
d[2:3]             # pick out subset of row
d[2]
quantile(d)[[1]]   # just the number without the named %

Work with built-in dataset:

library(datasets)      # load built-in datasets (see 00Index.html file in .\R\library\datasets\html\)
library(help="datasets")
plot(airmiles)
plot(beaver1$time, beaver1$temp)

Read data from file:

fn <- "c:/totalcmd/scripts/rdata.dat"  # note the / is opposite Windows default
d <- read.csv(fn, header=FALSE)
names(d) <- c('t','v')
d <- read.table(fn,skip=15,sep="\t",fill=TRUE,header=FALSE)
names(d) <- c("col1", "col2")

Working with lists:
detecting NA's: is.na(val) # tests element
suppressing NA's: most functions have parameter na.rm=TRUE
removing NA's: d-clean <- d[!is.na(d)] # remove NA's

min(d, na.rm=TRUE)    # min value suppressing NA's

Reading/Loading source files/running an R script/Including source files
source("script.r")
lib = "c:\\totalcmd\\R\\lib\\AKE\\akelib.r"
source(lib)

Generic functions


clean <- function(v) { v[!is.na(v)] }    # remove's NAs from a list  (ex: clean(d))

drawOn <- function(w,h) {
	windows(bg="white", width=w, height=h)   
    par(cex.axis=0.9, mar=c(4.5,3.2,2.5,0.5)+0.1, mgp=c(2.2,0.8,0))  #  tight margins for insertion into a Word doc
    # LATER: for LAS=2, set margin bottom from 4.5 to 7.0  # margin: bottom, left, top, right
}
# < PLOT BODY GOES HERE >
drawOff <- function(w,h,finame) {
    par(cex.axis=1, mar=c(5.1,4.1,4.1,2.1), mgp=c(3,1,0))  # set back to defaults
    fname <- sprintf("fig_%s.png",finame) 
	dev.print(png, file=fname, width=w, height=h)
	dev.off()
}

APPENDICES

Appendix 0: My Snippets

clean <- function(v) { v[!is.na(v)] } # remove's NAs from a list (ex: clean(d)) square <- function(val){ val**2 } # example for demonstrating sapply(clean(d), square)

Appendix 1: Using Packages

R packages are stored locally in the .\R\library path.
Packages must be loaded from the library (the location must be in your current search path):
library(ISwR)
library(datasets)

Packages can be imported from CRAN (R-repository)
To download a package (connect to internet) and run:
download.packages("ISwR", ".")
install.packages("ISwR", .libPaths()[1])

Package Installation
Packages can be installed from within R GUI (easy) OR from the command line:
c:\totalcmd\R\rw261\bin\Rmcd.exe (shows the Rcmd help)
c:\totalcmd\R\rw261\bin\Rmcd.exe INSTALL --help (shows the Rcmd INSTALL help)
c:\totalcmd\R\rw261\bin\Rmcd.exe INSTALL (echoes where it will install to)

Useful Packages (WIP)
teradataR_1.0.1.zip --- not on CRAN, get from Teradata directly --- REQUIRES R 2.11 through 2.XX (cannot use 3.000)
library(teradataR)

The Lowdown on Packages
-----------------------
R has some great capabilities out of the box.
But the best thing about R is that it is extensible and has a very active user community comprising many academics and professionals, from statisticians, machine learning, forecasting, data mining, big data, quantitative finance, and other analytic areas.

There are about 2 dozen packages installed with the base R system.
These aren't loaded in -- you have to load them in yourself if you want to use the functionality they provide.

Then there are thousands of user contributed packages, archived in CRAN (Comprehensive R Archive Network) modelled after CTAN (for TeX users) and CPAN (for Perl users).

Commands:
Within RGui > Packages Menu > Load Packages .... shows you the dozen or so packages included in the base installation.
> Packages Menu > Set CRAN mirror --- set this once so you're not prompted again. Choose a location near you. E.g UK (London)
> Packages Menu > Install Packages ... choose from CRAN list
> Packages Menu > Install Packages from ZIP ... unfortunately these are not available (as of this writing) on CRAN, so you have to get them externally as ZIP and then load in as ZIP.

Essential Packages --- whatever you can get from CRAN is best kept that way so you get the latest...
# Avail from CRAN (R Menu > Packages > Install Packages > Choose Mirror > Select packages... )
library(RODBC) # for connecting to databases (data base import to data frame)
odbcDataSources() # shows available connections installed on your machine use: Teradata or DW Master or ebraha01
library(rgl) # 3D rotatable plot
library(scatterplot3d) # 3D scatterplot / point cloud
library(ggplot2) # beautiful filled plots, http://stackoverflow.com/a/7027666/181638 (installs 13 others)

Appendix 2: Working with Data: load built-in (see Packages above), generate Randomly, import from File, pull from Database

WORKING WITH DATA

Loading data from a built-in library (see Packages above)

library(datasets)
plot(beaver1$time, beaver1$temp)
title("Body Temperature of Beavers, *C\nSource: built-in dataset, beaver1")    # note, can wrap text with \n newline

Sampling from a random distribution:
This is useful for testing, generating data for inputting into a function or plot, or running simulations

General form:  rdist(N, distparams...)
runif(100,0,1)		 draw 100 samples from the uniform 0-1 distribution
rnorm
rpois
rexp
rgamma
rbeta

Read / Load data from a file

getwd()           		# You'll need to know what directory you're working in.
fn <- "filename.dat"            

READ IN A **TABLE** OF NUMBERS INTO A DATAFRAME
data <- read.table(fn)          # reads multi-column, char separated table into data frame
d <- read.table(fn,skip=15,sep="-",fill=TRUE,header=FALSE)     # skip n lines, specify separating character, allow ragged data (i.e. different number of elements per row), no header
names(d)
head(d)             # top 5
head(d,10)          # top 10
tail(d)             # bottom 5
d$V1                 # show column by name, default names are V1, V2, ...
d[1]                 # show column by index

d<-readLines(fn)                # reads TEXT file line by line as a list of strings
writeLines()  #writes to a file line by line

READ IN A **COLUMN** OF NUMBERS INTO A VECTOR
list = scan("jj.dat")

READ IN A SINGLE **ROW** OF COMMA-SEPARATED NUMBERS
list = scan("list.txt",sep=",")

READ IN A CSV **TABLE** INTO A DATAFRAME
frm = read.csv("file.csv", header=TRUE)
frm = read.csv(filename, header=FALSE)
colnames(frm) = c('t','v')  				# to add your own column headers

READ IN DELIMITED DATA
read.delim()  #reads TAB or other delimited file, other than CSV


READ IN A BUNCH OF FILES IN A LOOP:
dir /B *.csv > files  (in cmd shell)
files = readLines("files")  # read in all filenames
for (filename in files) {
  do(filename)
}

# Example reading in columns of different types
coltype = c('integer','integer','character','integer')
f2026 = read.csv("F20-26.csv", header=TRUE,colClasses = coltype)


QUALITY CHECK --- read in and write out the data 
data = read.table("foo.csv", header = TRUE, sep = ",")
write.csv(data, "out.csv", row.names=FALSE)   # writes out WITH column names...
data = read.table("out.csv", header = TRUE, sep = ",")
write.csv(data, "out2.csv", row.names=FALSE)   # writes out WITH column names...

Writing out data to files

    write.table(x, file = "foo.csv", sep = ",", col.names = NA, qmethod = "double")

     ## and to read this file back into R one needs
     read.table("foo.csv", header = TRUE, sep = ",", row.names = 1)
     ## NB: you do need to specify a separator if qmethod = "double".

     ### Alternatively
     write.csv(x, file = "foo.csv")
     read.csv("foo.csv", row.names = 1)
     ## or without row names
     write.csv(x, file = "foo.csv", row.names = FALSE)
     read.csv("foo.csv")

WRITE OUT TO PNG, PDF, etc.
# a convenient one shot deal -- THIS WORKS!
fname = sprintf("plot_%d.png",i)  # For use within a loop and automatic naming
dev.print(png, file=fname, width=300, height=300)  # prints with transparent background, view in mspaint

drawOn(500,350)  # from one of my library files
def=par(las=2, cex=.6, cex.axis=.8)
barplot(t1rptot[1:15], ylim=c(0,30), main="Cholesterol Health Assessment", ylab = "Percent of Total Study Population", names = labt1r)
par(def)
drawOff(500,350,"37a")
dev.print(png, file=fname, width=300, height=300)

pdf("beamplot.pdf", width=800, height=600, paper="USr") # setup the PDF file to print on landscape Letter (US) sized paper
  ... plot stuff
dev.off()


    #drivers
pdf("file.pdf")  # graphics driver - like sink, but to pdf() and only for graphics
   ... plot stuff
dev.off()  # close file

    png()
    bmp()
    jpeg()
    dev2bitmap()
    bitmap()
    # closes the stream to the driver
    dev.off()  

    deviceswin.print()

WRITING OUT TO VECTOR GRAPHICS FILE (SO CAN SCALE, ETC.)
http://www.stat.wisc.edu/~deepayan/SIBS2005/demos/embed-output.html
 Bitmap formats are not suitable when there is the possibility of graphs being scaled (either when being viewed on-screen or after printing). This is usually the case for Word documents or Powerpoint presentations. For such use, the appropriate type of output devices to use are those that produce so called vector graphics, where the output file contains the plot information as a collection of geometric objects and text, and not pixels as in bitmap formats.

The most popular of these formats (on Windows) is the Windows Metafile format. The following example produces a scatter plot using the iris data in a file called iris.wmf

data(iris)
win.metafile(file = "iris.wmf")
plot(Sepal.Length ~ Sepal.Width, data = iris)
dev.off()

Once this is done, the file iris.wmf can be included in a Word or Powerpoint file.

Another popular vector format is PDF, which can be included in other documents, as well as viewed or printed directly. 

WRITE OUT A CSV
write.csv(frm1, "out.csv", row.names=TRUE)   # writes out the rownames and column names...
frm2 = read.csv("out.csv", header=TRUE)		 # reads in the output

? HOW TO APPEND?

WRITE OUT A DATA TABLE
write.table(data, "akhb2006m9.csv", sep=",", row.names=FALSE, col.names=TRUE)
write.table(data, sprintf("%s%s",filename,"_test.csv"), sep=",", row.names=FALSE, col.names=FALSE)

Loading data directly from a database

use package RODBC
db <- odbcDriverConnect("dsn=DW Master;uid=xxxxxxx;pwd=xxxxxxx;")  		make connection to database
qstr <- "select top 10 * from cctr;"									store sql query
frm <- sqlQuery(db, qstr, believeNrows=FALSE)							submit query and store result in data frame

	@@PITFALL!  R is CASE SENSITIVE.  So myVar is not the same as myvar.
	Be especially careful when importing from a database.

From Database:
	use RODBC
	sqlString <- " put your sql statement here -- end with ;"  # query string
	res <- sqlQuery(channel, sqlString, believeNRows = FALSE);	
Data is pulled back into a data.frame	

Tip: Import from database, export to file so that later you can easily re-import the dataset offline.
Tip: Take care to formulate / transform your dataset on the database side -- this keeps the data side separate from the statistical transformations you'll do in R solely for the purpose of analysis, visualisation, summarisation

@@If you do nothing else, at this point you should be able to see the power of R and you should be able to use it enough to want to (and need to!) learn more.
The next chapter will give you what you need to take control of the power you see around you.

Appendix 3: Mixing Languages with R

Excel
Python
SQL / Teradata databases
Ruby (used to be patchy)
Perl (used to be patchy)

R Cookbook and References - Assad Ebrahim

A Reference Collection for R Programmers

Appendix 4. R as a Numerical Platform

library(datasets)                    # load built-in datasets (see 00Index.html file in .\R\library\datasets\html\)
d<-beaver1$temp
summary(d)
quantile(d)
quantile(d,0.9)


Appendix 4. R as a Graphing & Visualization Platform

Plot Types: Plots & Curves
R has many plot types.  3 most basic are: scatterplot plot(), histogram hist(), and boxplot().
Others are: timeseries plot.ts(), Q-Q plots, contour plots (2D), 3D plots, wireframes, mesh plots, animation plots, heat maps, map overlays...

ADVANTAGE:  R can easily graph 1M data points.  In Excel, you struggle after more than 10K points.  So easily a factor 100 better.

Syntax:
plot(x,y)
lines(x,y)
curve(f(x),add=TRUE)
hist(x)
boxplot(x)
barplot(x)

Example: Barplot with values above bar
months = c("Jan", "Feb", "Mar", "Apr", "May", "June", "July")
org2020     <- c(3.7, 3.8, 3.7, 3.7, 3.9, 4.0, 4.0)
x<-barplot(org2020,     names.arg=months, main="Scores from 0 (low) to 5 (high)", ylim=c(0.0,5.0))
y<-org2020
text(x,y*1.1,labels=as.character(y))

Examples:
#plot curve with endpoints
x<-c(0,10)
y<-x^2
plot(x,y)
curve(x^2, add=TRUE)

#plot from dataset and add overlay
library(datasets)                    # load built-in datasets (see 00Index.html file in .\R\library\datasets\html\)
plot(beaver1$time, beaver1$temp)
curve(0*x + 37.0, add=TRUE)          # overlay function must have an x

#other plots
hist(beaver1$temp)
boxplot(beaver1$temp)

Inspect data behind the plot by setting PLOT=FALSE and assigning

  Example: 
h<-hist(x, plot=FALSE)

Titles, Labels, Ranges:  note title is at top, subtitle is at bottom of chart
  plot( ..., main="title", sub="subtitle", 
             xlab="x axis label", ylab="y axis label" 
	     xlim=c(x0,x1), ylim=c(y0,y1)
	)    	

Setting axes:

Or can do it separately:
	title(main="", sub="", cex.main = 1, cex.sub = 0.75)  # title and subtitle, with sizes
	mtext( , line=1)  # subtext below title, give one line gap between mtext and graph

Plotting an abline()    
Advanced: Histograms
	hist(x)					histogram, either of counts (frequencies) or relative frequencies (densities)
		hist(runif(100,0,1))		Useful for testing, filling histogram with randomly generated data
		hist(runif(100,0,1), breaks=c(LEFT,breaks...,RIGHT))	Breaks: LEFT and RIGHT must contain data range.  The rest are internal separators and determine number of bars (bins, buckets).
			Example:
				test <- rnorm(100)
				hist(test, breaks=c(-3,-2,-1,0,0.5,1,1.5,2,2.5,3)
		hist(..., xlim=c(x0,x1), ylim=c(y0,y1), density=50,col="blue")  						Pretty shaded histograms
			>> coloring different bars differently: use library(ggplot2)   http://stackoverflow.com/a/7027666/181638
			>> relative frequency (probability): 
			Pitfall: R does not produce *relative frequency* (probability) histograms out of the box.  This requires a bit of extra code as [given by @BondedDust] (http://stackoverflow.com/a/17433446/181638).
			Furthermore, if you specify breaks yourself, then R defaults to showing the DENSITY, not the COUNTS (frequencies)  What's the difference?  A density is related to an integral, it is x*f(x), so the width of your breaks makes a difference.
			Solution: To get relative frequency, you need is counts/total.  [ Answer to question: why are densities NOT adding to 1 ? ] 
				h$counts <- h$counts/sum(h$counts)
				plot(h, freq=TRUE)
	boxplot(x)				boxplots are important to identify outliers
		the neat thing is side by side boxplots for easy comparison of two populations a,b.  All populations to be plotted separately are presented in one list
			boxplot(c(a,b),names=c("a","b"))				

Advanced: Magic with figures

Recommended References:

Tips: http://blog.revolutionanalytics.com/2009/01/10-tips-for-making-your-r-graphics-look-their-best.html

Gallery: http://www.r-project.org/screenshots/screenshots.html

Graphics Export: https://www.stat.berkeley.edu/classes/s133/saving.html
		Prettification  (Pretty, Beautiful)
		Annotation
		Labelling
		Titling
		Symbols
		Line styles
		Colour
		Multi-plots
		Producing plots (exporting to PDF or EPS)

Plot / Chart Types:
	windows()						disable current active plotting window and open a new one, making it active
	plot(x,y) 						2-D scatterplot 
		par( )		for setting parameters
	scatterplot3D(x,y,z, angle=a, highlight.3d=TRUE)	external package.  Angle=80 (view from top)  angle=10 (view from side).  Highlighting helps viewing.
	contour()						2-D contour plot
	image()							2-D image / gradient plot / heat plot
	persp()							3-D perspective plot
	trans3d()
	plot3d(x,y,z)	[in package rgl]	rotatable 3-D plot  External package --- R OpenGL   rotatable (left mouse button and drag)  and zoomable (scroll wheel)
		par3d(	)  for setting parameters
		play3d(spin3d(rpm=6), duration=5)		to make the image spin
		configuring 3d plots is tricky:
			http://stackoverflow.com/a/15587188/181638
			http://stackoverflow.com/a/9658576/181638
			doc:  http://www.inside-r.org/packages/cran/rgl/docs/decorate3d
			blog: http://www.stattler.com/article/three-dimensional-plots-using-rgl-package			*** EXCELLENT -- including auto rotation animations ***
			tutorial: http://scs.math.yorku.ca/index.php/MATH_6627_2012-13_Practicum_in_Statistical_Consulting/R_tutorials/rgl_tutorial
		Regression model
		For our example we shall be using 3 variables which are "mpg" (Miles/(US) gallon), "cyl" (Number of cylinders) and "disp" (Displacement in cu.in.). mpg and cyl will be predictors and disp will be our response. The model we are interested in is:
y_disp = b_0 + b_{mpg} x_{mpg} + b_{cyl} x_{cyl} 
		> fit <- lm(cars.red$disp ~ cars.red$mpg + cars.red$cyl, data = cars.red)
		> coefs <- coef(fit)
		> planes3d(a=coefs["cars.red$mpg"], b=coefs["cars.red$cyl"], -1, coefs["(Intercept)"], alpha=0.50, col="plum2")
Plot items:
plot(..., xlab="", ylab="") 								x,y,z labels
plot(..., main="", cex.main=1, sub="", cex.sub=0.75)  		title (top of plot), subtitle (bottom of plot) with sizes
mtext=""  		text beneath the title
lty="2"  dashed line    See R RefCard
plot(..., pch='o', 'x', 'v', ')		choose plot symbols
					pch=20  filled circles
					cex=2	twice the size  0.5 is half the size
plot(x,y, col=(km$cluster+1))  		colour code plot symbols with a class assignment vector (classes 1,2,3,...)
palette()  							shows the list/sequence of colours
	1=black, 2=red, 3=green, 4=blue, 5=cyan (light blue), 6=magenta (pink), 7=yellow, 8=gray
plot(..., type="l|p|")		plot type: lines (l) or points (p) or broken/both (b)

Sizes:
	cex=1		point size
	cex.main	title size
	cex.sub		subtitle size
	cex.axis	axis tick labels size
	cex.lab		axis name labels size

@@PLOT BEAUTIFICATION & PRETTIFICATION
	Gallery:  http://www.r-project.org/screenshots/screenshots.html
	
	plot( x, y, xlab="", ylab="", type="l")  # the plot, with axis labels, and either a line graph (type L) or points (type p)
	title(main="", sub="", cex.main = 1, cex.sub = 0.75)  # title and subtitle, with sizes
	mtext( , line=1)  # subtext below title, give one line gap between mtext and graph
	margins:
		mai - margin vector in inches
		mar - margin vector for plot area.  default:   	c(5,4,4,2) + 0.1      # bottom, left, top, right.   (good for subtitle, one line mtext, roomy around graph)
											c(5,4,5,2) + 0.1   # with sub, two-line mtext, roomy
											c(5,4,2,0) + 0.1   # with sub, no mtext, roomy
											c(4,4,2,0) + 0.1   	  # no sub, no mtext, tight around title, bottom, and right hand side
		mgp - margin vector for axis area.  default:	c(3,1,0)   #  axis title label, axis tick label, axis line
											c(2.2,0.8,0)  # less space between axis title labels and plot
											c(1.8,0.5,0)  # fits 4 graphs per view
		Excellent reference:  http://research.stowers-institute.org/efg/R/Graphics/Basics/mar-oma/index.htm
	plot(..., width=, height=) 				plot sizes
		typical nice sizes for plots is 400,300   or 350,350,    or 350,300



@@Multiple Plots in a Window (View)
	windows()								# create new active window
	view <- matrix(1:2, nrow=2, byrow=T)  	# define view:  matrix(1:totcells_nxm, nrows, orientation)
	layout(view)							# assign view to layout
	
	pdef <- par(cex.axis=0.75, mar=c(4.3,3.2,2,0)+0.1, mgp=c(2.2,0.8,0))  	#  define tight margins to fit all elements
		par(pdef)
		

@@Outputting Graphs to Screen and File
	plot(...)
	dev.copy(png,'myplot.png')
	dev.off()
		
@@Easier graphics with ggplot2 library

Fancy logically coloured histogram   http://stackoverflow.com/a/7027666/181638
test <- rnorm(100)
dat <- data.frame( x=test, above=test>1 )
library(ggplot2)
qplot(x,data=dat,geom="histogram",fill=above)

Slick way to plot & save the plot to PNG	http://stackoverflow.com/a/8400798/181638
library(ggplot2)
ggplot_alternative <- function() 
{
  the_data <- data.frame(
    x <- seq(0, 1, length.out = 100),
    y = pbeta(x, 1, 10)
  )

ggplot(the_data, aes(x, y)) +
    geom_line() +
    xlab("False Positive Rate") +
    ylab("Average true positive rate") +
    coord_cartesian(0:1, 0:1)
}

ggsave(
  "ggtest.png",
  ggplot_alternative(),
  width = 3.25,
  height = 3.25,
  dpi = 108
)


Interactive Graphics and Animation

	http://www.r-project.org/conferences/DSC-2003/Proceedings/UrbanekTheus.pdf
	iPlots

	library(rgl) 			 # 3D rotatable plot
	library(scatterplot3d)   # 3D scatterplot / point cloud
	library(ggplot2)		 # beautiful filled plots,  http://stackoverflow.com/a/7027666/181638     (installs 13 others)
	
	ANIMATIONS
	can animate by adding a slider bar (Excel)
	can animate by looping with pause in between
	can animate by printing images and then using an image viewer to scroll through
	can animate by printing images and then putting them into a movie using ffmpeg
	can animate by printing images and then putting them into ImageJ and saved it as an animated gif.
		http://imagej.nih.gov/ij/	

Plotting to Image File
# draws a graph to png
drawOn <- function(x,y) {
	windows(bg="white", width=x, height=y)
    # LATER: for LAS=2, set margin bottom from 4.5 to 7.0
    # margin: bottom, left, top, right
    par(cex.axis=0.9, mar=c(4.5,3.2,2.5,0.5)+0.1, mgp=c(2.2,0.8,0))  # set to tight for Word insertion
}
# PUT THE BODY BETWEEN HERE
drawOff <- function(x,y,str_id) {
    par(cex.axis=1, mar=c(5.1,4.1,4.1,2.1), mgp=c(3,1,0))  # set back to defaults
    fname <- sprintf("fig_%s.png",str_id) 
	dev.print(png, file=fname, width=x, height=y)
	dev.off()
}
# Example
  drawOn(500,500)
  plot(rnorm(1000,0,1),rnorm(1000,0,1), xlab = "X position", ylab="Y position", main="Random Buoy movement above an anchored location")
  drawOff(500,500,"01")

  
# Function for doing preliminary analysis --- automatically reading in a file, plotting it, and graphics out to a file
do = function(filename) {
  # get input
  d1 = read.csv(filename, header=FALSE)
  names(d1) = c('t','v')

  # plot to screen and png
  drawOn(500,500)
  plot(d1$t, d1$v, xlab = "Time (s)\n(Sampling Resolution: 1 us)", ylab = "Volts (V), (Voltage Resolution: 0.01 V)")
  drawOff(500,500,filename)
}

Ref 1
@@1D1 R ON NUMBERS (SCALARS) 
	Arithmetic:  + - * /  ^  **  	Notice: exponentiation is either ^ or **     (Matlab and Python/Ruby syntax)
	Mod: %%
	Comparison: <   >   <=   !=  ==
	Strings:  '  "
	is.na	!is.na
	
	Higher Functions:  exp, log, log10, pi, sin, cos, [note: sin and cos are in radians]

	Probabilities:
		dnorm, pnorm, qnorm
		pois
		exp
		gamma
		beta
		...

	Assignment to Variables:
		Operator:   <-  (works at every level)   =   (only works at top level, or setting parameters inside function assignments)
		Though actually there are arguments for using = to avoid bugs:  http://www.win-vector.com/blog/2013/04/prefer-for-assignment-in-r/
		
	@@PITFALL:  Assignment:  <-  not  =  first thing you'll note is that R idiom uses <-  
						 Most newbies would prefer =
						 But the problem is that = only works at the TOP LEVEL, i.e. it won't work within functions or loops etc.
						 So while 99% of basic cases, there is no difference, doing anything advanced will require you to use <- so better get used to it now.
						http://stackoverflow.com/a/1742550/181638
						http://ig2600.blogspot.co.uk/2012/04/5-assignment-operators-of-r.html

	@@PITFALL:  Trapdoors from Numerical Mathematics
		Machine Numbers:  Don't forget that you're working on a machine!  0 is NOT exact.  Subtraction of equal numbers is NOT exact.  As an example: sin(pi) is NOT zero, but is 1.22e-16
		So when dealing with 0, you have to TRAP the case, e.g. 
			IF 	X > -1e-8 or X < 1e-8 
			THEN 0 
			ELSE X 
			ENDIF


	@@1D2  Data Structures in R
	
	
	@@1D2a  LIST / ARRAY / VECTOR / TUPLE MODE
		A list is the generic object.  Because R can add and multiply two lists element-wise, we automatically get tuple-mode and array-mode.  
		Vectors are the mathematical designation -- from a computational point, they behave under product like matrices, i.e. * is inner product.
	
		c(1,2,3,4) + c(5,6,7,8)   # any operation will work ELEMENT-WISE: +,-,*,/,** or ^
		
		x <- c(1,2,3,4)
		x <- 1:4  # better
		sum(x)		- sum
		prod(x)		- product
		
How to do a running sum?  You need sapply, or sequential apply
	sapply(sequence, function to do on each sequence, other arguments of function)
	rsum <- function(i,x) { sum(x[1:i] }  # x is a vector, i is the index
	sapply(1:length(x), rsum, x)  # runs over each index and calculates the running sum of x, returning the result in a sequence
        Note, to pass multiple arguments, list the 2nd and successive after fun.

		
		How to do a running product?  Same way

		
	@@1D2b  MATRIX MODE
	
		a<-matrix(c(1,2,3,4),nrow=2,ncol=2,byrow=TRUE) 
		b<-matrix(c(5,6,7,8),nrow=2,ncol=2,byrow=TRUE)
		
		a+b  # NOTE: all operations are still element-wise
		
		To do MATRIX MATRIX multiplication:
		a %*% b
		
		t(a)  # transpose
		
		solve(a)  # calculate A^-1
		
		NOTE: lists are treated as COLUMNS when doing matrix multiplication, so the usual conventions apply:  v'v
		v<-c(1,2,3,4)
		t(v) %*% v  # dot product
		
		solve(A,b)  # if A is square, invertible, and b is in range of A.
		
		. <- function(v1,v2) { t(v1) %*% v2 }  # dot product
			? how to define a new operator ? e.g. v.v
			

	@@1E	USE R IN GRAPHING MODE.
	
	
	
	@@1F	USE R IN STATISTICAL SUMMARY MODE.
	summary
	mean  (average)
	median
	quartiles
	IQR
	
Numeric Summarisation:
	categorical (qualitative) variables:
		levels(vec)		list the distinct categories found in a vector of categorical variables
		factor(vec) 	
		levels(factor(vec))  same as levels, but for an integer vector, returns as strings
		 as.integer(levels(factor(frm$report_week)))  returns as integers
		table(vec)		list the levels and counts
	quantitative variables:
		summary(vec)	produce 5 number summary
	data frames having both quantitative and qualitative variables
		tapply(frm$measure, frm$category, summary) 	returns an array of lists, each list holding the 5-number number **for that category**! very useful
		excellent reference: http://stackoverflow.com/a/7141669/181638
		
		but usually you want :
	by(frm, frm$category, summary)			# show how all of the attributes in frm are summarised by class, !! but returns vector 
		or:
	aggregate(frm$measure, list(frm$category), summary)						# ** returns data frame **  -- notice you have to coerce the category into a list
	s <- aggregate(frm[,c(1:3), list(frm$category), summary)				# ** returns data frame **  -- notice you can even have multiple measurements! :)
		aggregate(frm[,cols], list(rep(1,nrow(frm))), summary)				# return a summary of everything -- ie all have the same class
				note: you can use median, or mean instead of summary
		or:
	ave   provides the element-wise average, so useful with a list of lists, aka a matrix, or dataframe, or with sapply
		or:
	ddly


@@2	A DEEPER UNDERSTANDING OF R

	@@2A	R's Data Structures
	
		@@2A1  arrays, vectors, lists
			creating
				c( )   		create, concatenate
				seq( )
				rep( )
				1:5
			indexing into an array is done with square brackets e.g. x[2]
			end of an array: length(x)
			end of a row/column of a matrix:  ncol(row_x)  nrow(col_x)
			any list of integers can be used as indices into a further list
				v[1:5]
			you can also specify what you *don't* want, i.e. set complements.  This is very powerful.  e.g.
				x <- c(1,2,3,4)
				x[-1]   >> 2 3 4
				x[-2]   >> 1 3 4
				x[-(1:2)] >> 3 4
				
			which( condition )  returns the INDEX/INDICES of a list corresponding to the condition(s)
			subset(x, expr)
			
		@@2A2	 matrices, tables, frames
			frm[1,]						row 1
			frm[,1]						column 1 (matrix notation)
			frm$col						select column by name
			frm[[1]]					select column by number
			subset(x, expr, cols )		select subset
				subset(frm, new_dg=="D1", sig_nois_rat)		example
			
		@@2A3	 data types
			numeric
			float
			int
			...
			
		@@2A4	INTROSPECTION
			When things go wrong, as they always will, it is handy to find out what the data type is.
			
		
	@@2B  Working with Lists
In R you should be working with lists (vectors).
Sclar operations act on a list but may need na.rm=TRUE
Example:  min, max
library(datasets)
d<-airquality$Ozone
max(d, na.rm=TRUE)

FUNCTIONAL PROGRAMMING paradigm creates vector operations that act element-wise on the list while avoiding explicit looping in R.
In R, these are the *apply functions summarized here:		
**Mnemonics**
  * `lapply` is a *list* apply     (acts on a list and returns a list or list of lists)
  * `sapply` is a *shaped* sequential apply   (lapply but shaping output naturally ** you will usually use this one)
  * `vapply` is a *vectored* apply (vectored sapply for speed, returns atomic vectors)
  * `rapply` is a *recursive* apply  (for recursive lists, i.e. lists within lists)
  * `tapply` is a *tagged* apply  (the tags identify the subsets)
  * `apply`  is   *generic*: (applies a function to a matrix's rows or columns)

Using the `apply` family requires a change in perspective (users of Lisp will recognise the paradigm immediately):
 - Advanced R: Functional Programming, by Hadley Wickham
 - Simple Functional Programming in R, by Michael Barton

Example:
square <- function(val){ val**2 }
sapply(d[!is.na(d)], square)

			
			
	@@2C	Working with Frames
	
		@@2C1	Subsetting

@@3	PROGRAMMING R

	Programming in R is essential for efficiency and reproducibility.

	@@PITFALL!  R is CASE SENSITIVE.  So myVar is not the same as myvar.
	Be especially careful when importing from a database.
	
	@@3B	R programming style
		A bit of Forth (refactoring, small definitions)
		A bit of functional
		A bit of objects
		
	@@PITFALL!   DRY: Don't repeat yourself!
		In R it is very easy to do copy-paste development.
		This is a terrible habit to get into.
		It means you will have very long files, iteration will be slow.
		Get into the habit of honouring DRY.
		If you need to repeat yourself, REFACTOR into a function.
		
	@@PITFALL!  Clean up as you go
		As you refactor, don't forget to clean up old definitions so that you don't stumble across the wrong one.
		NOTE: You don't have to redefine definitions that include other functions that you have changed.  R references at RUN-TIME not COMPILE-TIME (like Forth)
		Finally, always re-test a long script after clearing out all of R's memory before finishing, to see that everything works "from scratch".
		
		ls() 	shows all the user objects
		remove(x)
		remove(ls()) 	removes all your objects, variables, functions, definitions
		rm(list=ls(all=TRUE))
		rm(list= ls()[!(ls() %in% c('keepThis','andThis'))])    removes all objects except keepThis andThis
		
	@@3C	Pragmatic Programming

		Comments

	@@3A	Defining a function
	
	@@3C 	Strings & Formatting in R

	Working with Console Output:
cat is a console output that adds a space between parameters
cat("\nLoading",length(rownames(data)),"records...\n")
print()
sprintf("%s %s",str1,str2)
NICELY FORMATTED OUTPUT

	@@3D	Console INPUT
	scan()
		
	@@3D 	Looping in R & Conditionals
	if
	for:  
		for (x in seq(0,1,by=0.3)) { cat(x) }
	filling list:
		# this forms a ragged list of lists
		L <- list()
		for (x in seq(1,100)) {
			N <- 1+rpois(1,5)		# ensures no 0 lists
			L[[x]] <- rnorm(N,0,1)   # lists are 1-indexed in R  
		}
			
	
	while
	repeat
	break 
	next
	switch
	
	Q: How can I use a loop to [...insert task here...] ?
A: Don’t. Use one of the apply functions. 
	blog:   http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/
	see below: numeric summarisation, and
		excellent reference: http://stackoverflow.com/a/7141669/181638


@@4	PRODUCING WITH R

	In this chapter we look at R AS A PLATFORM FOR PRODUCTIVITY.
		
	@@4A	Documentation, Revision Control, SWEAVE
	
	
@@5	STATISTICS, DATA MINING, ALGORITHMS, and SIMULATION WITH R

	Finally, the place where R really shines --- as a platform for statisticians, applied mathematicians, scientists & data scientists, algorithmicians, quants, and anyone else who likes to play with mathematical ideas professionally or for pleasure.

	@@5A 	References
Good ones are brilliant, and can make the entry smooth even for beginners.

Book: Introduction to Statistical Learning with R, 2014
Site: http://a-little-book-of-r-for-time-series.readthedocs.org/en/latest/src/timeseries.html#forecasts-using-exponential-smoothing

		
	@@5C	local Web App with R Shiny
	http://shiny.rstudio.com/
	A web application framework for R
	Turn your analyses into interactive web applications
	No HTML, CSS, or JavaScript knowledge required
	
	@@5D 	Plaforms, Plugins
		RExcel  - Excel plugin
		RCommander  -  environment
		RGUI  - environment
		RStudio  - IDE
		
		
		
@@6  A COLLECTION OF USEFUL SNIPPETS, RECIPES, TRICKS

USEFUL EXAMPLES

TO KEEP A LOG OF RESULTS  --  NOTE --- SINK does NOT sink the commands, just the output
sink("diary.txt", split=TRUE) # directs output to sink file AND stdout
sink("diary.txt", append=TRUE, split=TRUE) # appends to sink file
sink("log.txt")  # output ONLY to the file
sink()  --closes the logging
stdout!
print() does work with sink()
WARNING:  sprintf() doesn't seem to work with sink??
gettextf() does NOT work with sink (v2.6.1)
capture.output() and sink() both choke on sprintf() and gettextf() -- WEIRD!

TO STORE THE COMMAND HISTORY --
history(numlines)
savehistory("history.txt")
loadhistory()




TO SEE A DIRECTORY LISTING
dir()

TO SEE A LIST OF ALL OBJECTS IN MEMORY
objects() or ls()

TO REMOVE OBJECTS  (KILL, DELETE, DROP)
remove(..., ...)  or rm()

TO REMOVE ALL OBJECTS
remove(list=objects())
or 
rm(list=ls())

CONFIGURATION
options()  #shows all options
options(digits=4)

TO RUN A SYSTEM COMMAND
shell()  # less primitive
system()  # primitive

Ref 2
Environment:
	CASE SENSITIVE		R is case sensitive so nAME is not the same as Name or NAMe
	#					comment
	?ref				quick way to say help    same as help(ref)
	history(N)			show last N commands
	dir()				show current directory listing
	ls()				see defined objects
	rm(obj)				remove object,  rm(list=ls())  to remove all!
	sink(filename, append=TRUE, split=TRUE)		appends to sink file while also directing output to stdout  --- useful for logging results
	sink()				close file and stops logging

	Finding out what Data Type an object is  (introspection)   /   dumping / printing its contents
		class(obj)  	e.g. data.frame
		typeof(obj)		e.g. list
		str(obj)		*** displays the contents of the objects in a useful way
		print(obj)
		dput(obj)
		dump(obj)

HELP:
help(keyword)
help.start()  #starts online help
help(base)  # the base package of R
help(stats) # the stats package of R
library(help="stats") #shows all the stat functions in the stats package
google... stackoverflow... http://www.rseek.org/
[http://cran.r-project.org/web/packages/sos/index.html]



Operators
	sort(vec) 	sort(vec,TRUE)		sort(vec,decreasing=TRUE)					sort a column, either asc (default) or desc (TRUE)
	frm[ with( frm, order(-ColOne,ColTwo)), cols ]								sort a data table with ColOne desc, then ColTwo asc
	
Defining Subroutines, Procedures, Functions
	funcname = function(param1) {
		... body ...
		0     # last value is returned
	}
	so:  y <- funcname(param1)   # returns 0

Global & Local Variables
http://stackoverflow.com/questions/10904124/global-and-local-variables-in-r
	globvar <- ...
	
	globfunc <- function( ) { ...} 
	
	f <- function( x ) {
		localvar <- x
		(localvar - median(localvar)) / max(-min(localvar),max(localvar))
	}
	g <- function ( x ) {
		globvar <<- x
		...
	}

	
Randomness
	set.seed(2)
	rnorm(10)		# return 10 random values from normal distribution N(0,1)
	rnorm(10,u,s)	# normal N(mu,sigma)
	runif(10)		# uniform random U[0,1]
	runif(10,a,b)   # uniform random U[a,b]
	rpois(10,1)		# poisson random
	
Data Structures:  Know your structures!	
	sequence, list, array, vector
	data frame, table
	matrix						really an array, but treated like a 2D array
	
Working with Sequences (a specific type of array):
	1:5		seq(1,5)	seq(1,6,2)		create sequence
	
Working with vectors/arrays:  vec
	vec <- c(1,2,3)		creation using concatenation operator.  Any sequence, singleton, or array can be concatenated
	vec[a]				index element(s) a,  a is any sequence, singleton or array
	size: 	length(vec)

Working with matrices: mat
	mat <- matrix(rnorm(10*2),ncol=2)  # 10x2 matrix
	
Working with Data Frames:  frm
	What is a data frame:
	Size: nrow(frm)   ncol(frm)      	returns row or col dimensions
	View: frm[x,y]   frm[x,]    frm[,y]   frm[1,vec], frm[a:b,1]  		show element, row, column, indexed rows or columns (vec), sequence of rows or columns
		  frm["ColOne"]  frm$ColOne										used named columns to show column as column or row ($)
		  names(frm) colnames(frm) rownames(frm) 						show column or rownames
	Filter: 
		rows <- which(frm$ColOne > 20)									returns rows of dataset satisfying filter condition
	Subset:  (create a subset table based on a filter)
		frm2 <- subset(frm, frm$ColOne > 20)[a:b] 						pull off specified columns for all rows satisfying the filter
	Subset -- alternative
		w <- which(df$x > 0.9)
		data.frame(which=w, df=df[w, ])									creates and specifies column names
	re-ordering a data frame:
		frm[with(frm,order(colname1,-colname2)),] 
	reset the row numbers (rowID) of a data frame:
		row.names(frm) <- NULL

Appendix 4: Using Scripts
You can either start rterm.exe or rgui.exe and use the command
#   source("helloworld.r")
#       OR
#   you can use   rcmd.exe batch helloworld.r   on the command line
#   and the output will be written to helloworld.Rout

###  BATCH PROCESSING INFO is at   help(BATCH)   in R.
options(echo=FALSE)

#suppresses the text from output when running in Rcmd.exe

#when this is used in RGUI, you don't get the PROMPT

# to see the prompt again, run

options(echo=TRUE)

#RGUI is also a LOT slower with echo=FALSE, and you still see the instructions anyway...
Ref 3
CRAN packages:
http://cran.r-project.org/web/packages/

Recommended Links:
http://www.statmethods.net/graphs/scatterplot.html
http://www.cyclismo.org/tutorial/R/

options(echo=F)  # setting echo to off means that the OUTPUT to the console is NOT interspersed with > signs.
   You can then copy and paste items into your script much more easily.

RECOMMENDED DEFAULT OPTIONS
options("digits"=5)   # default is 7, prints 5 SIGNIFICANT FIGURES only, e.g. 3.1416  NOT 3.14159.  Rounds everything else.
options("scipen"=1)  # default is 0, biases whether to use fixed width or scientific notation in numbers
options("width"=80)   # set this so that you have a fixed width output

USEFUL COMMANDS
cat()  # no quote, no line numbers, only explicit newlines, so can concatenate values, etc in the same line
, format(), write(), print.default()
print(x, quote="FALSE")  # leaves off the quotes
?? why does R print with [1] ?  aaah!  these are LINE NUMBERS!
noquote()  # print without quotes
methods(print)
formatC()  # formatting numbers using C style
paste()
gettext()
cat(s, sep="\n")
cat(a,fill=TRUE)

HELP
help(command)
help(library=base)
help(any index label)
help(Logic)

COMMENT
#  this is comment
if(FALSE){
  this can now be a multiline comment.
  R will never execute it.
}

HELP
help(command)
??keyword


NOTE!
ARRAYS BEGIN INDEXED AT *ONE* not ZERO.


(For Installation, see Toolchain\R)

R repositories
  CRAN
  Rforge
  http://www.rforge.net/

R and Reporting
  brew
  http://www.rforge.net/brew/svn.html
  
  sweave  # for documentation
  http://www.stat.uni-muenchen.de/~leisch/Sweave/
  
  RSRuby (only for UNIX so far)
  
Knowledge Bases:
  How-to's for statistics: http://www.statmethods.net/advgraphs/axes.html
  Numerical Methods in Statistics: http://www4.stat.ncsu.edu/~monahan/jun08/toc.html
  R Cookbook: http://www.r-cookbook.com/
  
----------
Value of R
----------
1. Exploratory Data Analysis using R
	John W. Tukey began his classic book, Exploratory Data Analysis (EDA), stating that "EDA is detective work--numerical detective work--or counting detective work--or graphical detective work." He continued: "A detective investigating a crime needs both tools and understanding. If he has no fingerprint powder, he will fail to find fingerprints on most surfaces. If he does not understand where the criminal is likely to have put his fingers, he will not look in the right places. Equally, the analyst of data needs both tools and understanding." (Tukey, 1977, pg 1).
	The goal of this guide is to introduce readers to using R as a tool to take part in this detective work.
http://www.r-cookbook.com/node/32

2. R as a graphing calculator
# setup
  R = 10   # uniform  [-N,N] range
  x = seq(-R,R,.5)
  plot(0,0, xlim=c(-R,R), ylim=c(-R,R)) 
  lines(0*x,x)  # draws y axis
  lines(x,0*x)  # draws x axis
# functions
  y = 2*x+5  # for example
  lines(x,y)
# points
  points(9,9)  #black, circle
  points(9,7, pch=23, col="red")  #red, diamond
# points from file
  points = read.csv("points.csv", header=FALSE)
  points(points, pch=23, col="red")
  
3. R as a matrix calculator:

  
-------------
Concepts of R
-------------
1. To quit: `q()'
1. To escape out of a command halfway in progress: ESC
2. # is the comment indicator
3. R can be used as an overgrown calculator
4. Assignment operator: <-   (e.g. x <- 2)   also  =  (e.g. x = 2)  ?? Any difference?  No.
5. Symbolic variable: 
  - names *are* case-sensitive (so WT is not the same as wt)
  - reserved names: c, q, t, C, D, F, I, T, diff, df, pt
6. Data vectors can be created with c(), 
   - c(...,...) creates data vectors by listing their elements (e.g. weight <- c(50, 72, 57, 90, 95, 72)
7. Vector arithmetic is elementwise or cycled through a shorter vector if the length of the longer is a multiple of the length of the shorter -- this is the convention for vector arithmetic.  This makes it easy to do stats calculations such as mean and std. deviation, e.g.
   mean(x), sd(x), t.test(x).  Notice: use the p-value to assess your hypothesis relative to the theoretical mean mu.  Use the 95% confidence interval to see what the sample tells us about the 95% confidence range of the true mean if we were estimating it.
8. Plots: 
   - plot(x,y) - one vector against the other, or, by default, against the index of the vector
   - lines(x,y) - plots and connects the plotted points by a line and adds the line to the existing plot.  NOTE: x should be sorted!

---------------------------------
HEIRARCHY OF MATHEMATICAL OBJECTS
---------------------------------

scalars
pi
exp(1)

functions
sin(pi)
cos(pi) = -1

vectors / arrays
o = c(1,2,3)  # creates vector/array
o[1]      # access first element -- notice, arrays are indexed from 1
rep(vector,2)  # repeats vector twice, one after the other
rep(vector,rep(2,length(vector)))  # repeats each element of vector twice
a:b
seq(a,b,step)


matrix()
%*% is matrix multiplication
v=1:14
matrix(v,nrow=2, byrow=FALSE) # assigning down columns -- this is R's standard for tables
matrix(v,nrow=2, byrow=TRUE)  # assigning across rows

Example: extracting independent tables from a three-way table:
bp5i = table(ofacBPmeds, ofacBP, HasIns)  # 2 levels, 4 levels, 2 levels
t1 = matrix(bp5i[1:8], nrow=2, byrow=FALSE) # first table of 8
t2 = matrix(bp5i[9:16], nrow=2, byrow=FALSE) # second table of 8

matrix by row binding or column binding
rbind(r1,r2,r3,...)
cbind(c1,c2,c3,...)

v=1:14  # FIRST assign values to a vector, THEN dimension into a matrix
dim(v) = c(2,7)  # row by column, COLUMN MAJOR!
v[1,2] # shows element i,j
v[1, ] # shows row 1
v[ ,1] # shows column 1

data frame
col1 =
col2 =
data = data.frame(col1, col2, ...)
is.data.frame(d)

COLUMN SUM
sum(matrix[,1])
ROW SUM
sum(matrix[1,])


CATEGORICAL VARIABLES
categories = c("level1", "level2", ...)
catcounts = c(x1, x2, ...)
rep(categories, catcounts)

table(factor(natpop, levels=agelab))   #allows ordered labels (not autosorted)


=====================
R Language
2/23/2006 11:17AM
AKE

QUALITY DATA ANALYSIS WORK
--------------------------

Always three things:
numbers (absolute counts as well as relative percentages)
words,
pictures (charts, graphs, visuals),
(summary)
(then later statistics)
(subgroup exploratory data analysis)
(pivot by factors exploratory data analysis)

For categorical data:
table counts  (or ftable -- formatted table -- better)
  ftable(addmargins(table(HasIns, ofacBPmeds)))
table percentages
plot (percentages)

For numerical data, show:
quantile calculations (5 number summary, min, max, median, mean, quartiles, IQR)
histograms (if have notion of categorical grouping into bins, e.g. age bins)
plots
box plot

For comparison of grouped categorical data, show:
plot of two way tables

For comparisons of grouped numerical data, show:
side-by-side boxplots


For continuous variables, test for non-normality:
#  - remove NaNs
  	data.Age <- data$Age[!is.na(data$Age)]
  
#  - form histogram
  	data.Age.hist <- hist(data.Age)

#  - view statistics
  	quantile(data.Age)
  	mean(data.Age)
  	sd(data.Age)

#  - view **linear** correlation between two continuous variables
cor(Age, Creatinine, use="pairwise")

IS correlation robust?  I.e. is it pulled by outliers?
NO!  It IS pulled by outliers...

# **linear** modeling:
linfit = lm(response ~ predictor terms)
linfit = lm(y ~ 0 + x)  # to force not using a y-intercept
abline(linfit, lty = 2)  # plot as dotted line

# **quadratic** modeling:
quadfit = lm(response ~ term1 + I(term2^2))   # the I is essential so that ^ means power

# how to compare the quality of the fit?  You want the magnitude of the sum of squared errors..., so you're looking for the R-squared value.

linfit = lm(y ~ x)
summary(linfit)
quadfit = lm(y ~ x + I(x^2))
summary(quadfit)
plot(x, y)
curve(predict(quadfit, newdata = data.frame(x = x)), add = TRUE)
abline(linfit, lty = 2)

# general modeling (non-linear least squares)
nls(thisframe$CholLDL ~ (thisframe$Age^2) + thisframe$Age, trace=TRUE, start=list(thisframe$Age=20))

#  - create normal distribution model
med.Age = median(data$Age, na.rm=TRUE)
IQR.Age = summary(data$Age)[5] - summary(data$Age)[2]   # IQR
norm.Age = rnorm(1000000,mean=med.Age,sd=IQR.Age)
plot(density(sort(data$Age)))
plot(density(sort(norm.Age)))

#, xlab = "Age", ylab="Density", sub="Figure 7", main = "Age Distribution of the National Sample")

#  - set breaks
	Agebreaks<-c(0,data.Age.hist$breaks)

#  - plot normal distribution
  norm.Age.hist <- hist(norm.Age, breaks=Agebreaks)

#  - plot comparisons
windows()
view <- matrix(1:2, nrow=1, byrow=T)
layout(view)
hist(data.Age, freq=FALSE)
hist(norm.Age, breaks=Agebreaks, freq=FALSE)

#  - count comparisons

============================================
Statistical Methods in R
Assad E.K. Ebrahim, M.Sc.
11/30/2007 8:40AM
3/6/2006 12:59PM
-------------------------



---
ROWS, COLUMNS, AND CONDITIONAL SELECTION

HOW MANY DATA ROWS IN A TABLE? (i.e. not including the header row)
length(rownames(data))

SHOW A ROW
  - create a sequence of column indices and then show
  indices <- seq(2,10,1)
  data[3,indices]
  This will show row three, columns 2 through 10, stepping through cols singly
  OR
  choose your columns
  indices <- c(1,2,3,5,6,9)
  data[2,indices]
  This shows the selected columns for row 2
  data[1:3] 
  shows all rows with columns 1 through 3 inclusive

	data["BPsys"]  shows as a row
	data$BPsys shows as a column
  
HOW MANY COLUMNS?
length(names(date))

SHOW COLUMN HEADERS
names(data)
colnames(data)

SHOW A SUMMARIZED COLUMN WITH LEVELS AT THE END, 
referenced by header name
	data$column
referenced by column number
	data[,1]

SHOW A COLUMN
data[1]

SHOW MULTIPLE COLUMNS
data[2:4]  - shows columns 2 through 4

TO SEE ALL THE LEVELS ASSOCIATED WITH A COLUMN:
levels(data$colname)

QUERY
which(data$freq==)   # DOES NOT include NA's
which(data$HeightIn >= 11) 

rows = which(data$HeightIn >= 12)
cols = c(1,2,28,29,30)
data[rows,cols]

QUERY HOW MANY NA's, Non-NA's
length(which(is.na(data$HasHDinMaleFam)))
length(which(!is.na(data$HasHDinMaleFam)))

SORT A COLUMN
sort(vector)
s = sort(vector, index.return = TRUE)  # returns index vector for the sort
s$ix  # index vector:  element i in index vector says WHICH element in vector got put into ith place of the sorted vector
s$x  # sorted result

SORT TABLE BY A COLUMN
order()


SUBTABLES 
To create a subtable
subtable <- data.frame(table$colname, table$colname2)

subset(dataframe, condition to select)
e.g. 
subset(data, data$Age < 18)[1:9]

Make a two-way table from a dataframe
table(data)

CONDITIONAL DEFINITIONS
Defining a population from data
mwht.male <- mw$HeightTotalIn[mw$Gender=="M"]

COUNTS

COUNT HOW MANY ARE OF A CERTAIN KIND
  - either use a table or a conditional statement (Iverson notation!)
  e.g. 
  mw$HeightTotal[mw$HeightTotal>5]  # INCLUDES NA's

length(which(mw$HeightTotal>5))
length(mw$HeightTotal[mw$HeightTotal>5])

table(data$Gender)  # for categorical variables with fixed levels of a factor

OR
use which

length(which(data$Age<7))  # does NOT include NA's

length(which(data$Age>9 & data$Age<10))  # should be zero
length(which(data$Age>=9 & data$Age<10)) 
length(which(data$Age==9))



KEY: in a two-way table, the first argument goes down the rows, i.e. fills a column
BARPLOT(table()[1:n]) fills the array DOWN the rows, i.e. column at a time.
So!  To quickly get a nice visual of a two way table with adjacent classes, put what you want to be adjacent in the FIRST table argument, then barplot it...
barplot(table(Age,ofacBP)[1:16])
AND
to make a summary percentage table, divide by your second argument, repeated as many times as you have rows, and then columns, e.g.
barplot(table(Age,ofacBP)[1:16]/rep(table(ofacBP),rep(5,4)))

Use my special function to make life easy:
t1 = table(Age,ofacBP)
t1p = pctable(t1)  # make percents according to column sums

To get rowpercents:
t2 = table(ofacBP, Age)
tp2 = pctable(t2)

To plot with the BP grouping
barplot(tp1)

To plot side by side grouped by BP adjacent
barplot(t1p[1:20])

To plot side by side grouped by AGE, take a transpose first:
barplot(t(t1p)[1:20])

barplot(t)  # barplot with divided bars
plot(t)     # with sectioned bars

To have the bars' thicknesses reflect the numbers, do 
barplot(t,t), where t is the table

3-way tables
Example: extracting independent tables from a three-way table:
bp5i = table(ofacBPmeds, ofacBP, HasIns)  # 2 levels, 4 levels, 2 levels
t1 = matrix(bp5i[1:8], nrow=2, byrow=FALSE) # first table of 8
t2 = matrix(bp5i[9:16], nrow=2, byrow=FALSE) # second table of 8

labels(bp5i)  # labels for all three factors
labels(bp5i)[1] # labels for the first factor

# making a labeled table from a matrix
rownames(t1) = c( ... ) 
colnames(t1) = c( ... )

# Form Percentages
t1 = matrix(bp5i[1:8], nrow=2, byrow=FALSE)
t2 = matrix(bp5i[9:16], nrow=2, byrow=FALSE)
rownames(t1) = c("Y","N")
rownames(t2) = c("Y","N")
colnames(t1) = labBP
colnames(t2) = labBP

When barplotting in three-way tables, leave as the DARK STUFF the variable that is most "active", i.e. which you would visualize dramatically as "filling up" the beaker.  Let the adjacent beakers (bars) be the experimental sequence, and then finally duplicate the experiment with variable that can be visualized as the category and control...

OPTIMIZATION IN R

min()
max()
which()

If min is:
  min(data$freq)
then argmin is:
  which(data$freq==min(data$freq))
and then the VALUE at which something is minimized, say in a Z vs. freq graph is found with:
  data$freq[which(data$Z == min(data$Z))]

max(data$Age, na.rm=TRUE)   

GROUPED DATA
Grouped data means having data in one vector and having a parallel vector telling what factor (group) the individual is a member of because of the particular data value and a membership criterion/selction function.

STATISTICAL SUMMARIES
summary(dataframe)
summary(matrix)
summary(vector)


WORKING WITH SETS
-----------------
union()
intersect()
setdiff()
setequal()

%in%   help("%in%")  : a boolean indicating set membership
is.element(element,set) : same
!is.element(element,set)

rows=which(!(1:(nrow(thisframe)) %in% msTot))   # those NOT in the set
e.g. 1929 %in% rows_great
data[rows,]  # show the table with only those rows

screenplot()

text()
mtext() 

Rterm.exe accepts -f filename and -e expression !  (v2.5+)

plotmath()  (v2.6+)

Rwui()  provides R web interface!  So that the R scripts can run in a website environment.

Cairo package for better graphics rendering and bmp, png, tiff, etc. outputs

gcl package for fuzzy rules classification

gWidgets package is probably the way to go for simple effective user interfaces in R scripts.

hints package helps remember functions in a package

update.packages()

FKBL package supports fuzzy inference

sets package for fuzzy sets and generalized sets



======================================================

PLOT BEAUTIFICATION PRETTY PRETTIFICATION


font sizes:
cex
cex.axis  (tick marks and their annotations)
cex.lab  (x axis and y axis labels)
cex.main  (title)
cex.sub  (subtitle)

PLOTTING SYMBOLS
 pchShow <-
       function(extras = c("*",".", "o","O","0","+","-","|","%","#"),
                cex = 3, ## good for both .Device=="postscript" and "x11"
                col = "red3", bg = "gold", coltext = "brown", cextext = 1.2,
                main = paste("plot symbols :  points (...  pch = *, cex =",
                             cex,")"))
       {
         nex <- length(extras)
         np  <- 26 + nex
         ipch <- 0:(np-1)
         k <- floor(sqrt(np))
         dd <- c(-1,1)/2
         rx <- dd + range(ix <- ipch %/% k)
         ry <- dd + range(iy <- 3 + (k-1)- ipch %% k)
         pch <- as.list(ipch) # list with integers & strings
         if(nex > 0) pch[26+ 1:nex] <- as.list(extras)
         plot(rx, ry, type="n", axes = FALSE, xlab = "", ylab = "",
              main = main)
         abline(v = ix, h = iy, col = "lightgray", lty = "dotted")
         for(i in 1:np) {
           pc <- pch[[i]]
           ## 'col' symbols with a 'bg'-colored interior (where available) :
           points(ix[i], iy[i], pch = pc, col = col, bg = bg, cex = cex)
           if(cextext > 0)
               text(ix[i] - 0.3, iy[i], pc, col = coltext, cex = cextext)
         }
       }
     
     pchShow()
     pchShow(c("o","O","0"), cex = 2.5)
     pchShow(NULL, cex = 4, cextext = 0, main = NULL)
     


colors:
col
col.axis
col.lab
col.main
col.sub

label rotation:
crt
las  - 0 parallel, 2 perp

MARGINS NICE AND TIGHT:
def=par(cex.axis=0.75, mar=c(4.3,3.2,2,0)+0.1, mgp=c(2.2,0.8,0))
... plot stuff
par(def)


DRAWING IT / EXPORTING IT:
It is convenient to use my plotting library, which sets the sizes and outputs to a PNG
source("c:\\totalcmd\\R\\lib\\AKE\\akelib.r")  # Obtain the drawing functions drawOn(), drawOff()
drawOn(500,500)
  ... plot stuff
drawOff(500,500,"fft_joint_relmax")  # produces a PNG

SETTING UP A CUSTOM GRID
plot(..., axes=FALSE, frame.plot=TRUE)  # call plot with no axes, but with a frame -- IF YOU WANT THE DEFAULT AXIS LABELS, make axes=TRUE
# FOR INFO PURPOSES
axTicks(1)  # shows you what the x axis is
axTicks(2)  # shows you what the y axis is
# SETTING UP GRID
par(xaxp=c(0,250000,5))  # sets up x axis with 5 ticks over this range 0 to 250000
par(yaxp=c(0,200,8))  # sets up y axis with 8 ticks over this range 0 to 200
axis(1)  # plots first axis  # -- IF YOU WANT THE DEFAULT AXES COMMENT OUT THESE TWO LINES
axis(2)  # plots second axis
grid()  # then puts the grid, default lty="dotted", solid is lty=1

SELECTING COLORS
colors()  # shows available list


TYPES OF PLOTS

Make a stem plot from numeric data
stem(data$numericcol)
stem(as.numeric(data$shouldbebutisntnumeric))

TYPE CONVERSION
DATA TYPE DATATYPE
as.numeric

Make a plot of data
plot(data$col)
plot(table(data$col)) makes a plot that shows outliers more clearly

Adding plots
plot(x)
curve(sin(x), add=TRUE)

Make a boxplot of numerical data
boxplot(jk06$Age)
Make side-by-side boxplots of grouped numerical data
boxplot(age.nt, age.se, age.sw, age.ww, names=c("NT", "SE", "SW", "West"), ylab="Age", main="Age by Region")

Make a histogram of numerical data
hist(jk06$Age)
  You can specify the breaks with ,breaks="  "
  You can specify relative frequency (percentages) with freq=FALSE
  
table(data$Region, data$Gender))  # a two way table plot

PLOTS FOR TABLES
barplot(table())  # a plot of a table, even a 2-way table is nice!!

HOW TO GET TEXT LABELS TO BE VERTICAL?
par(las=2) # vertical
par(las=1) # horizontal

HOW TO GET SUBTEXT
mtext("Subtext")

HOW TO ADD AN AXIS
states=c("DYING","SEVERE","MILD","HEALTHY")
axis(2,at=0:3,labels=states,cex.axis=.6,las=1)

HOW TO SET UP A CUSTOM PLOT
plot(..., axes=FALSE, frame.plot=TRUE)  # call plot with no axes, but with a frame
axTicks(1)  # shows you what the x axis is
axTicks(2)  # shows you what the y axis is
par(xaxp=c(0,250000,5))  # sets up x axis with 5 ticks over this range 0 to 250000
par(yaxp=c(0,200,8))  # sets up y axis with 8 ticks over this range 0 to 200
axis(1)  # plots first axis
axis(2)  # plots second axis
grid()  # then puts the grid

# Plot Parameters
    fmax = 250000
    fstep = 50000
    ymax = 200
    ystep = 25

# Unsmoothed Plot
drawOn(500,500)
# call plot with no axes, but with a frame
plot(rec$freq,rec$fft_db, col="red",type="l", ylim=c(0,200),xlab = "Frequency (Hz)", ylab = "Spectral Coefficients (dB)", main = "Spectral Decomposition\n(BioSonics 204kHz Transducer)",axes=FALSE,frame.plot=TRUE)
lines(tap$freq,tap$fft_db, col="blue",type="l", ylim=c(0,200))
par(xaxp=c(0,fmax,fmax/fstep))  # sets up x axis ticks
par(yaxp=c(0,ymax,ymax/ystep))  # sets up y axis ticks
axis(1)  # plots first axis
axis(2)  # plots second axis
grid(col="blue")  # lays on grid
drawOff(500,500,"fft_joint_relmax")

HOW TO ADD A GRID
grid(col="blue")

drawOn(500,500)
plot(rec$freq,filter(rec$fft_db,ma5), col="red",type="l", ylim=c(0,200),xlab = "Frequency (Hz)", ylab = "Spectral Coefficients (dB)", main = "Spectral Decomposition\n(BioSonics 204kHz Transducer)",axes=TRUE)
lines(tap$freq,filter(tap$fft_db,ma5), col="blue",type="l", ylim=c(0,200))
# horizontal lines
abline(h=25)
abline(h=75)
abline(h=125)
abline(h=175)
grid(col="blue")

# handmade darker grid
for(ll in seq(-45,45,5)) {
  abline(v=ll,col="gray25")
}



PLOTTING COUNT DATA

You can use the following:

  (a) table:  table(mw$JK)
  This will show the levels and the counts for each level  (the levels are discrete categories)
  
  (b) frequency histogram:  plot(mw$JK)
  This shows a frequency histogram for each level (the levels are discrete categories)

BOXPLOTS

Boxplots provide a visual presentation of the 5-number summary of data:
  low, high, average, and inter-quartile (25th and 75th percentile) values.
  So the middle 50% of the values lie within the hinges while the extreme 50% lie outside the hinges.

boxplot(mw$Age)
boxplot(mw$Age, sw$Age)  #side-by-side boxplots
stripchart(mw$Age) #good for small data sets

SCATTER PLOTS AND OVERPLOTS

plot(mw$age, mw$BMI, pch='+')  #scatterplot
points(sw$age, sw$BMI, pch=2) #adding the points to the same plot, with triangle as the item

LINE PLOT

lines(...)

POINTS & LINES

type="p", "l", "b", "o"

COLORS
colors()  # returns all color types
blue
red
green
orange
brown
black
gray
blue1
blue2
blue3
...



MULTIPLE PLOTS ON A PAGE

windows()
view <- matrix(1:2, nrow=2, byrow=T)
layout(view)

OR

par(mfrow=c(3,2))         # multifigure setup: 3 rows, 2 cols


ADDING A LEGEND

windows()
plot(data$freq, data$Z, ylim=c(0,1500), main="Finding the Optimal Frequency", xlab = "Frequency (kHZ)", ylab="Z (ohms), Relative Phase Angle, Im(Z) (ohms)")
points(data$freq, -1000*data$phrad, pch=2)
points(data$freq, -1*data$Z*sin(data$phrad), pch=3)

legend(locator(1), locator(1), c("Z (ohms)", "Relative Phase Angle (unitless)", "Im(Z) (ohms)"), col = c(1,1,1),
            text.col = "black", lty = c(2, -1, 1), pch = c(1, 2, 3),
            merge = TRUE, bg = 'white')


MATH TEXT
ytext = expression(V^2)
ytext = expression(paste(plain(sin) * phi, "  and  ", plain(cos) * phi))
ytext = expression(paste("Power ( ", V^2, ")  (dB)"))
plot(..., ylib=ytext)


3-D surface plot
persp(seq(0,1,238),seq(0,1,3),d_scale)
 ## Default S3 method:
     persp(x = seq(0, 1, length.out = nrow(z)),
           y = seq(0, 1, length.out = ncol(z)),
           z, xlim = range(x), ylim = range(y),
           zlim = range(z, na.rm = TRUE),
           xlab = NULL, ylab = NULL, zlab = NULL,
           main = NULL, sub = NULL,
           theta = 0, phi = 15, r = sqrt(3), d = 1,
           scale = TRUE, expand = 1,
           col = "white", border = NULL, ltheta = -135, lphi = 0,
           shade = NA, box = TRUE, axes = TRUE, nticks = 5,
           ticktype = "simple", ...)
persp(seq(0,5,238),seq(0,650,3),outer(d_scale[,1], d_scale[,2], d_scale[,3]))

3-D scatter plot
Need to Packages > Install > search for scatterplot3d
library(scatterplot3d)
scatterplot3d(x,y,z)

# 3D Scatterplot with Coloring and Vertical Lines
# and Regression Plane
library(scatterplot3d)
attach(mtcars)
s3d <-scatterplot3d(wt,disp,mpg, pch=16, highlight.3d=TRUE,
  type="h", main="3D Scatterplot")
fit <- lm(mpg ~ wt+disp)
s3d$plane3d(fit)

Interactive 3-D scatter plot
library(rgl)
plot3d(wt, disp, mpg, col="red", size=3) 
            
===================================
12/27/2007 5:00AM
AKE
Numerical Statistics

Arithmetic: scalar and vector
  / is elementwise for arrays

  round(scalar, digits=2) #digits is decimal places
  
  table(data$Region) gives you a counts array
  
  you can create an array of totals tot = c( . , . , .)
  samp = table(data$Region)
  percentiles are now  round(samp/tot, digits = 2)
  
sum(vector) - returns the sum of all the elements
  
Trig functions are in RADIANS!

so use 
  phi_rad <- phi_deg * pi / 180
sin( )
pi

For continuous variables, test for non-normality:
  - remove NaNs, e.g. jk06.Age <- jk06$Age[!is.na(jk06$Age)]
  - form histogram, e.g. hist(jk06.Age)
  - view statistics, e.g. quantile(jk06.Age)
  - mean(jk06.Age)
  - sd(jk06.Age)
  - mad(jk06.Age)  -- most robust alternative to SD

Distributions
require(stats)
  - gaussian
  - binomial
  - Poisson distribution 
	stats package
	dpois - poisson density
	ppois - poisson cumulative distribution function
	qpois - poisson quantile function
	rpois - poisson random draw function
  - glm (generalized linear model)

  
Draws:  
  - d,p,q,r  : density, cumulative distribution (probability), quantile, random deviates
	- binom(count, num_trials, success_prob)
		dbinom - mass density function
		pbinom
    dxxxx(
	- uniform
		dunif
		punif
		qunif
		runif
    - norm
    - exp
    - gamma distribution
	  stats package
	  help(GammaDist)
	  dgamma()
	  pgamma()
	  qgamma()
	  rgamma()
	  
    - pois
    - weibull
    - cauchy
    - beta
    - t
    - f
    - chisq
    - binom
    - geom
    - hyper
    - logis
    - lnorm
    - nbinom
    - unif
    - wilcox

To draw 1 number randomly from the following distributions:
rnorm(1)
runif(1)
rbinom(1,1,.5)  # a coin flip
rbinom(1,10,.5) # number of heads after 10 flips
rpois(1,.5) # lambda = .5

Note - for help on these, use the Capitalized Distribution Name, e.g.
help(Normal)
help(Uniform)
help(Binomial)
help(Poisson)


normAge <- rnorm(1000,mean=mean(jk06.Age), sd=sd(jk06.Age))

You get a smooth normal curve when n = 1,000,000 and the classes are of width 1, stretching from -50 to 150
hist(sort(rnorm(1000000,mean=43,sd=12)), breaks=seq(-50,150,1))

You can compute the probability below a given P-value as follows:
pchisq(2.7, df=1)  this returns the probability distribution below that amount, i.e. the cumulative distribution


Test for normality, non-normality, example:
For continuous variables, test for non-normality:
sample <- rnorm(sz,mean=43,sd=12)
sample.hist <- hist(sample)
br <- sample.hist$breaks
sC <- sample.hist$counts
br.z <- (br - 43)/12
pnorm(br.z) # - this gives the cumulative distribution below that amount
est.cum <- round(1000*pnorm(br.z))
br.n <- length(br)
estC <- est.cum[2:br.n] - est.cum[1:br.n-1]
csq <-  sum((sC - estC)^2/estC)
ddf <- br.n - 3
pchisq(csq,df=ddf)
csq
br.n
ddf

===================================================

sz<-10000
k<-1  # no. of extreme values to skip
brk <- seq(-10,110,5)
sample <- rnorm(sz,mean=43,sd=12)
sample.hist <- hist(sample, breaks=brk)
br <- sample.hist$breaks
sC <- sample.hist$counts
br.z <- (br - 43)/12
pnorm(br.z) # - this gives the cumulative distribution below that amount
est.cum <- round(1000*pnorm(br.z))
br.n <- length(br)
n <- br.n-(2*k)+1
estC <- est.cum[(2+k):(br.n-k)] - est.cum[(1+k):(br.n-k-1)]
csq <-  sum((sC[(1+k):(length(sC)-k)] - estC)^2/estC)
ddf <- n - 3
pchisq(csq,df=ddf)
estC
csq
br.n
sz
k
n
ddf


=========================================================
Hypothesis Testing

Single Proportions Testing
--------------------------
prop.test(numpos, numtot, popparam)
where 
numpos is the number of positive outcomes
numtot is the total number of binary trials (positive, negative)
popparam is the population proportion that you want to hypothesis test for


Example: gender balance in study population F = 1442, out of 2814 individuals

But the known population proportion of females is .4836

> prop.test(1442,2814,.4836)

Gives output:
	        1-sample proportions test with continuity correction
	
	data:  1442 out of 2814, null probability 0.4836 
	X-squared = 9.2557, df = 1, p-value = 0.002348
	alternative hypothesis: true p is not equal to 0.4836 
	95 percent confidence interval:
	 0.4937878 0.5310536 
	sample estimates:
	        p 
	0.5124378 


Interpretation:
        p  =  0.5124378   this is the sample estimate (proportion) from the study data that you have, 
   e.g. p = numpos/numtest

95 percent confidence interval:
	 0.4937878 0.5310536 

This is the 95% confidence interval around your study data's sample estimate (propertion value), i.e. the value around which 95% of repeated random samplings of that size would have their proportions come out within.

	X-squared = 9.2557, df = 1, p-value = 0.002348
  This is the chi-squared value and likelihood of a random sampling under your sample estimate parameters yielding the value that your population parameter is.
  
  The p-value is the likelihood of obtaining a MORE EXTREME value away from the sample estimate's proportion.
  
  The Null Hypothesis is:
    the population parameter is popparam
  The Alternate Hypothesis: 
    the true population parameter for the population from which your sample was draw is NOT popparam
    

Exact Binomial Test
--------------------------
binom.test(numpos, numtot, popparam)


Fitting Lines
-------------

If A,B are correlated data sets,
plot(A,B) shows the cloud
z = line(A,B)   # fits a line robustly
abline(coeff(z))
or
abline(coeff(line(A,B)))


--------------------------------------
Data Editing in R

##**********************************************
###############################
print("Filter #11: HasHistHD");
print("1)  NO -> N");
print("2)  YES -> Y");
print("3)  D -> NA");
### corrections
ct <- 0

colToMod = data$HasHistHD

# PROGRAMMING IN R

  # i = row index; j = col index
for (i in 1:length(data$ID)) { 
    if (!is.na(colToMod[i]) & colToMod[i] == "NO") {
      colToMod[i] <- "N"
      ct <- ct + 1   #increment counter
    } else if (!is.na(colToMod[i]) & colToMod[i] == "YES") {
      colToMod[i] <- "Y"
      ct <- ct + 1   #increment counter
    } else if (!is.na(colToMod[i]) & colToMod[i] == "D") {
      colToMod[i] <- NA
      ct <- ct + 1   #increment counter
    }
}
colToMod <- factor(colToMod)  #re-assess levels after this modification
if (ct > 0) {
	print("  Number of records modified: "); ct
} else {
	print("No records modified.")
}

data$HasHistHD = colToMod

------------------------------------------
Working with Categorical Data in R

factor(data$JK)
levels(data$JK)
You can CHANGE the levels of a data set easily by simply re-assigning the level element:
e.g. 
levels(data$JK)[2] = "BGHM"


Tricks and Things
-----------------
prompt(data frame)  or prompt(any object)  produces a document describing that object.

# Graphing
plot(anything)
dev.copy2eps()  # packages the graphic into an eps file
bitmap(file, type = "png256", height = 6, width = 6, res = 72,
            pointsize, ...)

### Later define a table reversing function...
t1 = table(ofacCH, ofacCL)
t1r = rev(t1)
dim(t1r) = c(3,5)
rownames(t1r) = rev(labCH)
colnames(t1r) = rev(labCL)
t1rp = pctable(t1r)
t1rptot = 100*round(t1r/sum(t1r),4)   #table with percentage of total population

# making the names in prep for a barplot (using my array.r library)
nc = concat(rep("LDL",5),":",colnames(t1r))
nr = concat(rep("HDL",3),":",rownames(t1r))
labc = rep(nc,rep(3,5))  # cycling through 3 repetitions each
labr = rep(nr,5)  # cycling instances through 1 repetition each
labt1r = concat(labc,"\n", labr)

# printing an exploded barplot
print("Figure 37a shows the Cholesterol Health assessment of the total population")
drawOn(500,350)
def=par(las=2, cex=.6, cex.axis=.8)
barplot(t1rptot[1:15], ylim=c(0,30), main="Cholesterol Health Assessment", ylab = "Percent of Total Study Population", names = labt1r)
par(def)
drawOff(500,350,"37a")

strsplit()  # allows using regular expressions in R

system()  #makes a system call

capabilities() tells you what your system is capable of

CRAN packages can be downloaded from within R
update.packages()
download.file()


= is now allowed as an assignment operator

attach()  # for user defined tables of variables

match()
unique()
duplicated()

The grid graphics package as a drawing tool.

split() splits a list by factors -- could be very useful

.C() allows for making C function calls!
.Call()
.External()

dotchart()
stripchart()
barplot()
hist()
legend()
pie()
polygon()
rect()
bxp()
boxplot()
contour()

sprintf()
So we can now print variables:
filename = "akhb2006.csv"
sprintf("The filename is %s", filename)

princomp()
rpvm package: parallel virtual machine ( a cluster of computers working as one)

widgets are available through library(tktcl)

# Having mathematical notation in plots! :)
plot(0, main = expression(y == alpha*x[1] + beta*x[2]^2))

# VARIABLE SUBSTITUTIONS
a <- 3.5
x <- 1:2
substitute(y == a + alpha*x[1] + beta*x[2]^2, list(a = a))

plot(0, main = expression(substitute(y == a + alpha*x[1] + beta*x[2]^2, list(a = a))))

rowsum()
colsum()

capabilities("PCRE")
grep(), sub(), gsub() and regexpr() have a
new argument `perl' which if TRUE uses Perlstyle
regexps from PCRE (if installed). New capabilities
option "PCRE" to say if PCRE is available.

stdin()

installed.packages()

capture.output()  # sends results from an output to a string

Wow!
example(topic)  # gives examples of usage!!!

dir()  #shows the current working directory contents
list.file()

GRASS  -  open source  GIS

which.min(x)
which.max(x)

R2HTML package


REGULAR EXPRESSIONS
help("regex")  -- in v1.8 R
For example,
^[[:space:]]* matches any amount of empty space
at the start of a string, so
sub("^[[:space:]]*", "", string)
will remove any leading whitespace characters from
the strings (that is, replace it with nothing).
perl=TRUE allows the use of Perl style regexes

valid_name  is a valid name in v1.9 and above

must faster startup in v1.9 and above!

dir.create()   # v1.9 and above

split.screen()

addmargins() # adds row and column marginal summaries to tabular information

Graphviz package

quantile()
rank()

grid graphics package

all()
any()

# combinations
choose(n,k)

menu()

message()

plot.data.frame()

recordGraphics()

sunflowerplot()

write.csv()
write.csv2()

R.matlab package

gsl wrapper

options()

TeachingDemos package

glpk The GNU Linear Programming Kit (GLPK)
version 4.8. This interface mirrors the GLPK
C API. Almost all GLPK lpx routines are supported.
This

neural network package

relax package -- functions for report writing, presentation and programming

package outliers -- for outlier detection

for() can use raw vectors as indices (like foreach in Perl) in v2.3 +

Sys.time()
date()
timestamp()

options()  # there are many!

args(function)  #shows its arguments, e.g. args(options)

R commander package...

PearsonICA package

xlsReadWrite()  -- natively, read, write XLS data - in v2.4 +

dev.print() -- in v2.5 +

file("stdin")


Programming With R
------------------
R.h exists for C++

It is possible to use sockets!
make.socket
write.socket
nsl()  #tests for connection to an IP address

Rcgi can provide R's output directly to the web!

Interfaces between R and another programming language:

The old ones are C and Fortran.

The new ones are:
XML
Java
RSPerl
Python
CORBA

With v2.5, Objective-C is also supported!


RSRuby (only for Unix)
R & Ruby

R Home
c:\totalcmd\R\rw261\.. 
libR shared library...

c:\totalcmd\R\rw261\bin
R.dll

Spectral Analysis
-----------------
-------------------------------------------------
In R, trig is in radians, so
x = seq(0,2*pi,.01)  # slice the circle into .01 steps
plot(x,sin(x))       # plot one oscillation, x units are in radians
plot(x,sin(2*pi*x))  # plot 6 oscillations, x units are in s, f is 1 Hz

SO!  Frequency (Hz) of a sine in radians sin(k*t) is f = k/(2*p), you are dividing the frequency coefficient by the natural period of sine (2*pi).
-------------------------------------------------

spectrum - ... of time series

x = seq(1:1000)
x = x/100
y = sin(x)
plot(x,y)

A = 1
k = 2
c = 0
h = 0
y = A*sin(k*x + c) + h
plot(x,y)

To see in radians,
x = x / (2*pi)
plot(x,y)

To see 15 radians worth:
x = seq(1:10000)
x = x/100
x = x/(2*pi)
y = sin(x)
plot(x,y,xlab="Radians")

To see the units in units of pi/2
plot(x/(pi/2), y)


Time Series
-----------
ts() - create a time series from a vector
readts() - read a time series from a two column ascii file
plot.ts() - plots the points and connects them with a line, can plot multiple time series on the same plot, like boxplot...
stl() - seasonal decomposition of time series by loess

k = c(.5,1,1,1,.5)            # k is the vector of weights
(k = k/sum(k))       
  [1] 0.125 0.250 0.250 0.250 0.125
fjj = filter(jj, sides=2, k)  # ?filter for help [but you knew that already]
plot(jj)
lines(fjj, col="red")         # adds a line to the existing plot
lines(lowess(jj), col="blue", lty="dashed")

Now a histogram and a Q-Q plot, one on top of the other:

par(mfrow=c(2,1))        # set up the graphics 
hist(dljj, prob=T, 12)   # histogram    
lines(density(dljj))     # smooth it - ?density for details 
qqnorm(dljj)             # normal Q-Q plot  
qqline(dljj)             # add a line    


Examples:
vector 

jj = ts(jj, start=1960, frequency=4)  # creates QUARTERLY earnings time series
jj = ts(jj, start=1960, frequency=12)  # creates MONTHLY earnings time series



Spectral Analysis
------------------
x = arima.sim(list(order=c(2,0,0), ar=c(1,-.9)), n=2^8) # some data
(u = polyroot(c(1,-1,.9)))  # x is AR(2) w/complex roots 
Arg(u[1])/(2*pi)                      # dominant frequency around .16:
par(mfcol=c(3,1))
plot.ts(x)                             
spec.pgram(x, spans=c(3,3), log="no")  # nonparametric spectral estimate; also see spectrum()

Finally, note that R tapers and logs by default, so if you simply want the periodogram of a series, the command is 
spec.pgram(x, taper=0, fast=FALSE, detrend=FALSE, log="no"). 
If you just asked for spec.pgram(x), you wouldn't get the RAW periodogram because the data are detrended, possibly padded, and tapered in spec.pgram, even though the title of the resulting graphic would say Raw Periodogram. An easier way to get a raw periodogram is this: per=abs(fft(x))^2 .... duh. The moral of this story ... and the bottom line: pay special attention to the defaults of the functions you're using. 



So, e.g.

par(mfcol=c(4,1))  # 4 rows, 1 column
x = seq(0,2*pi*10,.01)

y1 = sin(x)
plot(y1)
spec.pgram(y1, taper=0, fast=FALSE, detrend=FALSE, log="no")

y2 = sin(10*x)
plot(y2)
spec.pgram(y2, taper=0, fast=FALSE, detrend=FALSE, log="no")

Now the question is, how to get that frequency!

sp = spec.pgram(y2, taper=0, fast=FALSE, detrend=FALSE, log="no")

The spectrogram is 
plot(sp$freq,sp$spec)

sp$freq[which(sp$spec == max(sp$spec))] * 2 * pi


Test:

par(mfcol=c(2,1))
x = seq(0,2*pi*10,.01)
y3 = sin(5*x)  # frequency is 5
plot(y3)
spec.pgram(y3, taper=0, fast=FALSE, detrend=FALSE, log="no")
sp = spec.pgram(y3, taper=0, fast=FALSE, detrend=FALSE, log="no")
sp$freq[which(sp$spec == max(sp$spec))] * 2 * pi * 100



THIS IS IT -- SPECTRAL ANALYSIS IN R
------------------------------------

par(mfcol=c(3,1)) # show 3 plots on the same chart
x = seq(0,2*pi*10,.01)  # domain
y1 = 3*sin(2*x)   # frequency 2, amplitude 3
y2 = -1*sin(50*x) # frequency 50, amplitude 1
y3 = 2*sin(20*x)  # frequency 20, amplitude 2
y = y1 + y2 + y3  # superposition of the three functions
plot(x,y1,xlim=c(0,2),ylim=c(-6,6),main="Component functions",sub="y1 = 3*sin(2*x); y2 = -1*sin(50*x); y3=2*sin(20*x); y=y1+y2+y3 (red)")  # start the plots
lines(x,y2)
lines(x,y3)
lines(x,y,col="red")
plot(x,y,xlim=c(0,2),ylim=c(-6,6),col="red",main="Sum function",sub="y=y1+y2+y3")  # sum function by itself
sp = spec.pgram(y, taper=0, fast=FALSE, detrend=FALSE, log="no")  # spec decomp
sp$freq[which(sp$spec > 500)] * 2 * pi * 100  # tell me the frequencies, in order of strength (amplitude)




Min Finding
-----------
w = seq(-50,50,1)
fw = .... (some function of w)
i = which (fw == min(fw))  # this gives the index of the minimum, i.e. the index min
w[i] # this gives the argument of the minimum, argmin


REVERSE a list, FLIP a list
rev(c(1,2,3))

SMOOTHING ALGORITHMS
AKE
2:00 PM 10/5/2009

In R:

smooth()  -- default is 3RS3R
help(smooth)
Tukey's smoothing: 3RS3R
  3 - moving medians of length 3
  R - repeat till convergence is reached
  S - splitting of horizontal stretches of length 2 or 3 ??

filter(signal, filtercoeffs)
  moving average
    5-point moving average:  coefficients uniform at 1.  normalization constant is the total sum of coeffs
      ma5 = c(1,1,1,1,1)/5
	  ma5 = rep(1,5)/5
  Savitsky-Golay smoothing function:
    5 point quadratic or cubic smoothfunction: coeffs
      sg5 = c(-3,12,17,12,-3)/35

dLateSmooth19 = filter(d$LateDays, rep(1,19)/19)

Exponential Smoothing
http://en.wikipedia.org/wiki/Exponential_smoothing
Exponentially Weighted Moving Average Filter (EWMA)
http://tolstoy.newcastle.edu.au/R/e2/help/06/12/6919.html
Packages
expsmooth	Data sets from "Forecasting with exponential smoothing"
MEWMA	Multivariate Exponentially Weighted Moving Average (MEWMA) Control Chart


a_j = alpha(1-alpha)^j

Time series classes
ts -- built in time series class
zoo
tseries -- package

library(forecast)
library(Mcomp)
http://robjhyndman.com/researchtips/estimation2/


	  
Non-Stationary Time Series

Simple Test for Stationarity
l1 = list[1:floor(length(list)/2)]
l2 = list[ceiling(length(list)/2):length(list)]
summary(l1)
summary(l2)

Exploring Differences
diff(list,lag,order)
diffinv(dlist, x_0)  # x_0 is initial value

so:
dl = diff(list,1,1)
idl = diffinv(dl,xi=list[0])
summary(l-idl)  # should be all zeros


KEY!! NOTE!!! R STATS COMPUTATIONS DEFAULT TO TREATING LISTS AS SAMPLES, and so using N-1 for sd, var, and cov...
If you're doing population corrections, you'll want to correct with *  (N-1)/N

cov(x,y)
cor(x,y)

ccf - cross correlation

To compute autocorrelations
cor(list[1:367],list[2:368])
cor(list[2:368],list[3:369])

But what IS the correlation?  To see this, one must plot a scattergram of values separated by lag $k$.
plot(list[1:367],list[2:368])
plot(list[2:368],list[3:369])

The AUTOCORRELATION FUNCTION (beginning at lag 0):
acf(list)

To see more of the ACF tail
acf(l,lag.max=90)  % lag.max approx N/4
acf(il, lag.max=90) 

To get the values OUT:
acf_l =acf(l,lag.max=90, plot=FALSE)  % lag.max approx N/4
acf_il = acf(il, lag.max=90, plot=FALSE) 

But the values are in a funky table format, where we want just a list.
acf_l$acf  give the actual values

This gives a MATRIX
acf2AR(acf_l$acf)


# SHIFT OPERATORS:
given list

backward shift  (adding zero to the end), i.e. this is the (1 + wB) z_t operation.
Bdld = c(0,dld[1:length(dld)-1]) # BACKWARD shift on shocks, i.e. Ba_t
1 2 3 4 5
0 1 2 3 4

forward shift
Flist = c(list[2:length(list)],0)
1 2 3 4 5
2 3 4 5 0



pacf  - partial autocorrelation

SPECTRAL ANALYSIS

cpgram
spec.pgram
fft

fft: this returns the complex fourier coefficients, so in R use:
abs(fft(list)) 
and 
plot(abs(fft(list)),type="l")


System stuff
------------
system()
memory.limit(2048) -- is max
memory.size()  -- how much is currently in us

LaTeX symbols in R plots:
help(symbol)


Symbols / Names in R:
as.name  # makes into a name
is.name  # checks if name

call("round", 10.5)  # call with argument
eval  # evaluates a call
do.call()

e = expression( single expression or list of expressions)
eval
function


PROGRAMMING IN R
http://manuals.bioinformatics.ucr.edu/home/programming-in-r
GUI for programming in R: RStudio
http://www.johndcook.com/R_language_for_programmers.html
http://en.wikibooks.org/wiki/R_Programming
http://zoonek2.free.fr/UNIX/48_R/02.html

Ref 4
# OUTPUT of a vector as a csv vector
write.csv(h1,"h1.csv",eol=",",row.names=FALSE)

# GRAPHICAL VIEW: 2 graphs side by side
wx = 800
hy = 600
windows(bg="white", width=wx, height=hy)
view <- matrix(1:2, nrow=1, byrow=T)
layout(view)
... plot statements ...
# output to file
dev.print(png, file="figCholTransError.png", width=wx, height=hy)

# Here we look at MAX, MIN, PLOT (to see patterns of possible typos and spread) and BOXPLOT (to see spread)
max(data$Age, na.rm=TRUE)
min(data$Age, na.rm=TRUE)
plot(data$Age)
boxplot(data$Age)

# variable correlated by Gender
boxplot(data$BMI[data$Gender=="F"],data$BMI[data$Gender=="M"])

# Here we look at a variable, pivoting on Age and Gender separately, and then together
  These are weights, so we know roughly what they could be.
max(data$WeightLbs[data$Age<18], na.rm=TRUE)
max(data$WeightLbs[data$Age>=18], na.rm=TRUE)
min(data$WeightLbs[data$Age<18], na.rm=TRUE)
min(data$WeightLbs[data$Age>=18], na.rm=TRUE)

max(data$WeightLbs[data$Gender=="F"], na.rm=TRUE)
max(data$WeightLbs[data$Gender=="M"], na.rm=TRUE)
min(data$WeightLbs[data$Gender=="F"], na.rm=TRUE)
min(data$WeightLbs[data$Gender=="M"], na.rm=TRUE)

plot(data$WeightLbs)
boxplot(data$WeightLbs)
# Now, weight is correlated by AGE and by GENDER, and a better measure is BMI
boxplot(data$WeightLbs[data$Age<18],data$WeightLbs[data$Age>=18])
boxplot(data$WeightLbs[data$Gender=="F"],data$WeightLbs[data$Gender=="M"])


# Annotating and Exporting a Plot
plot(data$HeightTotalIn[data$Gender=="F" & data$Age>18], main="Adult Female Height", ylab="Height (Inches)")
boxplot(data$HeightTotalIn[data$Gender=="F" & data$Age>18], main="Adult Female Height", ylab="Height (Inches)")
dev.print(png, file="figHeightF.png", width=wx, height=hy)

# Nice annotation for export
plot(data$CholLDL, main="LDL Cholesterol Level", xlab="Record ID #", ylab="LDL Cholesterol Level", ylim=c(0,300))
plot(data$CholHDL, main="HDL Cholesterol Level", xlab="Record ID #", ylab="HDL Cholesterol Level")
dev.print(png, file="figCholTransError.png", width=wx, height=hy)


# 6 correlation plots
## TWO-WAY PLOTS for correlation analysis...
# General population
wx = 800
hy = 600
windows(bg="white", width=wx, height=hy)
view <- matrix(1:6, nrow=2, byrow=T)
layout(view)
plot(data$CholLDL, data$CholTot)
plot(data$CholHDL, data$CholTot)
plot(data$Triglyc, data$CholTot)
plot(data$CholHDL, data$CholLDL)
plot(data$Triglyc, data$CholLDL)
plot(data$Triglyc, data$CholHDL)
dev.print(png, file="CholCorrelation1.png", width=wx, height=hy)


# Age specific correlation plots
# Age specific: 16 - 20
plot(data$Age,data$CholTot,xlim=c(0,30),ylim=c(80,250))
plot(data$Age,data$CholLDL,xlim=c(0,30),ylim=c(40,180))
plot(data$Age,data$CholHDL,xlim=c(0,30),ylim=c(0,150))
plot(data$CholHDL[data$Age<=20 & data$Age>=16],data$CholTot[data$Age<=20 & data$Age>=16],xlim=c(0,150),ylim=c(80,250))
plot(data$CholLDL[data$Age<=20 & data$Age>=16],data$CholTot[data$Age<=20 & data$Age>=16],xlim=c(40,180),ylim=c(80,250))
plot(data$CholHDL[data$Age<=20 & data$Age>=16],data$CholLDL[data$Age<=20 & data$Age>=16],xlim=c(0,150),ylim=c(40,180))
dev.print(png, file="CholCorrelation2.png", width=wx, height=hy)

# Age specific: 60 - 70
plot(data$Age,data$CholTot,xlim=c(50,80),ylim=c(80,300))
plot(data$Age,data$CholLDL,xlim=c(50,80),ylim=c(30,220))
plot(data$Age,data$CholHDL,xlim=c(50,80),ylim=c(20,130))
plot(data$CholHDL[data$Age<=70 & data$Age>=60],data$CholTot[data$Age<=70 & data$Age>=60],xlim=c(20,130),ylim=c(80,300))
plot(data$CholLDL[data$Age<=70 & data$Age>=60],data$CholTot[data$Age<=70 & data$Age>=60],xlim=c(30,220),ylim=c(80,300))
plot(data$CholHDL[data$Age<=70 & data$Age>=60],data$CholLDL[data$Age<=70 & data$Age>=60],xlim=c(20,130),ylim=c(30,220))
dev.print(png, file="CholCorrelation3.png", width=wx, height=hy)


#  Looking for a correlation -- a faint correlation -- how faint?
plot(data$BMI, data$Triglyc)


# Comparing two populations
# did NOT fast and yet had testing done
rows = which(data$HasFasted=="N" & !is.na(data$Triglyc))
foodies = data[rows,cols]
length(rows)

# DID fast and had testing done
s = which(data$HasFasted=="Y" & !is.na(data$Triglyc))
fasties = data[s,cols]
length(s)

wx=350
hy=600
windows(bg="white", width=wx, height=hy)
boxplot(foodies$Triglyc, fasties$Triglyc, names=c("Foodies", "Fasties"), ylab="Triglycerides Level", main="Fasting & Triglyceride Levels")
dev.print(png, file="figTriglycFasting.png", width=wx, height=hy)

!! Hunh!  It could also be that some people who SAID they fasted actually LIED and didn't fast... and hence that's why they have such high triglycerides?
!! OR could it be that the fasting triggered the body to release fat reserves?  And that those who ate, the body just used up the fats??

wx=600
hy=600
windows(bg="white", width=wx, height=hy)
view <- matrix(1:2, nrow=1, byrow=T)
layout(view)
plot(foodies$Triglyc,foodies$CholTot, xlim=c(0,275), ylim=c(0,300))
plot(fasties$Triglyc,fasties$CholTot, xlim=c(0,275), ylim=c(0,300))
dev.print(png, file="figTriglyFastingCholTot.png", width=wx, height=hy)

wx=600
hy=600
windows(bg="white", width=wx, height=hy)
view <- matrix(1:2, nrow=1, byrow=T)
layout(view)
plot(foodies$Triglyc,foodies$CholLDL, xlim=c(0,275), ylim=c(0,200))
plot(fasties$Triglyc,fasties$CholLDL, xlim=c(0,275), ylim=c(0,200))
dev.print(png, file="figTriglyFastingCholLDL.png", width=wx, height=hy)

wx=600
hy=600
windows(bg="white", width=wx, height=hy)
view <- matrix(1:2, nrow=1, byrow=T)
layout(view)
plot(foodies$Triglyc,foodies$CholHDL, xlim=c(0,275), ylim=c(0,100))
plot(fasties$Triglyc,fasties$CholHDL, xlim=c(0,275), ylim=c(0,100))
dev.print(png, file="figTriglyFastingCholHDL.png", width=wx, height=hy)


# Doing a comparative study of populations and drawing conclusions -- Triglycerides, dieting and females in their 30s... (after one or two kids)

rows = which(data$Triglyc<50)
dieters = data[rows,]
plot(dieters$Gender)  # mostly Female!
boxplot(dieters$Age)     # mostly young! (median 31yrs)

dieters.f = dieters[which(dieters$Gender=="F"),]  # females
dieters.m = dieters[which(dieters$Gender=="M"),]  # males

summary(dieters.f$Age)  #26 - 36 years  are 75% of these women
boxplot(dieters.f$Age)

dieters.f.young = dieters.f[which(dieters.f$Age<=36 & dieters.f$Age>=26),]
length(rownames(dieters.f.young))  # 29

----
s = which(data$Triglyc>=50)  # get the problem rows
nodieters = data[s,]
plot(nodieters$Gender)  # balanced
boxplot(nodieters$Age)     # mostly young! (median 41yrs)

nodieters.f = nodieters[which(nodieters$Gender=="F"),]  # females
nodieters.m = nodieters[which(nodieters$Gender=="M"),]  # males
length(rownames(nodieters.f))  # 1393 women
length(rownames(nodieters.m))  # 1357 men

summary(nodieters.f$Age)  #34 - 51 years are 75% of these women
boxplot(nodieters.f$Age)

nodieters.f.young = nodieters.f[which(nodieters.f$Age<=36 & nodieters.f$Age>=26),]
length(rownames(nodieters.f.young))  # 332

----
#COMPARISONS
wx=1050
hy=600
windows(bg="white", width=wx, height=hy)
view <- matrix(1:3, nrow=1, byrow=T)
layout(view)

boxplot(dieters.f.young$Triglyc, nodieters.f.young$Triglyc, ylim=c(30,230), main="Separating Young Female Pop'n by Triglyc Level", names=c("Low (<50)", "Regular (>=50)"), ylab="Triglyceride Levels") 

boxplot(dieters.f.young$BMI, nodieters.f.young$BMI, ylab="Body-Mass-Index (BMI)", names=c("Dieters (Low Triglyc)", "Non-Dieters (Ave Triglyc)"), main="BMI of Young Females with Low Triglyc")

boxplot(dieters.f.young$Age, nodieters.f.young$Age, main="Females Between 26 and 36 years", names=c("Dieters(Low Triglyc)", "Non-Dieters (Ave Triglyc)"), ylab = "Age") 

dev.print(png, file="figTriglycAbnormLowYoungWomen.png", width=wx, height=hy)

#!!!! WOW -- look -- these abnormally low numbers are from young females!!!  These are mostly young females!!!

--------------

IDENTIFYING OUTLIERS

BACKGROUND:
	boxplot(numeric vector) 
	This shows you median, 1st and third quartiles, and the largest and smallest instances that fall WITHIN 1.5 of the box WIDTH (IQR - interquartile range) beyond the 1st and third quartiles
	
	summary(numeric vector) 
	gives you the numerical backing
	
	E.g.
	 Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
	    4.0   110.0   120.0   123.3   134.0   227.0   244.0
	So:
		IQR = 134.0 - 110.0 = 24.0
		1.5 * IQR = 36
	Therefore:
		TOP WHISKER <= 36 + 134 = 170
		BOTTOM WHISKER >= 110 - 36 = 74
	
	
	QUESTION:
	how do we handle outliers?
	We take the sample median and IQR as estimators of the population from which the sample was drawn.
	Then we use the non-parametric measures of extremeness:
	  factor 1.5 corresponds to 2 standard deviations, or 95% of the population
	  factor 2.5 corresponds to 3 standard deviations, or 99.7% of the pop
	  factor 3.5 corresponds to 4 standard deviations, or 99.99% of the pop
	
	SEE MY STATS ARTICLE:
	e:\Repository\Mathematics\Statistics\_statisticalmethods.txt 
	
	OF COURSE, YOU ALWAYS HAVE THE PROBLEM OF WHETHER THE GENDER CONFOUNDING OR THE AGE CONFOUNDING ETC. MEAN THAT THE ASSUMPTION MAY BE VIOLATED THAT mu-hat and s-hat (sample parameters) are good estimators of the population parameters.  In which case their use to determine outliers is also flawed.

    
    EXPLORATORY TECHNIQUE -- PIVOTING AND TREND ANALYSIS
This is how to slice by age and gender and see the trends:

rowtest = which(data$Age > 60 & data$Gender=="M")
boxplot(data[rowtest,cols]$CholTot, data[rowtest,cols]$CholLDL, data[rowtest,cols]$CholHDL, data[rowtest,cols]$Triglyc)
summary(data[rowtest,cols]$CholHDL)


    ADDING LEVELS THAT AREN'T THERE
    
    reglev=c("FL","MW","NE","NT","SE","SW","W")
    fr = factor(data$Region,levels=reglev)
    table(fr)
    
    
# USING SUBTABLES

# FL, MW, NE, NT, SE, SW, W (alphabetical order like in R)
pop = c(3370,4740,4140,7840,11240,16890,5690)
samp = table(data$Region)

# subtables by region
ne = subset(data, data$Region=="NE")
se = subset(data, data$Region=="SE")
fl = subset(data, data$Region=="FL")
mw = subset(data, data$Region=="MW")
nt = subset(data, data$Region=="NT")
sw = subset(data, data$Region=="SW")
w = subset(data, data$Region=="W")

# summary(data)
# factor(data$JK)

# plot(data$Region, main="Counts by Region")

# subtable by risk factors
risk.hd = subset(data, data$HasHistHD=="Y")
risk.hd.compl = subset(data,data$HasHistHD=="N")

# summary(risk.hd)
# summary(risk.hd.compl)

# disentangling by gender and age
risk.hd.m = subset(risk.hd, risk.hd$Gender=="M")
risk.hd.f = subset(risk.hd, risk.hd$Gender=="F")
risk.hd.compl.m = subset(risk.hd.compl, risk.hd.compl$Gender=="M")
risk.hd.compl.m = subset(risk.hd.compl, risk.hd.compl$Gender=="F")

# summary( )

#disentangling by gender, generally
m = subset(data, data$Gender=="M")
f = subset(data, data$Gender=="F")

Ref 5
Good Statistical Philosophies:

+ Always explore your data at least two ways: numerically, graphically (Tukey)  
Why do it in R?

+ Because it has the BEST of Forth's factor approach and functional approach (implemented as the LISPy Functional Programming paradigm of lapply and sapply, etc. (though they take a little while getting your head around)) AND because they have all the built-in probability and visualisation capabilities.

So there are 4 killer reasons for using R.

+ Because probability engines are the heart of simulation -- and R has all 4 of the key functions:  p.m.f/pdf,  cdfs  qfs and  rngs   (prob mass func/prob func, cumul dist func, quantile func, random number generators) for a huge number of distributions

- but the problem is that R is a COMPLICATED object oriented language and so you'll spend lots of time reading documentation and googling in order to do almost anything --- until you get to a reasonable level of proficiency, which can take a while

	This says it best:

	"R can be a very unforgiving language. Obtuse error messages and a variety of differing yet similar data structures leads to a frustrating learning curve. Furthermore R is a functional programming language which, if you are used to imperative programming (e.g. Java/Perl/Python/Ruby), makes R seem like a crufty Perl for statistics." - Michael Barton

	This led to frustration at the poor object support and the seeming requirement to create large numbers of loops and temporary variables.

	[You'll] appreciate the functional heart of R and how much it can simplify your code. R is almost a lisp dialect since it draws much inspiration from Scheme.

	functional programming can be better than imperative approaches in terms of concise, clear code and for simplifying parallelisation.

		http://www.bioinformaticszen.com/post/simple-functional-programming-in-r/

	This is eye-opening.  My initial use of R for statistical analysis was brute force --- lots of copy paste, it was not software engineered by any stretch.  It was a tool to get a job done.  I should have used a lot more functions, and functional programming would have reduced duplication.
+ R has first-class functions, i.e. functions that are fundamental objects in themselves.  You have pass them as arguments, put them into lists, etc.

+ R supports a Forth-like working style: "[to get complex tasks done, you] build a family of composite tools starting from very simple primitives... this is a recurring theme: start with small, easy-to-understand building blocks, combine them into more complex structures, and apply them with confidence.

http://adv-r.had.co.nz/Functional-programming.html
Best conceptual references that you'll need to do advanced R programming & the sequence in which you should read them:

		R Objects:  http://ww2.coastal.edu/kingw/statistics/R-tutorials/objects.html

		R programming for those coming from other languages:  http://www.johndcook.com/R_language_for_programmers.html

		One Page R: A Survival Guide to Data Science with R:  http://www.datasciencecentral.com/profiles/blogs/one-page-r-a-survival-guide-to-data-science-with-r

		Functional Programming: Hadley Wickham: http://adv-r.had.co.nz/Functional-programming.html

		Functional Programming:  Michael Barton   http://www.bioinformaticszen.com/post/simple-functional-programming-in-r/

		Data Structures:  Hadley Wickham  http://adv-r.had.co.nz/Data-structures.html

		Subsetting & Assignment: Hadley Wickham http://adv-r.had.co.nz/Subsetting.html

		Grouping functions (lapply, sapply, apply, tapply, aggregate, by): Joran  http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega

		Data Frames --- and differences to databases: Chris  http://www.r-bloggers.com/select-operations-on-r-data-frames/

		Accessing Lists --  there are three operators: [], [[]], $

		http://stackoverflow.com/questions/1169456/in-r-what-is-the-difference-between-the-and-notations-for-accessing-the
Tips for coding in R

+ introspection: str(obj)

+ Everything is a LIST.  Not a matrix, but a list.

+ To make a table from a list, calculate it as a list and then reshape it as you wish

+ A sequential or iterative programmer will think about LOOPING through something.  A functional programmer will think MAP or REDUCE

+ lapply() allows creating a composite function by combining the function you want to do some work with lapply() which can then, in a single pass, do something to each column in a data frame.

+ if something doesn't work, think do I need to coerce it into a type, or perhaps even **uncoerce** it, e.g. unlist(list of lists)

+ once you've got a good script, rm.all() and then re-run it, and then save your workspace.  Just running your workspace next time will restore everything and you can easily demo, etc.
Why consider Lisp or Scheme?  For experience in FP, functional programming.

What other languages could give you this?

	FP languages: Haskell, OCaml, F#,

	Multi-paradigm, but FP friendly languages like Lisp, Scheme, Clojure, Scala
You learn about the key concepts of FP or functional programming:
	+ First class functions --- functions that can treated like any other data type --- made into lists, passed into arguments, etc.
		(Forth can do this using executable tokens)
	+ Anonymous functions --- functions without names
		(Forth has anonymity via the stack --- anonymous variables)
	+ Closures --- functions written by other functions
		When we have multiple functions that all follow same basic template, we can create closures: functions that return functions. Closures allow us to make functions based on a template.
	+ Lists of functions --- storing functions in a list
	+ functionals --- functions that take functions as arguments and return vectors as output
	+ function operators --- functions that use functions as inputs and return them as outputs
	Recently FP has experienced a surge in interest because it provides a complementary set of techniques to object oriented programming, the dominant style for the last several decades.
	+ Composite functions:  The key idea here is composition. We take two simple functions, one which does something to each column and one which fixes missing values, and combine them to fix missing values in every column. Writing simple functions that can be understood in isolation and then composited together to solve complex problems is an important technique for effective FP.  Hadley Wickham  http://adv-r.had.co.nz/Functional-programming.html
		apply(M,FUN,...)  --- running a function against rows or columns of a matrix
		lapply(L,fun,...)  --- list apply: runs a function against every element a list and returns a list
			for (xi in x) do fun(xi,y,z)
		sapply(L,fun,...) --- shaped apply: runs a function against every element of a list but returns an UNlist, i.e. a vector
		`lapply` is *list apply*   
		`sapply` is *shaped apply*
		`rapply` is *recursive apply*
		`vapply` is *vectored apply*
		`tapply` is *tagged apply*
	+ Reduce / Filter / Find / Position / Map / Negate --- these are higher-order functions because they operate on a data set with a function...
	+ Subsetting a data frame:  
		subset(dfrm, condition==a & condition==b)
		which ... 
		# idx <- 1+which(diff(pt3[,csl_str])==1)  # identify the breakpoints in target sellable stocks --- one column
		
	+ Removing row 539:  d<-frm[-539,]  
	
Data Structures in R
	case sensitivity --- R names ARE case sensitive --- so  Myname, MyName, myname are each different
		Remember this when importing from a database --- if all names are uppercase, you'll have to type them all upper case
	introspection:
		str( )
		class( )
	R has 8 primary objects: http://ww2.coastal.edu/kingw/statistics/R-tutorials/objects.html
	cell, element, scalar --- actually this doesn't exist -- everything in R is built on top of vectors
	string (as.character)
		grep(str, char.obj)   # remember, strings are CASE SENSITIVE
		tolower(), toupper()
	vector v : a standard collection or a hash (dictionary)
		v <- c(...)   c for concatenate, combine, create, or call-out individually
		v <- rep( )   rep for repeat
		v <- a:b 	  quick sequence: from a to b (inclusive)
		v <- seq( )   seq for sequence
		named vector
		names(v) <- c(...) # the names
	list l : a collection of OBJECTS --- so a list in R could be used like a record in another language...
		l <- list(...)
		named list
		l[[1]]      this REMOVES the named bit
		l$objname
		Note: data frames are actually LISTS: named lists in fact
		To UNLIST something, use unlist(ll)
	matrix m : a table, or two-dimensional array
		m <- matrix(1:12, nrows=3)
		transpose:  t(m)
	array a : a generalised table, or n-dimensional array
		a <- array(1:16, dim=c(4,2,2))   # rows, columns, layers -- think like a dungeon
	table  :  a summary or frequency table of a vector
		table(v)  
	data frame frm : a table, collection of lists, spreadsheet
		frm = data.frame ( ... )
		but usually you'll just load these from a file
		removing a row:  frm <- frm[-rwnm]
		
R syntax
	if (bool) branch1 else if branch2 else branch 3
		
Creating your own records & objects in R:
	use lists
Subsetting

Mathematical Science & Technologies

R (for Stats)

R for Statistical Computing: Quick Start & Useful Coding Snippets

1. Overview of R

2. Getting & Setting up R

3. Quickstart

APPENDICES

Appendix 0: My Snippets

Appendix 1: Using Packages

Appendix 2: Working with Data: load built-in (see Packages above), generate Randomly, import from File, pull from Database

Appendix 3: Mixing Languages with R

R Cookbook and References - Assad Ebrahim

Appendix 4. R as a Numerical Platform

Appendix 4. R as a Graphing & Visualization Platform

Ref 1

Ref 2

Appendix 4: Using Scripts

Ref 3

Ref 4

Ref 5

Leave a Reply

Dear Readers:

Reading List…

Mathematics-History & Philosophy

Electronic & Software Engineering

Pure & Applied Mathematics

Technology: Sensors & Intelligent Systems

Math Education

Explore…

Timeline

Mathematical Science & Technologies

R (for Stats)

R for Statistical Computing: Quick Start & Useful Coding Snippets

1. Overview of R

2. Getting & Setting up R

3. Quickstart

APPENDICES

Appendix 0: My Snippets

Appendix 1: Using Packages

Appendix 2: Working with Data: load built-in (see Packages above), generate Randomly, import from File, pull from Database

Appendix 3: Mixing Languages with R

R Cookbook and References - Assad Ebrahim

Appendix 4. R as a Numerical Platform

Appendix 4. R as a Graphing & Visualization Platform

Ref 1

Ref 2

Appendix 4: Using Scripts

Ref 3

Ref 4

Ref 5

Leave a Reply

Dear Readers:

Reading List…

Mathematics-History & Philosophy

Electronic & Software Engineering

Pure & Applied Mathematics

Technology: Sensors & Intelligent Systems

Math Education

Browse…

Explore…

Timeline