Computational Methods in Astrophysics Dr Rob Thacker AT
Computational Methods in Astrophysics Dr Rob Thacker (AT 319 E) thacker@ap. smu. ca
More on R n Useful things for scripting: Data frames n I/O n Looping n Graphics n Models n n Note: R is fussy about having the right quotes, copy and paste often fails because of this
Data frames n Commonly used in R for multi-variable data Consider people data: heights, masses, hair colour etc n Non-numeric values are called factors e. g. “blue” “brown” for eye colour n n Logically it’s a matrix of columns/vectors of equal length but potentially different types n n Once set-up you can address variables using $ sign Could use built-in data, but let’s see how
Data frames n Start the console, create a script and height <- c(1. 7, 1. 65, 1. 34, 1. 5, 1. 8) enter name <- c(“Izzy”, ”Chris”, ”Mel”, ”Viv”, ”Alex”) mass <- c(70, 55, 50, 62, 80) eyes <- c(“brown”, ”green”, ”brown”, ”blue”, ”brown”) hair <- c(“brown”, ”blonde”, ”brown”) ourpop <-data. frame(name, eyes, hair, height, mass)
Interacting with data frames The full table can be printed using ourpop n Try ourpop$height n Can also use ourpop[, 1] to get column n ourpop[1, ] to get first row n Note if you define something in R using a variable with a value e. g. mylist <n list(a=2, b=1) [2] will report the variable name & value n [[2]] will report just the value (try them!) n
Subsetting & sampling data frames n The subset() function allows you to quickly select data that matches criteria, e. g. try mypop <- subset(ourpop, height>1. 5) mypop 2 <- subset(ourpop, height>1. 5 & height < 1. 79) n n mypop, mypop 2 will be a data frames as well You can randomly sample any data using sample(), e. g. try mysamp <ourpop$height[sample(1: nrow(ourpop), 10, replace=TRUE)] n In this case you’ll produce a vector of samples
Simple Output n print() – this is generic output function that is specified for different datatypes n n Try print(ourpop); n n n Depending on what you pass, you’ll get different results print(“Hello World”) Try removing quotes – it fails why? Essentially same as ourpop in console methods(print) will tell you what it is defined for (in this case a lot!)
Concatenated Output n is much simpler than print() – can’t handle a data frame for example cat() n n n But it does handle newline Allows you to write to a file as well Separate lines of output still need a loop Try cat(“Hi “, ourpop$height, ”n”) n You can restrict the number of pieces too: n cat(“Hi “, ourpop$height[1: 3], ”n”) n Plus add a file argument and separator cat(“Hi “, ourpop$height[1: 3], ”n”, file=“myfile”, sep=“ , ”)
Redirecting n sink(“myfile. txt”) will redirect the console (strictly the R output) to myfile. txt n n sink() restores output Can also check how many are being used with sink. number() Try, sink(“list. txt”); 1: 10 ; sink() For graphics there are specific devices, e. g. pdf(“myplot. pdf”) n jpeg(“myplot. jpeg”) n n More on this later when we look at graphics
Reading from files n R understands directory handling & paths: To get the current working directory getwd() n To set the current working directory setwd e. g. n n setwd(“C: /Users/Rob/Documents”) n dir() lists current directory contents Already noted: source(“myfile. R”) will execute script n Simplest way to read a table of separate vals: n n n mytab <- read. table(“list. txt”) Check help – can specify separator
Reading from files: specialist n R can read other stats-related formats too Excel – read. xls() n SPSS – read. spss() n Minitab – read. mtp() n n Comma separated variable files too: read. csv() n Normally expects variables names in first line, e. g Height, mass, name n 1. 7, 60, John
Reading from files: the web! n Instead of a directory name, you can give an http address! Try this: n mytab <read. table(“http: //ap. smu. ca/~thacker/list. txt”) n Note if you need passwords then there are options, including using the Rcurl package n n Obvious point – passwords in scripts are a bad idea! You can easily forget and send someone a script with your passwords!
Quick thoughts on “data manipulation” n Selecting, inserting, deleting are all supported in R, but not always in a simple way n Strictly speaking a data manipulation language like SQL is needed – see the RSQLite package So never a bad idea to preprocess data first n If your data is small enough you could always use a spreadsheet n
Looping & timing n R is “vector language”, and you should try and think that way n n Of course it isn’t always possible to vectorize To time how long operation takes use Sys. time() start. time <- Sys. time() end. time <- Sys. time() time. taken = start. time – end. time print(time. taken)
Looping – what can you do? n R supports three types of loops for n while n repeat n
For loops n Think about a vector of loop values controlling loop for (i in 1: 10000) { An operation } n For strides: steps then <- seq(1, 10000, by=2) for (i in steps) { An operation }
for loop: Exercise Create a vector mylist with values 1: 100000 n Create a vector mylist. sq = NULL n Now write a loop from 1: 100000 that sets each element of mylist. sq to the square of n mylist Time this using Sys. time() n Nxt instead declare mylist. sq = n rep(0, 100000) n rm(mylist, mylist. sq) rerun
while loop n While loops are the next step up in complexity n n Consider the condition at the beginning of each iteration General format: while (condition){ An operation } n Example, for loop as while loop (try it): j = 1 while (j<=10) { cat(“j=“, j, ” n”) j=j+1 }
if constructs n R supports if (condition) control structures x = 1 if (x>0) { cat(“x is positive”) } else if (x < 0) { cat(“x is negative”) } else { cat(“x is zero”) } n Can use if’s to break out of loops
Using breaks Flow control like “break out goto” n Can use in any kind of loop structure – even for n x <- 1: 10 for (j in x) { if (j == 7) { break } cat(“j=“, j, ”n”) } n Remember – position of break condition will determine whether following code is executed n Easy to trip yourself up on this…
repeat loop n Repeat loops differ from while loops two ways n n n 1) There’s no explicit condition following repeat 2) you must break using a condition to leave the loop Example for loop in repeat format: j = 1 repeat { if (j == 7) { break } cat(“j=“, j, ”n”) j=j+1 Important to watch where you put the break point – easy to get your loop logic wrong. When in doubt, print out…
Tips for better performance of “for loops” n Ensure the list/vector you are writing to is the right length before you start Growing the list/vector on each iteration is expensive n Even if you don’t know exact length, you probably have an upper bound n n Get as many operations outside the loop as you can
Why is vectorization so much better? It goes beyond just compiled vs interpreted n Each call to a function requires R to determine what type data is being passed and then send the correct data type to a compiled function n For vectors this is straightforward – all same datatype n Doing this *once* rather than repeated calls is obviously better! n These issues are sorted out at compile time in compiled languages n
When do you have to use loops? If one iteration depends on the previous one (recall data dependence issues) n If a function doesn’t take a vector input n Sometimes recursive situations require it too n
Plots! n As for most packages, simple plots are easy, more detailed ones need more qualifiers n Try this: plot(ourpop$height, ourpop$mass) n To create a line-point plot try plot(ourpop$height, ourpop$mass, type=“o”) n Highlights that R plots in data order when creating line graphs So need to create an ordering array – that’s not difficult n op. sort = order(ourpop$height) Now try n plot(ourpop$height[op. sort], ourpop$mass[op. sort], ty pe=“o”)
Labels and ranges n Axis labels: plot(ourpop$height[op. sort], ourpop$mass[op. sort], type="o", ylab="Mass/kg", xlab="Height/m") n Setting ranges plot(ourpop$height[op. sort], ourpop$mass[op. sort], typ e="o", ylab="Mass/kg", xlab="Height/m", ylim=c(40, 90), x lim=c(1, 2)) n Tip: make sure you don’t use a colon e. g. ylim=c(1: 2) – that will fail n Colour: try plot(ourpop$height[op. sort], ourpop$mass[op. sort], typ e="o", ylab="Mass/kg", xlab="Height/m", ylim=c(40, 90), x lim=c(1, 2), col=“blue”) n To add a title, use the title function: title(main="Mass vs weight")
Creating hardcopy Need to pipe to file and appropriate device n pdf example: n pdf(“myfile. pdf”) plot(ourpop$height[op. sort], ourpop$mass[op. sort], typ e="o", ylab="Mass/kg", xlab="Height/m", ylim=c(40, 90), x lim=c(1, 2), col=“blue”) dev. off() #flush to file n n **Always remember to close with dev. off()** help(“device”) will tell which graphical devices are available (typically, pdf, ps, xfig, bitmap+…)
Annotation & legends n You can add text using the text command e. g. text(1. 2, 80, “Hi there") x, y position are the first two values Note you can also use text to label points: text(ourpop$height[op. sort], ourpop$mass[op. sort], our pop$name[op. sort], cex=0. 6, pos=4, col="red") n Legends: legend("topleft", lty=1, col="blue", pch=21, "H eights") n A bit messy, but it works! Try help(“legend”) for more info
Simple fitting n Linear fitting can be done with lm(y~x) Try: lm(ourpop$height~ourpop$mass) n Should get Call: lm(formula = ourpop$mass ~ ourpop$height) Coefficients: (Intercept) ourpop$height -24. 85 55. 23 n Better to put into a “fit” object, e. g. fit = lm(ourpop$height~ourpop$mass) summary(fit) n You can plot the residuals etc using plot(fit), & plot the fit with abline(fit)
Summary Data frames are a powerful way of storing data that can be easily subsetted n Avoid loops when you can – vectorization is much faster n Basic I/O is much like a terminal, but be aware there are more sophisticated packages out there n Plotting is tricky, but amazingly powerful, we’ve only just touched on things today n
- Slides: 30