Wednesday, March 13, 2013

for loops vs. sapply

Loops are very useful for doing the same (or similar things) multiple times. Unfortunately, in R, loops can be very clunky and slow.

For loops are perhaps more intuitive than sapply because the result you get is the same as if you ran the code within the for loop multiple times. What do I mean?

> x = numeric(10)
> y = numeric(10)
> z = for(i in 1:10) {
+ y[i] = i
+ x[i] = i*2
+ x[i]
+ }
> x
 [1]  2  4  6  8 10 12 14 16 18 20
> y
 [1]  1  2  3  4  5  6  7  8  9 10
> z
NULL

So the code within the for loop actually changes what is stored in x and y, but it does not return anything itself. Thus, z is NULL.

Let's use very similar code, except using sapply:

> x = numeric(10)
> y = numeric(10)
> z = sapply(1:10, function(i){
+    y[i] = i
+    x[i] = i*2
+    x[i]
+  })
> x
 [1] 0 0 0 0 0 0 0 0 0 0
> y
 [1] 0 0 0 0 0 0 0 0 0 0
> z
 [1]  2  4  6  8 10 12 14 16 18 20

Wait, why are x and y still full of 0s? This occurs because any assignments made within an sapply does not affect the global environment. So changing y[i] = i within sapply does not change the vector y itself. Thus it stays a vector of 0s as it was initialized. The trouble with sapply is that because of this, one iteration of the loop cannot depend on a different iteration of the loop--i.e., we cannot calculate x based off of what x was in a previous iteration. This is in direct contrast to a for loop, where because the changes happen in the global environment, we can use a previous iteration to determine the current iteration, like in this example:

> x = numeric(10)
> y = numeric(10)
> z = for(i in 2:10) {
+    y[i] = i
+    x[i] = x[i-1]+y[i-1]
+  }
> x
 [1]  0  0  2  5  9 14 20 27 35 44
> y
 [1]  0  2  3  4  5  6  7  8  9 10
> z
NULL

No comments:

Post a Comment