So I've created a file that reads into a data frame in the same way as yours:
> str(v2)
'data.frame': 360 obs. of 720 variables:
BUT data.frame isn't really the right thing here. Its really meant for record-oriented data, where each row is a record and each column is a potentially different variable for that record (eg each row is a person, the columns are name, age, height, etc).
So you really only need to scan
the data in as one long vector and feed it to a raster.
Step 1, define an empty raster of the right size and shape (note I'm assuming the raster covers the whole world, so the limits are not the cell centres):
> m2=raster(nrow=360,ncol=720,xmn=-180,xmx=180,ymn=-90,ymx=90)
Step 2, read numeric values into the raster data slot:
> m2[]=scan("d.txt",what=1)
Read 259200 items
And give it a projection if needed:
> projection(m2)="+init=epsg:4326"
> plot(m2)
If you want to check that the resolution and the cell centres are as expected, use these functions:
> res(m2)
[1] 0.5 0.5
> xFromCol(m2,1:10)
[1] -179.75 -179.25 -178.75 -178.25 -177.75 -177.25 -176.75 -176.25 -175.75
[10] -175.25
> yFromRow(m2,1:10)
[1] 89.75 89.25 88.75 88.25 87.75 87.25 86.75 86.25 85.75 85.25
which shows the resolution is half a degree and the cell centres (or at least the first 10) are at those specified coordinates.
1) resample
results in 50% improvement
I was able to get about 50% improvement by resampling directly from the cld
raster to a new raster with the same extent/resolution as r
and a nearest neighbor sampling method:
system.time({
mat<-as.data.frame(getValues(r))
mat$landuse<- NA
mat$landuse<-getValues(resample(cld,r,method='ngb'))
})
user system elapsed
2.60 0.00 2.61
vs.
system.time({
mat<-as.data.frame(getValues(r)) # getting values from the stack
xy<-xyFromCell(r,c(1:ncell(r)),spatial = TRUE)
cells<-cellFromXY(clip1,xy)
mat$landuse<- NA
mat$landuse<- extract(clip1,cells) #this line takes 5 mins based on profiling
})
user system elapsed
4.98 0.00 5.02
On a smaller dataset, and with a much smaller memory footprint
2) Parallelization could get you a lot more
That will improve things significantly but you can get massive improvement if you can parallelize this. R comes with a couple parallelization backends and they all run through foreach
. I assume you are going to either process mat
in place or save it for later. Since it takes so much work to get that resampled data let's just assume we'll save it for later. The most convenient form is probably a raster alongside the data_robin
files.
Unfortunately, Windows and Unix parallelization options differ. On linux, use doMC
, on Windows use doSNOW
. Assuming we employ 4 workers:
linux initialization:
library(doMC)
registerDoMC(4) # number of workers should be less than number of CPU cores
windows initialization:
library(doSNOW)
cluster<-makeCluster(4, type = "SOCK") # num workers should be < num CPU cores
registerDoSNOW(cluster)
next:
library(foreach)
library(tools)
# Assume you have an array of filenames called 'files'
foreach (i=1:length(filenames), .packages=c('raster')) %dopar% {
r <- stack(paste0(path, "/data_robin/", files[i]))
outFilename=paste0(path, "/data_robin/", file_path_sans_ext(files[i]), "_cld.tif")
cldResampled <- resample(cld,r,method='ngb')
writeRaster(cldResampled, filename=outFilename, format="GTiff")
}
One of the drawbacks of the parallel foreach
is that it's hard to tell when something goes wrong. It would be good to do this serially first by replacing the %dopar%
with %do%
until you know it is working, and then let it run through the whole thing.
Caveats
In my simple example above (each raster had 1/100 the pixels in cld
and r
, respectively) I only improved an additional 30% by engaging 5 workers over doing it serially with just the single process. I was unable to parallelize the example with large rasters without getting Error in { : task 1 failed - "cannot allocate vector of size xx.x Mb"
. I think conceptually this should work but I wasn't able to get it working at the scale you are working at.
Best Answer
It should be a one-liner:
That gets you a raster of 0/1. If you want NAs for 0, then do:
as well.