[GIS] Seeking function to calculate mean for missing or NA values in dataframe using R

data-framer

I am trying to develop a function that will calculate a mean lat/long for my missing or zeros location (lat/long) data in my dataset.

I have locations for many SETS within the same TRIP, which tends to be close in space. However, in many cases, I am missing the location information for some of my SETS. I would like to replace these NAs or Zeros values with the mean lat/long calculated from the remaining SETS within that TRIP.

> head(newdata[,c(3,12,26:27)])#This is my data
        TRIP_ID SET_NO SH_LATITUDE SH_LONGITUDE
13365 100020848      3    41.88983    -66.29183
13373 100020848      6    41.93317    -66.29767
13430 100020848     11    42.01200    -66.45217
13442 100020848      7    41.92383    -66.29733
23207 100020848      4    41.94817    -66.29750
24669 100020848      9    41.88300    -66.30567

So for example, SH_LONGITUDE has 17239 SET_NO with missing values from different TRIP_ID.

> look<-newdata[newdata$SH_LONGITUDE==0,]
> dim(look)
[1] 17239   130
> head(table(look$TRIP_ID))

100020997 100021109 100021204 100021306 100021337 100021340 
        1         2         1         1         2         1 

I am a newbie with user-defined functions, but this is what I would like to be able to call for.

replacebymean<-function(dat, x...?) {
    if dat[dat$x==0, ] or dat[is.na(dat$x), ] 
    then "find these NAs/zeros associated TRIP_ID" and 
          calculate the mean lat/long for each TRIP_ID as  
             mean(dat$x[dat$TRIP_ID=="e.g.,100022478"], na.rm=T) and 
          replace these NAs/zeros by this mean value.
}

Best Answer

You can use the is.na function to select for values which are NA. Eg :

> t<-c(1:10,NA,1:5)
> t
 [1]  1  2  3  4  5  6  7  8  9 10 NA  1  2  3  4  5
> mean(t)
[1] NA
> mean(t, na.rm=T)
[1] 4.666667
> t[is.na(t) ]<-mean(t, na.rm=T)
> t
 [1]  1.000000  2.000000  3.000000  4.000000  5.000000  6.000000  7.000000
 [8]  8.000000  9.000000 10.000000  4.714286  1.000000  2.000000  3.000000
[15]  4.000000  5.000000

So in your case:

dat$x[dat$TRIP_ID=="e.g.,100022478"&is.na(dat$x)] <-mean(dat$x[dat$TRIP_ID=="e.g.,100022478"], na.rm=T)

You could use a loop to iterate over your TRIP_ID values:

for(trip in unique(dat$TRIP_ID)){
dat$x[dat$TRIP_ID==trip&is.na(dat$x)] <-mean(dat$x[dat$TRIP_ID==trip], na.rm=T)
}
Related Question