I am trying to develop a function that will calculate a mean lat/long for my missing or zeros location (lat/long) data in my dataset.
I have locations for many SETS within the same TRIP, which tends to be close in space. However, in many cases, I am missing the location information for some of my SETS. I would like to replace these NAs or Zeros values with the mean lat/long calculated from the remaining SETS within that TRIP.
> head(newdata[,c(3,12,26:27)])#This is my data
TRIP_ID SET_NO SH_LATITUDE SH_LONGITUDE
13365 100020848 3 41.88983 -66.29183
13373 100020848 6 41.93317 -66.29767
13430 100020848 11 42.01200 -66.45217
13442 100020848 7 41.92383 -66.29733
23207 100020848 4 41.94817 -66.29750
24669 100020848 9 41.88300 -66.30567
So for example, SH_LONGITUDE has 17239 SET_NO with missing values from different TRIP_ID.
> look<-newdata[newdata$SH_LONGITUDE==0,]
> dim(look)
[1] 17239 130
> head(table(look$TRIP_ID))
100020997 100021109 100021204 100021306 100021337 100021340
1 2 1 1 2 1
I am a newbie with user-defined functions, but this is what I would like to be able to call for.
replacebymean<-function(dat, x...?) {
if dat[dat$x==0, ] or dat[is.na(dat$x), ]
then "find these NAs/zeros associated TRIP_ID" and
calculate the mean lat/long for each TRIP_ID as
mean(dat$x[dat$TRIP_ID=="e.g.,100022478"], na.rm=T) and
replace these NAs/zeros by this mean value.
}
Best Answer
You can use the is.na function to select for values which are NA. Eg :
So in your case:
dat$x[dat$TRIP_ID=="e.g.,100022478"&is.na(dat$x)] <-mean(dat$x[dat$TRIP_ID=="e.g.,100022478"], na.rm=T)
You could use a loop to iterate over your TRIP_ID values: