r - Specify different types of missing values (NAs) -


i'm interested specify types of missing values. have data have different types of missing , trying code these values missing in r, looking solution can still distinguish between them.

say have data looks this,

set.seed(667)  df <- data.frame(a = sample(c("don't know/not sure","unknown","refused","blue", "red", "green"),  20, rep=true), b = sample(c(1, 2, 3, 77, 88, 99),  10, rep=true), f = round(rnorm(n=10, mean=.90, sd=.08), digits = 2), g = sample(c("c","m","y","k"),  10, rep=true) ); df #                       b    f g # 1              unknown  2 0.78 m # 2              refused  2 0.87 m # 3                  red 77 0.82 y # 4                  red 99 0.78 y # 5                green 77 0.97 m # 6                green  3 0.99 k # 7                  red  3 0.99 y # 8                green 88 0.84 c # 9              unknown 99 1.08 m # 10             refused 99 0.81 c # 11                blue  2 0.78 m # 12               green  2 0.87 m # 13                blue 77 0.82 y # 14 don't know/not sure 99 0.78 y # 15             unknown 77 0.97 m # 16             refused  3 0.99 k # 17                blue  3 0.99 y # 18               green 88 0.84 c # 19             refused 99 1.08 m # 20                 red 99 0.81 c 

if make 2 tables missing values ("don't know/not sure","unknown","refused" , 77, 88, 99) included regular data,

table(df$a,df$g) #                     c k m y # blue                0 0 1 2 # don't know/not sure 0 0 0 1 # green               2 1 2 0 # red                 1 0 0 3 # refused             1 1 2 0 # unknown             0 0 3 0 

and

table(df$b,df$g) #    c k m y # 2  0 0 4 0 # 3  0 2 0 2 # 77 0 0 2 2 # 88 2 0 0 0 # 99 2 0 2 2 

i recode 3 factor levels "don't know/not sure","unknown","refused" <na>

is.na(df[,c("a")]) <- df[,c("a")]=="don't know/not sure"|df[,c("a")]=="unknown"|df[,c("a")]=="refused" 

and remove empty levels

df$a <- factor(df$a)  

and same done numeric values 77, 88, , 99

is.na(df) <- df=="77"|df=="88"|df=="99"  table(df$a, df$g, usena = "always")        #       c k m y <na> # blue  0 0 1 2    0 # green 2 1 2 0    0 # red   1 0 0 3    0 # <na>  1 1 5 1    0  table(df$b,df$g, usena = "always") #      c k m y <na> # 2    0 0 4 0    0 # 3    0 2 0 2    0 # <na> 4 0 4 4    0 

now missing categories recode na lumped together. there way in recode missing, retain original values? want r thread "don't know/not sure","unknown","refused" , 77, 88, 99 missing, want able still have information in variable.

to knowledge, base r doesn't have in-built way handle different na types. (editor: does: na_integer_, na_real_, na_complex_, , na_character. see ?base::na.)

one option use package so, instance "memisc". it's little bit of work, seems you're looking for.

here's example:

first, data. i've made copy since making pretty significant changes dataset, , it's nice have backup.

set.seed(667)  df <- data.frame(a = sample(c("don't know/not sure", "unknown",                                "refused", "blue", "red", "green"),                             20, replace = true),                   b = sample(c(1, 2, 3, 77, 88, 99), 10,                              replace = true),                   f = round(rnorm(n = 10, mean = .90, sd = .08),                             digits = 2),                   g = sample(c("c", "m", "y", "k"), 10,                              replace = true)) df2 <- df 

let's factor variable "a":

df2$a <- factor(df2$a,                  levels = c("blue", "red", "green",                             "don't know/not sure",                            "refused", "unknown"),                 labels = c(1, 2, 3, 77, 88, 99)) 

load "memisc" library:

library(memisc) 

now, convert variables "a" , "b" items in "memisc":

df2$a <- as.item(as.character(df2$a),                    labels = structure(c(1, 2, 3, 77, 88, 99),                                      names = c("blue", "red", "green",                                                 "don't know/not sure",                                                "refused", "unknown")),                   missing.values = c(77, 88, 99)) df2$b <- as.item(df2$b,                   labels = c(1, 2, 3, 77, 88, 99),                   missing.values = c(77, 88, 99)) 

by doing this, have new data type. compare following:

as.factor(df2$a) #  [1] <na>  <na>  red   red   green green red   green <na>  <na>  blue  # [12] green blue  <na>  <na>  <na>  blue  green <na>  red   # levels: blue red green as.factor(include.missings(df2$a)) #  [1] *unknown             *refused             red                  #  [4] red                  green                green                #  [7] red                  green                *unknown             # [10] *refused             blue                 green                # [13] blue                 *don't know/not sure *unknown             # [16] *refused             blue                 green                # [19] *refused             red                  # levels: blue red green *don't know/not sure *refused *unknown 

we can use information create tables behaving way describe, while retaining original information.

table(as.factor(include.missings(df2$a)), df2$g) #                        #                        c k m y #   blue                 0 0 1 2 #   red                  1 0 0 3 #   green                2 1 2 0 #   *don't know/not sure 0 0 0 1 #   *refused             1 1 2 0 #   *unknown             0 0 3 0 table(as.factor(df2$a), df2$g) #         #         c k m y #   blue  0 0 1 2 #   red   1 0 0 3 #   green 2 1 2 0 table(as.factor(df2$a), df2$g, usena="always") #         #         c k m y <na> #   blue  0 0 1 2    0 #   red   1 0 0 3    0 #   green 2 1 2 0    0 #   <na>  1 1 5 1    0 

the tables numeric column missing data behaves same way.

table(as.factor(include.missings(df2$b)), df2$g) #       #       c k m y #   1   0 0 0 0 #   2   0 0 4 0 #   3   0 2 0 2 #   *77 0 0 2 2 #   *88 2 0 0 0 #   *99 2 0 2 2 table(as.factor(df2$b), df2$g, usena="always") #        #        c k m y <na> #   1    0 0 0 0    0 #   2    0 0 4 0    0 #   3    0 2 0 2    0 #   <na> 4 0 4 4    0 

as bonus, facility generate nice codebooks:

> codebook(df2$a) ========================================================================     df2$a  ------------------------------------------------------------------------     storage mode: character    measurement: nominal    missing values: 77, 88, 99              values , labels    n    percent       1   'blue'                   3   25.0 15.0     2   'red'                    4   33.3 20.0     3   'green'                  5   41.7 25.0    77 m 'don't know/not sure'    1         5.0    88 m 'refused'                4        20.0    99 m 'unknown'                3        15.0 

however, suggest read the comment @maxim.k constitutes missing values.


Comments

Popular posts from this blog

Why does Ruby on Rails generate add a blank line to the end of a file? -

keyboard - Smiles and long press feature in Android -

node.js - Bad Request - node js ajax post -