r - Specify different types of missing values (NAs) -
i'm interested specify types of missing values. have data have different types of missing , trying code these values missing in r, looking solution can still distinguish between them.
say have data looks this,
set.seed(667) df <- data.frame(a = sample(c("don't know/not sure","unknown","refused","blue", "red", "green"), 20, rep=true), b = sample(c(1, 2, 3, 77, 88, 99), 10, rep=true), f = round(rnorm(n=10, mean=.90, sd=.08), digits = 2), g = sample(c("c","m","y","k"), 10, rep=true) ); df # b f g # 1 unknown 2 0.78 m # 2 refused 2 0.87 m # 3 red 77 0.82 y # 4 red 99 0.78 y # 5 green 77 0.97 m # 6 green 3 0.99 k # 7 red 3 0.99 y # 8 green 88 0.84 c # 9 unknown 99 1.08 m # 10 refused 99 0.81 c # 11 blue 2 0.78 m # 12 green 2 0.87 m # 13 blue 77 0.82 y # 14 don't know/not sure 99 0.78 y # 15 unknown 77 0.97 m # 16 refused 3 0.99 k # 17 blue 3 0.99 y # 18 green 88 0.84 c # 19 refused 99 1.08 m # 20 red 99 0.81 c
if make 2 tables missing values ("don't know/not sure","unknown","refused"
, 77, 88, 99
) included regular data,
table(df$a,df$g) # c k m y # blue 0 0 1 2 # don't know/not sure 0 0 0 1 # green 2 1 2 0 # red 1 0 0 3 # refused 1 1 2 0 # unknown 0 0 3 0
and
table(df$b,df$g) # c k m y # 2 0 0 4 0 # 3 0 2 0 2 # 77 0 0 2 2 # 88 2 0 0 0 # 99 2 0 2 2
i recode 3 factor levels "don't know/not sure","unknown","refused"
<na>
is.na(df[,c("a")]) <- df[,c("a")]=="don't know/not sure"|df[,c("a")]=="unknown"|df[,c("a")]=="refused"
and remove empty levels
df$a <- factor(df$a)
and same done numeric values 77, 88,
, 99
is.na(df) <- df=="77"|df=="88"|df=="99" table(df$a, df$g, usena = "always") # c k m y <na> # blue 0 0 1 2 0 # green 2 1 2 0 0 # red 1 0 0 3 0 # <na> 1 1 5 1 0 table(df$b,df$g, usena = "always") # c k m y <na> # 2 0 0 4 0 0 # 3 0 2 0 2 0 # <na> 4 0 4 4 0
now missing categories recode na
lumped together. there way in recode missing, retain original values? want r thread "don't know/not sure","unknown","refused"
, 77, 88, 99
missing, want able still have information in variable.
to knowledge, base r doesn't have in-built way handle different na
types. (editor: does: na_integer_
, na_real_
, na_complex_
, , na_character
. see ?base::na
.)
one option use package so, instance "memisc". it's little bit of work, seems you're looking for.
here's example:
first, data. i've made copy since making pretty significant changes dataset, , it's nice have backup.
set.seed(667) df <- data.frame(a = sample(c("don't know/not sure", "unknown", "refused", "blue", "red", "green"), 20, replace = true), b = sample(c(1, 2, 3, 77, 88, 99), 10, replace = true), f = round(rnorm(n = 10, mean = .90, sd = .08), digits = 2), g = sample(c("c", "m", "y", "k"), 10, replace = true)) df2 <- df
let's factor variable "a":
df2$a <- factor(df2$a, levels = c("blue", "red", "green", "don't know/not sure", "refused", "unknown"), labels = c(1, 2, 3, 77, 88, 99))
load "memisc" library:
library(memisc)
now, convert variables "a" , "b" item
s in "memisc":
df2$a <- as.item(as.character(df2$a), labels = structure(c(1, 2, 3, 77, 88, 99), names = c("blue", "red", "green", "don't know/not sure", "refused", "unknown")), missing.values = c(77, 88, 99)) df2$b <- as.item(df2$b, labels = c(1, 2, 3, 77, 88, 99), missing.values = c(77, 88, 99))
by doing this, have new data type. compare following:
as.factor(df2$a) # [1] <na> <na> red red green green red green <na> <na> blue # [12] green blue <na> <na> <na> blue green <na> red # levels: blue red green as.factor(include.missings(df2$a)) # [1] *unknown *refused red # [4] red green green # [7] red green *unknown # [10] *refused blue green # [13] blue *don't know/not sure *unknown # [16] *refused blue green # [19] *refused red # levels: blue red green *don't know/not sure *refused *unknown
we can use information create tables behaving way describe, while retaining original information.
table(as.factor(include.missings(df2$a)), df2$g) # # c k m y # blue 0 0 1 2 # red 1 0 0 3 # green 2 1 2 0 # *don't know/not sure 0 0 0 1 # *refused 1 1 2 0 # *unknown 0 0 3 0 table(as.factor(df2$a), df2$g) # # c k m y # blue 0 0 1 2 # red 1 0 0 3 # green 2 1 2 0 table(as.factor(df2$a), df2$g, usena="always") # # c k m y <na> # blue 0 0 1 2 0 # red 1 0 0 3 0 # green 2 1 2 0 0 # <na> 1 1 5 1 0
the tables numeric column missing data behaves same way.
table(as.factor(include.missings(df2$b)), df2$g) # # c k m y # 1 0 0 0 0 # 2 0 0 4 0 # 3 0 2 0 2 # *77 0 0 2 2 # *88 2 0 0 0 # *99 2 0 2 2 table(as.factor(df2$b), df2$g, usena="always") # # c k m y <na> # 1 0 0 0 0 0 # 2 0 0 4 0 0 # 3 0 2 0 2 0 # <na> 4 0 4 4 0
as bonus, facility generate nice codebook
s:
> codebook(df2$a) ======================================================================== df2$a ------------------------------------------------------------------------ storage mode: character measurement: nominal missing values: 77, 88, 99 values , labels n percent 1 'blue' 3 25.0 15.0 2 'red' 4 33.3 20.0 3 'green' 5 41.7 25.0 77 m 'don't know/not sure' 1 5.0 88 m 'refused' 4 20.0 99 m 'unknown' 3 15.0
however, suggest read the comment @maxim.k constitutes missing values.
Comments
Post a Comment