sql server - Data Warehouse design - Handling NULL and empty values in the OLTP -
i creating dw oltp creaking somewhat.
a problem i'm faced there isn't data integrity in oltp database. example suburb field.
this suburb field free text field on oltp ui means we've got values in field, plus we've got empty strings , we've got null values.
how handle this? scenarios i've come are:
- import data (not ideal)
- in etl process, treat empty string same null , replace word 'unknown' in dw
- import both empty strings , null's empty strings in dw
just fyi, i'm using microsoft bi stack (sql server, ssis, ssas, ssrs)
the short answer is, depends on null
, empty strings mean in source system.
this general question (handling null
) has been discussed lot, e.g. here, here, here etc. think important point remember data warehouse database; may have specific type of schema , designed 1 purpose, it's still database , general advice on null
still applies.
(as side note, prefer talk "reporting database" rather "data warehouse", because keeps things in perspective. dbas , developers start making plans huge server farms , multi-year etl projects hear words "data warehouse", in end it's reporting database.)
anyway, isn't clear want use null
looks may attribute on dimension.
i (probably) wouldn't use of 3 approaches, depends on meaning of data. importing data as-is not useful because part of value of data warehouse data has been cleaned , consistent, makes querying , comparing data along other dimensions easier.
replacing empty strings 'unknown' may or may not correct: empty string mean in source system? there's big difference between "it means there's no suburb" , "it means don't know if there's suburb". assuming empty string means "no suburb" , null
means "unknown" import empty strings are, replace null
'unknown'. main reason doing if suburb field used filter condition in report, it's easier users (and possibly reporting tool) work non-null
value 'unknown'. , if there no consistency in source system , don't know empty strings , nulls mean, need clarify first , ideally fix source system (another benefit of dwh helps identify inconsistencies , data handling errors in source systems).
your last idea convert null
s empty strings same issue: null
mean in source system? if means "no suburb" replacing empty string idea, if means else should handle else.
so summarize, preference import empty strings as-is, , convert null
'unknown', can't sure makes sense in case. there's no single answer question because depends on specific data , means. there's no problem using null
in data warehouse (or other database) long consistently , clear understanding of how source systems handle data.
Comments
Post a Comment