All
about schema drift – Part 1
Source
When you choose to allow schema drift on the source, the additional
columns which were not a part of the dataset definition are added to the
end of the source projection. Let’s look at an example.
The actual data file looks like this. In this case CSV with headers was
turned on.
Scenario 1
As you see ‘gross’ is missing in this file and rated is a new column. The
resulting data preview with schema drift will look like.
If you pay special attention to the column order, all columns are retained
in the same order as defined in the dataset. The gross column has NULL
values since it was not found and rated is added as a new column at the end. Schema
drift works even if the column order in the actual source file is different.
Scenario 2
Let’s go the extreme and have an empty dataset with no schema and see
what the behavior is.
The data preview will look like
It worked like Scenario 1 except it found every other column to be new
and added them to the end of the list. Hence the order is similar to how you
see in the source file.
TIP – This is often misinterpreted. If you forget to ‘Import schema’ in
your dataset, you still get data preview results because of schema drift.
However, if you interrogate columns in subsequent transforms you will find it empty!
The tool only presents columns defined in the dataset. Data preview is how
things work at runtime.
Scenario 3
Let try the same exercise with headers in CSV off. It should give
the same behavior correct ? Not quite.
The projection in the source looks like.
Tip: Since it is a header-less CSV, you can change the column names in
the source projection. It will appear as Col0, Col1 otherwise.
The data preview on this shows a result different than Scenario 1.
What happened here!!! CSV with no headers matches everything positionally.
So, the first column in your data maps
to movieId and 3rd to gross and so on.
Does this mean schema drift for header-less CSV not apply? Look at the
next scenario to understand better.
Scenario 4
The data file has new gross column at the end.
The projection in source looks like
Now if we preview, the data shows up as
So, the 'gross' column shows up at the end, but it named _col0 because
dataflow generates a pseudo name for this. Great after all this I have a column
with a cryptic name which I don’t know how to deal with. TIP: You can use the ‘pos’
function to identify a column by its position. That’s for a different blog.
Scenario 5
This is a common scenario where users complain they are not getting data
preview results they desire.
The actual data file is a CSV with no headers (Scenario 3) but
you unintentionally checked ‘First row as header’ in the dataset.
The projection looks like
The preview shows up as
What happened here is dataflow considered the first row as header and ‘1’
and ‘Rambo’ turned out to be columns!
So why am I getting NULL values for the rest of the columns. Because
dataflow matched the columns by name and did not find movieId, title etc. and filled
them with NULLs.
To test this, let’s turn schema drift off and check the results.
Ok I lose the drifted columns and, in both cases, I lose a row to the
header.
TIP: If you write this to a sink with schema drift OFF you will get a
file filled with NULLs!
Validate schema –
Validate schema is intended to make sure the columns you have defined in
the dataset do exist in the actual source. If you had turned validate schema in
Scenario 1, the dataflow would have failed with a validation exception at
runtime. It is a way to detect missing columns in the schema, so the downstream
transforms do not need do the hard work.
What does ‘missing columns’ in a header-less CSV mean? If you have 6
columns defined in the dataset, it will check positionally if at least 6
columns exist in the source file.
You can use validate schema with schema drift turned off too! Powerful right?
Schema drift on SQL Sources
SQL source works similar to CSV with headers. If the table or query returns
new columns, they are added to the end of the projection. Missing columns are
filled with NULLs.
Validate schema for SQL not only checks for column name but also for the
column datatype compatibility as defined in the dataset.
We
will explore schema drift in the sink and how you can use it in conjunction
with the source in my next blog.
No comments:
Post a Comment