Wednesday, 12 June 2019

All about schema drift - part 1


All about schema drift – Part 1


 ADF dataflow is built ground up to support schema drift. Schema drift is a scenario where the schema of the dataset changes by adding or deleting columns. The integration essentially become fragile when you use traditional ETL tools. We will first explore how to deal with schema drift at the source and sink and extend that going forward to more complex transformation scenarios.

Source


When you choose to allow schema drift on the source, the additional columns which were not a part of the dataset definition are added to the end of the source projection. Let’s look at an example.

The actual data file looks like this. In this case CSV with headers was turned on.


Scenario 1
 Dataset schema was defined as –


 As you see ‘gross’ is missing in this file and rated is a new column. The resulting data  preview with schema drift will look like.



If you pay special attention to the column order, all columns are retained in the same order as defined in the dataset. The gross column has NULL values since it was not found and rated is added as a new column at the end. Schema drift works even if the column order in the actual source file is different.

Scenario 2
Let’s go the extreme and have an empty dataset with no schema and see what the behavior is.

The data preview will look like


It worked like Scenario 1 except it found every other column to be new and added them to the end of the list. Hence the order is similar to how you see in the source file.
TIP – This is often misinterpreted. If you forget to ‘Import schema’ in your dataset, you still get data preview results because of schema drift. However, if you interrogate columns in subsequent transforms you will find it empty! The tool only presents columns defined in the dataset. Data preview is how things work at runtime.

Scenario 3

Let try the same exercise with headers in CSV off. It should give the same behavior correct ? Not quite.

The projection in the source looks like.



Tip: Since it is a header-less CSV, you can change the column names in the source projection. It will appear as Col0, Col1 otherwise.

The data preview on this shows a result different than Scenario 1.


What happened here!!! CSV with no headers matches everything positionally.  So, the first column in your data maps to movieId and 3rd to gross and so on.

Does this mean schema drift for header-less CSV not apply? Look at the next scenario to understand better.

Scenario 4

The data file has new gross column at the end.


The projection in source looks like


Now if we preview, the data shows up as


So, the 'gross' column shows up at the end, but it named _col0 because dataflow generates a pseudo name for this. Great after all this I have a column with a cryptic name which I don’t know how to deal with. TIP: You can use the ‘pos’ function to identify a column by its position. That’s for a different blog.

Scenario 5
This is a common scenario where users complain they are not getting data preview results they desire.

The actual data file is a CSV with no headers (Scenario 3) but you unintentionally checked ‘First row as header’ in the dataset.

The projection looks like


The preview shows up as



What happened here is dataflow considered the first row as header and ‘1’ and ‘Rambo’ turned out to be columns!
So why am I getting NULL values for the rest of the columns. Because dataflow matched the columns by name and did not find movieId, title etc. and filled them with NULLs.

To test this, let’s turn schema drift off and check the results.


Ok I lose the drifted columns and, in both cases, I lose a row to the header.

TIP: If you write this to a sink with schema drift OFF you will get a file filled with NULLs!

Validate schema –
Validate schema is intended to make sure the columns you have defined in the dataset do exist in the actual source. If you had turned validate schema in Scenario 1, the dataflow would have failed with a validation exception at runtime. It is a way to detect missing columns in the schema, so the downstream transforms do not need do the hard work.

What does ‘missing columns’ in a header-less CSV mean? If you have 6 columns defined in the dataset, it will check positionally if at least 6 columns exist in the source file.

You can use validate schema with schema drift turned off too! Powerful right?

Schema drift on SQL Sources
SQL source works similar to CSV with headers. If the table or query returns new columns, they are added to the end of the projection. Missing columns are filled with NULLs.

Validate schema for SQL not only checks for column name but also for the column datatype compatibility as defined in the dataset.

We will explore schema drift in the sink and how you can use it in conjunction with the source in my next blog.

No comments:

Post a Comment