All
about dataflow metadata - Part I
Let’s get introduced by creating a simple dataflow which reads a csv
file, upper cases a column and writes it to a sink.
Visually the dataflow looks like this.
The JSON code looks like this
Unfortunately, JSON escaping makes
it extremely hard to read the metadata! If I switch to ‘Plan’ I can view but
cannot edit anything! I ended up using https://jsonformatter.org/json-unescape
to unescape the metadata to make it readable.
We will examine the source UI and
metadata together. I have annotated every configuration that appears in the UI to
how it is expressed in the metadata.
Metadata for source looks like
Did you not tell me this was a
programming language? Where is the UI generating scala code which is compiled
into Databricks? Where are for loops, variables, cursors and the things I
love in programming??
Fortunately, all the statements above
are false. Don't get misled by script tag in the JSON to mean it is a programming platform like VBScript.
Dataflow metadata is a specification to
solve a data integration problem. The dataflow engine can interpret this
specification dynamically at runtime and run it at scale on any Spark
distribution!!! No code, no compilation, no libraries to link, no
deployment, just metadata driven with full data lineage.
Let’s move to the derive transform
and examine the metadata. It follows a syntax of one stream applying a
transformation to result in another stream.
inputStream transformType(….) ~> outputStream.
Derive has the structure
MovieSource derive(….) ~>
NormalizeTitle
As the source does not have any
inputs its structure looks like
source(….) ~> MovieSource
Metadata for it looks like
Finally let's look at the sink. It has the
input columns from the dataset and the columns I choose to map.
In metadata it is expressed as
Let’s try to manually change the metadata for Derived and add a new column fullPrice by rounding the ticketPrice. We will use the escape facility(https://jsonformatter.org/json-escape)
to escape the content back to JSON.
After updating the script tag in
ADF, the derive transformation looks like
Viola!!!! I can switch back and
forth to a textual mode whenever I need to. It gives me the best of both worlds
with a UI that can assist novices with the flexibility of power editing at any
time.
Understanding Dataflow metadata can power you
to –
1.
Generate dataflow external to ADF without using
the UI
2.
Make mass edits to the dataflow like search and
replace column names across transforms
3.
Understanding errors at runtime which are
reported with script line numbers
4.
Handle mundane edits that may require too many
clicks on the UI
This was a primer to Dataflow metadata but we
are just scratching the surface. We will explore transformation syntax, datatypes,
expression syntax, hierarchical functions etc. in future blogs. You can start
to see how this is different compared to M, SQL, notebook or U-SQL.
No comments:
Post a Comment