All about dataflow metadata - Part I

ADF dataflow is a fully metadata driven data integration service on Azure. It is powered by a definition of the data flow which manifests as a JSON script tag. The dataflow script or metadata is a declarative language to describe your ETL flow. There is no Microsoft term for this so don't go looking for it in the docs. We will just call it metadata. It is declarative like SQL not imperative like procedural languages as Java, Scala or Python. Is this yet another Microsoft language? Not quite. It is expressive metadata on how data gets read, transformed and written. If you peel the layers, it has 1 to 1 correspondence to dataflow UI.

Let’s get introduced by creating a simple dataflow which reads a csv file, upper cases a column and writes it to a sink.

Visually the dataflow looks like this.

The JSON code looks like this

Unfortunately, JSON escaping makes it extremely hard to read the metadata! If I switch to ‘Plan’ I can view but cannot edit anything! I ended up using https://jsonformatter.org/json-unescape to unescape the metadata to make it readable.

We will examine the source UI and metadata together. I have annotated every configuration that appears in the UI to how it is expressed in the metadata.

Metadata for source looks like

Did you not tell me this was a programming language? Where is the UI generating scala code which is compiled into Databricks? Where are for loops, variables, cursors and the things I love in programming??

Fortunately, all the statements above are false. Don't get misled by script tag in the JSON to mean it is a programming platform like VBScript.

Dataflow metadata is a specification to solve a data integration problem. The dataflow engine can interpret this specification dynamically at runtime and run it at scale on any Spark distribution!!! No code, no compilation, no libraries to link, no deployment, just metadata driven with full data lineage.

Let’s move to the derive transform and examine the metadata. It follows a syntax of one stream applying a transformation to result in another stream.

inputStream transformType(….) ~> outputStream.

Derive has the structure

MovieSource derive(….) ~> NormalizeTitle

As the source does not have any inputs its structure looks like

source(….) ~> MovieSource

Metadata for it looks like

Finally let's look at the sink. It has the input columns from the dataset and the columns I choose to map.

In metadata it is expressed as

Let’s try to manually change the metadata for Derived and add a new column fullPrice by rounding the ticketPrice. We will use the escape facility(https://jsonformatter.org/json-escape) to escape the content back to JSON.

After updating the script tag in ADF, the derive transformation looks like

Viola!!!! I can switch back and forth to a textual mode whenever I need to. It gives me the best of both worlds with a UI that can assist novices with the flexibility of power editing at any time.

Understanding Dataflow metadata can power you to –

1. Generate dataflow external to ADF without using the UI

2. Make mass edits to the dataflow like search and replace column names across transforms

3. Understanding errors at runtime which are reported with script line numbers

4. Handle mundane edits that may require too many clicks on the UI

This was a primer to Dataflow metadata but we are just scratching the surface. We will explore transformation syntax, datatypes, expression syntax, hierarchical functions etc. in future blogs. You can start to see how this is different compared to M, SQL, notebook or U-SQL.

Dataflow enclycopedia

Friday, 21 June 2019

All about dataflow script - Part I

All about dataflow metadata - Part I

No comments:

Post a Comment