Sunday, 23 June 2019

All about dataflow expressions - Primer


All about dataflow expressions - Primer


 ADF dataflow has a very rich expression language to describe transformation intent. All this is part of the dataflow metadata specification. Expression are engrained everywhere in the metadata from creating new columns to grouping data to partitioning for optimization. They are the ABCs of dataflow metadata. So, if you know Excel, M or SQL expressions this is equivalent right? Yes and no. ‘Yes’ in the sense the syntax looks alike, ‘No’ in the sense the similarity stops there. To understand this better you need to first know the datatypes of dataflow.




Dataflow supports the following datatypes for column values
1.     String – Unicode set of characters
2.     Boolean – if it is true or false value
3.     Binary – byte array
4.     Date – Date with month/day/year
5.     Timestamp – Date which includes hour/minute/second/millisecond precision
6.     Integral type – Any non-fractional type or whole number
      a.     Byte – 8 bit
      b.     Short – 16 bit
      c.     Integer – 32 bit
      d.     Long – 64 bit
 7.     Fractional type – Any number that can represent fractions
      a.     Float – Single precision with 32 bits
      b.     Double – Double precision with 64 bits
      c.     Decimal – High precision fractional with 38 bits of precision or scale. If you have financial math that requires high degree of precision which cannot have rounding errors use Decimal. Decimal math can be a shade slower that float or decimal. More on this after you master the sword.

Dataflow also has array type, structures to represent hierarchical content, maps for mapping one type to another and combination of all these are also supported. That’s for a different blog after you get the basics.

Ok this means I have my favorite GUID datatype in SQL and it has no representation in dataflow? Dataflow can support newer datatypes with behavior in the future but until then it will be the closest thing this can map to (a string or a binary type). Don’t mistake this with datatypes your source system provides. Dataflow will do the mapping under the scenes to make sure you have no data loss when you read or write from that source system.

Back to the difference between dataflow metadata and any other expression system. It has to do with the grammar. Just like natural language grammar you can say the same words and can be interpreted in different countries differently. Let’s take a simple example.
col1 + col2

That’s easy. Adding two numbers. Not so fast!!
col1 and col2 were strings it means concatenation of strings.
col1 was timestamp and col2 was a number it means adding as many milliseconds to the timestamp.
col1 was high precision decimal(Account balance) and col2 was a integer, the result is a high precision decimal
col1 was a string and col2 was a number it is an error!!!

Now you start to see why expressions looks similar but can mean different things? Don’t go running to paste your U-SQL or Excel expressions without careful inspection.

All expressions are made of three things. Operands, operators and functions.
If this is all sounding like the compiler class you skipped in school, dataflowninja will help you cover for it J.

Operand is a value you are acting on. In the simplest form it is a column.
Function is a routine you are calling on one or more operands.
Operator is a short-hand notation for a function.

Let’s look at a small snippet of the dataflow metadata.

Movies derive(
            upperTitle := upper(title)

            fullName = :upperTitle + concat(initCap(firstName) , lastName)

            balance = amount * floor($discount)%100

            partitionBy(‘key’, trim(country)
 ) ~> DeriveMovies

The underlined text represent expressions and color coded components of an expression.

      Operand


There are 6 types of operands you can use

1.     Column – reference to a column in the input stream. If the name is alphanumeric (alphabets followed by numbers) you can directly use it in the expression. If not, you should enclose it with {}.
 dumboCount
 {count of #dumbos} references a column named count of #dumbos

2.     Constant – a constant value. These can be

       a.     String constant – set of characters enclosed in single quote. Use a \ character to escape single quote in the value. Everything within single quotes is the actual value.

      ‘dumbo’  equates to dumbo
      ‘dumbo\’o big’ equates to dumbo’obig

          b. Integer constant – a whole number that can be held in an integer datatype
              123

          c.   Long constant – a whole number that can be held in a long datatype. Notice it is suffixed by ‘L’
              123123123123L

         d.   Double constant – Fractional number with having a .(dot)
              123.456

         e.   Float constant – Fractional number that can be held in a float datatype suffixed by ‘f’
              123.456f

f.  Boolean constant – true or false. You can also use true() or false() functions if your column names are named true/false!
    true

g. Regex constant – This is a special version of the string constant where you don’t need regex escaping rules. It has a back quote ` start and end
`(\\d+) and (\\d+)` is the same as ‘(\\\\d+) and (\\\\d+)’

3.     Parameter – If you are passing parameters into your dataflow, they can be referenced in expression with the $ syntax. Parameters also have datatypes like any operand
$discountCode

4.     Transient – Transients allow you to store repeated computations in a temporary column. They cannot be used outside the transformation they are defined in. You can reference a transient prefixed with : symbol
 :orderTempTotal

5.     Function outputs – A output of a function can be used as an operand.
     concat(‘dum’, upper(‘bo’))

6.     Meta operands – These are special inputs that result out of pattern matching functions. We will cover these in a schema drift for transformation blog.
     $1

Functions

Functions take operands as input and return an output. The output is again of a certain datatype. Functions in dataflow have either fixed or variable number of inputs.

upper(title)
concat(‘d’, ‘u’, ‘m’, ‘b’, ‘o’)

Functions also have optional inputs. You can choose to pass a value or omit it.
trim(title)
trim(title, ‘-‘)

Operators

Operators are shorthand notation for some functions. It is highly encouraged to use operators for brevity instead of function equivalents.
orderTotal * {discount%} + 100
add(multiply(orderTotal, {discount%}), 100)

Operators supported are
1.     -  unary minus to negate a numeric value
2.     !  negate a boolean value
3.     *  multiply
4.     /  divide
5.     %  modulus
6.     + add
7.     – substract
8.     == equals
9.     === equal with null (we will learn about NULL in a different blog)
10.  != not equal
11.  <=> equalsIgnoreCase for strings
12.  > greater
13.  < lesser
14.  >= greaterOrEqual
15.  <= lesserOrEqual

Operators have precedence in that some apply first before the other. The operators above are given in the precedence order.

100 orderTotal * {discount%}  is the same as orderTotal * {discount%} + 100 as * precedes +.

TIP: Did you know dataflow metadata has no reserved words! You can name your columns Select or Table or {I love Dumbeldore!@!#^&}

Now that you are armed with the basics of how expressions work, we will explore tips and tricks and advanced expression techniques in subsequent blogs.




Friday, 21 June 2019

All about dataflow script - Part I


All about dataflow metadata  - Part I


 ADF dataflow is a fully metadata driven data integration service on Azure. It is powered by a definition of the data flow which manifests as a JSON script tag. The dataflow script or metadata is a declarative language to describe your ETL flow. There is no Microsoft term for this so don't go looking for it in the docs. We will just call it metadata. It is declarative like SQL not imperative like procedural languages as Java, Scala or Python. Is this yet another Microsoft language? Not quite. It is expressive metadata on how data gets read, transformed and written. If you peel the layers, it has 1 to 1 correspondence to dataflow UI.

Let’s get introduced by creating a simple dataflow which reads a csv file, upper cases a column and writes it to a sink.

Visually the dataflow looks like this.



The JSON code looks like this



Unfortunately, JSON escaping makes it extremely hard to read the metadata! If I switch to ‘Plan’ I can view but cannot edit anything! I ended up using https://jsonformatter.org/json-unescape to unescape the metadata to make it readable.


We will examine the source UI and metadata together. I have annotated every configuration that appears in the UI to how it is expressed in the metadata.



Metadata for source looks like

Did you not tell me this was a programming language? Where is the UI generating scala code which is compiled into Databricks? Where are for loops, variables, cursors and the things I love in programming?? 
Fortunately, all the statements above are falseDon't get misled by script tag in the JSON to mean it is a programming platform like VBScript.
Dataflow metadata is a specification to solve a data integration problem. The dataflow engine can interpret this specification dynamically at runtime and run it at scale on any Spark distribution!!! No code, no compilation, no libraries to link, no deployment, just metadata driven with full data lineage.

Let’s move to the derive transform and examine the metadata. It follows a syntax of one stream applying a transformation to result in another stream.
inputStream transformType(….) ~> outputStream.

Derive has the structure
MovieSource derive(….) ~> NormalizeTitle

As the source does not have any inputs its structure looks like
source(….) ~> MovieSource


Metadata for it looks like


Finally let's look at the sink. It has the input columns from the dataset and the columns I choose to map.



In metadata it is expressed as


Let’s try to manually change the metadata for Derived and add a new column fullPrice by rounding the ticketPrice. We will use the escape facility(https://jsonformatter.org/json-escape) to escape the content back to JSON. 


After updating the script tag in ADF, the derive transformation looks like


Viola!!!! I can switch back and forth to a textual mode whenever I need to. It gives me the best of both worlds with a UI that can assist novices with the flexibility of power editing at any time.

Understanding Dataflow metadata can power you to –
1.     Generate dataflow external to ADF without using the UI
2.     Make mass edits to the dataflow like search and replace column names across transforms
3.     Understanding errors at runtime which are reported with script line numbers
4.     Handle mundane edits that may require too many clicks on the UI

This was a primer to Dataflow metadata but we are just scratching the surface. We will explore transformation syntax, datatypes, expression syntax, hierarchical functions etc. in future blogs. You can start to see how this is different compared to M, SQL, notebook or U-SQL.