File handling

ADF Dataflow offers superior file handling for both source and sink. Let’s explore some facilities that will improve file handling in your data flows.

Sources

Wildcard paths -

It is highly recommended to use the wildcard feature to iterate files instead of using a lookup activity and foreach loop in the pipeline. It is more performant to iterate and partition your data within dataflow. You can enter multiple wildcard directories for any file-based source. This includes ADLS Gen1/2, Blob,S3 or SFTP storage. Wildcards follow the popular linux globbing techniques. Assume a directory structure of sort

/data

/sales

/<year>

/<month>

/<day>

* - Represents any set of characters

** - Represents recursive directory nesting

? - Replaces one character

[] – Matches one of more characters in the brackets

/data/sales/**/*.csv – Gets all csv files under /data/sales

/data/sales/20??/** - Gets all files in the 20^th century

/data/sales/2004/*/12/[XY]1?.csv – Gets all csv files in 2004 in December starting with X or Y prefixed by a 2-digit number

Container has to be specified in the dataset and make sure your wildcard path includes your folder path from the root folder.

Base directory –

This is used to specify the root of a partitioned directory if your wildcard path reads only a subset of the content. Assume a directory that is partitioned as per Spark guidelines

/data

/sales

/year=2000

/month=1

/month=2

…

/year=2001

If your wildcard path is /data/sales/*/month=12/*, it will read the content for December month of every year. The projection of this source will not include year and month as columns.

However, if you specify the base path to be /data/sales, month and year are included in the resulting projection. The data for these columns will be derived from the directory names (everything that comes after the = )

List of Files –

Instead of compiling the files to process in the pipeline, you can prepare an external file which in turn contains a list of wildcarded paths to process. This file can be created by an external process outside ADF.

Every line in this file in turn points to a path you would have manually entered in the wildcard path.

Column to store file name –

This is a powerful function to capture the lineage of every row in your source. You can add a synthetic column to your source which will hold the full path to the file where this row originated. You can use this facility with a symmetric function in the sink to transform your data and potentially mirror the same directory structure of the source. Make sure the column name does not conflict with any existing column name in your dataset.

In the event your directory structure does not follow Spark guidelines for key partitioning, you can use the file url data and parse the columns in a subsequent derive transformation

Source(with file url column) -> Derived(extracts year/month as 2 new columns from the file url) -> Sink

Post processing –

You can choose to do post successful completion of your dataflow, certain action on the source files. You can choose to either delete or move the files sourced.

Move action moves from one directory path to another. If you are reading from a folder with no wildcard paths typically the ‘from’ directory will be the same as your source folder.

However, let’s look at an example where you have a wildcarded path

/data/sales/20??/**/*.csv

You can specify ‘from’ as

/data/sales

And ‘to’ as

/backup/priorSales

Here all subdirectories under /data/sales which were sourced (20^th century csv files) are moved relative to /backup/priorSales

Your resulting directory after move would look like

/backup/priorSales

/2000

/01

/02

…

/2001

….

Sink

Output options -

There are multiple options to name and output your files at the sink.

a. Default – This option follows the spark guidelines of generating files with the ‘part’ prefix. The directory will also have the marker _SUCCESS file. This is suitable if your downstream applications read from a folder and understand the rules to ignore marker files

b. Pattern – This option allows you to specify a naming pattern for your files. If the pattern is order_[2n]_prior.csv you will get files for every partition named as

order_01_prior.csv

order_02_proir.csv

….

c. Per Partition – If you know the number of partitions in the sink you can explicitly choose to name the files in every partition.

Partition1 – January_Sales.csv

Partition2 – Feb_Sales.csv

…

d. As data in column – This is a mirror of the ‘Column Name to store file name’ option in the source. You can identify which path every row in your sink needs to go to. This is completely data dependent. This column is a synthetic column and never makes it to the actual data file.

The folderPath in the dataset is disregarded in this option. You can write your data to multiple files across different directories.As a performance tip, this option is not recommended if the output file size per partition is large. All data for a single file has to be co-located to one node on the cluster and it may potentially run out of memory. Either choose a bigger cluster or further split the large files into smaller files by using a secondary key.

e. Output to single file – This option writes the entire data to one file. This option is only recommended when your data size is not large.

f. Key partition – This allows you to place your data on disk partitioned by the key specified. Every key starts a new directory. The order of keys determines the nesting structure in the output.

If you specify ‘year’ then ‘country’ your directory structure will look like

/data/orders

/year=2000

/country=US

/country=UK

…

/year=2001

…

Dataflow enclycopedia

Tuesday, 11 June 2019

All about files

File handling

Sources

Sink

No comments:

Post a Comment