File
handling
ADF Dataflow offers superior file handling for both source
and sink. Let’s explore some facilities that will improve file handling in your
data flows.
Sources
Wildcard paths -
It is highly recommended to use the wildcard feature to
iterate files instead of using a lookup activity and foreach loop in the
pipeline. It is more performant to iterate and partition your data within dataflow.
You can enter multiple wildcard directories for any file-based source. This
includes ADLS Gen1/2, Blob,S3 or SFTP storage. Wildcards follow the popular
linux globbing techniques. Assume a directory structure of sort
/data
/sales
/<year>
/<month>
/<day>
* - Represents any
set of characters
** - Represents recursive directory nesting
? - Replaces one
character
[] – Matches one of more characters in the brackets
/data/sales/**/*.csv – Gets all csv files under /data/sales
/data/sales/20??/** - Gets all files in the 20th
century
/data/sales/2004/*/12/[XY]1?.csv – Gets all csv files in
2004 in December starting with X or Y prefixed by a 2-digit number
Container has to be specified in the dataset and make sure
your wildcard path includes your folder path from the root folder.
Base directory –
This is used to
specify the root of a partitioned directory if your wildcard path reads only a
subset of the content. Assume a directory that is partitioned as per Spark
guidelines
/data
/sales
/year=2000
/month=1
/month=2
…
/year=2001
If your wildcard path is /data/sales/*/month=12/*, it will
read the content for December month of every year. The projection of this
source will not include year and month as columns.
However, if you specify the base path to be /data/sales,
month and year are included in the resulting projection. The data for these
columns will be derived from the directory names (everything that comes after
the = )
List of Files –
Instead of compiling the files to process in the pipeline,
you can prepare an external file which in turn contains a list of wildcarded
paths to process. This file can be created by an external process outside ADF.
Every line in this file in turn points to a path you would
have manually entered in the wildcard path.
Column to store file
name –
This is a powerful function to capture the lineage of every
row in your source. You can add a synthetic column to your source which will hold
the full path to the file where this row originated. You can use this facility
with a symmetric function in the sink to transform your data and potentially
mirror the same directory structure of the source. Make sure the column name
does not conflict with any existing column name in your dataset.
In the event your directory structure does not follow Spark
guidelines for key partitioning, you can use the file url data and parse the
columns in a subsequent derive transformation
Source(with file url column) -> Derived(extracts
year/month as 2 new columns from the file url) -> Sink
Post processing –
You can choose to do post successful completion of your
dataflow, certain action on the source files. You can choose to either delete
or move the files sourced.
Move action moves from one directory path to another. If you
are reading from a folder with no wildcard paths typically the ‘from’ directory
will be the same as your source folder.
However, let’s look at an example where you have a wildcarded
path
/data/sales/20??/**/*.csv
You can specify ‘from’ as
/data/sales
And ‘to’ as
/backup/priorSales
Here all subdirectories under /data/sales which were sourced
(20th century csv files) are moved relative to /backup/priorSales
Your resulting directory after move would look like
/backup/priorSales
/2000
/01
/02
…
/2001
….
Sink
Output options -
There are multiple options to name and output your files at
the sink.
a.
Default – This option follows the spark guidelines
of generating files with the ‘part’ prefix. The directory will also have the
marker _SUCCESS file. This is suitable if your downstream applications read
from a folder and understand the rules to ignore marker files
b.
Pattern – This option allows you to specify a
naming pattern for your files. If the pattern is order_[2n]_prior.csv you will
get files for every partition named as
order_01_prior.csv
order_02_proir.csv
….
c.
Per
Partition – If you know the number of partitions in the sink you can explicitly
choose to name the files in every partition.
Partition1 – January_Sales.csv
Partition2 – Feb_Sales.csv
…
d.
As data in column – This is a mirror of the ‘Column
Name to store file name’ option in the source. You can identify which path
every row in your sink needs to go to. This is completely data dependent. This
column is a synthetic column and never makes it to the actual data file.
The folderPath in the dataset is disregarded
in this option. You can write your data to multiple files across different
directories.As a performance tip, this option
is not recommended if the
output file size per partition is large. All data for a single file has to be co-located
to one node on the cluster and it may potentially run out of memory. Either
choose a bigger cluster or further split the large files into smaller files by
using a secondary key.
e.
Output to single file – This option writes the
entire data to one file. This option is only recommended when your data
size is not large.
f.
Key partition – This allows you to place your
data on disk partitioned by the key specified. Every key starts a new directory.
The order of keys determines the nesting structure in the output.
/data/orders
/year=2000
/country=US
/country=UK
…
/year=2001
…
No comments:
Post a Comment