ETL SIMPLIFIED WITH CLOUD DATAPREP BY TRIFCTA ON GCP

ETL SIMPLIFIED WITH CLOUD DATAPREP BY TRIFCTA ON GCP

Hi guys, remember in my last post we had a long time trying to preprocess our data and make it ready for modelling and Analytics (?).

Now, what if i told you there is a simple (no code) way around data preprocessing other than following the long procedures in my last post.

Yeaaaa! Data Prep got you covered, saves time and allows you to reuse all of the recipes used in processing other datasets.

What is DataPrep?

Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis, reporting, and machine learning. Because Cloud Dataprep is serverless and works at any scale, there is no infrastructure to deploy or manage. Your next ideal data transformation is suggested and predicted with each UI input, so you don’t have to write code. With the automatic schema, datatype, possible joins, and anomaly detection, you can skip time-consuming data profiling and focus on data analysis.

Features of DataPrep

i. Predictive transformation

Cloud Dataprep uses a proprietary inference algorithm to interpret the data transformation intent of a user’s data selection. A ranked set of suggestions and patterns for the selections to match are automatically generated.

ii. Parameterization

Execute a recipe across multiple instances of identical datasets by parameterizing a variable to replace the parts of the file path that change with each refresh. This variable can be modified as needed at job runtime.

iii. Collaboration

In team environments, it can be helpful to be able to have multiple users work on the same assets or to create copies of good quality work to serve as templates for others. Cloud Dataprep enables users to collaborate on the same flow objects in real-time or to create copies for others to use for independent work.

iv. Pattern matching

Utilize columnar pattern matching to identify data patterns of interest to you and to surface them in the interface for use in building your recipes. Additionally, in your recipe steps, you can apply regular expressions or Cloud Dataprep patterns to locate patterns and transform the matching data in your datasets.

v. Visual profiling

See and explore your data through interactive visual distributions of your data to assist in discovery, cleansing, and transformation. Visual representations help interpret large volumes of data, and Cloud Dataprep’s innovative profiling techniques visualize key statistical information in a dynamic, easy-to-consume format.

vi. Sampling

For performance optimization, Cloud Dataprep automatically generates one or more samples of the data for display and manipulation in the client application. However, you can easily change the size of samples, the scope of the sample, and the method by which the sample is created.

vi. Scheduling

Schedule the execution of recipes in your flows on a recurring or as-needed basis. When the scheduled job successfully executes, you can collect the wrangled output in the specified output location, where it is available in the published form you specify.

vii. Target matching

Define target schemas, through imported or created datasets, and assign to an existing recipe to systematize and speed up your wrangling efforts. Targets appear in the Transformer page and can be applied against the entire dataset or selected columns of the dataset you need to wrangle.

viii. Common data types

Transform structured or unstructured datasets, stored in CSV, JSON, or relational table formats, of any size — megabytes to petabytes — with equal ease and simplicity.

ix. Integrated with Google Cloud Platform

Process data stored in Cloud Storage, BigQuery, or from your desktop, then export refined data to BigQuery or Cloud Storage for storage, analysis, visualization, or machine learning. User access and data security is seamlessly managed with Cloud Identity and Access Management.

That was quite a lot of work done by the Trifacta Team. Let me share some resources that would would guide through the process of getting started on Dataprep.

Should you find anything challenging and want to talk about it, please reach me via any of the following channels:

Twitter, Email