AWS Glue Databrew: A Guide to Visual Data Preparation

Our Practice Leaders

Venkata Potturu

What is AWS Glue DataBrew?

AWS Glue DataBrew is a new visual data preparation tool that helps enterprises analyze data by cleaning, normalizing, and structuring datasets up to 80% faster than traditional data preparation tasks. It can interface with Amazon S3, S3 buckets, AWS data lakes, Aurora PostgreSQL, RedShift tables, Snowflake, and many other data sources.

Being that AWS Glue DataBrew is a no-code ETL tool, it provides data analysts and data scientists with self-service access to perform data preparation, without writing a single line of code. As a result, a workflow typically reserved for engineers and ETL developers is accessible and usable by anyone with knowledge of SQL.

Users can choose from up to 250 built-in transformations to clean, combine, pivot, and transpose terabytes of data directly from their data warehouse, data lake, or database platforms using a simple interactive interface. It also allows the data to be ready for immediate use in analytics and machine learning.

Learn more: AWS Glue Implementation Services

AWS Glue DataBrew Visual Interface

AWS-Glue-DataBrew-Visual-Interface-Graphic

What is Data Preparation Used For?

Since enterprises host vast quantities of data in a variety of formats, it is often a time-consuming and resource-intensive process to transform data into actionable insights. With AWS Glue DataBrew, users are provided with a visual interface to make sense of the following data formats:

Raw, Unstructured Data

This can include system logs, sensor data, text files, and so forth. Raw data is generated as-is, without further processing. It has no model or structure. Raw data is suited to platforms like AWS data lakes, which can house and load large unstructured data sets.

Semi-Structured Data

An application or service may output data with a basic model or structure. This includes Excel files, which use the comma-separated values (CSV) and columns or rows to form a hierarchy.

Structured Data

Structured data could include a combined list of first and last names, dates, zip codes, or payment card information. Enterprises can query structured data using SQL, NoSQL, or any other data language.

Since AWS Glue DataBrew is used to clean, normalize, and structure datasets, all the raw, unstructured data, and semi-structured data can be loaded and transformed into structured data. This allows analysts to quickly shape the data for their analytics solutions.

Top 4 DataBrew Features for Enterprises

As you can see, DataBrew has the potential to help enterprises prepare and transform large datasets into structured formats for analytical insight. Let’s explore the top four features of this visual data preparation tool:

Visualized Data Preparation

Where most data are viewed as alphanumeric values in columnar databases, DataBrew works different. It visualizes all loaded data sources so you can understand the hierarchy and data relations. On the topic of no code, the graphical user interface (GUI) allows users to click, drag, and edit data without a single line of code

250+ Data Preparation Automations

Data scientists follow several isolated, repeatable workflows and processes as part of their role. AWS has modeled these processes and workflows as language and data-agnostic modules, creating a library of actions that are usable by end users. This can include creating pivot tables, handling missing data values, data filters, and data encoding. Then, you can create a DataBrew recipe to create singular or batch action automation templates.

Data Lineage

Much like audit logs are used to track customer activities in an IT network, data lineage allows you to track data transformation activities within AWS Glue DataBrew. This includes the data source, any transformations that have been applied, and the data output including target location. With transformations, it covers how data was transformed, what changed, and why it was transformed, so that everyone can interpret the chain of events. Much like a family tree, this allows you to go back and understand how the data came to be.

Data Mapping

Trying to find the similarities between two databases? Databrew helps you find matching fields that are present in two data sources. As matching fields are identified, they are loaded into a schema. This enables the creation of a homogenous, centralized database that condenses multiple data sources into one.

Benefits of Using AWS Glue DataBrew

AWS Glue DataBrew is focused on improving the accessibility and usability of data preparation tasks. Here are a few benefits and advantages of using DataBrew in more detail:

Reduced Barrier of Entry to Data Preparation

The no-code format of the platform means anyone, of any skillset, can prepare data. While functionality is not advanced enough to replace data engineers, it allows non-technical employees to perform an ETL process, data exploration, and data quality workflows much like data analysts would. This reduces the barrier of entry to data preparation, which directly tackles the shortage of data science personnel across all industries.

Automated Data Profile Creation

As you work on projects in DataBrew, you will be shown a dataset preview, profile overview, column or row statistics, and a full data lineage. This falls under a data profile, giving you a snapshot overview of the new dataset.

Automate 250+ Common Data Preparation Processes

Using models for data preparation can significantly reduce the amount of effort required. Rather than writing a line of code, or copying a code template and editing the parameters, you can simply configure and execute an action to prepare data. This is known as workflow enhancement, speeding up low-level repeatable processes to get more done in less time.

Intelligent Prescriptive Suggestions

Glue DataBrew incorporates light prescriptive analytics while you work. This means you get suggestions to speed up preprocessing workflows, reducing utilization costs with a faster time to benefit (FTTB).

Disadvantages of Using AWS Glue DataBrew

While AWS Glue DataBrew transforms the accessibility and usability of data for business users, it also has the following limitations:

No Code Means No Code

DataBrew uses standardized, tested modules that write the code for you while allowing editing of parameters through a GUI. When AWS says “no code,” it really does mean no code. However, you cannot introduce custom code to this platform, as it is heavily locked down to the 250+ data transformation actions. For advanced technical users, this may be a deal breaker. But most non-technical users will appreciate the no-code nature of DataBrew.

Cloud Only Leads to No Offline Support

Since AWS is a cloud service provider, naturally there is no offline support. It is web only, accessible through a web application. This means that on-premises data centers cannot use their LAN datasets in DataBrew if broadband connections fail — a complication for hybrid cloud users.

No Custom Visualizations

The no-code nature of the platform means visual interface customizations of any sort are scarce. You are limited to DataBrew visualizations, such as basic graphs, charts, and bars, rather than integrating third-party services like PowerBI, Tableau, or Looker.

AWS Glue DataBrew Pricing

DataBrew offers 40 interactive sessions for first-time users that are completely free. This provides ample leeway to test and confirm that the platform is suited to your business requirements.

DataBrew Interactive Sessions

Per Session Pricing

DataBrew sessions are charged on a per-session basis. Each session costs $1.

Session Timescales

AWS DataBrew sessions are 30 minutes each. Each session is interactive, so moving the mouse means the clock is ticking. No activity at the end of a 30-minute period

Example Pricing for AWS Glue DataBrew Interactive Sessions

You log on at 2 PM and work until 2:29PM = $1 cost.

You log on at 2 PM and work until 2:29 PM. Then, you log back in at 2:35 PM and forget to log off. You are forced to log off at 3 PM by DataBrew = $2 costs.

You log in 5 minutes between 2 and 2:30 PM, 5 minutes between 2:30 and 3 PM, and 5 minutes between 3 and 3:30 PM = $3 costs.

As you can see, pricing is based on 30-minute blocks. Interacting with the service within these blocks will trigger a $1 charge, even if you only spend a minute logged in. You can reduce costs by maximizing work completed in each session as well as by logging off immediately once work is complete.

To learn more about Interactive Session pricing, you can access the AWS calculator for DataBrew.

Learn More: Amazon Redshift Consulting Services

DataBrew Jobs

Per Node Pricing

DataBrew jobs use nodes to calculate pricing. Each node includes four virtual CPUs and 16GB of RAM. The default for each job is five nodes.

Node Hours

A node hour costs $0.48. This is calculated per minute.

Additional AWS Services

You will be charged for each additional service if you use supplementary AWS services alongside DataBrew jobs. For example, the AWS Glue Data Catalog.

Example Pricing for DataBrew Jobs per Node Hour

You run a DataBrew data profile job for 60 minutes using 5 nodes for one job. This means you have 5 node hours total: 5 nodes x $0.48 per hour = $2.40

You run a DataBrew job for 30 minutes, using 10 nodes for two jobs: 10 nodes x 0.5 hours = 5 full node hours or $2.40 again.

Alternative Competitor Platforms to AWS Glue DataBrew

Despite its cost-effective pricing, AWS Glue DataBrew might not be the best fit for your use cases. Here are a few DataBrew competitors and alternatives to consider:

Pandas

Pandas is an open-source tool for data analysis and manipulation. Built in the Python programming language, it allows users to create objects, view and select data, populate missing data, merge datasets, and much more. It supports CSV, TXT, HTML, JSON, LaTeX, XML, SQL, SAS, and SPSS as “IO tools.”

NumPy

Numerical Python (NumPy) is another tool used across all data industries. It’s an open-source Python library focused on numerical data (Pandas is actually based on this as a source). It allows users to create arrays, add and remove elements, alter shapes and sizes of arrays, convert from 1D to 2D arrays, and create or reshape matrices.

Alteryx

Where the previous two competitors are free and open source, Alteryx is a paid automated analytics service. It matches DataBrew with data quality and preparation tools, including data access, catalogs, exploration and profiling, and enrichment. More broadly, Alteryx also offers an analytics cloud, analytics process automation, and machine learning.

Dask

Dask is another free, open-source Python library. It is built in coordination with NumPy and Pandas. It allows users to create a Dask array, Dask bags, DataFrame, collections, and more. The focus is enabling parallel, scalable Python solutions for data preparation using schedulers that execute task graphs.

Considering AWS Glue for Your Next Migration Initiative?

As an AWS premier partner, Trianz can help you connect data with data engineers and ETL developers using AWS Glue to make it easy to discover, prepare, and combine data for analytics, machine learning, and application development. Whether you are looking to easily explore new data sets in a visual interface using DataBrew, or want to learn more about how AWS Glue can save you up to 50% on your legacy migrations — Trianz is here to help.

For more in-depth information on AWS Glue, check out our ETL migration page or schedule a call with one of our ETL migration experts.

Digital Transformation and the Business Urgency for Analytics

Experience the Trianz Difference

Trianz enables digital transformations through effective strategies and excellence in execution. Collaborating with business and technology leaders, we help formulate and execute operational strategies to achieve intended business outcomes by bringing the best of consulting, technology expertise, and execution models.

Powered by knowledge, research, and perspectives, we enable clients to transform their business ecosystems and achieve superior performance by leveraging infrastructure, cloud, analytics, digital, and security paradigms. Reach out or get in touch to learn more.

Benchmarking & Strategy

Technology Implementation

Managed Services

Platform Services

Benchmarking & Strategy

Technology Services

Managed Services

Platform Services

Benchmarking & Strategy

Technology Services

Managed Services

Platform Services

Benchmarking & Strategy

Technology Implementation

Managed Services

Benchmarking & Strategy

Technology Implementation

Managed Services

Platform Services

Benchmarking & Strategy

Technology Services

Managed Services

Platform Services

Benchmarking & Strategy

Technology Services

Managed Services

Platform Services

Benchmarking & Strategy

Technology Implementation

Managed Services

AWS Glue Databrew: A Guide to Visual Data Preparation

Our Practice Leaders

Venkata Potturu

What is AWS Glue DataBrew?

AWS Glue DataBrew Visual Interface

What is Data Preparation Used For?

Top 4 DataBrew Features for Enterprises

Benefits of Using AWS Glue DataBrew

Disadvantages of Using AWS Glue DataBrew

AWS Glue DataBrew Pricing

DataBrew Interactive Sessions

Example Pricing for AWS Glue DataBrew Interactive Sessions

DataBrew Jobs

Example Pricing for DataBrew Jobs per Node Hour

Alternative Competitor Platforms to AWS Glue DataBrew

Considering AWS Glue for Your Next Migration Initiative?

Experience the Trianz Difference

You might also like...

Lake House Infrastructure on AWS: A Game - Changer for Faster Analytics at Lower Costs

Unlocking Innovation: Modernizing Windows Workloads with AWS

Implementing a Data Mesh on AWS with Trianz Extrica

Data Evolution: A Story of Search| Secure | Share

Why Data Mesh?

Managing Enterprise Data Sets

Get in Touch

Let us help youtransform and grow

Follow us on social media

Let us help you
transform and grow