Data provides the amazing power to understand and grow businesses. But data always comes in different shapes and formats, and need to be transformed into a more analysis friendly structure to be able to get meaningful insights out of it.
Are you a Google BigQuery user? Do you have different data sources other than Google Analytics? Would you like to have a tool to collect all your data sources into BigQuery to have better control over it and start analyzing, visualizing and building various models to discover hidden patterns? If yes, then Talend is the perfect tool for you. Continue reading this post to find out how to use it to build your data pipeline.
Talend is one of our data integration partners for Google BigQuery, and is one of the most powerful data integration and ETL tools available in the market. It is easy to use, with more than 900+ connectors and components. Talend enables users to design their data pipelines, using drag-n-drop building block and each block is highly configurable, then converts this flow into Java code that can run on different machines making use of Java Portability. Another big feature is that developers can set Talend to produce Java code that runs on a single machine or produce MapReduce code that runs on a cluster of servers, making use of parallel/distributed computing, processing large amounts of data in a short time. Talend is also considered to be the next-generation leader in cloud and big data integration software.
What Is ETL?
Before we consider our use case, let’s first understand what ETL is. ETL stands for Extract, Transform and Load. It’s the process of moving raw data from various sources to a destination (data warehouse or a database) after manipulating the data.
- Extract: is the process of reading/fetching raw data, from database(s), file(s) and/or any other source(s). Extracting can happen from one or more sources at the same time.
- Transform: is the process of altering the raw data extracted, and preparing it to be loaded in the target database or data warehouse.
- Load: is the process of injecting/writing the new modified data into target database or data warehouse.
There are various ETL tools in the market. ETL tools assist with the process, providing capabilities and advantages unavailable if you were to complete the ETL process on your own by using a programming language.
Advantages of using ETL Tools:
- It is very easy to use as it is almost just drag and drop steps, and eliminates the need for writing the code.
- ETL tools are GUI based, which makes it easier to understand the flow of the data.
- ETL tools provide better data management when dealing with large and complex data.
- Performance of the ETL tools is much better, and the process can be dispatched to several servers to run in parallel, reducing the execution time.
The use case at hand is one of a Chief Marketing Officer of a home appliance distribution company (let’s call it OME Appliances). Our CMO is in charge of marketing for a global network of retailers and dealers. Her Data/BI team is responsible for the consolidation and automation of reporting across all retailers and across all the brands they carry.
Daily files from different retailers containing transactions and revenue data are consolidated into two aggregated files: one for sales and another for services. Those files are then transformed and loaded it into BigQuery.
Hence the three main steps mapping to the ETL process:
- Extract: Connect to a SFTP server and download the latest aggregated sales and services .txt files. Files are named according to timestamp of file generation.
- Transform: Read and parse the .txt files, filter out unneeded columns, add some new columns, manipulate existing ones to handle formats and/or modify values.
- Load: Inject the resulting data from step 2 into Google BigQuery partitioned tables.
At first, we considered custom development, using a programming language like Java or C#. But we realized it would have taken about 4 months.
We then decided, this is a task that should be implemented easily and efficiently using an ETL tool, and what was estimated to take 4 months, was created in just one week.
Now, we’ll get into some details of how we implemented the solution using Talend integration studio.
First let’s take an overall look on the job structure, then break it into the three ETL steps.
- Extraction: In this step, we used Talend’s FTP components (tFTPConnection & tFTPGet), to securely connect to the online marketing company’s SFTP server, and retrieve the pre-prepared sales and services text files that include transactions with all needed details (datetime, company name, revenue, appliance… etc), using the current date as the files name’s mask to get today’s files only.
- Transform: the transformation step is where the raw data gets cleaned, i.e. removing unneeded columns, formatting existing columns to match the target schema, and adding whatever new columns required like snapshot date and file name. The output of this step is CSV files ready to be loaded into BigQuery, each file is named with the partition date it will be inserted in into the BigQuery partitioned table. For more info about partitioned tables please visit this page.
- Load: the final step, is loading the CSV files into BigQuery, which is done using tBigQueryBulkExec component, which uploads the given CSV files into Google Cloud Storage, then loads them as a bulk into BigQuery partitioned table.
This step can be split into two, first uploading the files into GS then loading them into BigQuery, but it’s recommended to use this compact component.
- Additional advantages: as we previously mentioned, ETL tools provide better data management. Here are some extra components that we used here just for saving backups and archiving the generated files in case we needed to use the historical data for any purpose. Here we use Google Cloud Storage components “tGSConnection” to open a connection to GS, “tGSBucketCreate” to create a new archive folder with today’s date, “tGSPut” to upload the generated CSV files to the created archive folder.
Turning data into insights
After loading the data into BigQuery, now we can explore it in many ways, and make use of all the cool BigQuery features and tools that connect to BigQuery to analyze and visualize the data, to get meaningful insights, and answer business questions.
One of the powerful visualization tools that connect to BigQuery is Data Studio, below are sample screenshots of what can be accomplished in Data Studio using the external data loaded in BigQuery.
For a home appliances retailer, these Data Studio dashboards provide a high level overview and some detailed insight into the sales and service data for decision making.
For a brick and mortar retailer it helps to know for planning purposes which states sell the highest number of appliances, which brand of appliances have the highest gross profit. This will help the retailer move inventory in a way that maximizes revenue.
To allocate resources and plan ahead a home appliance servicing business needs to know which brands require the most labor hours and have the highest repair costs, and what is the average repair order by brand. This will help them procure and stock the required spare parts reducing delays. By knowing if the appliance being serviced is new or used the retailer can send personalized coupons or emails to customers encouraging them to get their appliances serviced regularly.