As a Senior Business Intelligence consultant with extensive data preparation experience, clients often ask me to recommend a data prep/analytics platform. Whether I recommend KNIME or Alteryx depends on the project requirements and data sources because each product has specific capabilities, pros, and cons.
In this blog, I’m going to look at the two tools side by side in terms of their ETL capabilities:
KNIME is a powerful, free, open source ETL and BI tool.
Alteryx is a commercially licensed, self-service, analytic process automation platform with capabilities of ETL and complex analytics including predictive, spatial and statistical analysis.
ETL tools, at the core, enable loading data from multiple data sources, combining and transforming them into a format that can then be loaded into a database for further querying. Beyond these primary functions, many of these tools contain a wide set of extra features. These can consist of everything from data analytics tools such as predictive modeling features, including the ability to create graphics, charts and full-fledged dashboards.
KNIME is an open source analytics platform that is growing continuously by integrating new developments. KNIME provides a GUI (graphical user interface) based platform where reusable workflows can be built quickly by simple drag and drop to perform ETL, business intelligence analytics and machine learning. It’s a popular tool because it provides functionality ranging from natural language processing, text mining and information retrieval that reads, processes, mines and visualizes textual data.
Alteryx is an analytic process automation platform that provides automation capabilities for all analytics functions (ETL, diagnostic, predictive, prescriptive, and geospatial analytics). It combines code-free and code-friendly data science, machine learning, artificial intelligence, and business process automation in one platform. One of Alteryx’s differentiators is adding location intelligence through easy to use spatial analytical tools.
Both Alteryx and KNIME use a workbench sort of approach.
KNIME: There is a list of nodes (i.e., tools) in a repository, divided into different segments. Each node can be dragged on to the canvas and can be connected by a line from an output to an input of a similar or a different node. By double clicking or by right clicking on a node, you can configure the node based on the functionality.
Alteryx: Interface is quite similar to that of KNIME. Tools are grouped into an understandable color-coded category such as In/Out, Data Preparation etc. on the top of the application. For each tool clicking by category opens the interface, which can be expanded or closed as needed.
Verdict:
While both interfaces appear similar, navigating between nodes is considerably easier in Alteryx than with KNIME. With KNIME one can end up with many windows all over the place, consuming a lot of memory which can be a factor on a slow computer. Purely from interface perspective, Alteryx has a more intuitive workflow.
Both tools can pull data from a wide set of sources ranging from CSV, databases or from cloud sources.
KNIME: By double clicking or right clicking on a node we can see all configuration options. Below is an example of file reader, opening a CSV file. While loading, you can see the preview of the data and can modify it, which includes file types and more. It also gives the ability to filter not only for rows but also for columns, which can be useful if the data has a range of keys in the database or any data source.
Alteryx: It provides an easy drag and drop and selection of datatypes in a visual way. After connecting the input/output tool to a database or a file we are automatically provided with a quick visualization with comprehensive information about each data type.
Verdict:
Alteryx’s Data cleansing tool is easy to understand. Modifying data types is easy as it uses a drop down to select data types. Conversely, data type conversion in KNIME is a time-consuming process. Alteryx’s data prep tool seems a bit superior to KNIME although KNIME has a slight edge when it comes to filtering data by columns.
Both Alteryx and KNIME provide great tools for combining data, however the ease of use varies between them.
KNIME: The join node is easy to understand and can combine datasets off a shared identifier. We can choose the columns from each data set in the result. Some of the joins such as left and right joins can be a bit tricky since there are no options for left or right join in the JOIN node.
Alteryx: The Join tool works similarly to KNIME. Users can simply choose the identifier to link the datasets together. It allows users to construct SQL queries without writing a line of code. However, Alteryx has included some higher level-functionality by combining analysis tools that make sense to work together. However, this doesn’t always provide the expected user-experience, which can cause trouble with larger data sets. For example, Alteryx chose a method for simplifying the analyst effort by using UNION features in SQL with the JOIN tool (including inner and left/right joins). This means Alteryx performs all the joins without specifying a join type. This is great for smaller datasets but for larger data sets this could be time and resource consuming as Alteryx needs to process all joins to complete the processing. KNIME, on the other hand, is completely modular and treats each activity as a discrete activity in a separate node and can be completely managed by the ETL designer This requires the ETL designer to be more technically proficient in KNIME, but allows more control and can avoid some of the excessive processing and resource consumption in large data sets that occur in Alteryx.
Verdict:
Both tools have great and easy to use joining and data manipulation capabilities to combine data sets on the fly. However, when working with larger datasets, KNIME comes out stronger since it treats all the activities discreetly.
Both tools provide built-in predictive analytics capabilities which are useful in analyzing preliminary trends, which can be later used for more in-depth analytics.
KNIME: KNIME is quite strong when it comes to different predictive and analytical nodes. Since KNIME is open source, many developers have created a wide range of plugins and adapters to be able to use for many already existing functionalities. KNIME also has many statistical tools as can be seen in the following image .
Alteryx: It does not offer the same range of analytical tools that KNIME offers, but it does include a few useful data investigation tools such as Pearson and Spearman correlation. It also uses various analytical models and some simulation sampling.
Verdict:
KNIME has more tools and has an advantage over Alteryx when it come to analytical models and machine learning since it is an open source application that has a large population of independent developers creating additional tools.
Both the tools are extremely powerful when it comes to ETL; however, the selection of the tool largely depends on the intent and the user. If the user is looking for a user-friendly tool which can handle necessary data preparation by semi-technical or technical users, Alteryx offers an advantage. On the other hand, if the user is looking for some heavy analytical options and has a more technical understanding of “data”, KNIME comes out stronger in terms of its capabilities. Also KNIME is open source and a newer tool that has a medium-size user community; Alteryx has a larger community, is more adaptive to users, and has a more welcoming support environment.
Contact a Data Expert if you need help assessing your environment and deciding which data prep/analytics platform best meets your needs.
Over the last month, I’ve come across several problems that clients were encountering with their visuals. In this post I’d like to talk about three of them. These three problems were similar in the sense that they were all solved with lesser known table calculation functions. In each case, I’ll present the problem and how it was solved. I will be using the Superstore dataset to recreate the scenarios that were encountered.
A client had a simple bar chart showing sales by month for the current year. They wanted to hide the last bar since it represented a partial month due to the fact that we were only halfway through June. This problem was a perfect candidate to be solved with the LAST () table calculation function.
The LAST () function returns the number of cells from the current cell to the last cell in the partition. This means that if we have six bars representing January to June, June would have a value of 0 since it is the last bar and January would have a value of 5 since it is 5 bars away from the last bar. To demonstrate this, I created a calculation called “Last Label” with the formula LAST () and put this on the label shelf.
To remove the last bar, we can create a calculated field called “Last Filter” with the formula LAST () = 0. This will return either true or false. We can put this field on the filter shelf and select false. This removes the partial month.
A client had a simple bar chart showing sales broken down by segment and product category. They wanted to emphasize the product category with the most sales per segment. This problem was a perfect candidate to use the WINDOW_MAX () table calculation function.
The WINDOW_MAX () function returns the maximum of the expression within the partition. If you are unfamiliar with the concept of a partition, I would encourage you to consider taking our Tableau Master Training (link to master class). Understanding addressing and partitioning is fundamental in correctly using table calculations in Tableau. You can think of a partition as the area of your chart in which the function WINDOW_MAX will look at to compute the maximum value of the expression that we specify.
We can use the WINDOW_MAX function to identify the product category with the most sales per segment. Then we can colour this bar so that it stands out against the other product categories. We can create a calculation “Is Window Max?” with the formula SUM([Sales]) = WINDOW_MAX(SUM([Sales])).
Let’s put this field on the colour shelf. You can see that Tableau is currently colouring the largest bar across the entire visual with a different colour. This is indicating that the current partition is in fact the entire table. This is not quite what we want. Let’s click on the “Is Window Max” pill and change the “Compute Using” value to “Pane (down)”. There we have it. Tableau is now colouring the product category with the highest sales per segment.
A client had a visual that displayed sales by month for the past two years. The visual also showed subtotals and grand totals. This was a requirement. Displaying subtotals and grand totals created the problem of shrinking the monthly bars because the large total values increased the axis range of the chart. This made it difficult to see the monthly sales values and those were the most important part of the visual. The client wanted to display the subtotals and the grand total value but it didn’t have to be in the form of a bar. It could simply be a label. This problem was a perfect candidate for the SIZE () table calculation function.
The SIZE () function returns the number of cells in the partition. If you are already knowledgeable about table calculations, then you might know that totals are their own partition. We can take advantage of this to solve the problem.
Our desired result will be to display the bars normally for the monthly figures. For subtotals and the grand total, we will display a label rather than a bar so we don’t skew the axis range. Let’s create a calculation called “Size” with the formula SIZE (). We’ll put this on the label shelf to see how this works. Let’s also change the addressing to “Pane (across)” so that our calculation is computing per year.
The monthly bars all have a label of 12. This is because there are 12 months per year. In other words, there are 12 cells or marks per partition. The subtotals and grand total on the other hand all have a value of 1. This is because totals are their own partition. The 2018 subtotal sales value is not lumped together with the 2018 monthly sales partition. It exists separately and is its own partition.
We can use this to our advantage in our next calculation. Let’s create a calculation called “Sales for Months” with the formula IF SIZE () = 1 THEN 0 ELSE SUM([Sales]) END. This calculation goes to each bar and checks how many bars are within the partition in which that bar resides in. For all the monthly bars, the size calculation will return 12, so our new field returns SUM([Sales]) due to the ELSE condition. For the subtotals and grand total, the size calculation returns a 1 and as a result the “Sales for Months” calculation will return a 0. This means that when we encounter a total, we won’t be displaying a bar.
Let’s replace SUM([Sales]) on the rows shelf with the “Sales for Months” field. Let’s make sure the addressing is “Pane (across)”. Let’s also put SUM([Sales]) on the label shelf. There we have it. We are no longer displaying bars for our totals which allows us to see the sales trend across months much easier now that our axis scale isn’t inflated due to the total values.
It is also worth mentioning that this approach won’t work if your partition legitimately has only one bar. For example, if you were displaying sales by year and we were in the month of January, then the partition would only have one bar. Consequently, the size function would return a value of 1 and the “Sales for Months” formula would treat January as a total.
And there you have it. Three visualization challenges solved with lesser known table calculations functions.
If you’re encountering a challenge with a Tableau visual, contact one of our Tableau professionals for assistance
The following list of changes, tweaks, or considerations are all designed to improve performance of a Tableau dashboard. Not all suggestions will work in all situations, so even with this list there is a need for experience and understanding of how Tableau works in order to properly evaluate this list. There are three broad categories of things to do, some requiring more technical expertise and access to source systems than others.
We have color-coded, based on category, actions to be taken. The list is generally presented from “most impactful” to “least impactful”. Again, not all “most impactful” suggestions will work on all structures of data, and in some cases entries at the bottom of the list may be much more impactful in some scenarios. For example, if a dashboard is running on a heavily used Tableau Server the last entry in the list, “Use fixed sized dashboards” may actually produce significantly better results. Caveats aside… here is the list.
If you need assistance to improve the performance of your Tableau dashboards, we would be happy to help! CONTACT US
RFM (Recency, Frequency, Monetary) analysis is commonly used for customer segmentation, to split our users into types. For example, people who visit a website regularly but don’t buy much would be a high “frequency” but a low “monetary” visitor. This analysis is especially common in retail where we want to create a view of our customer types without drowning in all the transactional data.
This page contains a comprehensive explanation and description.
When constructing RFM scores, one challenge is that the metrics are on completely different scales:
To compare them, we divide each metric into quintiles. If you are above the 80th percentile, your score is 5; if you are in the 60th to 80th percentile, your score is a 4 and so on.
Using the included Superstore dataset, we’ll define RFM as follows:
Four Steps to Achieve our Goal:
Notice that with 5x5x5 combinations, we have 125 possible combinations. However, in practice our data won’t necessarily contain every combination. Certain combination like 4-1-5 or 2-5-1 are very uncommon because generally you don’t have customers who are frequent visitors who don’t spend much and haven’t ordered regularly.
In situations like these with many possible combinations, it’s often helpful to provide Tableau with a scaffold. A scaffold is a separate table that lists all the possible values (125 combinations). When we join or blend this to our results, it guarantees that every possible value appears in the results, even if there are zero customers in that bin. Scaffolds are frequently helpful with dates where not every day exists in a dataset, but we would like to see a chart for activity every day. To start, I’ve created a sheet called “RFM all codes.” All it contains is a list of the 125 possible combinations:
With that set up, we need to calculate each person’s percentile. This can be easily done with the RANK_PERCENTILE table calculation:
We’ll then calculate the percentile with a simple if-then statement:
Completing this for each metric and combining them, we can now see the customers assigned to each RFM combination:
Great! Now let’s remove the customer names, so we can just see counts by RFM category.
Wait a minute… how did everyone get in 555?
Unfortunately, this won’t work with table calculations like RANK_PERCENTILE. The issue is the view granularity. Since we haven’t broken it down by customer name, the calculations for percentiles all break. There’s only a single value for frequency now – the overall frequency – which is equal to the maximum, making it the 100th percentile. What to do?
Generally, in Tableau, many formulas that can be done as a table calculation can be re-done as a level of detail calculation. Table calculations are limited to the granularity of the view – there is no way to override that or change an option. Since the view is broken down by customer, I can’t roll it back up to the level of the whole dataset. However, with a level of detail expression, we can bypass the level of the view and define an overall 80th percentile, 60th percentile, and so on.
Table calculations are at the core of Tableau, and they solve many common calculation problems. The most common calculations like running total, moving average, change from prior period, and Year-to-Date are available as “quick” table calculations, meaning Tableau will write the formula for you. But occasionally we run into situations like this where the table calculation doesn’t work. Enter the Level of Detail formula.
Level of detail calculations were introduced in Tableau 9. They allow us to ignore filters or create formulas that aren’t dependent on dimensions in the view. The key to resolving this problem is the FIXED level of detail formula. In my experience, this is the most commonly used level of detail formula – see the Tableau documentation for a full explanation.
I’ll re-do my approach with Level of Detail expressions. First, to get the number of items for each customer, I’ll use the FIXED expression with Customer Name.
This guarantees that the count of items purchased is at the customer level. Regardless of what view I make – a summary by state, or zip code, or product subcategory – the Frequency is still calculated for each customer.
How do we get the overall 80th percentile, so I can compare each customer’s score to that value? With another FIXED level of detail:
{ FIXED : PERCENTILE([Frequency LOD],0.8) }
Note this is a bit confusing in Tableau. PERCENTILE is an aggregation, like SUM(), MEDIAN(), COUNT() and so on. But RANK_PERCENTILE is a table calculation that is used on top of an aggregation. Table calculation and levels of detail can sometimes be combined, but in this case, we’ll keep them separate.
After creating these for each of recency, frequency, and monetary, I need percentiles. Each person needs to be assigned to group 1-5 for each value. Time for another calculated field.
This ensures everyone is assigned to a “bin” for frequency. Notice I’ve made these text fields, so it will be easier to combine them, but they could just be numbers as well.
We’ll do the same for recency and monetary. Now I’ll combine everything into an RFM score with one more formula:
We could also create a group with this data, collapsing the 125 RFM combinations into 4 categories:
RFM values can be further grouped and boiled down – for example if you are an R of 1 or 2, and also an F of 1 or 2, you may be called a “hibernating” customer who is no longer engaged.
The challenge for RFM is visualizing it. First, we have a lot of different categories – 110 of the 125 possible categories are present in our data. Secondly, there are three dimensions (R/F/M) and viewing anything in 3 dimensions is challenging.
You can review two dashboards that summarize customers by RFM score here. The workbook is available for download as well from Tableau Public.
To sum up, we’ve resolved the calculation of RFM scores by leveraging Level of Detail expressions. When you find yourself in a situation where table calculations are causing you problems, level of detail expressions or “LODs” are frequently the answer.
Often this comes down to understanding how Tableau thinks. It’s not necessarily immediately obvious that PERCENTILE is an aggregation, but RANK_PERCENTILE is a table calculation. However this is the key to resolving the RFM calculation; combining PERCENTILE with a FIXED level of detail expression.
Essentially, we need data at two levels at once: the value for a single customer, and the value across all customers. Level of detail formulas allow us to work on both levels at the same time, while table calculations do not.
Thank you for following along. Please feel free to contact us to discuss RFM analysis in Tableau, or any other data question that may arise.
Automate and Digitize: Where Should Credit Unions Start to Build Efficiency?
Credit unions that want to improve efficiency should start by identifying their pain points, building a business case for digitization/automation, conducting a pilot program, implementing quick wins, and exploring new opportunities.
CAO Leadership Series: Municipal Budget Development
Addressing the challenges of municipal budget planning requires a holistic and forward-thinking approach. It necessitates active and continuous engagement with community members, proactive risk reduction strategies, and efficient procurement practices.
Unlocking Security Excellence: Essential IAM & RBAC Best Practices for Robust Application Access and IT Risk Management
The rise of new technologies and the ever-evolving IT landscape have necessitated the implementation of a robust Identity and Access Management (IAM) system accompanied by a Role-Based Access Control (RBAC) framework.
How Canadian Credit Unions Can Leverage ESG Principles and Technology to Reach a Younger Generation of Members
The incorporation of ESG principles by CCUs, coupled with the effective use of technology and targeted marketing, presents a powerful strategy for attracting younger members and securing future growth.
Building High Performing Teams: The 8 Components of Resilience
In a world where change is the only constant, organizations are awakening to the undeniable truth – resilience is the secret weapon for survival and success.
Natural Language in Data Visualization: A Showdown Between Tableau and Power BI
Two industry-leading data visualization tools, Tableau and Power BI, both offer the ability to query data using natural language. But how do they stack up?
7 Drivers of Economic Development
These seven drivers of economic development bring new money into the municipalities, accelerate the velocity of money within the city, increase the engagement of citizens, and propel the generation of new ideas, technologies, talent, success stories, wealth, and global rankings.
Steering Through Uncertainty: The Impact of IFRS 17 on Risk Management and Control Strategies
With a strong emphasis on accuracy and integrity, insurers are faced with the task of redefining their control environments and governance structures for financial reporting.
Analytical Data Mart vs. Data Lake: Which Approach is Better for Your Analytics?
Welcome to the world of data-driven organizations where it is crucial to have a well governed repository to efficiently store and manage your valuable data.
Mastering IFRS 17 with a Strategic Target Operating Model
When applied specifically to the realm of IFRS 17, a strategic Target Operating Model provides a high-level view of the end-to-end solution design, processes, controls, and close schedule required to execute the new finance model.
The Push for Companies to Prioritize Leadership Development
Leaders have been expected to do more than ever in the past few years. Navigating through uncertainty, dealing with new challenges, and responding to rapid change have all become commonplace demands for management teams.
Navigating a Hybrid Work Environment with Gen Z Employees
Millennials, who have dominated the workforce for the past decade, are now ceding the stage to the next generation of employees – Generation Z.
Developing Early Career Talent: 5 Strategies for Success
A robust and effective early career talent development program is essential for companies looking to grow their future leaders from within.
How to Capitalize on Your IFRS 17 Investment
With guidance and support insurers can move from IFRS 17 compliance to business as usual (BAU) and fully capitalize on their investment.
Leading & Engaging Gen Zs – The Bold Approach
Gen Zs are the new age workforce that is gradually changing the landscape of the corporate world. Leading and engaging Gen Zs in this environment requires a bold approach.
Power BI vs Tableau – Which is Better?
Although Tableau and Power BI are similar business intelligence tools, there are key differences that organizations should be aware of when considering analytical requirements.