{"id":22533,"date":"2020-12-30T10:17:46","date_gmt":"2020-12-30T10:17:46","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/data-ingestion-almost-solved-problem\/"},"modified":"2023-09-13T13:05:27","modified_gmt":"2023-09-13T13:05:27","slug":"data-ingestion-almost-solved-problem","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/data-ingestion-almost-solved-problem\/","title":{"rendered":"Data Ingestion Is (Almost) A Solved Problem"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"22533\" class=\"elementor elementor-22533\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-70599e1 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"70599e1\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-ef66444\" data-id=\"ef66444\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-26604ac elementor-widget elementor-widget-text-editor\" data-id=\"26604ac\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p class=\"has-medium-font-size\">A review of two out of the box tools to solve a traditionally tedious problem<\/p>\n\n<p id=\"2b1c\">Ask anyone who has been involved in a data related job over the past 10\u201315 years what is the most boring task they would rather avoid, and chances are many would answer \u2018data ingestion\u2019.<\/p>\n\n<p id=\"b791\">Everyone would like to have data ready to be analysed. Sure, data cleaning and pre-processing is also another good candidate to avoid, but there is something uniquely annoying in the feeling of spending time purely on replicating data, without supposedly adding any value to it. In a sense at least data cleaning gives that sense of purpose of \u2018putting your house in order\u2019.<\/p>\n\n<p id=\"04de\">Yet,\u00a0data ingestion is a fundamental task, and, until some time ago, you had to work on a combination of <a href=\"https:\/\/www.experfy.com\/blog\/software\/a-laymans-guide-for-data-scientists-to-create-apis-in-minutes\/\" target=\"_blank\" rel=\"noreferrer noopener\">API <\/a>calls, CSV get requests, web-hooks, incremental loads rules, streaming services and ODBC connections just to replicate external data into your data warehouse or local system.<\/p>\n\n<p id=\"d5b1\">I\u2019ve faced lots of these challenges myself over the past years, which is why I was particularly excited when in my last project I had the chance to explore in detail how data ingestion tools have transformed this space.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-04ff717 elementor-widget elementor-widget-heading\" data-id=\"04ff717\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">There are two data ingestion tools that I am going to review here:<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-26da726 elementor-widget elementor-widget-text-editor\" data-id=\"26da726\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li><a href=\"https:\/\/fivetran.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Fivetran<\/a><\/li><li><a href=\"https:\/\/www.stitchdata.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Stitch Data<\/a><\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9adfaa3 elementor-widget elementor-widget-heading\" data-id=\"9adfaa3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">When a data ingestion tool is suitable?<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a22d444 elementor-widget elementor-widget-heading\" data-id=\"a22d444\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">The ideal scenario of using a data ingestion tool is when:<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b73e151 elementor-widget elementor-widget-text-editor\" data-id=\"b73e151\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li>You have a data warehouse and want to enrich it with new data sources<\/li><li>You do not need real-time synchronisation<\/li><li>Your sources are within a relatively large set of standard pre-defined databases or SaaS tools (lists\u00a0<a href=\"https:\/\/fivetran.com\/connectors\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>\u00a0and\u00a0<a href=\"https:\/\/www.stitchdata.com\/integrations\/sources\/\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>)<\/li><li>Your source data is already structured in a relational format or in simple JSON structures<\/li><li>Your data pipeline requirements fit an\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Extract,_load,_transform\" target=\"_blank\" rel=\"noreferrer noopener\">ELT approach<\/a>\u00a0(Extract-Load-Transform), rather than an\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Extract,_transform,_load\" target=\"_blank\" rel=\"noreferrer noopener\">ETL one<\/a><\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1e38092 elementor-widget elementor-widget-text-editor\" data-id=\"1e38092\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"71e9\">The ELT requirement is a crucial one here. While, I would argue, ELT vs ETL is often down to individual preference, there are scenarios where it is preferable to transform data before ingestion, for example if lots of data cleaning, transformation or filtering is required before ingestion, to avoid un-necessary complicated processing after the data has already been loaded into the DB.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5f76e2c elementor-widget elementor-widget-heading\" data-id=\"5f76e2c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">When this approach is NOT suitable?<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-04ccf4b elementor-widget elementor-widget-heading\" data-id=\"04ccf4b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">While the above use cases cover a vast range of use cases, as exclusion of points above, they won\u2019t cover cases where:<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6dbca0e elementor-widget elementor-widget-text-editor\" data-id=\"6dbca0e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li>You need to have a real-time replication<\/li><li>You have unstructured or heavily nested JSON source data<\/li><li>You have unsupported technologies as data source systems or data warehouse<\/li><li>You want to transform your source data before ingestion<\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d87e13a elementor-widget elementor-widget-image\" data-id=\"d87e13a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t<figure class=\"wp-caption\">\n\t\t\t\t\t\t\t\t\t\t<img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"693\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0_xySWjVOJLmcUF8Y-1024x693.jpeg\" class=\"attachment-large size-large wp-image-18313\" alt=\"Data Ingestion Is (Almost) A Solved Problem\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0_xySWjVOJLmcUF8Y-1024x693.jpeg 1024w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0_xySWjVOJLmcUF8Y-300x203.jpeg 300w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0_xySWjVOJLmcUF8Y-768x520.jpeg 768w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0_xySWjVOJLmcUF8Y-1536x1039.jpeg 1536w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0_xySWjVOJLmcUF8Y-2048x1386.jpeg 2048w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0_xySWjVOJLmcUF8Y-610x413.jpeg 610w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0_xySWjVOJLmcUF8Y-750x508.jpeg 750w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/0_xySWjVOJLmcUF8Y-1140x771.jpeg 1140w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t\t\t\t\t\t\t\t<figcaption class=\"widget-image-caption wp-caption-text\">Photo by\u00a0<a href=\"https:\/\/unsplash.com\/@romankraft?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Roman Kraft<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener\">Unsplash<\/a><\/figcaption>\n\t\t\t\t\t\t\t\t\t\t<\/figure>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-445c0ea elementor-widget elementor-widget-text-editor\" data-id=\"445c0ea\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"a9d6\">Now let\u2019s compare the two tools on a number of key dimensions.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e5a8cf7 elementor-widget elementor-widget-heading\" data-id=\"e5a8cf7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Replication methods<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9de1e8a elementor-widget elementor-widget-text-editor\" data-id=\"9de1e8a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"e014\">An obvious advantage of using a data ingestion tool is that it will handle incremental uploads for you. If you have small enough data sources that can be replicated in full every day, of course you don\u2019t need to worry too much about this point.<\/p>\n\n<p id=\"73b9\">Generally speaking, both Fivetran and Stitch Data perform log-based replications, that is, they access source systems change-logs to consume the incremental changes in the data, and replicate them downstream. This is something that happens under the hood, so you don\u2019t have to worry about that, apart from making sure your source systems have change-logs enabled.<\/p>\n\n<p id=\"2fc6\">The other point to consider is whether your data has a primary key. You obviously want to have a primary key, because this is the only way the tools can recognize that an existing record has been updated. If not, they might insert a new copy of the record each time it is modified, which is obviously an undesired behaviour.<\/p>\n\n<p id=\"e6a8\">Fivetran here is a bit more sophisticated in that it <a href=\"https:\/\/fivetran.com\/docs\/getting-started\/consumption-based-pricing#tableswithoutprimarykeys\" target=\"_blank\" rel=\"noreferrer noopener\">generates a synthetic primary key<\/a>\u00a0even when source tables have none.<\/p>\n\n<p id=\"e7b4\">Stitch Data on the other hand resorts to an append-only replication method when no primary key is present at source, as it explains in its\u00a0<a href=\"https:\/\/www.stitchdata.com\/docs\/replication\/loading\/understanding-loading-behavior\" target=\"_blank\" rel=\"noreferrer noopener\">documentation<\/a>.<\/p>\n\n<p id=\"599c\">From my experience, in most practical cases you will deal with sources that have a primary key (and if you have control on the source databases, you need to make sure the source tables have one).<\/p>\n\n<p id=\"4120\">However, there are situations where this is not the case, a notable example I have encountered is the\u00a0<a href=\"https:\/\/www.stitchdata.com\/docs\/integrations\/saas\/mixpanel\/v23-12-2015\" target=\"_blank\" rel=\"noreferrer noopener\">Stitch Data Mixpanel integration<\/a>, which lacks a primary key. In this case, the same record can sometimes be inserted multiple times, which means you have to dedupe it at destination, for example by taking a unique combination of some fields (such as event_type, user_distinct_id, page_url and timestamp).<\/p>\n\n<p id=\"114c\">This is far from ideal, but should not generate too much trouble as long as you are careful with your queries to consume the data.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6190563 elementor-widget elementor-widget-heading\" data-id=\"6190563\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Handling JSON data sources<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2cb4dff elementor-widget elementor-widget-text-editor\" data-id=\"2cb4dff\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"7ae0\">JSON data sources can in general be a pain to handle, but how much of a pain would depend on the actual data model and the tools used.<\/p>\n\n<p id=\"4ca8\">This scenario is commonly found if you have a NoSql data source such as MongoDB, but would also apply if you have a relational data source with some JSON fields, such as in a Postgres database.<\/p>\n\n<p id=\"26c6\">The most critical case here is if you have source tables (ie collections), that are truly unstructured, that is, they contain a high number of different fields each. This could be the case, for example, if each object represents a book and its fields represent the chapter names. Chances are that the chapter names would be different for each book. Cases like this would not match well the JSON-to-relational mapping, and, both with Fivetran and Stitch Data, you would find your destination tables having as many columns as there are unique fields in the source collection. This is not only impractical, but it would also throw an error above a certain threshold, as for example Postgres has a limit of 1600 columns per table.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-069f908 elementor-widget elementor-widget-heading\" data-id=\"069f908\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">If instead you have a limited number of unique fields in each collection (say, up to a few hundreds at the very most), the behaviour would change depending on the tools:<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-57928e8 elementor-widget elementor-widget-text-editor\" data-id=\"57928e8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li>Fivetran will map each JSON field at source to a JSON field at destination, as long as the destination DB handles dictionary-type structures<\/li><li>Stitch Data will do the same, but only for a limited set of destinations, such as\u00a0<a href=\"http:\/\/www.snowflake.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Snowflake<\/a>, but for example not with Postgres. This is quite annoying as Postgres natively handles JSON fields. More details on this are\u00a0<a href=\"https:\/\/www.stitchdata.com\/docs\/data-structure\/nested-data-structures-row-count-impact\" target=\"_blank\" rel=\"noreferrer noopener\">explained here<\/a>, and you also need to bear in mind that this would also affect your row count for pricing (more on that later).<\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1d8a3a2 elementor-widget elementor-widget-text-editor\" data-id=\"1d8a3a2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"6bb1\">In summary, there is not any magic that a tool can do if you have truly unstructured source data, however, if you do have some structure, under certain conditions these tools will allow to preserve the dictionary-like format in the destination database.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-afa9cf9 elementor-widget elementor-widget-heading\" data-id=\"afa9cf9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Tables and field selection<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-edfefc8 elementor-widget elementor-widget-text-editor\" data-id=\"edfefc8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"e90b\">Both Fivetran and Stitch Data allow to only import certain tables from a source schema, but Stitch Data here also offers the advantage of being able to selectively import only chosen fields, which can be a useful advantage if you have tables with lots of fields, and particularly if some of these fields contain nested JSON that could complicate the target structure.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-85a5147 elementor-widget elementor-widget-heading\" data-id=\"85a5147\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Row count and charging methods<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-55d5fae elementor-widget elementor-widget-heading\" data-id=\"55d5fae\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">On this point, Fivetran and Stitch Data have two different approaches:<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b2bb226 elementor-widget elementor-widget-text-editor\" data-id=\"b2bb226\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li>Fivetran charges on Unique monthly rows updated<\/li><li>Stitch Data on overall monthly rows updated (that is, the same row can be counted multiple times)<\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-169c24e elementor-widget elementor-widget-text-editor\" data-id=\"169c24e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"24d8\">How does this matter? If your source stream is mainly a sequence of new records appended, this won\u2019t make it much difference, but if your source systems tend to frequently update the same records over and over, Fivetran here can offer an advantage of more clear and predictable pricing.<\/p>\n\n<p id=\"343a\">Another point to be very careful here is about historical backfills and table schema changes. If you have for example a large table and you add one additional column to it and populate it for all the history, both systems will spot these changes and will update the full records history, even if you might not be genuinely interested in that additional field. This could generate quite high unexpected charges for both tools.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1ba54fc elementor-widget elementor-widget-heading\" data-id=\"1ba54fc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Pricing<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1f66367 elementor-widget elementor-widget-text-editor\" data-id=\"1f66367\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"3901\">Based on the charging methods described above, it\u2019s not exactly possible to compare like for like the two tools, however, in a common scenario where the vast majority of source changes are actually new inserts, Stitch Data proves to be significantly cheaper than Fivetran.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9ac74e5 elementor-widget elementor-widget-heading\" data-id=\"9ac74e5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">As an example, as of today:<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6236228 elementor-widget elementor-widget-text-editor\" data-id=\"6236228\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li>Stitch Data has a\u00a0<a href=\"https:\/\/www.stitchdata.com\/pricing\/\" target=\"_blank\" rel=\"noreferrer noopener\">pricing plan<\/a>\u00a0that starts at $100\/month for 5M modified rows a month, with a month-by-month commitment and 10 integrations sources. The next level is $180 for 10M rows.<\/li><li><a href=\"https:\/\/fivetran.com\/pricing\" target=\"_blank\" rel=\"noreferrer noopener\">Fivetran pricing<\/a>\u00a0is a bit more obscure, and you need to navigate through their\u00a0<a href=\"https:\/\/fivetran.com\/pricing\" target=\"_blank\" rel=\"noreferrer noopener\">documentation pages<\/a>\u00a0a bit to properly understand it. In summary, it practically starts at $1000\u20131500 a month for 1M unique rows updated a month, but it then increases more gently for higher consumption volumes.<\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d158805 elementor-widget elementor-widget-heading\" data-id=\"d158805\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Replication frequency<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-43010ea elementor-widget elementor-widget-text-editor\" data-id=\"43010ea\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"175a\">The replication frequency that you can set-up will depend both on the tool used and the source system. Fivetran claims a lower limit of 5 minutes, Stitch Data for most sources has a lower limit of 30 minutes, however some sources allow a lower value, but you need to be careful in general not to overlap between two subsequent runs.<\/p>\n\n<p id=\"efa4\">Your replication frequency will also affect the number of rows modified, which would particularly affect Stitch Data pricing.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0d24d82 elementor-widget elementor-widget-heading\" data-id=\"0d24d82\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Transformation capabilities<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d751893 elementor-widget elementor-widget-text-editor\" data-id=\"d751893\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"0453\">Let\u2019s remember we are talking about ingestion, not data transformation, so it\u2019s not surprising that neither of these tools have a focus on data transformation capabilities.<\/p>\n\n<p id=\"c63e\">However, Fivetran also offers integration with dbt (data build tool) and Stitch Data is part of the Talend family, traditionally a big player in the ETL domain.<\/p>\n\n<p id=\"3584\">However, there are no obvious advantages in using their sister data transformation tools, so that part can be genuinely decoupled from ingestion.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1818940 elementor-widget elementor-widget-heading\" data-id=\"1818940\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Outcome<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b30e395 elementor-widget elementor-widget-text-editor\" data-id=\"b30e395\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"20a0\">Both Fivetran and Stitch Data are great tools for data ingestion. Which one fits the bill would depend on the actual use cases and needs.<\/p>\n\n<p id=\"0040\">Fivetran has in some way more complex replication capabilities, but this also comes with an increased cost.<\/p>\n\n<p id=\"27e5\">Stitch Data is a great entry-level tool, which can also handle replication at scale for a number of standard scenarios.<\/p>\n\n<p id=\"07ff\">Fivetran pricing is quite steep at the beginning, but can become more justifiable if you have very high ingestion volumes (eg billions unique rows a month), and can also offer more predictability due to its unique rows pricing model.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Data ingestion is a fundamental task, and everyone would like to have data ready to be analysed. Here is A review of two out of the box tools to solve a traditionally tedious problem.<\/p>\n","protected":false},"author":1006,"featured_media":18314,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[205,942,611,1195,94,1196],"ppma_author":[3843],"class_list":["post-22533","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-data","tag-data-cleaning","tag-data-engineering","tag-data-ingestion","tag-data-science","tag-database"],"authors":[{"term_id":3843,"user_id":1006,"is_guest":0,"slug":"gianluca-gindro","display_name":"Gianluca Gindro","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/Gianluca-Gindro-150x150.jpeg","user_url":"https:\/\/www.thedatamba.com\/","last_name":"Gindro","first_name":"Gianluca","job_title":"","description":"Gianluca is Head of Data Science &amp; Analytics at Commonplace, a company with the mission to foster engagement of local communities. He is also founder and lead instructor at THE DATA MBA, an MBA-style school for Data Science and Analytics. He has worked in the data sector since 2010 when he set-up the data operations at Groupon and worked there as Head of Analytics."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22533","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/1006"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=22533"}],"version-history":[{"count":4,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22533\/revisions"}],"predecessor-version":[{"id":32905,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22533\/revisions\/32905"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/18314"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=22533"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=22533"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=22533"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=22533"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}