{"id":22569,"date":"2021-01-19T10:06:51","date_gmt":"2021-01-19T10:06:51","guid":{"rendered":"https:\/\/www.experfy.com\/blog\/how-to-become-data-engineer\/"},"modified":"2023-09-05T18:27:37","modified_gmt":"2023-09-05T18:27:37","slug":"how-to-become-data-engineer","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/how-to-become-data-engineer\/","title":{"rendered":"How To Become a Data Engineer"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"22569\" class=\"elementor elementor-22569\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-1ae8f89 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"1ae8f89\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-a8d816c\" data-id=\"a8d816c\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-1af5b88 elementor-widget elementor-widget-text-editor\" data-id=\"1af5b88\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The demand for data engineers is growing rapidly. According to&nbsp;<a href=\"http:\/\/hired.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">hired.com<\/a>&nbsp;the&nbsp;<a href=\"https:\/\/hired.com\/state-of-software-engineers\" target=\"_blank\" rel=\"noreferrer noopener\">demand has increased by 45%<\/a>&nbsp;in 2019. The median salary for Data Engineers in SF Bay Area is around $160k. So the question is: how to become a data engineer?<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-31f4b54 elementor-widget elementor-widget-heading\" data-id=\"31f4b54\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">What Data Engineering is<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-88d5ba3 elementor-widget elementor-widget-text-editor\" data-id=\"88d5ba3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Data engineering is closely related to data as you can see from its name. But if data analytics usually means extracting insights from existing data, data engineering means the process of building infrastructure to deliver, store and process the data. According to\u00a0<a href=\"https:\/\/hackernoon.com\/the-ai-hierarchy-of-needs-18f111fcc007\" target=\"_blank\" rel=\"noreferrer noopener\">The AI Hierarchy of Needs<\/a>, the data engineering proccess is located at the very bottom: Collect, Move &amp; Store, Data Preparation. So if your organization wants to be data\/AI-driven then they should hire\/train data engineers.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4af1216 elementor-widget elementor-widget-image\" data-id=\"4af1216\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img fetchpriority=\"high\" decoding=\"async\" width=\"500\" height=\"321\" src=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1_7IMev5xslc9FLxr9hHhpFw.png\" class=\"attachment-large size-large wp-image-18467\" alt=\"How To Become a Data Engineer\" srcset=\"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1_7IMev5xslc9FLxr9hHhpFw.png 500w, https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/1_7IMev5xslc9FLxr9hHhpFw-300x193.png 300w\" sizes=\"(max-width: 500px) 100vw, 500px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0a83569 elementor-widget elementor-widget-text-editor\" data-id=\"0a83569\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>But what data engineers actually do? The amount of data is growing rapidly every single day. We are contemplating the new era where everybody can do a content from their mobile phone and other gadgets. Even small devices are connected to the Internet. Data engineers from the past were responsible for writing complex SQL queries, building ETL (extract, transform &amp; load) processes using big enterprise tools like Informatica ETL, Pentaho ETL, Talend etc. But now the market demands more broader skillset. If you want to work as a data engineer you need to have:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dfb6e06 elementor-widget elementor-widget-text-editor\" data-id=\"dfb6e06\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li>Intermediate knowledge of SQL and Python<\/li><li>Experience working with cloud providers like AWS, Azure or GCP<\/li><li>The knowledge of Java\/Scala is a big plus<\/li><li>Understading SQL\/NoSQL databases (data modeling, data warehousing, performance optimization)<\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ba9debb elementor-widget elementor-widget-text-editor\" data-id=\"ba9debb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The skillset is very similar to what Backend engineers usually know. In reality if an organization is growing in terms of data the ideal candidate to transform into data engineer is a backend engineer.<\/p>\n\n<p>The particular technologies and tools could differ due to company size, data volumes and data velocity. If we look at the FAANG for example, they usually require:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2827622 elementor-widget elementor-widget-text-editor\" data-id=\"2827622\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li>Knowledge of Python, Java or Scala<\/li><li>Experience working with Big data tools like Apache Hadoop, Kafka and Spark<\/li><li>Solid knowledge of algorithms and data structures<\/li><li>Undestanding of distributed systems<\/li><li>Experience with Business Intelligence tools like Tableau, QlikView, Looker or Superset<\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5c0fba7 elementor-widget elementor-widget-heading\" data-id=\"5c0fba7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Data Engineer's Skillset<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ef3ef6b elementor-widget elementor-widget-text-editor\" data-id=\"ef3ef6b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Data engineering is an engineering area that is why the knowledge of computer science fundamentals is required, especially the understanding of most popular algorithms and data structures (Hello Mr. Cormen!).<\/p>\n\n<p>Because data engineers deal with the data on a daily basis understading how databases work is a huge plus. For example, the most popular SQL databases like SQLite, PostgreSQL, MySQL use B-Tree data structure under the hood.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c8097e7 elementor-widget elementor-widget-heading\" data-id=\"c8097e7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Algorithms &amp; Data Structures<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ff46553 elementor-widget elementor-widget-text-editor\" data-id=\"ff46553\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>If you prefer video courses I would recommend to look at the\u00a0<a href=\"https:\/\/www.coursera.org\/specializations\/data-structures-algorithms\" target=\"_blank\" rel=\"noreferrer noopener\">Data Structures and Algorithms Specialization<\/a>. I took these courses and think they are quite good as a starting point.<\/p>\n\n<p>Speaking about talks I would highly recommend take a look at Alex Petrov&#8217;s presentation called&nbsp;<em>What Every Programmer has to know about Database Storage:<\/em><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f1d5a6d elementor-widget elementor-widget-video\" data-id=\"f1d5a6d\" data-element_type=\"widget\" data-e-type=\"widget\" data-settings=\"{&quot;youtube_url&quot;:&quot;https:\\\/\\\/youtu.be\\\/e1wbQPbFZdk&quot;,&quot;video_type&quot;:&quot;youtube&quot;,&quot;controls&quot;:&quot;yes&quot;}\" data-widget_type=\"video.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t<div class=\"elementor-wrapper elementor-open-inline\">\n\t\t\t<div class=\"elementor-video\"><\/div>\t\t<\/div>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7b652b2 elementor-widget elementor-widget-heading\" data-id=\"7b652b2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">Alex has a great series of posts related to databases:<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d14ba70 elementor-widget elementor-widget-text-editor\" data-id=\"d14ba70\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li><a href=\"https:\/\/medium.com\/databasss\/on-disk-io-part-1-flavours-of-io-8e1ace1de017\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">On Disk IO, Part 1: Flavors of IO<\/a><\/li><li><a href=\"https:\/\/medium.com\/databasss\/on-disk-io-part-2-more-flavours-of-io-c945db3edb13\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">On Disk IO, Part 2: More Flavours of IO<\/a><\/li><li><a href=\"https:\/\/medium.com\/databasss\/on-disk-io-part-3-lsm-trees-8b2da218496f\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">On Disk IO, Part 3: LSM Trees<\/a><\/li><li><a href=\"https:\/\/medium.com\/databasss\/on-disk-storage-part-4-b-trees-30791060741\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">On Disk IO, Part 4: B-Trees and RUM Conjecture<\/a><\/li><li><a href=\"https:\/\/medium.com\/databasss\/on-disk-io-access-patterns-in-lsm-trees-2ba8dffc05f9\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">On Disk IO, Part 5: Access Patterns in LSM Trees<\/a><\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-af9dd3c elementor-widget elementor-widget-text-editor\" data-id=\"af9dd3c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Courses, video presentations are good but what about books? I would recommend the only book by Thomas Cormen and friends called\u00a0<a href=\"https:\/\/www.amazon.com\/gp\/product\/0262033844\/ref=as_li_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0262033844&amp;linkCode=as2&amp;tag=adilkhash-20&amp;linkId=74742875db503b1a899ca35159749067\" target=\"_blank\" rel=\"noreferrer noopener\">Introduction to Algorithms<\/a>. The most comprehensive reference on algorithms and data structures. To practice and strengthen your knowlegde go to\u00a0<a href=\"http:\/\/leetcode.com\/\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">leetcode.com<\/a>\u00a0and start solving problems. Practice makes perfect.<\/p><p>Databases are great, Carnegie Mellon University uploads their lessons to Youtube:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e7f5f72 elementor-widget elementor-widget-text-editor\" data-id=\"e7f5f72\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li><a href=\"https:\/\/www.youtube.com\/watch?v=oeYBdghaIjc&amp;list=PLSE8ODhjZXjbohkNBWQs_otTrBTrjyohi\" target=\"_blank\" rel=\"noreferrer noopener\">Introduction to Database Systems (Fall 2019)<\/a><\/li><li><a href=\"https:\/\/www.youtube.com\/watch?v=SdW5RKUboKc&amp;list=PLSE8ODhjZXjasmrEd2_Yi1deeE360zv5O\" target=\"_blank\" rel=\"noreferrer noopener\">Advanced Database Systems (Spring 2020)<\/a><\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-58ad71e elementor-widget elementor-widget-heading\" data-id=\"58ad71e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">SQL \u2014 the lingua franca for databases<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1b5e90b elementor-widget elementor-widget-text-editor\" data-id=\"1b5e90b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>SQL was developed back in 70s and still is the most popular language to work with data. Periodically some experts claim that SQL is going to die very soon, but it is still alive despite many rumors. I think we will stick to SQL for another decade or two (or even more). If you look at the modern and popular databases you will see that almost all of them support SQL:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-834650e elementor-widget elementor-widget-text-editor\" data-id=\"834650e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li>PostgreSQL, MySQL, MS SQL Server, Oracle DB<\/li><li>Amazon Redshift, Apache Druid, Yandex ClickHouse<\/li><li>HP Vertica, Greenplum<\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0271d30 elementor-widget elementor-widget-text-editor\" data-id=\"0271d30\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>In a big data ecosystem there are many different SQL engines: Presto (Trino), Hive, Impala etc. I would highly recommend to invest some time to master SQL.<\/p>\n\n<p>If you are new to SQL, start with the Mode&#8217;s SQL guide:\u00a0<a href=\"https:\/\/mode.com\/sql-tutorial\/introduction-to-sql\/\" target=\"_blank\" rel=\"noreferrer noopener\">Introduction to SQL<\/a>. If you feel comfortable you can continue with the\u00a0<a href=\"https:\/\/bit.ly\/2XrRH4i\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">DataCamp&#8217;s interactive courses<\/a>. I would recommend these:<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e96cc6e elementor-widget elementor-widget-text-editor\" data-id=\"e96cc6e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li><a href=\"https:\/\/www.datacamp.com\/courses\/intermediate-sql?tap_a=5644-dce66f&amp;tap_s=1331588-6fc352&amp;utm_medium=affiliate&amp;utm_source=adylzhankhashtamov\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">Intermediate SQL<\/a><\/li><li><a href=\"https:\/\/www.datacamp.com\/courses\/joining-data-in-postgresql?tap_a=5644-dce66f&amp;tap_s=1331588-6fc352&amp;utm_medium=affiliate&amp;utm_source=adylzhankhashtamov\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">Joining Data in SQL<\/a><\/li><li><a href=\"https:\/\/datacamp.com\/courses\/postgresql-summary-stats-and-window-functions?tap_a=5644-dce66f&amp;tap_s=1331588-6fc352&amp;utm_medium=affiliate&amp;utm_source=adylzhankhashtamov\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">PostgreSQL Summary Stats and Window Functions<\/a><\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8b5a591 elementor-widget elementor-widget-text-editor\" data-id=\"8b5a591\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The best resources on SQL are\u00a0<a href=\"https:\/\/modern-sql.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Modern SQL<\/a>\u00a0and\u00a0<a href=\"https:\/\/use-the-index-luke.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Use The Index, Luke!<\/a>\u00a0Practice makes perfect that is why go to\u00a0<a href=\"https:\/\/leetcode.com\/problemset\/database\/\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">Leetcode Databases problemset<\/a>\u00a0and start practicing   By the way, do not forget to read my article on\u00a0<a href=\"https:\/\/khashtamov.com\/en\/sql-window-functions\/\" target=\"_blank\" rel=\"noreferrer noopener\">SQL window functions<\/a>\u00a0as well.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-844a3f3 elementor-widget elementor-widget-heading\" data-id=\"844a3f3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">Programming: Python, Java and Scala<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c3680f7 elementor-widget elementor-widget-text-editor\" data-id=\"c3680f7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Python is a very popular programming language to build web apps as well as for data analytics &amp; science. It has a very rich ecosystem and huge community. According to\u00a0<a href=\"https:\/\/www.tiobe.com\/tiobe-index\/\" target=\"_blank\" rel=\"noreferrer noopener\">TIOBE Index<\/a>\u00a0Python is in the Top 3 widely used programming languages after C &amp; Java.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b184a66 elementor-widget elementor-widget-heading\" data-id=\"b184a66\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">Speaking about other 2 languages, many big data systems are written in Java or Scala:<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3e18b28 elementor-widget elementor-widget-text-editor\" data-id=\"3e18b28\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li>Apache Kafka (Scala)<\/li><li>Hadoop HDFS (Java)<\/li><li>Apache Spark (Scala)<\/li><li>Apache Cassandra (Java)<\/li><li>HBase (Java)<\/li><li>Apache Hive, Presto in Java<\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b4f8ae9 elementor-widget elementor-widget-text-editor\" data-id=\"b4f8ae9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>In order to undestand how these systems work I would recommend to know the language in which they are written. The biggest concern with <a href=\"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/rewiring-your-brain-from-python-to-java\/\" target=\"_blank\" rel=\"noreferrer noopener\">Python <\/a>is its poor performance hence the knowledge of a more efficient language will be a big plus to your skillset.<\/p>\n<p>If you are interested in Scala, I would recommend to take a look at\u00a0<a href=\"https:\/\/twitter.github.io\/scala_school\/\" target=\"_blank\" rel=\"noreferrer noopener\">Twitter&#8217;s Scala School<\/a>. The book\u00a0<a href=\"https:\/\/www.amazon.com\/gp\/product\/0981531687\/ref=as_li_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=0981531687&amp;linkCode=as2&amp;tag=adilkhash-20&amp;linkId=69314619f1e546e3921eb97c72cf4850\" target=\"_blank\" rel=\"noreferrer noopener\">Programming in Scala<\/a>\u00a0by its creator is also a good starting point.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1e15acd elementor-widget elementor-widget-heading\" data-id=\"1e15acd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">The Big Data Tools<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2dc72e3 elementor-widget elementor-widget-heading\" data-id=\"2dc72e3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">There are lots of different technologies in a Big Data landscape. The most popular are:<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-26b94e1 elementor-widget elementor-widget-text-editor\" data-id=\"26b94e1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li><strong>Apache Kafka<\/strong>&nbsp;is the leading message queue\/event bus\/event streaming<\/li><li><strong>Apache Spark<\/strong>&nbsp;is the unified analytics engine for large-scale data processing<\/li><li><strong>Apache Hadoop<\/strong>, the big data framework which consists of different tools, libraries and frameworks including distributed file system (HDFS), Apache Hive, HBase etc.<\/li><li><strong>Apache Druid<\/strong>&nbsp;is a real-time analytics database<\/li><\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-13bbc6d elementor-widget elementor-widget-text-editor\" data-id=\"13bbc6d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>It is really difficult to learn everything that is why focus on the most popular and learn fundamental concepts behind them. For example, back in 2013 Jay Kreps (co-founder of Apache Kafka) wrote the paper called\u00a0<a href=\"https:\/\/engineering.linkedin.com\/distributed-systems\/log-what-every-software-engineer-should-know-about-real-time-datas-unifying\" target=\"_blank\" rel=\"noreferrer noopener\">The Log: What every software engineer should know about real-time data&#8217;s unifying abstraction<\/a>.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-db47ae9 elementor-widget elementor-widget-heading\" data-id=\"db47ae9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">Cloud Platforms<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8f7de54 elementor-widget elementor-widget-text-editor\" data-id=\"8f7de54\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Everything goes to the clouds. You should have experience working with at least one cloud provider. I would recommend go with the&nbsp;<strong>Amazon Web Services<\/strong>&nbsp;which is the leading cloud provider in the world. The second place goes to the&nbsp;<strong>Microsoft Azure<\/strong>, the third place takes&nbsp;<strong>Google Cloud Platform<\/strong>.<\/p>\n\n<p>All providers have certifications. For example, the most suitable certification for a data engineer in AWS is\u00a0<a href=\"https:\/\/aws.amazon.com\/ru\/certification\/certified-data-analytics-specialty\/\" target=\"_blank\" rel=\"noreferrer noopener\">AWS Certified Data Analytics \u2013 Specialty<\/a>. If you decide to proceed with GCP the right choice is\u00a0<a href=\"https:\/\/cloud.google.com\/certification\/data-engineer\" target=\"_blank\" rel=\"noreferrer noopener\">Professional Data Engineer<\/a>, for MSA is\u00a0<a href=\"https:\/\/docs.microsoft.com\/en-us\/learn\/certifications\/azure-data-engineer\" target=\"_blank\" rel=\"noreferrer noopener\">Azure Data Engineer Associate<\/a>.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5ecee4f elementor-widget elementor-widget-heading\" data-id=\"5ecee4f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">Fundamentals of Distributed Systems<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f777852 elementor-widget elementor-widget-text-editor\" data-id=\"f777852\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>The amount of data generated nowadays is tremendous. You cannot fit it into one computer. The data should be distributed across different nodes. If you want to be a good data engineer you have to understand the fundamentals of distributed systems. There are lots of resources where you can start your journey into this field:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8998a8c elementor-widget elementor-widget-text-editor\" data-id=\"8998a8c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul><li><a href=\"https:\/\/www.youtube.com\/playlist?list=PLrw6a1wE39_tb2fErI4-WkMbsvGQk9_UB\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">Distributed Systems<\/a>\u00a0lectures from MIT by Robert Morris<\/li><li><a href=\"https:\/\/www.youtube.com\/playlist?list=PLeKd45zvjcDFUEv_ohr_HdUFe97RItdiB\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">Distributed Systems<\/a>\u00a0lectures by Martin Kleppmann<\/li><li><a href=\"https:\/\/www.youtube.com\/playlist?list=PLNPUF5QyWU8O0Wd8QDh9KaM1ggsxspJ31\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">Distributed Systems<\/a>\u00a0by Lindsey Kuper<\/li><\/ul>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7828bf2 elementor-widget elementor-widget-text-editor\" data-id=\"7828bf2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>I would also highly recommend the book\u00a0<a href=\"https:\/\/www.amazon.com\/gp\/product\/1449373321\/ref=as_li_tl?ie=UTF8&amp;camp=1789&amp;creative=9325&amp;creativeASIN=1449373321&amp;linkCode=as2&amp;tag=adilkhash-20&amp;linkId=e7e0e096aa5761066245eb90965ac849\" target=\"_blank\" rel=\"noreferrer noopener\">Designing Data-Intensive Applications<\/a>\u00a0by Martin Kleppmann. He has a\u00a0<a href=\"https:\/\/martin.kleppmann.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">blog<\/a>. Also if you prefer blogs I would recommend take a look at the\u00a0<a href=\"https:\/\/medium.com\/baseds\" target=\"_blank\" rel=\"noreferrer noopener\" class=\"broken_link\">series of posts<\/a>\u00a0about distributed systems by Vaidehi Joshi.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-413cddb elementor-widget elementor-widget-heading\" data-id=\"413cddb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\">Data Pipelines<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-37a4134 elementor-widget elementor-widget-text-editor\" data-id=\"37a4134\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Building a data pipeline is one of the main responsibilities of a data engineer. Data pipeline is a process of data consolidation. Data engineer should be able to reliably deliver, load and transform data from multiple sources into a specific destination, usually it is central data warehouse or data lake. There are many tools which can help you build this process. Take a look at the\u00a0<a href=\"https:\/\/khashtamov.com\/en\/introduction-to-apache-airflow\/\" target=\"_blank\" rel=\"noreferrer noopener\">Apache Airflow<\/a>,\u00a0<a href=\"https:\/\/github.com\/spotify\/luigi\" target=\"_blank\" rel=\"noreferrer noopener\">Luigi<\/a>\u00a0from Spotify,\u00a0<a href=\"https:\/\/www.prefect.io\/\" target=\"_blank\" rel=\"noreferrer noopener\">Prefect<\/a>\u00a0or\u00a0<a href=\"https:\/\/dagster.io\/\" target=\"_blank\" rel=\"noreferrer noopener\">Dagster<\/a>. If you prefer nocode solution\u00a0<a href=\"https:\/\/nifi.apache.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">Apache NiFi<\/a>\u00a0is a way to go.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5acae6e elementor-widget elementor-widget-heading\" data-id=\"5acae6e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Summary<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-be1a92c elementor-widget elementor-widget-text-editor\" data-id=\"be1a92c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p>Data engineer is a team player who is working with data analysts, scientists, infrastructure engineers and other stakeholders. So do not forget about soft skills like empathy, understanding a business domain, open-mindedness etc.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Data engineer is a team player who is working with data analysts, scientists, infrastructure engineers and other stakeholders. So do not forget about soft skills like empathy, understanding a business domain, open-mindedness etc.<\/p>\n","protected":false},"author":1025,"featured_media":18468,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[1260,1261,1262,408],"ppma_author":[3684],"class_list":["post-22569","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-data-engineer","tag-data-pipelines","tag-distributed-systems","tag-programming"],"authors":[{"term_id":3684,"user_id":1025,"is_guest":0,"slug":"adylzhan-khashtamov","display_name":"Adylzhan Khashtamov","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2021\/05\/Adylzhan-Khashtamov-150x150.jpeg","user_url":"https:\/\/khashtamov.com\/en\/","last_name":"Khashtamov","first_name":"Adylzhan","job_title":"","description":"Adylzhan Khashtamov, a Software engineer, is Product Owner at Playrix, and technopreneur, writer and content creator."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22569","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/1025"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=22569"}],"version-history":[{"count":4,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22569\/revisions"}],"predecessor-version":[{"id":32444,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/22569\/revisions\/32444"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/18468"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=22569"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=22569"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=22569"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=22569"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}