Experfy
No Result
View All Result
  • Home
  • Future of Work
  • AI & Machine Learning
  • Big Data & Cloud
  • IoT & Automation
  • Software
  • ConsumerTech
  • HealthTech
  • FinTech
  • Home
  • Future of Work
  • AI & Machine Learning
  • Big Data & Cloud
  • IoT & Automation
  • Software
  • ConsumerTech
  • HealthTech
  • FinTech
No Result
View All Result
Experfy Insights
No Result
View All Result
Home Big Data & Cloud

Data Preprocessing For Non-Techies: Basic Terms And Definitions, Part One — Data Structures, Types And Values

Melody Ucros by Melody Ucros
March 27, 2018
in Big Data & Cloud
7 min read
1
Share on FacebookShare on Twitter

Ready to learn Data Science? Browse courses like Data Science Training and Certification developed by industry thought leaders and Experfy in Harvard Innovation Lab.

If you are getting started in your data science journey and don’t come from a technical background (like me), then you definitely understand the struggle of keeping up with the terminology of data pre-processing.

For the past few months, I’ve tried to really understand all the terms and transformation strategies used by professors, yet I still felt that their description and impact on my models weren’t entirely clear.

This was obviously a concern, considering that Data Scientists spend 60% of the time cleaning and organizing data!

Source: Forbes

Therefore, I’ve decided to dive deeper into the topic of data pre-processing, outline the basics, and share it with all of you.

This is the FIRST article, so we will only focus on key terms. Make sure to follow me, in order to read the next posts more focused on feature engineering, model selection, etc.

Keep in mind that some of these terms differ depending on the language or platform you are using. But, I hope it gives you a nice overview.

Basics of Data Structure:

Data objects: an instance or observation containing a set of characteristics. For example, every person (row) on the table.

Attributes: characteristics of an object . Also called features, variables, tuples, or dimensions. For example, the marital status (column) per person (row).

Record: Data that consists of a collection of objects, each of which consists of a fixed set of attributes. For example, the table above. Records aren’t the only type of data set, but the most common, so we will focus on this ones for now.

Vector: a collection of values for an attribute. For example, ‘Single, Married, Divorced’ for Marital Status. All values should be the same data type.

Matrix: in general terms, it is the same as a table but more flexible. The secret is that matrices are all composed of data the same type, and you can apply algebraic functions to them.

Frame: a frame, can be seen as a “snapshot” of a table. Usually used in R, to reduce the size of the table we are working with, or create a new format. We use data frames, instead of matrices or vectors when columns are of different data types (numeric/character/logical etc.).

Source: Wellesly Research Guides

Types of Records:

  • tabular: flat, rows represent instances and columns represent features. Could be like the table above, but also like the table below.

  • transactional: rows represent transactions, so if a customer makes multiple purchases, for example, each would be a separate record, with associated items linked by a customer ID.

Basics on Attributes:

There are four types of data that may be gathered, each one adding more to the next (if gathered correctly). Thus ordinal data is also nominal, and so on. A useful acronym to help remember this is NOIR (French for ‘black’).

Type of Data:

  • nominal (qualitative): used to “name,” or label discrete data. (categorical, unordered)

  • ordinal(quantitative): provide good information about the order of choices, such as in a customer satisfaction survey. (categorical, ordered)

  • interval(quantitative): give us the order of values + the ability to quantify the difference between each one. Usually used with continuous data.

  • ratio/scale (quantitative): give us the ultimate–order, interval values, plus the ability to calculate ratios since a “true zero” can be defined.

— — — — Side Note #1:

Interval and ratio data are parametric, used with parametric tools in which distributions are predictable (and often Normal).

Nominal and ordinal data are non-parametric, and do not assume any particular distribution. They are used with non-parametric tools such as the Histogram.

Qualitative data commonly summarized using percentages/proportions, while numeric summarized using average/means.

— — — — end.

Sub-set of Data Types:

  • binary/ dichotomous(qualitative): type of categorical data with only two categories. Can describe either nominal or ordinal data. ex. M vs F (or 0 vs 1 when converted in dummy variables)
  • discrete(numeric): gaps between possible values. ex. number of students
  • continuous(numeric): no gaps between possible values. ex. temperature

Common Types of Values:

  • decimal: numeric values to the right of the decimal point, should specify precision and scale (see source: SQL Data Types).
  • integer: accepts numeric values with an implied scale of zero. It stores any whole number between 2#k8SjZc9Dxk -31 and ²³¹ -1.
  • boolean: accepts the storage of two values: TRUE or FALSE.
  • date/time/time stamp: accepts values based on format specified.
  • string: basically a “word”, made up of characters. But, sometimes you need to convert an integer into a string in order to treat is as non-numeric.

— — — —Side Note #2:

When the precision provided by decimal (up to 38 digits) is insufficient, use float or real type of values.

  • FLOAT[(n)]: used to store single-precision and double-precision floating-point numbers.
  • REAL: A single-precision floating-point number.
  • DOUBLE [PRECISION]: A double-precision floating-point number.

A single-precision floating-point number is a 32-bit approximation of a real number. The number can be zero or can range from -3.402E+38 to -1.175E-37, or from 1.175E-37 to 3.402E+38. The range of n is 1 to 24. IBM DB2 internally represents the single-precision FLOAT data type as the REAL data type.

A double-precision floating-point number is a 64-bit approximation of a real number. The number can be zero or can range from -1.79769E+308 to -2.225E-307, or from 2.225E-307 to 1.79769E+308. The range of n is 25 to 53. IBM DB2 internally represents the double-precision FLOAT data type as the DOUBLE [PRECISION] data type.

If n is not specified the default value is 53.

— — — — end.

Key Takeaway: You can have, for example, a record with an attribute that is qualitative, categorical, ordinal, continuous, with decimal as a value and that you might need to identify it as real to increase precision. Each of these descriptors will determine how you clean, model and test the data.

Tags: Data Science
Previous Post

Hiring Data Scientists Step 1: Stop Looking For Data Scientists.

Next Post

Why Interoperability Is Key To Building Confidence In IoT

Melody Ucros

Melody Ucros

Melody Ucros, an entrepreneurial Techie with an interest in Sales Strategy and Digital Transformations, is Director of Operations at Fundie Ventures, an impact investment consultancy startup. A full-time student at IE Business School's Masters in Big Data, she is Passionate about helping start-ups, playing with data & exchanging knowledge with impact-makers around the world.

Next Post

Why Interoperability Is Key To Building Confidence In IoT

Comments 1

  1. Avatar ปั้มไลค์ says:
    8 months ago

    Like!! Thank you for publishing this awesome article.

    Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR POST

  • A Comparison of Tableau and Power BI, the Two Top Leaders in the BI Market

    11629 shares
    Share 4657 Tweet 2905
  • Insights to Agile Methodologies for Software Development

    2964 shares
    Share 1186 Tweet 741
  • Why You Should Forget Loops and Embrace Vectorization for Data Science

    2552 shares
    Share 1021 Tweet 638
  • Greedy Algorithm And Dynamic Programming

    1898 shares
    Share 759 Tweet 475
  • Cloudera vs Hortonworks vs MapR: Comparing Hadoop Distributions

    1869 shares
    Share 747 Tweet 467
Experfy Insights

Experfy Insights provides cutting-edge perspectives on Big Data and analytics. Our unique ability to focus on business problems enables us to provide insights that are highly relevant to each industry.

Join Us At

About Us

Contact Us


1700 West Park Drive, Suite 190
Westborough, MA 01581

Email: [email protected]

Toll Free: (844) EXPERFY or
(844) 397-3739

© 2020, Experfy Inc. All rights reserved.

No Result
View All Result
  • Home
  • Future of Work
  • AI & Machine Learning
  • Big Data & Cloud
  • IoT & Automation
  • Software
  • ConsumerTech
  • HealthTech
  • FinTech

© 2020, Experfy Inc. All rights reserved.