facebook-pixel
  • Big Data & Technology
  • Ezz El Din Abdullah
  • AUG 19, 2019

SQL Case Study: Helping a Startup CEO Manage His Data

You may have heard that if you know English, you can mostly write queries with SQL. This is almost right, because SQL is very intuitive, and it is really like plain English.

In this tutorial, you will learn how to create a table, insert values into it, use and understand some data types, use SELECT statements, UPDATE records, use some aggregate functions like COUNT(), MIN(), SUM(), use WHERE clause, remove a character from a string, cast a string to numeric values, write subquery, retrieve the top and bottom candidates, add a column, do a for loop, and make a histogram.

Image

Though, if you love videos more than articles while learning, I have published this article as a practical course at “Intuitive SQL Database Case Study”. This course includes all the snippets of codes written here and more in an interactive way on a SQL shell.

All the previous tasks can be done by showing a case study to help you find some guidance to play with data in a practical way. The best thing to do while reading this tutorial is to write the codes I write and please do yourself a favor and make your hands dirty especially if you are a beginner because this is what makes you brilliant in this language (actually true for almost anything you learn). If you know some SQL, I hope you find this article useful as well and see some other things might add to your knowledge.

I’m working with PostgreSQL which is a general-purpose database management system, but you can still do all the codes for MySQL, Oracle, DB2, etc. If you just happen to find that a piece of code does not work, please ask google about the syntax of the database management system you use.

Let’s say, we would like to have this table:

Id name age city salary
1 Wael 23 Cairo 800$
2 Ragab 24 Giza 900$
3 Ali 24 Giza 1500$
4 Othman 25 Giza 2000$
5 Waleed 26 Cairo 2100$
6 Abdo 23 Alex 900$
7 Moetaz 22 Cairo 800$

 
 
Creating a table:

We can create such a table using the following query:

CREATE TABLE startup(
id int NOT NULL,
name varchar(10) NOT NULL,
age int CHECK(age > 0),
city varchar(10),
salary varchar(5),
PRIMARY KEY(id));

 

Let’s see in more detail what happened for what we wrote. We created a table using create(). The table is called startup and it includes the id city with the data type integer with no possible NULL values; we’ll see why later. The second column is ‘name’ whose data type is a varchar(10) which is a variable-length character of 10 characters i.e. ‘Ali’ will take 3 character-size in memory while ‘Othman’ will take 6 character-size in memory. This column can also be stored as a char which is a fixed-length character of 10 but in this case ‘Ali’ will take 10 character-size in memory while ‘Othman’ will take also 10 character-size in memory. The ‘name’ column can also be stored as a text which is not very wise for storage because text can take up to 2GB of text data in memory. When writing queries, you should put storage into consideration because your table can contain large amounts of data and retrieving them may not be as fast as you desire.

Back to the query, the ‘name’ column cannot be NULL (i.e. should contain a name not nothing). The ‘age’ column contains integer values and only the positive numbers are allowed. The ‘city’ and ‘salary’ contain variable-length character of 10 and 5 characters respectively. Finally, the ‘id’ column contain unique values (a primary key). This means it cannot contain duplicate values (integers in our case). This syntax could be done when defining id as int, so it can be done as such ‘id int primary key not null’.

Notice that SQL is case insensitive, so any key words from the above query can be lower case, but it’s a good practice to make them in upper case.

 
Retrieving the data:

Now, we can retrieve what we have so far through:

SELECT *
FROM startup;

 

The SELECT clause is the most used key word in SQL. This query simply can retrieve all the data from the startup table. As we can use from the result, it includes all the columns we created with 0 rows in it.

 
Inserting values into a table:

Let’s fill the first row in this table:

INSERT INTO startup
VALUES(1, 'Wael', 23, 'Cairo', '800$');

 

Now the first row is done:

Id name age city salary
1 Wael 23 Cairo 800$

 

Let’s write the previous query again and see what happens:

INSERT INTO startup
VALUES(1, 'Wael', 23, 'Cairo', '800$');

 

We see an error telling us:

ERROR:  duplicate key value violates unique constraint "startup_pkey"
DETAIL:  Key (id)=(1) already exists.

 

That’s because the id is a primary key and can only contain unique values not duplicates like the id = 1.

If you made a mistake and enter a negative number for the age like this:

INSERT INTO startup
VALUES(1, 'Wael', -23, 'Cairo', '800$');

 

You will get an error:

ERROR:  new row for relation "startup" violates check constraint "startup_age_check"
DETAIL:  Failing row contains (1, 'Wael', -23, 'Cairo', '800$').

 

That’s because you already made a constraint with a CHECK clause to include only the positive numbers.

Let’s do another query to form the second row:

INSERT INTO startup
VALUES(2, 'Ragab', 24, 'Giza', '900$');

 

Let’s see what will happen if I do the following to fill the third row:

INSERT INTO startup
VALUES(3, Ali, 24, 'Giza', '1500$');

 

There is an error telling us:

ERROR:  column "ali" does not exist

 

That’s because the column ‘name’ which contains ‘Ali’ is not appearing as a string so we need to put it between single quotes.

When forming the fourth row:

INSERT INTO startup
VALUES(4, 'Othman', 25, 'Giza', '2000$');

 

Let’s fill the rest of the columns:

INSERT INTO startup
VALUES(5, 'Waleed', 26, 'Cairo', '2100$');
INSERT INTO startup
VALUES(6, 'Abdo', 23, 'Alex', '900$');
INSERT INTO startup
VALUES(7, 'Moetaz', 22, 'Cairo', '2000$');

 

In the last row, I made a mistake and put the salary 2000$ not 800$. We can fix this by an updating query:

UPDATE startup
SET salary = '800$'
WHERE id = 7;

 

The UPDATE clause is used followed by the name of the table setting the salary to 800$ for a condition where we do know that it defines the row of interest like the id = 7.

 
Inserting into specific columns:

Let’s try to add another row with specific columns; id and salary values:

INSERT INTO startup(
id, salary)
VALUES(8, '1000$');

 

This will introduce an error telling us:

ERROR:  null value in column "name" violates not-null constraint
DETAIL:  Failing row contains (8, null, null, null, 1000$).

 

This means you must fill the column ‘name’ because we defined it as NOT NULL which means we can NOT let it empty; it must take value.

So, we can fix this by adding the name column to the query:

INSERT INTO startup(
id, name, salary)
VALUES(8, 'Mohammed', '1000$');

 

How many rows do exist in a table?

We would like to know how many persons in this startup. We do that by counting the number of rows using count(*):

SELECT COUNT(*)
FROM startup;

 

 How many rows with minimum values in a column?

We would like to know which persons in the startup have the lowest salary and how many of them. Let’s try this query:

SELECT *
FROM startup
WHERE salary = MIN(salary);

 

This will introduce an error which says:

ERROR:  aggregate functions are not allowed in WHERE

 

Aggregate functions are like COUNT(), MIN(), MAX(), AVG(), SUM(), etc. which take values in a column as an input and returns a single value (or NULL) . Here MIN() is used after WHERE clause, so we can do that by checking the salary if it equals the minimum value (or not) which we can get from another query, not by aggregating it, like this:

SELECT MIN(salary)
FROM startup;

 

But this will produce illogical value which is 1000$, so what happened?!

This is done because of the collation sequence, if you are interested in the ASCII collating orders you can see this link.

So, what should we do now?

Actually, I did something not efficient at the beginning of creating the table which is storing the column salary as varchar. The reason exists at the RidFilter‘s answer of this stackoverflow question.

But believe it or not, I dealt with some data like that. It includes dollar sign stored as varachar so let’s fix it to be able to do some operations on it.

 
Removing a character from a string:

This problem can be solved by first removing the dollar sign and then converting this varchar to numeric value which can be integer.

SELECT REPLACE(salary, '$', '')
FROM startup;

 

This REPLACE() function will result in the values in salary column without the dollar sign (i.e. replacing the $ by nothing), but beware that this is not edited into the table. So, we need to use it next time we need to operate on it. Also, beware that this column is still string not numeric, so we need to cast it into decimal.

Now, we can apply the MIN() on the casted replaced values of the salaries:

SELECT MIN(
CAST(
REPLACE(salary, '$', '')
as DECIMAL))
FROM startup;

 

The CAST() function converts the new salary column into decimal values.

Remember, we still see the salary with the dollar sign in the startup table.

To make it available in the table without the dollar sign, we use the UPDATE() function:

UPDATE startup
SET salary = REPLACE(salary, '$', '');

 

We, now, return to our problem which is finding out the persons inside the startup who have the lowest salary.

SELECT *
FROM startup
WHERE CAST(salary AS DECIMAL) = 800;

 

We could use just one query with a subquery instead of the last two separate queries:

SELECT *
FROM startup
WHERE CAST(
salary AS DECIMAL) = (
SELECT MIN( CAST(salary AS DECIMAL) ) FROM startup );

 

We can also count them using COUNT(*):

SELECT COUNT(*)
FROM startup
WHERE CAST(salary AS DECIMAL) = 800;

 

 The three lowest paid:

We would like to get the three lowest paid engineers in the startup. We can do that by first querying all the rows using ORDER BY clause followed by ASC or just ORDER BY which will order the output ascendingly by default.

SELECT *
FROM startup
ORDER BY
CAST( salary AS DECIMAL );

 

Adding to ‘LIMIT 3’ will bring the first three corresponding to the rows with the lowest salaries.

 
The three highest paid:

Notice what we changed!

SELECT *
FROM startup
ORDER BY
CAST( salary AS DECIMAL ) DESC
LIMIT 3;

 

 Total salaries:

Let’s say the CEO wants to know the total cost of the salaries, so we can do that using the aggregate function SUM():

SELECT SUM(
CAST( salary AS DECIMAL )
) FROM startup;

 

 Adding a column to a table:

Let’s say, he wants now to hire females, so he would add another column named sex. This can be done using ALTER clause:

ALTER TABLE
startup
ADD sex char(1);

 

We should update each row like that:

UPDATE startup
SET sex = 'M'
WHERE id = 1;

 

 Using for loop:

Of course, this is a cumbersome if we do it manually. That’s why we should use loops instead.

SQL does not have loops, but it can only be used inside a procedural language function or a ‘Do’ statement as answered here:

DO
$do$
BEGIN
FOR i IN 2..8 LOOP
UPDATE startup2 set sex = 'M' WHERE id = i;
END LOOP;
END
$do$;

 

(Kind of) histogram:

One of the possible requirements is to know the frequency of something occurrence. We can get the frequency of the city among the engineers in the startup by counting each row occurrence for each city value. That’s why we use GROUPY BY clause:

SELECT city, COUNT(*)
FROM startup
GROUP BY city;

 

This is like histogram; it shows us how frequent value occurs.

We can name any column if we follow it by AS:

SELECT city, COUNT(*) AS frequency
FROM startup
GROUP BY city;

 
I hope you find this article useful. 

The Harvard Innovation Lab

Made in Boston @

The Harvard Innovation Lab

350

Matching Providers

Matching providers 2
comments powered by Disqus.