SQL

1 Selecting and Retrieving Data with SQL

1.1 What is SQL?

Structured Query Language (SQL): standard computer language for data management and data manipulation. It is the way you can communicate with databases. It is a non-procedural language, meaning you cannot write complete applications with it however you will be able to read, write, and update data.

Just because SQL is the language you are using, there are differences in the syntax depending on the database management system (DBMS) you are using.

1.2 Data Models

Databases: a container to store organized data; a set of related information.

Tables: a structured list of data or a specific type.

Columns: a single field in a table
Rows: a record in a table

What is data modelling? Organizes and structures information into multiple, related tables. Can represent a business process or show relationships between business processes. Data models should closely represent the real world processes

NoSQL (not only SQL): a mechanism for storage and retrieval of unstructured data modelled by means other than tabular relations in relational databases.

All relational models have an entity, attribute, and relationship.

Entity: person, place, thing, or event. Distinguishable, unique and distinct.
Attribute: characterstic of an entity.
Relationship: describes association among entities (e.g., one-to-many, many-to-many, one-to-one)

ER diagrams depict the relationships between tables. Helps represent business processes. Joined by primary-foreign keys.

Primary keys: a column (or set of columns) whose values uniquely identify every row in a table.
Foriegn key: one or more columns that can be used together to identify a single row in another table.
Different types of notation (Chen Notation, Crow’s Foot Notation, UML Class Diagram Notation)

1.3 Retrieving data with a SELECT statement

The select statement you need to specific what you want and where you want it from.

SELECT item1,
        item2,
        item3
FROM tableName;

Selects items 1-3 from tableName.

SELECT *
FROM tableName;

Selects all items from tableName.

A lot of times you want to limit results (e.g., when the database is large). To select a sample from the database.

SELECT columnsYouWantToSee
FROM tableName
LIMIT 5;

Will show you the first 5 columns you want to see from tableName.

*Note: LIMITS have different syntaxes across DBMS.

1.4 Creating tables

Creating tables is useful if you want to extract data from other sources.

CREATE TABLE tableName (
    Id   char(10)    PRIMARY KEY,
    Variable1   char(10)    NOT NULL,
    Variable2   char(250)   NOT NULL,
    Variable3   decimal(8,2)  NOT NULL,
    Variable4   Varchar(750)    NULL
);

1.4.1 Nulls and primary keys

Every column is either NULL or NOT NULL. You will receive an error if you omit this with no value. *Note: Do not confuse null values with empty strings**. Primary keys cannot be null.

Primary keys MUST have a value.

1.4.2 Adding data to the table

Two ways described below.

INSERT INTO tableName
VALUES ('000001', 
    'StringValue1', 
    'Stringvalue2', 
    '10.45', 
    NULL
    );

Using INSERT. This is not a recommended way because you are manually inserting values into columns and if you mess up the order, the entry will be wrong.

INSERT INTO tableName (
    Id, 
    Variable1, 
    Variable2, 
    Variable3, 
    Variable4
    )
VALUES (
    '000001', 
    'StringValue1', 
    'Stringvalue2', 
    '10.45', 
    NULL
    );

Now you are guaranteeing that the values are going into the right column.

1.4.3 Creating temporary tables

Temporary tables will be deleted when the current session is terminated. Faster than creating a real table. Useful for complex queries using subsets and joints.

CREATE TEMPORARY TABLE tmpTableName AS (
    SELECT *
    FROM tableName
    WHERE variableName1 = "conditionalString"
);

Depending on the DBMS, you will not be able to write data to temporary tables.

1.4.4 Adding comments to SQL

Mutes the expression of code. Also helps you remember what and why you were doing.

Two ways to add comments

Single line comments

SELECT variableName1,
--variableName2,
variableName3
from tableName

The above comments out variableName2.

Section comments

SELECT variableName1,
/* variableName2,
variableName3 */
from tableName

The above comments out variableName2, and variableName3.

1.5 Star Schema vs Snowflake Schema

These are two types of datawarehouse models. They are ways to organize data using relational databases.

Star schema: Aggregated central fact table that contains relationships to dimension tables.

Snowflake schema: Stores the same data as the star schema however the main difference is that the dimensional tables in the snowflake schema are normalized producing snowflaking.

1.5.1 First difference: Normalization

Snowflake schemas will use less space to store dimension tables because normalized databases produce far fewer redundant records. As a result of this, denormalized data models increase the chance of data integrity issues where records across dimension tables start to vary.

1.5.2 Second difference: Query complexity

Snowflake schemas, because the dimension tables are normalized, require more complex querying. In star schema, you will only join the fact table with dimension tables you need - you will only have one JOIN per dimension table. In snowflake schema querying, you will need to use multiple joins to find the right dimension. Because of this, querying takes slightly longer in snowflake schema however the absolute difference is meaningless (e.g., 1ms vs 1s).

Source for snowflake vs star schema

2 Filtering, Sorting, and Calculating Data with SQL

2.1 Basics of Filtering with SQL

Allows you to narrow the data you want to retrieve, reduce the number of records you retrieve, increase query performance, reduce the strain on the client application, and governance limitations.

2.1.1 WHERE clause operator

SELECT *
FROM tableName
WHERE columnName operator value;

List of operators:

= equal
<> not equal (in some operators it may be written as !=)
> greater than
< less than
>= greater than or equal
<= less than or equal to
BETWEEN between an inclusive range
- You need to specific the range (e.g., WHERE columnName BETWEEN value1 AND value2;)
IS NULL is a null value

2.2 Advanced Filtering: IN, OR, NOT

2.2.1 IN operator

Specifics a range of conditions. Comma deliminted list of values. Enclosed in ().

SELECT variableName
FROM tableName
WHERE variableName IN (value1, value2, value3)

2.2.2 OR operator

DBMS will not evaluate the second conditions in a WHERE clause if the first condition is met. Use for any rows matching the specific conditions.

SELECT variableName
FROM tableName
WHERE variableName = 'string1' OR 'string2';

2.2.3 IN vs OR

In works the same as OR, however there are some benefits of IN

Long list of options
IN executes faster than OR
Don’t have to think about the order with IN
Can contain another select

2.2.4 ORDER with AND

SQL will process an AND before an OR, so you will need to include parathesis to specific the groups.

2.2.5 NOT operator

A way to exclude certain variables.

SELECT *
FROM tableName
WHERE NOT variableName = 'string1' AND 
NOT variableName = 'string2';

2.3 Using Wildcards in SQL

2.3.1 Percent (%) wildcard

Wildcards: special character used to match parts of a value. Search pattern made from literal text, wildcard charracter, or a combination. Uses LIKE as an operator (though technically a predicate). Can only be used with strings and not non-text datatypes.

To use a wildcard, you use a % before, in the middle, and/or after the word.

Another way to use a wildcard is to put it between letters to produce a more dynamic search. (E.g., t%@gmail.com will grab gmail addresses that start with “t”).

% wildcars will not match null values. NULL represents no value in a column.

SELECT variableName1, variableName2
FROM tableName
WHERE variableName2 LIKE '%substring'

2.3.2 Underscore (_) wildcard

_ will match a single character (however not supported in all DBMSs)

Instead of % which will match any length string, _ will only match a single character.

2.3.3 Bracket [] wildcard

Used to specific a set of characters in a specific location (also does not work with all DBMSs)

2.3.4 Downsides of wildcards

Takes longer to run so better to use other operators. Statements with wildcards will take longer to run if used at the end of search patterns (so placement of wildcards is important).

2.4 Sorting with ORDER BY

Why sort data? Data displayed appears in the order of the underlying tables. Updated and deleted data can change this order. Sequence of retrieved data cannot be assumed if order was not specified? Sorting data logically helps keep information you want on top. ORDER BY clause allows user to sort data by particular columns

SELECT *
FROM tableName
ORDER BY variableName43

2.4.1 Rules for ORDER BY

Takes the name of one or more columns. Add a comma after each additional column name. You can sort by a column not retrieved (in the original SELECT statement). Must always be the last clause in a select statement.

2.4.2 Sorting by column position

ORDER BY 2,3

Will sort by column position where 2 means 2nd column, and 3 means 3rd column, etc.

2.4.3 Sort direction

DESC for descending order

ASC for ascending order

Only applies to the column names it directly precedes.

2.5 Math operations

+, -, *, /

SELECT variableName1, 
    variableName2, 
    variableName3 * variableName1 AS variableName4
FROM tableName

Creates a new column based on the math operation between variable1 and variable3.

2.5.1 Order of operations

Parantheses, exponents, multiplication, division, addition, substraction

2.5.2 Combining math operations

SELECT variableName1, 
    variableName2, 
    variableName3
    (variableName4 - variableName3) * variableName1 AS variableName5
FROM tableName

2.6 Aggregate functions

Provide various ways to summarize data. Useful for finding highest/lowest values, total number of rows, or average values.

AVG() averages a column of values
COUNT() counts the number of values
MIN() finds the minimum value
MAX() finds the maximum value
SUM() sums the column value

SELECT AVG(variableName) AS variableNameAlias
FROM tableName

Rows containing null values will be ignored.

COUNT(*) will count all the rows in a table containing values or NULL values

COUNT(variableName) will count all rows in a specific column while ignoring NULL values

SELECT MIN(variableName) as min_value,
    MAX(variableName) as max_value
FROM tableName

MIN/MAX will ignore null values

SELECT SUM(variableName1*variableName2) AS variableNameAlias
FROM tableName
WHERE variableName3 = 'valueID';

Will do math operations on data that matches condition and store the new value as a new column.

2.6.1 Using DISTINCT on Aggregate functions

If DISTINCT is not specified, ALL is assumed. Cannot use DISTINCT on COUNT(*). No value to use with MIN and MAX functions.

SELECT COUNT(DISTINCT variableName)
FROM tableName

2.7 Grouping data with SQL

2.7.1 Grouping data

SELECT variableName1,
    COUNT(variableName2) AS variableName3
FROM tableName
GROUP BY variableName1

if you did not have the group by, it would count the whole table.

GROUP BY clauses can contain multiple columns.

Every column in your SELECT statement MUST be present in a GROUP BY clause, except for the aggregated calculations.

NULLs will be grouped together if your GROUP BY column contain NULLs.

2.7.2 HAVING Clause - Filtering for groups

WHERE does not work for groups. WHERE filters on rows. Instead, use HAVING clause to filter for groups.

SELECT variableName1,
    COUNT (*) AS variableName2
FROM tableName
GROUP BY variableName1
HAVING COUNT (*) >= 2;

Counts variableName1 that had more than 2 of variableName2.

2.7.3 WHERE vs. HAVING

WHERE filters before data is grouped.

HAVING filters after data is grouped.

Rows eliminated by the WHERE clause will not be a included in the group.

2.7.4 ORDER BY with GROUP BY

ORDER BY sorts data.

GROUP BY does not sorta data.

3 Subqueries and Joins in SQL

3.1 Using Subqueries

Subqueries: Queries embedded in other queries. Relational databases store data across multiple tables. Subqueries merge data from multiple sources together. Helps with adding other filtering criteria.

SELECT CustomerID, 
    CompanyName, 
    Region
FROM Customers
WHERE CustomerID in (SELECT customerID
    FROM Orders
    WHERE Freight > 100);

Example from Coursera. In the above example, we queried the Orders table to get customerID with Freight over 100 to look up their CompanyName and Region. We used subquering to filter one table to retrieve data from another.

The query will always perform the innermost SELECT portion first.

3.2 Subquery best practices and considerations

There is no limit to the number of subqueries you can have. Performance slows when you nest too deeply.

Subquery selects can only retrieve a single column

Properly indent subqueries to know where each SELECT statement starts and ends. www.poorsql.com will reformat code with proper indenting.

3.2.1 Subqueries for calculations

SELECT customer_name,
    customer_state
    (SELECT COUNT(*) AS orders
    FROM Orders
    WHERE Orders.customer_id = Customer.customer_id) AS orders
FROM customers
ORDER BY Customer_name

The subquery is enclosed in the brackets. Retrieves the total number of orders placed by each customer.

3.3 Joining tables: An introduction

The reasons why we break data into multiple tables

Efficient storage
Easier manipulations
Greater scalability
Logically models a process
Tables are related through common values (i.e., keys)

3.3.1 Joins

Associates correct records from each table on the fly. Allows data retrieval from multiple tables in one query. Joins are not physical - they persist for the duration of the query execution.

3.4 Cartesian (Cross) Joins

What is a Cartesian (Cross) join? Each row from the first table joins with all the rows of another table.

Computational expensive (e.g., a table with x rows and another with y rows will produce a new table of x * y rows)

SELECT variableName1,
    variableName2,
    variableName3
FROM tableName1 CROSS JOIN tableName2;

This is a simple join that you can do, but it does not match anything. Therefore, it will return products with the incorrect information (therefore, it is not frequently used).

3.5 Inner joins

Inner joins: used to select records that have matching values (usually by keys) in both tables.

SELECT variableName1,
    variableName2,
    variableName3
FROM tableName1 INNER JOIN tableName2
ON tableName1.keyName = tableName2.keyName;

NOTE: If you have a common variableName that is found in both tables, you need to prequalify the variableName (that is a column name) with the table Name. E.g., SELECT tableName1.variableName1, if variableName1 is found in both.

You can perform on multiple tables. But becareful not to overly join because they are computationally taxing.

SELECT a.variableName1,
    b.variableName2,
    c.variableName3
FROM (
        (
        tableName1 a INNER JOIN tableName2 b ON a.keyName = b.keyName
        )
        INNER JOIN tableName3 c ON a.keyName = c.keyName
    );

Be as specific which table a column is from.

3.6 Aliases and self joins

3.6.1 Aliases

Alias: shortens a table or a column with a temporary name. Makes it more readable and does not physically change the table/column name.

SELECT variableName1, 
    variableName2
FROM tableName1 as t1, 
    and tableName2 as t2
WHERE t1.keyName = t2.keyName;

3.6.2 Self joins

When you join a table to itself.

SELECT t1.variableName1 as name1, 
    t2.variableName1 as name2,
    t1.variableName2
FROM tableName1 t1,
    tableName2 t2
WHERE t1.keyName = t2.keyName
    AND t1.variableName2 = t2.variableName2
ORDER BY t1.variableName2

The following matches variableName1 that have the same variableName2.

3.7 Advanced Joins: left, right, and full outer joins

3.7.1 Left Joins

Returns all records from the left table (table1) and the matched records from the right table (table2).

The result is NULL from the right side, if there is no match.

E.g., if Table1 is a customers table and Table2 is an orders table. A left join would produce a new table with all the customers and their associated order information (or null if no order).

SELECT C.CustomerName, 
    O.OrderID
FROM Customers C
LEFT JOIN Orders O ON C.CustomerID = O.CustomerID
ORDER BY C.CustomerName;

3.7.2 Right Joins

Returns all records from the right table (table2) and the matched records from left table (table1).

The result is NULL from the left side, when there is no match.

E.g., if Table1 is a employees table and Table2 is an orders table. A right join would produce a new table with all the orders and the employees associated with the orders.

SELECT O.OrderID, 
    E.LastName, 
    E.FirstName
FROM Orders O
RIGHT JOIN Employees E ON O.EmployeeID = E.EmployeeID
ORDER BY O.OrderID;

3.7.3 Difference between right and left

If you switch the order of when the tables appear, a left join can be turned into right joins and vice versa.

3.7.4 Full Outer Joins

Returns all records when there is a match in either left (table1) or right (table2) table records.

Gives you everything.

SELECT C.CustomerName,
    O.OrderID
FROM Customers C
FULL OUTER JOIN Orders O ON C.CustomerID=O.CustomerID
ORDER BY C.CustomerName;

3.8 Unions

The UNION operator is used to combine the result-set of two or more SELECT statements. Each SELECT statement within UNION must have the same number of columns. Columns MUST have similar data types. The columns in each SELECT statement MUST be in the same order.

Sort of like appending two tables on top of each other.

SELECT variableName(s)
FROM tableName1
UNION
SELECT variableName(s)
FROM tableName2;

SELECT variableName1, 
    variableName2
FROM tableName1
WHERE variableName2='someString'
UNION
SELECT variableName1, 
    variableName2
FROM tableName2
WHERE variableName2='someString'
ORDER BY variableName1

3.8.1 Union vs Union All

The UNION command is used to select related information from two tables, much like the JOIN command. However, when using the UNION command all selected columns need to be of the same data type. With UNION, only distinct values are selected.

The UNION ALL command is equal to the UNION command, except that UNION ALL selects all values.

The difference between Union and Union all is that Union all will not eliminate duplicate rows, instead it just pulls all rows from all tables fitting your query specifics and combines them into a table.

From SQL Authority

4 Modifying and Analyzing Data with SQL

4.1 Working with text strings

Allows you to retrieve the data in the format you need. There are differences between client vs server formatting.

Support joins

String functions

Concatenate
Substring
Trim
Upper
Lower

4.1.1 Concatenate


```SQL
SELECT variableName1, 
    variableName2,
    variableName3 || ' ('|| variableName4||')'
FROM tableName

Concatenates the variableName3 and variableName4. In the above example we are adding variableName4 in between ().

NOTE: SQL server supports + instead of

4.1.2 Substring

Returns the specified number of characters from a particular position of a given string.

SUBSTR(string name, string position, number of characters to be returned);

SELECT firstName, SUBSTR(firstName,3,4)
FROM labMembers
WHERE labName="JMT";

Gives the 3-7 characters of the first name of lab members that belong to the JMT Lab.

4.1.3 Trim

Trims the leading or trailing space from a string (e.g., TRIM, RTRIM, LTRIM)

SELECT TRIM("     This is an example string.      ") AS TrimmedString;

The above trims left and right trailing space.

4.1.4 Upper and Lower

Changes case of string.

SELECT UPPER(variableName) FROM tableName;

SELECT UCASE(variableName) FROM tableName;

UCASE does the same as UPPER.

SELECT LOWER(variableName) FROM tableName;

4.2 Working with date and time strings

Dates are stored as a datatype. Each DBMS uses it’s own variety of datatypes.

Having a time portion makes queries a bit more complex in contrast to just a date portion.

There are different formats

DATE Format: YYYY-MM-DD
DATETIME Format: YYYY-MM-DD HH:MI:SS
TIMESTAMP Format: YYYY-MM-DD HH:MI:SS

SQLite supports 5 date and time functions:

DATE(timestring, modifier, modifier, ...)
TIME(timestring, modifier, modifier, ...)
DATETIME(timestring, modifier, modifier, ...)
JULIANDAY(timestring, modifier, modifier, ...)
STRFTIME(format, timestring, modifier, modifier, ...)

Timestrings can also be in a variety of formats.

YYYY-MM-DD
YYYY-MM-DD HH:MM
YYYY-MM-DD HH:MM:SS
YYYY-MM-DD HH:MM:SS.SSS
YYYY-MM-DDTHH:MM
YYYY-MM-DDTHH:MM:SS
YYYY-MM-DDTHH:MM:SS.SSS
HH:MM
HH:MM:SS
HH:MM:SS.SSS

Modifiers

NNN days
NNN hours
NNN minutes
NNN.NNNN seconds
NNN months
NNN years
start of month
start of year
start of day
weekday N
unixepoch
localtime
utc

Date and Time String examples

SELECT Birthdate,
    STRFTIME('%Y', Birthdate) AS Year,
    STRFTIME('%m', Birthdate) AS Month,
    STRFTIME('%d', Birthdate) AS Day
FROM labMembers

Flattens/Retrieves year, month, and day from a date string.

To compute the current date:

SELECT DATE('now')

SELECT STRFTIME('%Y %m %d', 'now')

You can compute operations

SELECT STRFTIME(('now') -Birthdate)

4.3 Case statements

Mimics if-then-else statement found in most programming languages. Can be used in SELECT, INSERT, UPDATE, and DELETE statements

CASE
WHEN C1 THEN E1
WHEN C2 THEN E2
...
ELSE [result else]
END

SELECT employeeid,
    firstname,
    lastname,
    city,
    CASE City
        WHEN 'Calgary' THEN 'Calgary'
    ELSE 'Other'
        END calgary
FROM Employees
ORDER BY LastName, FirstName;

If you had a list of cities in a column but only wanted to create a column if the city was calgary or not.

You can also use CASE as a search (below).

CASE WHEN Boolean_expression
THEN result_expression [...n]
[ ELSE else_result_expression ]
END columnName

4.4 Views

A view is a stored query. Can add or remove columns without changing schema. Use it to encapsualte queries. The view will be removed after database connection has ended.

CREATE [TEMP] VIEW [IF NOT EXISTS] 
view_name(column-name-list)
AS
select-statements;

Example:

CREATE VIEW my_view
AS
SELECT
r.regiondescription,
t.territorydescription,
e.Lastname,
e.FirstName,
e.Hiredate,
e.Reportsto
FROM Region r
INNER JOIN Territories t on r.regionid = t.regionid
INNER JOIN Employeeterritories et on t.TerritoryID = et.TerritoryID
INNER JOIN Employees e on et.employeeid = e.EmployeeID

Here we combine three tables based on keys and create a View that contains select variables.

SELECT *
FROM view_name

To view the data, you will need to use a SELECT statement

DROP VIEW view_name

Drops the view.

Once you create a view, you can treat it as a normal table to run additional queries.

Views are useful to create stepping stones in multiple-level queries and prevents transfer of data during ETL processes. Only good for the SQL session.

4.5 Data Governance and Profiling

Data profiling: when you look at descriptive statistics or object data information -examining data for completeness and accuracy. Helps understand the data. E.g., number of rows, table size, when the objects were last updated.

Column data profiling: column date type, number of distinct values, number of rows with NULL values, descriptive statistics (max, average, std for column values).

You need to test your data along the way during query generation.

4.5.1 Governance best practices

Depends on the organization. Understand the read and write capabilities. Clean up your environments after you’re done. Understand your promotion/value process.

4.6 Using SQL for data science

4.6.1 Data Understanding

The most important step. Understand the relationships in your data (data and business understanding).

Data understanding: data types, contatenated date and times, etc.
- Subject area understanding: may be new until you write more queries until you know how the data relates to each other. Understanding where data joins are. Differentiating integers from strings.
Business understanding: ask question about the business problem you are solving. Hard to separate date and business understanding.

Beware of the unspoken need! These are specific details that are necessary to understand the business understanding (which are sometimes somewhat vague).

4.6.2 Profiling data

Get into the details of your data. Create a data model and map the fields and tables you need. Consider joins and calculations. Understand any data quality or format issues.

Start with SELECT

Start simple
Any query begins with SELECT statement
Add in special formatting
If using subqueries, start with the inner-most query and work outward

4.6.3 Test and troubleshoot

Do not wait until the end to test queries. Test after each join or filter. Are you getting the results you expected. Start small and go step-by-step when troubleshooting a query.

4.6.4 Format and comment

Use correct formatting and indentation. Comment strategically. Clean code and comments help when you revist or hand off code.

4.6.5 Review

Always review old queries, changes to business rules, data changes, data indicators. Work the problem from beginning to end.

SQL