Skip to content

Commit

Permalink
Add Relational Databases course (#21)
Browse files Browse the repository at this point in the history
Looks good to me
  • Loading branch information
sumeshpremraj committed Nov 27, 2020
1 parent f357523 commit d42a09c
Show file tree
Hide file tree
Showing 12 changed files with 481 additions and 3 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
.DS_Store
.venv
site/
98 changes: 98 additions & 0 deletions courses/databases_sql/concepts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
* Relational DBs are used for data storage. Even a file can be used to store data, but relational DBs are designed with specific goals:
* Efficiency
* Ease of access and management
* Organized
* Handle relations between data (represented as tables)
* Transaction: a unit of work that can comprise multiple statements, executed together
* ACID properties

Set of properties that guarantee data integrity of DB transactions

* Atomicity: Each transaction is atomic (succeeds or fails completely)
* Consistency: Transactions only result in valid state (which includes rules, constraints, triggers etc.)
* Isolation: Each transaction is executed independently of others safely within a concurrent system
* Durability: Completed transactions will not be lost due to any later failures

Let’s take some examples to illustrate the above properties.

* Account A has a balance of ₹200 & B has ₹400. Account A is transferring ₹100 to Account B. This transaction has a deduction from sender and an addition into the recipient’s balance. If the first operation passes successfully while the second fails, A’s balance would be ₹100 while B would be having ₹400 instead of ₹500. **Atomicity** in a DB ensures this partially failed transaction is rolled back.
* If the second operation above fails, it leaves the DB inconsistent (sum of balance of accounts before and after the operation is not the same). **Consistency** ensures that this does not happen.
* There are three operations, one to calculate interest for A’s account, another to add that to A’s account, then transfer ₹100 from B to A. Without **isolation** guarantees, concurrent execution of these 3 operations may lead to a different outcome every time.
* What happens if the system crashes before the transactions are written to disk? **Durability** ensures that the changes are applied correctly during recovery.
* Relational data
* Tables represent relations
* Columns (fields) represent attributes
* Rows are individual records
* Schema describes the structure of DB
* SQL

A query language to interact with and manage data.

[CRUD operations](https://stackify.com/what-are-crud-operations/) - create, read, update, delete queries

Management operations - create DBs/tables/indexes etc, backup, import/export, users, access controls

Exercise: Classify the below queries into the four types - DDL (definition), DML(manipulation), DCL(control) and TCL(transactions) and explain in detail.

insert, create, drop, delete, update, commit, rollback, truncate, alter, grant, revoke

You can practise these in the [lab section](../lab.md).



* Constraints

Rules for data that can be stored. Query fails if you violate any of these defined on a table.


Primary key: one or more columns that contain UNIQUE values, and cannot contain NULL values. A table can have only ONE primary key. An index on it is created by default.

Foreign key: links two tables together. Its value(s) match a primary key in a different table \
Not null: Does not allow null values \
Unique: Value of column must be unique across all rows \
Default: Provides a default value for a column if none is specified during insert

Check: Allows only particular values (like Balance >= 0)



* [Indexes](https://datageek.blog/en/2018/06/05/rdbms-basics-indexes-and-clustered-indexes/)

Most indexes use B+ tree structure.

Why use them: Speeds up queries (in large tables that fetch only a few rows, min/max queries, by eliminating rows from consideration etc)

Types of indexes: unique, primary key, fulltext, secondary

Write-heavy loads, mostly full table scans or accessing large number of rows etc. do not benefit from indexes



* [Joins](https://www.sqlservertutorial.net/sql-server-basics/sql-server-joins/)

Allows you to fetch related data from multiple tables, linking them together with some common field. Powerful but also resource-intensive and makes scaling databases difficult. This is the cause of many slow performing queries when run at scale, and the solution is almost always to find ways to reduce the joins.



* [Access control](https://dev.mysql.com/doc/refman/8.0/en/access-control.html)

DBs have privileged accounts for admin tasks, and regular accounts for clients. There are finegrained controls on what actions(DDL, DML etc. discussed earlier )are allowed for these accounts.

DB first verifies the user credentials (authentication), and then examines whether this user is permitted to perform the request (authorization) by looking up these information in some internal tables.

Other controls include activity auditing that allows examining the history of actions done by a user, and resource limits which define the number of queries, connections etc. allowed.


### Popular databases

Commercial, closed source - Oracle, Microsoft SQL Server, IBM DB2

Open source with optional paid support - MySQL, MariaDB, PostgreSQL

Individuals and small companies have always preferred open source DBs because of the huge cost associated with commercial software.

In recent times, even large organizations have moved away from commercial software to open source alternatives because of the flexibility and cost savings associated with it.

Lack of support is no longer a concern because of the paid support available from the developer and third parties.

MySQL is the most widely used open source DB, and it is widely supported by hosting providers, making it easy for anyone to use. It is part of the popular Linux-Apache-MySQL-PHP ([LAMP](https://en.wikipedia.org/wiki/LAMP_(software_bundle))) stack that became popular in the 2000s. We have many more choices for a programming language, but the rest of that stack is still widely used.
13 changes: 13 additions & 0 deletions courses/databases_sql/conclusion.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Conclusion
We have covered basic concepts of SQL databases. We have also covered some of the tasks that an SRE may be responsible for - there is so much more to learn and do. We hope this course gives you a good start and inspires you to explore further.


### Further reading

* More practice with online resources like [this one](https://www.w3resource.com/sql-exercises/index.php)
* [Normalization](https://beginnersbook.com/2015/05/normalization-in-dbms/)
* [Routines](https://dev.mysql.com/doc/refman/8.0/en/stored-routines.html), [triggers](https://dev.mysql.com/doc/refman/8.0/en/trigger-syntax.html)
* [Views](https://www.essentialsql.com/what-is-a-relational-database-view/)
* [Transaction isolation levels](https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html)
* [Sharding](https://www.digitalocean.com/community/tutorials/understanding-database-sharding)
* [Setting up HA](https://severalnines.com/database-blog/introduction-database-high-availability-mysql-mariadb), [monitoring](https://blog.serverdensity.com/how-to-monitor-mysql/), [backups](https://dev.mysql.com/doc/refman/8.0/en/backup-methods.html)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
26 changes: 26 additions & 0 deletions courses/databases_sql/innodb.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
### Why should you use this?

General purpose, row level locking, ACID support, transactions, crash recovery and multi-version concurrency control etc.


### Architecture

![alt_text](images/innodb_architecture.png "InnoDB components")


### Key components:

* Memory:
* Buffer pool: LRU cache of frequently used data(table and index) to be processed directly from memory, which speeds up processing. Important for tuning performance.
* Change buffer: Caches changes to secondary index pages when those pages are not in the buffer pool and merges it when they are fetched. Merging may take a long time and impact live queries. It also takes up part of the buffer pool. Avoids the extra I/O to read secondary indexes in.
* Adaptive hash index: Supplements InnoDB’s B-Tree indexes with fast hash lookup tables like a cache. Slight performance penalty for misses, also adds maintenance overhead of updating it. Hash collisions cause AHI rebuilding for large DBs.
* Log buffer: Holds log data before flush to disk.

Size of each above memory is configurable, and impacts performance a lot. Requires careful analysis of workload, available resources, benchmarking and tuning for optimal performance.

* Disk:
* Tables: Stores data within rows and columns.
* Indexes: Helps find rows with specific column values quickly, avoids full table scans.
* Redo Logs: all transactions are written to them, and after a crash, the recovery process corrects data written by incomplete transactions and replays any pending ones.
* Undo Logs: Records associated with a single transaction that contains information about how to undo the latest change by a transaction.

21 changes: 21 additions & 0 deletions courses/databases_sql/intro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Relational Databases

### What to expect from this training
You will have an understanding of what relational databases are, their advantages, and some MySQL specific concepts.

### What is not covered under this course
* In depth implementation details

* Advanced topics like normalization, sharding

* Specific tools for administration

### Introduction
The main purpose of database systems is to manage data. This includes storage, adding new data, deleting unused data, updating existing data, retrieving data within a reasonable response time, other maintenance tasks to keep the system running etc.

### Prerequisites
* Complete [Linux course](/linux_basics/intro/)
* Install Docker (for lab section)

### Pre-reads
[RDBMS Concepts](https://beginnersbook.com/2015/04/rdbms-concepts/)
207 changes: 207 additions & 0 deletions courses/databases_sql/lab.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
**Prerequisites**

Install Docker


**Setup**

Create a working directory named sos or something similar, and cd into it.

Enter the following into a file named my.cnf under a directory named custom.


```
sos $ cat custom/my.cnf
[mysqld]
# These settings apply to MySQL server
# You can set port, socket path, buffer size etc.
# Below, we are configuring slow query settings
slow_query_log=1
slow_query_log_file=/var/log/mysqlslow.log
long_query_time=0.1
```


Start a container and enable slow query log with the following:


```
sos $ docker run --name db -v custom:/etc/mysql/conf.d -e MYSQL_ROOT_PASSWORD=realsecret -d mysql:8
sos $ docker cp custom/mysqld.cnf $(docker ps -qf "name=db"):/etc/mysql/conf.d/custom.cnf
sos $ docker restart $(docker ps -qf "name=db")
```


Import a sample database


```
sos $ git clone [email protected]:datacharmer/test_db.git
sos $ docker cp test_db $(docker ps -qf "name=db"):/home/test_db/
sos $ docker exec -it $(docker ps -qf "name=db") bash
root@3ab5b18b0c7d:/# cd /home/test_db/
root@3ab5b18b0c7d:/# mysql -uroot -prealsecret mysql < employees.sql
root@3ab5b18b0c7d:/etc# touch /var/log/mysqlslow.log
root@3ab5b18b0c7d:/etc# chown mysql:mysql /var/log/mysqlslow.log
```


_Workshop 1: Run some sample queries_
Run the following
```
$ mysql -uroot -prealsecret mysql
mysql>
# inspect DBs and tables
# the last 4 are MySQL internal DBs
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| employees |
| information_schema |
| mysql |
| performance_schema |
| sys |
+--------------------+
> use employees;
mysql> show tables;
+----------------------+
| Tables_in_employees |
+----------------------+
| current_dept_emp |
| departments |
| dept_emp |
| dept_emp_latest_date |
| dept_manager |
| employees |
| salaries |
| titles |
+----------------------+
# read a few rows
mysql> select * from employees limit 5;
# filter data by conditions
mysql> select count(*) from employees where gender = 'M' limit 5;
# find count of particular data
mysql> select count(*) from employees where first_name = 'Sachin';
```

_Workshop 2: Use explain and explain analyze to profile a query, identify and add indexes required for improving performance_
```
# View all indexes on table
#(\G is to output horizontally, replace it with a ; to get table output)
mysql> show index from employees from employees\G
*************************** 1. row ***************************
Table: employees
Non_unique: 0
Key_name: PRIMARY
Seq_in_index: 1
Column_name: emp_no
Collation: A
Cardinality: 299113
Sub_part: NULL
Packed: NULL
Null:
Index_type: BTREE
Comment:
Index_comment:
Visible: YES
Expression: NULL
# This query uses an index, idenitfied by 'key' field
# By prefixing explain keyword to the command,
# we get query plan (including key used)
mysql> explain select * from employees where emp_no < 10005\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: employees
partitions: NULL
type: range
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: NULL
rows: 4
filtered: 100.00
Extra: Using where
# Compare that to the next query which does not utilize any index
mysql> explain select first_name, last_name from employees where first_name = 'Sachin'\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: employees
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 299113
filtered: 10.00
Extra: Using where
# Let's see how much time this query takes
mysql> explain analyze select first_name, last_name from employees where first_name = 'Sachin'\G
*************************** 1. row ***************************
EXPLAIN: -> Filter: (employees.first_name = 'Sachin') (cost=30143.55 rows=29911) (actual time=28.284..3952.428 rows=232 loops=1)
-> Table scan on employees (cost=30143.55 rows=299113) (actual time=0.095..1996.092 rows=300024 loops=1)
# Cost(estimated by query planner) is 30143.55
# actual time=28.284ms for first row, 3952.428 for all rows
# Now lets try adding an index and running the query again
mysql> create index idx_firstname on employees(first_name);
Query OK, 0 rows affected (1.25 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> explain analyze select first_name, last_name from employees where first_name = 'Sachin';
+--------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN |
+--------------------------------------------------------------------------------------------------------------------------------------------+
| -> Index lookup on employees using idx_firstname (first_name='Sachin') (cost=81.20 rows=232) (actual time=0.551..2.934 rows=232 loops=1)
|
+--------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)
# Actual time=0.551ms for first row
# 2.934ms for all rows. A huge improvement!
# Also notice that the query involves only an index lookup,
# and no table scan (reading all rows of table)
# ..which vastly reduces load on the DB.
```

_Workshop 3: Identify slow queries on a MySQL server_
```
# Run the command below in two terminal tabs to open two shells into the container.
docker exec -it $(docker ps -qf "name=db") bash
# Open a mysql prompt in one of them and execute this command
# We have configured to log queries that take longer than 1s,
# so this sleep(3) will be logged
mysql -uroot -prealsecret mysql
mysql> sleep(3);
# Now, in the other terminal, tail the slow log to find details about the query
root@62c92c89234d:/etc# tail -f /var/log/mysqlslow.log
/usr/sbin/mysqld, Version: 8.0.21 (MySQL Community Server - GPL). started with:
Tcp port: 3306 Unix socket: /var/run/mysqld/mysqld.sock
Time Id Command Argument
# Time: 2020-11-26T14:53:44.822348Z
# User@Host: root[root] @ localhost [] Id: 9
# Query_time: 5.404938 Lock_time: 0.000000 Rows_sent: 1 Rows_examined: 1
use employees;
# Time: 2020-11-26T14:53:58.015736Z
# User@Host: root[root] @ localhost [] Id: 9
# Query_time: 10.000225 Lock_time: 0.000000 Rows_sent: 1 Rows_examined: 1
SET timestamp=1606402428;
select sleep(3);
```

These were simulated examples with minimal complexity. In real life, the queries would be much more complex and the explain/analyze and slow query logs would have more details.
Loading

0 comments on commit d42a09c

Please sign in to comment.