Databases
Database
A Database is an organized collection of structured or unstructured data,
typically stored electronically in a computer system. It is managed by a
Database Management System (DBMS), which acts as the software interface allowing
users and applications to securely create, read, update, and delete (CRUD) the data.
1. Types of Databases: SQL vs. NoSQL
The two primary paradigms for storing data are Relational (SQL) and Non-Relational (NoSQL). Choosing between them depends entirely on the structure of your data and how it needs to scale.
1.1 Relational Databases (SQL)
Data is stored in highly structured Tables (Relations) consisting of rows (records) and columns (attributes). Tables are strictly linked together using Primary and Foreign Keys to form complex relationships. Examples: MySQL, PostgreSQL, Oracle, Microsoft SQL Server.
SQL Characteristics
- Rigid Schema: You must define exactly what data type (integer, string, date) goes into each column before inserting any data. If you want to add a new attribute (like "MiddleName") later, you must alter the entire table structure.
- ACID Compliance: Guarantees strong data integrity and absolute reliability for complex, multi-step transactions (like financial bank transfers).
- Vertical Scaling (Scale-Up): Primarily scales by increasing the horsepower of a single server (buying a bigger, more expensive server with more RAM/CPU). Distributing a relational database across many servers is complex.
1.2 Non-Relational Databases (NoSQL)
Data is stored in flexible formats, often as JSON-like documents, key-value pairs, wide-columns, or graph nodes, rather than strict tables. Examples: MongoDB (Document), Redis (Key-Value), Cassandra (Wide-Column), Neo4j (Graph).
NoSQL Characteristics
- Dynamic Schema: Schema-less design. You can insert varied data structures into the same collection without altering the database. One user document might have a "phone_number" field, while another does not.
- Horizontal Scaling (Scale-Out): Designed from the ground up to scale horizontally by adding cheaper, commodity servers to distribute the load across a massive cluster. Perfect for massive, unstructured Big Data and real-time web apps.
- BASE Properties: Often trades strict ACID consistency for higher availability and performance (Basically Available, Soft state, Eventual consistency).
Key Takeaways
- SQL databases use rigid tables with predefined schemas, ideal for highly structured data requiring complex joins and absolute transactional integrity.
- NoSQL databases use flexible, unstructured formats, ideal for rapid agile development, changing data shapes, and massive horizontal scaling.
2. Core Relational Concepts (Keys)
Relational databases rely on specific keys to mathematically organize and link data across multiple tables.
Key Types
- Primary Key (PK): A column (or combination of columns) that uniquely identifies every single row in a table. It must be unique and cannot be NULL. (e.g., A
StudentID, aSocialSecurityNumber, or an auto-incrementing integerID). - Foreign Key (FK): A column in one table that specifically references the Primary Key of another table. This creates the relational link between the tables, enforcing referential integrity. (e.g., An
Orderstable has aCustomerIDcolumn linking back to theCustomerstable). - Composite Key: A Primary Key that consists of two or more columns combined to create a unique identifier (often used in linking tables for Many-to-Many relationships).
Key Takeaways
- Primary keys guarantee row uniqueness within a single table.
- Foreign keys mathematically link tables together, preventing orphan records (referential integrity).
3. The ACID Properties
Relational DBMS systems guarantee reliability even during power failures, system crashes, or massive concurrent access by enforcing four strict rules known as the ACID properties for every database transaction.
Transaction
A Transaction is a logical unit of work that contains one or more SQL statements (e.g., deducting money from Account A, and adding it to Account B).
ACID Transactions
- Atomicity ("All or Nothing"): A transaction must either complete entirely, or not happen at all. If a server crashes halfway through a bank transfer, the database entirely "rolls back" the partial changes to the exact starting state.
- Consistency: A transaction can only bring the database from one mathematically valid state to another valid state. It must obey all predefined rules, constraints (like NOT NULL), and cascading triggers.
- Isolation: Concurrent executions of transactions by hundreds of users must leave the database in the exact same state as if they were executed sequentially (one after the other). This prevents "dirty reads" where two users withdraw from the same account simultaneously, corrupting the final balance.
- Durability: Once a transaction has been successfully committed, the changes are permanently stored on non-volatile disk. They will survive a catastrophic system crash or power outage immediately afterward.
Key Takeaways
- ACID properties (Atomicity, Consistency, Isolation, Durability) ensure bulletproof data reliability, particularly for financial, medical, and critical systems.
- They prevent partial data updates, corruption from concurrent access, and data loss during sudden crashes.
4. Database Normalization
Normalization is the systematic process of organizing data in a relational database to reduce data redundancy (duplicate data) and improve data integrity. It involves breaking large, messy tables into smaller, tightly-related tables and defining relationships between them.
4.1 The Normal Forms
Procedure
- First Normal Form (1NF): Eliminate repeating groups or arrays in individual columns. Create a separate table for each set of related data, and identify each row with a unique Primary Key. Rule: Every column must contain atomic (indivisible) single values. (e.g., A "PhoneNumbers" column cannot contain "555-1234, 555-9876").
- Second Normal Form (2NF): Must fulfill 1NF. Create separate tables for sets of values that apply to multiple records, and relate these tables with a Foreign Key. Rule: No partial dependency. Every non-key column must depend on the ENTIRE primary key (only relevant if the PK is a composite key).
- Third Normal Form (3NF): Must fulfill 2NF. Eliminate fields that do not depend strictly on the primary key. Rule: No transitive dependency. A non-key column cannot depend on another non-key column. (e.g., If a table has
ZipCodeandCity,Citydepends onZipCode, not theCustomerIDprimary key. Move them to a separateLocationstable).
The Goal of Normalization
The goal is to ensure that every non-key attribute in a table provides a fact about the Primary Key, the whole Primary Key, and nothing but the Primary Key. This prevents Insertion, Update, and Deletion Anomalies (e.g., updating a customer's address in one place automatically updates it everywhere, preventing conflicting data).
Key Takeaways
- Normalization organizes schemas to eliminate data redundancy and prevent anomalies during database modifications.
- 1NF requires atomic values; 2NF removes partial dependencies; 3NF removes transitive dependencies.
5. Structured Query Language (SQL)
SQL is the standard programming language used to communicate with, manage, and query Relational Databases.
Common SQL Commands (CRUD)
- SELECT (Read): Retrieves data from a database. (e.g.,
SELECT FirstName, LastName FROM Users WHERE Age > 18 ORDER BY LastName;) - INSERT (Create): Adds new rows (records) into a table.
- UPDATE (Update): Modifies existing data within specific rows based on a
WHEREcondition. - DELETE (Delete): Removes specific rows from a table based on a
WHEREcondition. - JOIN: The most powerful feature of SQL. It dynamically combines rows from two or more separate tables based on a related column (usually Primary/Foreign keys) to return a unified dataset.
6. Indexing and Performance
An Index is a specialized data structure (often a B-Tree) created on a database column that drastically improves the speed of data retrieval operations (SELECTs), at the cost of slower writes (INSERTs/UPDATEs) and increased storage space.
How Indexing Works
Without an index, the database engine must perform a "Full Table Scan" (checking every single row from top to bottom) to find
WHERE Username = 'John'. If the Username column is indexed, the database uses a sorted tree structure to find 'John' nearly instantly (O(log n) time) instead of scanning millions of rows (O(n) time).The Trade-off: Every time a new user is inserted or updated, the index tree must be computationally rebuilt and re-sorted. Therefore, over-indexing a table slows down write operations significantly.
Key Takeaways
- SQL provides standardized commands (SELECT, INSERT, UPDATE, JOIN) to query and manipulate relational data.
- Indexing drastically speeds up data retrieval (SELECTs) by maintaining sorted lookups using B-Trees.
- Over-indexing slows down data modification (INSERT/UPDATE/DELETE) because the indices must be constantly maintained.
Summary
Key Takeaways
- Databases organize data securely. SQL (Relational) uses rigid tables for complex relations and guarantees integrity; NoSQL uses flexible documents for massive horizontal scalability.
- Primary and Foreign Keys establish the vital referential links between separate tables in relational models.
- ACID properties (Atomicity, Consistency, Isolation, Durability) guarantee absolute transactional reliability.
- Normalization (1NF, 2NF, 3NF) structures tables to eliminate redundancy and update anomalies.
- SQL is the standard querying language, utilizing JOINs to combine data and Indexing to optimize retrieval speeds at the cost of write performance.