Data Management Fundamentals: Databases, Algorithms, & Distributed Systems
Classified in Computers
Written on in English with a size of 12.62 KB
Understanding Algorithms
A programming algorithm is a computer procedure, much like a recipe, that tells your computer precisely what steps to take to solve a problem or reach a goal.
Evolution of Data Management
Early data management systems include file and database systems that were designed prior to the relational database in the 1970s. These include:
- Flat File Data Management
- Hierarchical Data Management Systems
- Network Data Management Systems
Database Generations & Models
1. Flat File Data Model
An organized set of data stored in a long-term storage medium, such as a disk or magnetic tape.
2. Hierarchical Data Model
Files are related in a parent/child manner, with each child file having at most one parent file.
3. Network Data Model
Made of data records linked together. Data records are known as “Nodes” and the links as “Edges.” This model is not restricted to one parent record and includes a schema and database. It was standardized in 1971 by the CODASYL group (Conference on Data Systems Languages).
4. Relational Database Model (1970 – E. F. Codd)
A Relational Database Management System (RDBMS) is an application made of multiple programs that manage data and allow users to add, update, read, and delete data. It is designed to use a common and standardized language to manipulate data, called SQL. The minimal requirements to implement an RDBMS include four components:
- Storage Management Programs
- Memory Management Programs
- Data Dictionary
- Query Language
Relational Database Components
- User Interface
- Business Logic
- Database Code
Relational Database Disadvantages
Relational databases can struggle to support large volumes of read and write operations, often leading to low latency response times and challenges with high availability. When too many users access relational databases, increasing the number of CPUs for memory might be necessary, but this solution only works up to a certain point.
SQL Query Example
SELECT first_name, last_name FROM employees;
5. Object-Oriented Database Model
This model supports the modeling and creation of data as objects. It can efficiently manage a large number of different data types. Objects with complex behaviors are easy to handle using concepts like inheritance and polymorphism.
Key Characteristics for Large-Scale Data Management
Four characteristics are particularly important for large-scale data management:
- Scalability
- Cost
- Flexibility
- Availability
Scaling Out Explained
Adding servers as needed, depending on the traffic, to distribute the workload.
Scaling Up Explained
Upgrading an existing database server to add additional processors, memory, or network bandwidth to improve performance.
Scaling out is generally more flexible than scaling up.
Entity-Relationship Model (ERD)
An Entity-Relationship Diagram (ERD) shows the relationships of entity sets stored in a database. An entity set is a collection of similar entities. These entities can have attributes that define their properties. For example, an HR schema might include employees, managers, and departments, while an inventory schema could include warehouses, products, and suppliers.
Flat File Limitations
Some limitations of flat files include:
- It is inefficient to access data in any way other than the one it was organized in the file.
- Changes to file structure require changes to programs.
- Different kinds of data have different security requirements.
- Data can be stored in multiple files, making it difficult to maintain consistent data sets.
Major Data Storage Types & Examples
Punched Cards
The leading card format was the IBM 80-column card, which remained prevalent until the 1950s.
Magnetic Tapes
Initially used for sound recording, magnetic tapes were adapted for data storage. Examples include:
- Half-inch tape formats originating from IBM reel-to-reel tape (e.g., IBM cartridges, StorageTek cartridges, DLT, and LTO)
- Quarter-inch and 8-mm QIC tape formats (e.g., 3M QIC, SLR, and Travan)
Magnetic Disks
The primary component in many modern storage systems, magnetic disks began with the IBM 350 Disk File, developed by the IBM team led by Reynold B. Johnson.
Optical/Magneto-Optical Storage Media
These are storage media that can record information by changing photo-physical forms on their recording surfaces and read the recorded information by emitting light beams against the surface and sensing their reflection. An early example is the LaserDisc (LD), announced in 1980 by David Paul Gregg.
Storage Class Memory
Non-mechanical storage media, such as flash memory, are currently deployed for secondary storage in computer systems. Flash memory is a type of electrically erasable programmable read-only memory (EEPROM).
Storage Networking (SAN)
Storage networking can connect arbitrary storage devices and computers via a network often designed specifically for storage devices. A Storage Area Network (SAN) links together multiple storage devices and provides block-level storage that can be accessed by servers.
Cloud Storage & Future Trends
Storage service providers (SSPs) began managing customers' storage systems in their data centers, allowing customers to access their business data via broadband networks. From the customers' viewpoint, this trend was rightly regarded as storage management outsourcing, enabled by emerging storage virtualization technology. Currently, major cloud-based storage services include Amazon S3, Windows Azure Storage, and Google Cloud Storage.
Network Storage vs. Cloud Storage
Cloud storage involves renting space from a provider, which can be imagined as renting space on several NAS devices located remotely. Network Attached Storage (NAS) provides space for the entire local network and is often found in local networks or small offices. With the ability to store and share data, most NAS devices can also run as servers (e.g., for websites, FTP, or other services).
SQL and Python: Complementary Tools
SQL provides the foundation for data retrieval, allowing you to build a dataset into a final table with all necessary attributes. From this large dataset, Python offers the power and flexibility to perform deeper analysis and answer complex questions.
The Need for NoSQL Databases
NoSQL databases emerged due to the need for distributed systems that offer operational simplicity, allowing servers to be added and removed as needed, and providing flexibility for handling diverse data types and large volumes.
Types of NoSQL Databases
- Key-Value Database: Based on keys (identifiers for looking up data) and values (e.g., a baggage tag for a suitcase).
- Document Database: Also uses identifiers to look up values, but the values are typically more complex. Documents are collections of data items stored together in a flexible structure (e.g.,
First_name
,Last_name
,Position
,Office_Number
). - Column-Family Database: Offers the ability to link or join tables for improved performance, sharing similarities with relational databases in its use of columns and rows.
- Graph Database: Suited to model objects and relationships between objects, representing data as nodes and edges in a graph structure.
The Three Vs of Big Data
The three Vs of big data are:
- Volume
- Variety
- Velocity
CAP Theorem Explained
The CAP Theorem, proposed by Eric Brewer, states that it is impossible for a distributed data store to simultaneously provide more than two of the following three guarantees:
- Consistency: Every read receives the most recent write or an error.
- Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write.
- Partition Tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.
ACID vs. BASE Consistency Models
ACID (Atomicity, Consistency, Isolation, Durability)
ACID provides a safe environment in which to operate on your data:
- Atomic: All operations in a transaction succeed, or every operation is rolled back.
- Consistent: On the completion of a transaction, the database is structurally sound.
- Isolated: Transactions do not contend with one another. Contentious access to data is moderated by the database so that transactions appear to run sequentially.
- Durable: The results of applying a transaction are permanent, even in the presence of failures.
BASE (Basically Available, Soft State, Eventual Consistency)
BASE offers a less strict assurance than ACID, prioritizing availability and flexibility:
- Basic Availability: The database appears to work most of the time.
- Soft-state: Data stores don’t have to be write-consistent, nor do different replicas have to be mutually consistent all the time.
- Eventual Consistency: Data stores exhibit consistency at some later point. Overall, the BASE consistency model provides a less strict assurance than ACID: data will be consistent in the future, either at read time or after a propagation delay.
Graph Databases: Nodes and Relationships
- A node is an object that has an identifier and a set of attributes.
- A relationship is a link between two nodes that contains attributes about their connection.
For example, nodes can represent entities (e.g., people), and relationships can represent their connections (e.g., in a social network).
Key-Value Database Features
A key-value database stores data as a collection of key-value pairs, where a key serves as a unique identifier. Both keys and values can be anything, ranging from simple objects to complex compound objects. Key-value databases are highly partitionable, allowing for horizontal scaling at levels often unmatched by other database types. For example, Amazon DynamoDB allocates additional partitions to a table if an existing partition fills to capacity and more storage space is required.
Understanding Distributed Systems
A distributed system is a system with multiple components located on different machines that communicate and coordinate actions to appear as a single coherent system to the end-user. This means systems run on multiple servers rather than a single machine.
Two-Phase Commit Protocol
The two-phase commit protocol helps ensure data consistency in distributed transactions:
- Phase 1 (Prepare): The database writes, or commits, the data to the disk of the primary server.
- Phase 2 (Commit): The database writes data to the disk of the backup server.
This protocol helps ensure consistency because if the primary server fails, the system can switch to the backup database, which has the committed data.
Monotonic Write Consistency & Importance
Monotonic write consistency means that if you issue several update commands, they will be executed in the order they were issued. This ensures that the results of a set of commands are predictable, and repeating the same commands with the same starting data will yield the same results.
Key-Value Database: Values per Key
Only one value can be stored with a single key in a key-value database.
Namespace in Key-Value Databases
A namespace is a collection of identifiers. In a key-value database, it is important because keys must be unique within a given namespace to ensure proper data retrieval and organization.