Data is a collection of raw, unorganized facts and details like text, observations, figures, symbols, and descriptions of things etc
Data does not carry any specific purpose and has no significance by itself
Data is measured in terms of bits and bytes
Types
Quantitative
Numerical form => Weight, volume, cost of an item
Qualitative
Descriptive, but not numerical => Name, gender, hair color of a person
Information
Processed data is called Information
It provides context of the data and enables decision making
Database
Database is an electronic place/system where data is stored in a way that it can be easily accessed, managed, and updated
DBMS
A database-management system (DBMS) is a collection of interrelated data and a set of programs to access those data
The collection of data, usually referred to as the database, contains information relevant to an enterprise
The primary goal of a DBMS is to provide a way to store and retrieve database information that is both convenient and efficient.
A DBMS is the database itself, along with all the software and functionality. It is used to perform different operations, like addition, access, updating, and deletion of the data
Disadvantages of File System
Slow Searching, Not Efficient Memory Utilization
Difficulty in accessing data
Concurrency => Data Inconsistency
Data Redundancy
Data isolation
Integrity problems
Atomicity problems
Security
CAP Theorem
Concept in Distributed Databases
The CAP theorem states that a distributed system can only provide two of three properties simultaneously: consistency, availability, and partition tolerance
CAP
Consistency
In a consistent system, all nodes see the same data simultaneously
The read should cause all nodes to return the same data
Availability
It means that the system remains operational all of the time
Every request will get a response regardless of the individual state of the nodes
Unlike a consistent system, there’s no guarantee that the response will be the most recent write operation
Partition Tolerance
When a distributed system encounters a partition, it means that there’s a break in communication between nodes
If a system is partition-tolerant, the system does not fail, regardless of whether messages are dropped or delayed between nodes within the system
To have partition tolerance, the system must replicate records across combinations of nodes and networks
NoSQL Databases => Great for distributed networks, allow for horizontal scaling, and can quickly scale across multiple nodes
CA Databases
CA databases enable consistency and availability across all nodes
Unfortunately, CA databases can’t deliver fault tolerance
In any distributed system, partitions are bound to happen, which means this type of database isn’t a very practical choice
Some relational databases, such as MySQL or PostgreSQL, allow for consistency and availability
CP Databases
CP databases enable consistency and partition tolerance, but not availability
When a partition occurs, the system has to turn off inconsistent nodes until the partition can be fixed
MongoDB is an example of a CP database
The CP system is structured so that there’s only one primary node that receives all of the write requests in a given replica set
Secondary nodes replicate the data in the primary nodes, so if the primary node fails, a secondary node can stand-in
AP Databases
AP databases enable availability and partition tolerance, but not consistency
In the event of a partition, all nodes are available, but they’re not all updated
When the partition is eventually resolved, most AP databases will sync the nodes to ensure consistency across them
Apache Cassandra is an example of an AP database
It’s a NoSQL database with no primary node, meaning that all of the nodes remain available
Cassandra allows for eventual consistency because users can re-sync their data right after a partition is resolved
BASE property
Basically Available
System remains operational and provides basic functionality even in the presence of failures or partitioning
Soft state
The state of the system may change over time, even without any input or activity
Eventually consistent
The system guarantees that the data will eventually become consistent, but there may be a temporary period of inconsistency
Master-Slave Architecture
Master-Slave is a general way to optimize IO in a system where number of requests goes way high that a single DB server is not able to handle it efficiently
The true or latest data is kept in the Master DB thus write operations are directed there, Reading ops are done only from slaves
This architecture serves the purpose of safeguarding site reliability, availability, reduce latency etc
If a site receives a lot of traffic and the only available database is one master, it will be overloaded with reading and writing requests
Making the entire system slow for everyone on the site
DB replication will take care of distributing data from Master machine to Slaves machines
This can be synchronous or asynchronous depending upon the system’s need