View: A set of SQL instructions that comprise a table set. Every time a view is queried the underlying SQL that comprises the view is run to retrieve the query
Can see view source in dbeaver by going to table properties -> source. This is just SQL code
Materialized view: A database view that consists of all of the data already loaded - you do not need to run the queries defining the view to return the data
Window functions: Operate on a set of rows (like a group by) but does not squash the results to one aggregated rows
Inverted index: mapping of content/key/term to location on disk
Very effective for finding “needle in a haystack” type values
These are search indexes
Each term points to a posting list (list of locations on disk for documents that contain this term)
It is inverted because it maps terms -> documents, rather than documents -> terms
It is no relation to the indexes used in rows/columns
A document is just anything that is stored in the database
Document Sharding -> Sharded by document, not term. Thus, each document will be shared on different machines. This allows for the use of multiple CPU cores to process a query. Allows for lower latency, but comes at the cost of lower throughput
Elasticsearch uses document sharding, as does Rockset, but it is less common
Traditional sharding shards by term -> each document corresponding to a term will live on a single node (optimizing for throughput)
Correlated Subquery - Where data in a subquery references a value from an outer scope
JDBC - Java Database Connectivity: a java API that allows you to connect to a database, issue commands, etc.
Handles Java applications connecting to a database
Flow is: Application code -> JDBC -> Database specific JDBC drivers -> Databases
You need a JDBC driver for a given database for a Java program to interact with it
Postgres is not threaded, it just spawns new processes
A schema is a named collection of tables
Can contain views, indexes, etc.
RDBMS Types
OLAP (Online analytical processing) - low number of complex queries for business analytics (e.g.)
OLTP (Online transactional processing) - high number of simple queries (webapps)
Online refers to the fact that the queries execute immediately, not in batches
RocksDB - optimized for SSDs -> can be used as a storage layer
Sharding is a more general form of database partitioning which is splitting up your data into multiple tables/databases
Sharding is just horizontal partitioning
In sharding, you replicate the schema across multiple databases, and have some logic or identifier to determine which server
to look for the data at. This identifier is called a shard key
One common way to shard is based on the alphabet (e.g. A-G key 1, etc. ), hash the user id (if you’re searching for a user, etc.)
If sending a query for a column that the data is not sharded on, the query will be sent to all nodes and the results will be combined
In distributed SQL databases the data needs to be partitioned across nodes automatically
Take up more space than RDBMS’s because they must store key:value for all documents where in RDBMS’s the key is pulled into the column named and only stored once
Turning an interpreted program into a native one at runtime
In databases, instead of having general purpose code to evaluate a predicate, can generate a function specific to that expression (e.g. return val=3, etc. to run natively by the CPU and gain performance improvements)
Many databases (like Postgres) have support for doing this with LLVM