- Part 1: Into to Databases and the Relational DB Model:
- Background, Relational Model,
- Database Constraints
- Part 2: DB Access and the SQL declarative language
- SQL basics,
- Advanced SQL,
- SQL as a DDL (Database Definition Language) and DML (Database Manipulation Language)
- Part 3: Accessing a SQL DBS programmatically
- Embedded SQL (C), and
- JDBC (Java)
- Part 4: Logical DB Design
- Data Modelling, ER and EER models
- Mapping to the Relational Model
- Normalisation
- Relational Algebra and Calculus
- Data Modelling, ER and EER models
- Part 5: Database Security
- Structured - tables with predefined columns
- Unstructured - web pages, text docs, images, videos
- Semi structured data - JSON or XML
- Traditionally, DBs managed structured (tabular data)
- Nowadays “modern” DBs handle text, documents, graphs…
- DBs were designed for “structured” data
- Un/semi-structured is textual data and is the subject of Information Retrieval (IR)
- Nowadays, IR and DBs are coming closer
- Also, many graphs DBSs have emerged to handle graph-type data (social networks, proteins, …)
- Un/semi-structured is textual data and is the subject of Information Retrieval (IR)
- Data modelling “defines” the data space:
- If you give me a table, I will ask for rows and columns…,
- If you give me a graph I will ask of nodes and edges and paths…,
- Which are the data items that I care about?
- What are the data items’ attributes (characteristics)?
- What are the relationships between data items?
- What are the attributes of such relationships?
- As mentioned, the data model employed in Relational DBs is table-centric.
How can a relationship be build between different linking bits of data?
- What are the principal functions of a DBS?
- In a nutshell, DB systems provide software too “manage” data
- Model data
- Access data: query + update
- Analyse data: complex queries
- Store data
- Secure and ensure the integrity/consistence of data
What do they allow us to do?
- 1:
- Ensure/maintain data consistency, integrity, and security in the dace of:
- Attempts for unauthorized accesses and/or to corrupt data
- Updates (a.k.a transactions) may violate “integrity constraints” on the data (e.g. bank account balances less than 0)
- Failures (think of machine crashes due to sw bugs, power outages, disk crashes) and recovery from failures.
- Concurrent transactions (think of multiple users or programs reading/writing the same data at the same time)…
- Ensure/maintain data consistency, integrity, and security in the dace of:
- 2:
- Optimize data accesses - that is, ensure tshort response times, proper utilization of resources, high throughput, etc.
- Create and employ indexes - speed up finding where desired data items are
- Find best way to execute a set of operators in a query
- Find the best order for executing relational data operations:
- Selection, progection, join
- Order of joins in a multi-way
- Optimize data accesses - that is, ensure tshort response times, proper utilization of resources, high throughput, etc.
- Can be seen as a box with a nice interface offering the previous functionality
- DBS interface is based on :
- An easy way to understand the data space (model)
- On a declarative programming language
- Once a query is received by the box, all previously mentioned functionality will be provided to us, transparently - the “box” worries about it ; NOT THE USER
Architectural schematic of a database:
A definition for a database - a collection of related data - known facts that can be recorded and have implicit meaning
- This definition is typically too general
A database has the following implicit properties:
- A database represents some aspect of the real world, sometimes called the miniworld or the universe of Discourse changes to the minimiworld are reflected in the database
- A database is a logically coherent collection of data with some ingerent meaning. A random assortment of data cannot correctly be referred to as a database.
- A database is designed, build, and populated with data for a specific purpose. It has an intended group of users and some preconcieved applications in which the users are interested.
A database can be of any size and complexity.
A database management system is a computerized system that enables users to create and maintain a database. The DBMS is a general purpose software system, that facilitates the process of defining, constructing, manipluating and sharing databases among various users and applications.
- Defining a DB entails specifying the data types, structures and constraints of the data to be stored in the database.
- On creation of the database, the above data is stored in the metadata of the database.
- Constructing the DB is the process of storing data on some medium that is controlled by the DBMS.
- Manipulating a database includes functions such as querying the database to retrieve specific data, updating the database to reflect changes, and generating reports from the data.
- Sharing a DB allows multiple users and programs access the database simultaneously.
- An application program accesses the database by sending queries or request for data to the DBMS
- A query typically causes some data to be retrieved
- A transaction may cause some data to be read and some data to be written.
- Protecting the database and maintiaing it over time
- Protection can entail:
- hardware or software malfunction
- security protection against unauthorized or malicious access
In a DB approach, a single repository maintains data that is defined once and then accessed by vaiour users repeatedly through queries, transactions and application programs.
The main characteristics of the DB approach are:
- Self-desciribing nature of the database system
- Insulation between programs and data, and data abstraction
- Support of multiple views of the data
- Sharing of data and multiuser transaction processing
The DBS contains not only itself, but also a complete definition or description of the database structure and contraints.
This definition is stored in the DBMS catalogue - metadata.
- This is used by the DBMS software and users who need information about the DB structure
A DBMS is a very multi-purpose application, and must be able to work equally well with any number of DB applications - a university, a bank, or a company DB, so long as the DB definition is stored in the catalogue.
The structure of data files is stored in the DBMS catalogue seperately from the access programs:
- This is program data independence.
- Informally, a relation is a table of values, having:
- A set of rows. The data elements in each row represent certain facts that correspond to a real-world entity or relationship
- An employee, a bank account, a student etc …
- Formally, rows are called tuples
- Each column
- Represents a characteristic/attribute of interest of that entity
- Has a column header that gives an indication of the meaning of the data items in that column
- Formally called attribute name
- Key of relation:
- Each row must be uniquely identifiable in the table
- The key of the row does this
- Sometimes row-ids or sequenctial numbers are assigned as keys to indentity the rows in a table
- Called artificial key or surrogate kay
- Each row must be uniquely identifiable in the table
- A set of rows. The data elements in each row represent certain facts that correspond to a real-world entity or relationship
- A prevalent data model and DBS type
- Based on tables (relations);
- With known table names, and attribute (name, type) pairs.
- To formulate queries:
- Specify the table name(s), and attribute names of interest
- special constraints (predicates) that must be identified
- The Schema of a Relation:
- Denoted by
$R(A1, A2, …, An)$ -
$R$ is the name of the relation - The attributes of the relation are:
$A1,A2, …,An$
- Denoted by
- Example: CUSTOMER(Cust-Id, Cust-name)
- CUSTOMER is the relation name
- Defined over the two attributes
- Each attribute has a domain or a set of values
- A tuple (row) of a relation (table) is an (ordered) set if values, enclosed in angled brackets.
- Each value is derived from an appropriate domain
- A relation is an (unordered) set of tuples.
- A domain may have a specific format
- Relation state is the set of tuples currently in the relation.
- The relation state is a subset of the Cartesian product of the domains of its attributes
- Each domain contains the set of all possible values the corresponding attribute can take.
- Formally,
- Given
$R(A1,A2,… ,An)$
- Given
-
$R(A1,A2,… ,An)$ is the schema of the relation -
$R$ is the name of the relation -
$(A1,A2,…,An)$ are the attributes of the relation -
$r(R)$ : a specific state [instance] or relation$R$ - an actual set of tuples [rows]-
$r(R) = t_1,t_2,...,t_n$ where each$t_i$ is an$n$ tuple -
$t_i = \langle v_1,v_2,...,v_n\rangle$ where each$v_i$ comes from$dom(An)$ $r(R) \sub dom(A1) \cross dom(A2) \cross ... \cross dom(An)$
-
Why use the mathematical model for the DB:
- Presents a human with an understandable format for our data
- With the goal to simplify how:
- To interrogate it
- Maintain it
- Database = collection of relations
- Relation schema = relation name and attribute list
- Optionally: types are specified.
- Database schema = set of all relation schemas in the database.
- Order of
- Tuples in a relation is not important
- Attributes in a tuple is not important - same order is assumes
- Collection of relation schemas
- Relational database schema
$\longleftarrow$ intension - model of your data, how you want it to be stored
- Relational database schema
- Corresponding relation instances/states
- Relational database
$\longleftarrow$ extension - when that data has been inserted into the database
- Relational database
- NB:
- intension vs. extension
- schema vs. data
- Metadata of the DB: schema and other key table info
- Why Relations?
- Very simple
- Abstract model
- Has been enriched with declarative languages, such as SQL, the most important in DBMS today, simplifying users’ interactions with data
- If you just select a particular field, then will only return the individual instances.
- Set oriented
- Sets of attributes in a tuple
- Sets of tuples in a relation
- SQL is based on bags (multisets)
- NOTE Sets
$\neq$ Bags
- on keys
- on entities (entity integrity)
- on references (referential integrity)
- Possible violations of contrains
- Functional Dependencies
- Used to determine the permissible states of relation instances
- Three (explicit schema-based) constrains used in the R-DBMSs:
- Key constrains
- Entity contraints
- Referential integrity constraints
- Domain constraints
- Values in an attribute in a tuple must come from the domain of that attribute
- values could also be NULL
- NULL could be where we do not yet know
- Values in an attribute in a tuple must come from the domain of that attribute
- SuperKey of
$R$ a subset of attributes$R, SK$ such that: - In any valid state for
$r(R)$ :- For any two distinct tuples:
$t_1,t_2 \in r(R)$ $t_1[SK] \neq t_2[SK]$
- E.g.
-
$R$ (Name, Country, PhoneNumber) - $SK = {$County, PhoneNumber$}$
-
- For any two distinct tuples:
- Candidate key of
$R$ : a minimal superkey- Any key
$K$ is a “minimal” superkey if:- The removal of any attribute from
$K$ results in a set of attributes that is no longer a superkey.
- The removal of any attribute from
- if a relation has several candidate keys, one is chosen arbitrarily.
- Any key
- Entity Integrity:
- The primary key attributes (
$PK$ ) cannot be NULL in any tuple of$r(R)$ - This is because primary key values are used to indetiy the individual tuples
- If PK has several attributes, null is not allowed as a value in any of these attributes
- Any attribute or
$R$ may not be allowed to be null- Specified when the table is created
- The primary key attributes (
- A table Students lists data about students such as age, grades etc.
- Students has an attribute Student_Id
- This is the PK.
- Another table Student-Courses has an attribute Student_ID
- Listing, for each module which students are taking it, at which term
- (Bad idea to put this into one table)
Splitting data across different tables allows for much ‘better’ queries with more depth, and prevents too much NULL data in feilds.
- A set of attributes
$FK$ from relation$R1$ is a foriegn jey that references relation$R2$ if:- Attribute(s) in the
$FK$ from$R1$ have the same domain as the attribute(s) in the primark key$PK$ of$R2$ , and - A value of
$FJ$ in a tuple$t_1$ of$R!$ mustt either:- Refer to a value of the
$PK$ of some tuple$t_2$ in$R2$ - Or be NULL
- Refer to a value of the
- Attribute(s) in the
Summary
Keys, Permissibility of NULL, Candidate Keys (called unique), Foreign Keys, Referential Integrity are expressed by the CREATE TABLE statement.
- Other “sematic attribute integrity constraints” also exist.
- For example bank balance > 0
- Constraints exist to define permissible values
- This concerns DB updates
- Tuple insertions/deletions
- Attribute-value updates which may, by definition, cause constraint violations
- The DB ensures all contraints are respected at any point
-
Insert:
- Domain
- One of the attribute values for new tuple is not in the attribute domain
- Key
- The value of a key attribute in new tuple already exists
- Referential Integrity
- A foreign key value in new tuple references a primary key value that does not exist in referenced relation
- Entity Integrity
- If primary key value is null in new tuple
- Domain
-
Delete
- May only violate referential integrity
- If PK value of the tuple being deleted is referenced from other tuples in other relations
- Some option must be specified during DB design for each foreign key constraint on how to handle such deletions leading to RI violations
-
UPDATE
- The domain and NOT NULL contraints may be violated on an attribute being modified
- Updating PK
- Updating a FK
- May violate RI
- Updating an ordinary attribute may only violate domain contraints
- Why do we need
- A formal tool that allows derivation of “good” DB design
- A good design depends on the dependecies between attributes chosen to be in the same relation.
- Assume
-
$X,Y,Z$ represents sets of attributes -
$A,B,C$ represents single attributes.$ABC$ denotes the set${ A, B, C}$
-
-
$X\Rightarrow Y$ is an assertion about a relation$R$ and two sets of attributes from$R$ - Specifies a constraint on the possible tuples in $ r(R)$
- If
$X \Rightarrow Y$ the values of the$Y$ component of a tuple depend on values of the$X$ components- The values of the
$X$ component of a tuple uniquely (functionlally) determone those of the$Y$ component
- The values of the
- So,
$X \Rightarrow Y$ specifies that whenever two tuples of$R$ agree on values of all the attributes of$X$ , then they must also agree on values of attribute of$Y$ - Formally:
$t_1[X] = t_2[X] \Rightarrow t_1[y] = t_2[Y]$
- Formally:
- Claim: If
$K$ is a superkey, then$K \Rightarrow A$ , for any (subset of) attribute(s)$A \in R$ - Claim: If
$X\Rightarrow R$ then$X$ is a superkey.
- We could assert a key
$K$ - Then onlys FDs asserted are that
$K \Rightarrow A$ for every attribute$A$ - No surprise: K is the only key for those FDs, according to the fomal defintition of key
- Then onlys FDs asserted are that
- Or we could assert some FDs and deduce one or more keys by the formal definition.
- Rule of Thumb: Often, DFs either come from keyness or from physics.
Data Abstraction generally refers to the suppression of details of data organization and store, only highlighting the essential features for an improved understanding of data. One of the main features of the DB approach is to support data abstraction so that different users can perceive data at a preferred level of details.
Data model - a collection of concepts that can be used to describe the structure of a database - provides the necessary means to achieve this abstraction.
Database Structre - the data types, relationships, and contraints that apply to the data. Most data models also include a set of basic operations for specifying reteivals and updates on the DB.
In addition to this, it is becoming common to include concepts in the data model to specify the dynamic ascpect or behabiour of a DB application, allowing the DB designer to specify a set of valid user-defined operations that are allowed on the Db objects.
The basic operations such as INSERT, DELETE, UPDATE or RETRIEVE any kind of object are typically included in the basic data model operations.
Data models range from the high level or conceptual models provide concecpts that are close to the way the user would typically perceive the dat, whereas low level or physica; models typically describe the details of how data is stored on the computer storeage - mostly magnetic disls.
Between these two, are representational or implementation data models, which provide concepts more easily understood by end users, but not too far removed from the way data is organized in computer storage. Representational data models hide many details of data storage on disk, but can be implemented on a computer system directly.
Conceptual data models use concepts such as entities, attributes, and relationships. An entity represents real-world object or concept, such as en employee or a person which is described in the DB. An attribute represents some property of interest that further describes an entity. A relationship across two or more entities represents an association among the entities. E.g. an employee and a project to which they have been assigned.
Representational or implementation data models are those used most frequently in traditional DBMSs, including the relational data model, the network and hierarchical data models.
The description of a database is called the DB Schema, which is specified during design, and not expected to change often. This describes the structure of each record type, but not eh actual instances.
- The internal level has an internal schema, which describes the physical storage structure of the DB, it uses a physical data model and describes the complete details of data storage and access paths in the DB.
- The conceptual level has a conceptual schema, describing the structure of the whole database for a community of users. This hides the details of storage, and concentrates on describing entities, data types, relationships, user operations and constraints. Typically, a representational data model is used to describe the conceptual schema when the DB is implemented.
- The external view includes a number of external schemas or user view. Each external schema describes the part of the DB that a particular user group is interest in and hides the rest from that UG. A above, each schema is normally implemented using a representational data model based on an external schema.
This idea helps the user visualise the schema levels in the DBS, despite this however most DBMSs do not separate these explicitly.
Note - all three levels of schemas are only descriptions of the data, not actual parts of the data itself.
The three schema architecture can be used to explain the concept of data independence.
- Logical data independence is the capacity to change the conceptual schema without having to change external schemas or application programs
- Physical data independence is the capacity to change the internal schema without having to change the conceptual schema
Wherever there is a multiple level DBMS its catalogue must be expanded to include information on how to map requests and data among the various levels.
Where no strict separation of levels is maintained:
- DDL - data definition language is used by the DBA and the DB designers to define both schemas.
- The DMBS will have a DDL compiler whose function it is to process the DDL statements in order to identify descriptions of the schema construct and to store the schema
Where a clear separation is maintained between the conceptual and internal levels, the DDL is used to specify the conceptual schema only:
- The SDL - storage definition language is used to specify the internal schema. The mapping between the two can be specified in either of the two languages
- Typically, now a days, there is no specific language that performs the role of SDL. The internal schema is specified by a combination of functions, parameters and specifications related to storage of files, permitting the DBA staff to control indexing choices and mapping of data to storage.
For a true three level architecture, we need a third language the:
- VDL - view definition language to specifiy user viewsa and the mappings to the schema
- Typically, the DDL is used to define both conceptual and external schemas.
- In relational databases, SQL is used in the role of VDL to define user applications views as results of predefine querues
Once this has been compiled, and the DB populated, the language used to manipulate the data is called the:
- DML - data manipulation language.
NOTE - currently, the previous types if languages are not considered distinct languages, rather an integrated language is used that includes the functionality of all of them combined. Excluding the storage definition, as it is used for defining physical storage structure to fine tune DB performance
Typically, DBMSs allow the user to either enter DML statements from a command line, a GUI, or embedded into a general-purpose programming language.
- A low level or procedural DML must be embedded in a gen-pop programming language.
- This type of DML typically retrieved individual records or objects from the DB and processes each separately, therefore must use programming language construct, such as looping, to reteive and process each record. These are also often called record at time DMLs.
- High level DMLs such as SQL can specify and retrieve as many records in a single DML statement - called set at a time or set oriented DMLs.
- This generally specifies what data to retrieve rather than how - also called declarative.
-
The image below shows a simplfied diagram of the DBMS components. This is divided into two parts:
- The top half references the users of the database and the appropriate interfaces available to them.
- The bottom half references the internal modules of the DBMS responsible for storeage of data and transactions.
-
The database and the DBMS catalogue are usually stored on disk, access to the disk is controlled primarily by the OS, which schedules disk read/write. Many DBMSs have their own buffer management module to schedule dis read/write.
- Management of buffer space has a considerable effect on performance.
- Reducing tee number of disk read/write improves performance considerably.
-
A higher level stored data manager module controls access to the DBMS information that is stored
User Section
- This interface shows the DBA staff, casual users, application programmers, and parametric users who do data entry by supplying parameters to pre-defined transactions.
- The DBA staff work on defining the database and tuning it by making changes ti uts definition using the DDL and other privileged commands.
- Casual users and persons with occasional need for information from the DB interact using the interact query interact. These queries are validated for correctness etc by the query compiler that compiles the queries into an internal form. This query is then optimised - reordering of operations etc.
- Application programmers write programs in host languages such as Java or C, that are submitted to a precomplier. This extracts DML command from an application program written in the host programming language. These are send to the DML compiler for compilation into object code for DB access. The rest is sent to the host language compiler. The object codes for the DML commands and the rest of the program are linked.
- It is becoming common to use scripting language such as Python or #PHP to wrtie database programs
Most DBMSs have database utilities that help the DBA manage the DB system.
- Loading - a loading utility used to load existing files - text or sequential files into the DB.
- Typically, the source format of the current and the target database file structure are specified to the loading utility. This automatically reformats the data and stores it in the DB.
- It has been very common to transfer data across databases, therefore some DBMSs come with conversion tools.
- Backup a backup utility created a copy of the entire database. Typically by dumping the entire DB onto tape or a mass storage medium. Incremental backups are also used to backup any changes made to the DB.
- DB storage reorganisation - This can be used to reorganise a set of DB files into different organizations and create new access paths to improve performance.
- Performance Monitoring - This monitors DB usage and provides statistics to the DBA. This uses the stats in making decisions such as whether to reorganise files or where to add or drop indexes to improve performance.
- Data dictionary/ repository system - in addition to storing catalogue information about schemas …, the data dictionary stored information such as design decisions, usage standards, application program descriptions and user info.
- This can be accessed directly by the user or the DBA when needed.
- Application Development Environment provide an environment for developing DB applications and inculde facilities that help in many aspects of DB systems such as DB design, GUI development, query and updating, and AP development.
- Communications software - This allows users to remote access to the DB system site through computer terminal, workstations, or personal computers. These are connected to the DB site through various forms of data/network connection hardware.
- The main model is the relational data model, and the systems based of this are SQL systems.
- Occasionally some NOSQL systems
- Some object-relational DBSMs
This gives rise to the following classifications:
- Relational
- Object
- Object-Relational
- NOSQL
- Key-value
- Hierarchical
- Network
Conceptual modelling is a key phase in designing a DB application.
- DB application tends to refer to a particular database and the associated programs the implement the DB queires and updates
The image below shows a diagram of the rough process for designing a DB application.
Entity Relationship (ER) Diagrams
- Entity relationship diagrams give a graphical representation of the relationships between different entities and their attributes.
Entities and Attributes
The basic concept that ER model represents is an entity - something that is a thing or an object in the real world with an independent existence. Either physical or conceptual.
Each entity has attributes - the particula properties that describe it (A person may have an attribute AGE)
Composite verus Simple (Atomic) Attributes
- Composite attributes can be divided into sub-parts, which represent more basic attributes with independent meaning.
- An Address for example may have multiple components:
- Street name
- Post code
- House Number
- An Address for example may have multiple components:
- Simple attributes are those that are not divisible any more
- The composition of composite and atomic attributes can leave you with a hierarchy
Single-Valued versus Multivalued Attributes
Most attributes have a single value for each entity: these are called single valued entities (a person has one age). Some attributes have multiple values: the colour of a car which has been painted red and blue has two values.
Stored versus Derived Attributes
In some cases, two (or more) attribute values are related, for example the age and birth date of a person are related. They are both derivable from each other.
- Say we store the Birth_Date of a person, the Age of that person is called a derived attribute, and is derivable form the birth date, which is called the stored attribute.
- Some attribute values can be derived from related entities. - The number of students on a course at university can be derived from the number of entries of students.
A database typically contrains groups of entities that are similar - a company may want to store similar information on each of the employess - they share the same attributes but with different values. An entity type defines a collection (or set) of entities that have the same attributes.
- The collection of all entities of a particular type in the database at any point is called an entity set or collection, this is usually referred to using the same name as the entity type, despite being two different concepts.
- An entity type is represented in ER diagrams as a rectangular box enclosing the type name. Attribute names are enclosed in ovals and attached to the entity with straight lines.
- Composite attributes are attached to their components by straight lines
- Multivalued attributes are depicted by double ovals.
- An entity type descibes the schema or insertion for a set of entities that share the same structure
An important constraint on the entities of an entity type is the key or uniqueness constraint on attributes. An entity type usually has one or more attributers whose values are distinct for each entity in the set. This is called a:
- Key Attribute, and its values can be used to identity each entity uniqeuly. For example a company name could be a key as no company can have the same name as another.
- Sometimes multiple values may make up a key, meaning the combination of each attribute must form a unique key: an address. Lots of houses can share a house number or street, but none can have the same combination of both.
- If this is the case, the proper practice is to define a composite attribute and designate it as a kay attribute of the entity type.
- In an ER diagram, they key attribute has its name underlined in the oval.
- Defining something as a key, means that it must be unique for every entry in the set, at any point in time - not just a particular set.
- Some entities may have multiple keys.
(Covered in notes above in detail - see RDBM section)
A relationship type
- Mathematically, the set
$R$ is the set of relationship instances$r_i$ where each$r_i$ associates$n$ individual entities:$(e_1,e_2,…,e_n)$ , and each of these entities$e_j$ in$r_i$ is the remember of the entity set$E_j : 1 \leq j \leq n$ , - Therefore, a relationship set is a mathematical relation on
$E_1,E_2,…,E_n$ , or can be defined as a subset of the cartesian product of the entity sets. - Each of the entity types is said to participate in the relationship type
$R$ , and each of the individual entities is said to participate in the relationship instance$r_i = (e_1,e_2,…,e_n)$
From the textbook “Informally, each relationship instance
View the example below:
- In ER diagrams, relationship types are displayed as diamond box shapres, which are connected by straigt lines to the rectangular boxes representing the entity types. The relationship name is showed in a diamond shaped box.
The degree of a relationship type is the number of participating eneity types.
- A degree of two is called binary, and of three is called ternary. An example of a ternary is shown below:
- Relationships can be of any degree, but tend to be binary.
Sometimes it is useful to think of a binary relationship as an attribute.
As Stated by text book: - come back to this to explain better
Consider the WORKS_FOR relationship type in Figure 3.9. One can think of an attribute called Department of the EMPLOYEE entity type, where the value of Department for each EMPLOYEE entity is (a reference to) the DEPARTMENT entity for which that employee works. Hence, the value set for this Department attribute is the set of all DEPARTMENT entities, which is the DEPARTMENT entity set. This is what we did in Figure 3.8 when we specified the initial design of the entity type EMPLOYEE for the COMPANY database. However, when we think of a binary relationship as an attribute, we always have two options or two points of view. In this example, the alternative point of view is to think of a multivalued attribute Employees of the entity type DEPARTMENT whose value for each DEPARTMENT entity is the set of EMPLOYEE entities who work for that department. The value set of this Employees attribute is the power set of the EMPLOYEE entity set. Either of these two attributes—Department of EMPLOYEE or Employees of DEPARTMENT—can represent the WORKS_FOR relationship type. If both are represented, they are constrained to be inverses of each other.
Each entity that participates in a relationship type plays a particular role in the relationship. The role name signifies the role that a participating entity from the entity type plays in each relationship instance.
-
Taking the example for a WORKS_FOR relationship:
- An EMPLOYEE plays the role of employee or worker
- DEPARTMENT plays the role of department or employer.
-
Role names are not technically necessary in relationship types where all the participating entity types are distinct, since each entity type name can be used as the role name.
-
In some cases, the same entity type participates more than once in a relationship type in different roles
- Here it becomes essential for distinguishing the meaning of the role that each entity plays
-
This is called recursive relationships or self referencing relationships
Take the example below:
- The SUPERVISION relationship can have multiple connections for each employee.
- The lines labelled 1 represent the supervisor role, and 2 represent the supervisee role.
Relationship types have constraints that limit the certain possible combinations of entities that may participate in the relationship set.
The cardinality ration specifies the maximum number of relationship instances that an entity can participate in. The possible values for these rations are:
$1:1$ $1:N$ $N:1$ $M:N$
The participation constraint specifies whether the existence of an entity depends on its being relation to another entity via the relationship type.
-
This specifies the minimum number of relationship instances that each entity can participate in (minimum cardinality constraint). There are two types of participation constraints:
- Total: Take the example of a company policy that every employee must work in a department is total participation. In other words, this is called an existence dependency
- Partial: Take the example where not every employee is expected to run a department, this is partial participation.
-
In ER diagrams, total participation is displayed as a double line, and partial participation is displayed as a single line
- Relationships can have attributes similar to those of entity types.
- To record the number of hours per week that an employee works on a project, an attribute HOURS can be added to a the WORKS_ON relationship
Entities without key attributes are called weak entity types - all previous so far, are called strong entity types. Entities belonging to a weak entity type are identified by being related to specific entities from another type combined with one of their attribute values.
The other entity is called the identifying/owner entity type, and call the relationship between them the identifying relationship.
A weak entity always has a total participation constraint with respect to its identifying relationship, as it cannot be identified without its owner.
Take the example of CHILDREN related to EMPLOYEE, used to keep track of each employee with a 1:N relationship. The children of different employees may be called the same thing, and have the same DOB, but they are still distinct. In this case what separates the children is the EMPLOYEE which relates to that child.
This means that each weak entity type typically has a partial key, which can identify different weak entities related to the same owner entity.
- Come back to this, mainly slightly more in depth than what was on slides.
-
SQL is a Higher level language
- Declarative
- What to do rather than how
- How to do is typically used in procedural languages
- Declarative languages avoid a lot of the complexity involved for data-manipulation which is present in languages like Java
- Even during data definition, we simply state the constraints and the DB behind the scenes ensures the consistency
-
The Philosophy is:
- Tell a DBS what is wanted
- The DBS will find an “optimal” solution to execute the query
- Referred to as query optimisation
- Tell a DBS what is wanted
-
SQL is both:
- DDL - Data definition Language: schemas, relations constrains
- DML - Data manipulation language: updates, queries
| Relational | SQL |
|---|---|
| Relation | Table |
| Tuple | Row |
| Attribute | Column |
Definitions created via the CREATE command.
- Evlolving over time
- Later standards have core specifications and extensions
- Extensions are typicallh for emerging applications - data mining for example
- SQL-2006 added XML and 2008 added O-O features
- SQL-3 is the current standard which started with SQL-1999
- SQL must operate with the provided DBMS
Each DB contains one or more schemas
- Each schema contains one or more tables
The CREATE SCHEMA statement: CREATE SCHEMA company AUTHORIZATION ‘leon’;
- Creates a schema company owned by leon
- The schema company can create one or more tables
Catalogue
- Name collection of schemas in an SQL environment
- This specifies new relations they provide:
- Name of the table
- The attributes, along with their types and names
- Any constrains
- All tables belong in a schema
- Base tables (relations)
- Relation and its tuples are created and stored as a file by the DBMS
- Virtual relations (views)
- Created through the CREATE VIEW command - doesn’t necessarily correspond to any physical file.
-
SQL Create table:
CREATE TABLE <tableName> (...);
- tableName is subsituted with the name of the table/relation variable
- (…) is the list of attributes and constraints
-
Attribute/Constraint List Format
(<Att1Name><Att1DataType><AttConstraints>,<Att2Name><Att2DataType><AttConstraints>)
| ANSI Data Type | Desc. | PostgreSQL | SQLITE3 |
|---|---|---|---|
| CHARACAT(n) | Character string on n chars, fixed length | CHAR(n) | TEXT |
| CHARACTER VARYING(n) | Character String up to n chars, variable length | VARCHAR(n), TEXT | TEXT |
| NUMERIC(p,s) | Number with selectable precision | NUMERIC(p,s) DECIMAL(p,s) | NUMERIC |
| INTEGER, INT, SMALLINT | Integers with 4 or 2 bytes respectively | INT, INT4, INT2 | INTEGER |
| FLOAT(b), DOUBLE PRECISION(c) REAL(d) | Floating point numbers | FLOAT4, FLOAT8 | REAL |
- ANSI datatypes are standard
- A DBMS may have its own datatypes
- Many DBMS recognise ANSI and convert to their own
CREATE TABLE students (
studentID INTEGER PRIMARY KEY,
studentName CHAR VARYING(30),
courseID INTEGER FOREIGN KEY
);- This created this table schema and recorded it and the constraints in the catalog
-
Seen that SQL supports 3 contraint types
- Key constraint: A PK value is unique
- Entity integrity Constraint: a primary key value cannot be null
- Referential Integrity contraints: a “foreign key” must have a value that is already present as a primary key or it may be null
-
Additional restrictions on attribute domains can be specified.
- Null not permitted
- Default value
AccountBalance REAL NOT NULL DEFAULT 0.0;
- Check Clauses
Age INT NOT NULL CHECK (Age > 0 AND Age < 125)
-
PRIMARY KEY clause
- Specifies one or more attributes that make up the primary key of a relation
Dnumber INT PRIMARY KEY;
-
UNIQUE clause
- Specifies alternate keys - candidate keys in the relational model
Dname VARCHAR(15) UNIQUE;
- FOREIGN KEY clause induces contraints which must be checked and handled during updates
- Default operation: reject updates on violation
- Attach referential triggered action clause
- SET NULL
- CASCADE
- SET DEFAULT
- Attach action to be taken upon violation can be made to vary based on updates
- ON DELETE or ON UPDATE
-
CONSTRAINT SuperSSN-SSN FOREIGN KEY (Super_ssn) REFERENCES EMPLOYEE(SSN) ON DELETE SET NULL ON UPDATE CASCADE
-
If a supervisors tuple is deleted from EMPLOYEE,
- The tuples referencing (through the Foreign key Super_ssn) the deleted tuple will have its Super_ssn value set to NULL
-
If a supervisors tuple’s SSN value in EMPLOYEE is updated:
- The new SSN value will be cascaded to the value of Super_ssn of all tuples referencing the updated EMPLOYEE tuple
Primary KEY
<attName><attType> NOT NULL PRIMARY KEY,
<attName><attType> NOT NULL, ..., PRIMARY KEY(<attName>),This identifies the <attNAme> attribute as primary key - also means it is unique
The syntax is the same for UNIQUE
FOREIGN KEY(<attName1>) REFERENCES <tableName>(<attName2>),Sets <attName1> to act as a foreign key that refers to attribute <attName2> in tableName.
Note <attName1> can be both a primary key and a foreign key.
Example:
FOREIGN KEY(students.courseID) REFERENCES courses(courseID)Note The foreign key does not have to reference an attribute of the same name - the integrity check only cares about comparing values
- Imposes a fixed order to operations:
- Must add a course with ID=3 before adding a student with a reference to 3.
Domain restrictive constraints:
<attName><attType> CHECK(<boolean expr.>),
<attName><attType>,..., CHECK(<boolean expr.>), - CHECK applies restrictions on the accepted values, they do not only need to match the domain constraints, but must also evaluate the boolean expression to true.
- CHECK can be added at the end of CREATE TABLE - regardless, it will be run whenever a tuple is inserted or updated
Example
StudentID INT PRIMARY KEY
CHECK (StudentID > 0 AND StudentID <5000)Constraint Naming:
- Naming constraints allows for easier references, especially if a constraint is more complex.
CONSTRAINT <constraintName> <constraint>
StudentID INT CONSTRAINT nonzero CHECK(StudentID > 0)
CourseID INT DEFAULT 1,FOREIGN KEY(students.courseID) REFERENCES courses(courseID) ON UPDATE CASCADE
FOREIGN KEY(students.courseID) REFERENCES courses(courseID) ON DELETE SET NULL
FOREIGN KEY(students.courseID) REFERENCES courses(courseID) ON DELETE SET DEFAULTProvide actions based on triggerss:
DROP Table <tableName>;- Deletes the table - the entire relation, including the schema and the data.
DROP Schema <schemaName>;- Deletes the schema and all its objects - all the tables contained in it.
- The relational model is set-theory based:
- Relations in RM are defined to be a set of tuples:
${a,b,c,…}$
- Unlike the RM, SQL DBs allow tables to contain duplicates of tuples
-
$\Rightarrow$ Multi-set semantics or bag semantics: ${a,b,b,a,c,…}$
-
- Why allow duplicates
- SQL query results are also tables
- It is expensive to delete duplicates:
- requires a sorting operation - expensive.
- Duplicates may be desirable
- Different ‘entities’ have the same value - name, address, or the same car
- For aggregates (AVG,SUM,COUNT)
- There must still be a way to identify each tuple in a table.
- A tuple-id (auto inc.) may be used as a key to ensure uniqueness
SELECT <attribute1,attributei> FROM <table list> WHERE <condition>- Defines a list of attributes
- Defined a list of the relation names/tables needed to process the query
- Defines a boolean expression
- Only return tuples who satisfy the boolean condition.
SELECT DISTINCT studentName FROM students;- This removes any duplicates from the output
- The where clause can have multiple expressions, with AND, OR, NOT combinations.
General view:
WHERE val1 BETWEEN 0 AND 1;
WHERE boolExpr1 AND boolExpr2 OR boolExpr1 AND NOT boolExpr3;SELECT * FROM students:SELECT students.studentID FROM students:- references the studentID column from the students Table
- Useful if tables in your table list have a shared attribute name between them
SELECT S.studentID FROM students S;- The students table has been aliased to the name S
- The name S is now used for any qualification - even in a WHERE clause
- Start with the relation in the FROM clause
- Apply the row selection indicated by the WHERE clause
- Apply the column projection indicated by the SELECT clause
Operational semantics
- Imagine a tuple variable ranging over all tuples
$t_v$ ranging over all tuples in the relation mentioned in FROM- For each tuple:
- Check if it satisfies the predicates/expressions in the WHERE clause
- IF so, computer the attributes or expressions of the SELECT clause using the components of the tuple
- Move to the next
- For each tuple:
- A SQL query only produces a tuple in the answer if its truth value for the WHERE clause evaluates to TRUE
Comparing a Java method with the same variables:
SELECT dateOfBirth, houseNumber, street
FROM Employee
WHERE city = 'Warwick';public void employeeDetails(){
System.out.println("dateOfBirth" + " " + "houseNumber" + " " + "street");
for(int e = 0; e < nEmployees; e ++)
if (city == "Warwick") System.out.println("dateOfBirth" + " " + "houseNumber" + " " + "street");
}- Both are iterating through the set of tuples
- For each one, the boolean expression is evaluated
- If TRUE, output the values specified in SELECT clause
Select *
FROM Teas
WHERE manf='Tetley'
ORDER BY name;By default, done alphabetically or in growing size, however this can be defined.
-
To change the attribute name on a query:
As <newName> renames the attribute - this can be useful where two columns have the same name in different tables that are being queried
- Any expressions which make sense can appear as an element of a SELECT clause
SELECT productName, price*147 AS priceInYen
FROM Sells; - a constant can also be added here - where it makes sense - a particular string would work in a
VARCHARfield, but not in anINTfield
- Can also have conditions for strings, in which a string is compared with a pattern to see if it matches
- General form:
<StringAttr> LIKE <pattern><StringAttr> NOT LIKE <pattern>
- A pattern is a quotes string with special chars:
- % = “any string” of 0 or more chars
- _ = “any characters” - exactly one char
- Tuples in SQL can have NULL values as a special value for one or more attributes
- Needs care - They are not real values
- Each individual NULL is considered from every other NULL values
- Very iseful:
- Unknown value
- Unavailable value
- Inapplicable
The WHERE clause in SQL provides logical Boolean expressions in order to identify the tuples desired by the query.
-
a SQL query only proeduces a tuple in the answer if its true value for the WHERE-clause evaluates to true
SELECT name WHERE surname ='chipchase';
-
To account for NULLs, the logic of conditions in SQL is not just TRUE,FALSE, but includes a third state - UNKNOWN
-
When any values is compared with NULL, the true value is UNKNOWN
| TRUE | FALSE | UNKNOWN | AND |
|---|---|---|---|
| TRUE | FALSE | UNKNOWN | TRUE |
| FALSE | FALSE | UNKNOWN | FALSE |
| UNKOWN | FALSE | UNKNOWN | UNKNOWN |
| TRUE | FALSE | UNKNOWN | OR |
|---|---|---|---|
| TRUE | TRUE | TRUE | TRUE |
| TRUE | FALSE | UNKNOWN | UNKNOWN |
| TRUE | UNKNOWN | UNKNOWN | UNKNOWN |
| NOT | |
|---|---|
| TRUE | FALSE |
| FALSE | TRUE |
| UNKNOWN | UNKNOWN |
- Never use “A = NULL” will never return true
- USE
<Attr> IS NOT NULLor theIS NULLoperator.