Difference between revisions of "Structured Query Language in Python"

From Sustainability Methods
m (This article provides an introduction to structure query language (SQL) and compares database management systems which can interact with SQL)
 
Line 280: Line 280:
 
“What Is SQLite and When to Use SQLite.” 2021. Simplilearn.com. July 22, 2021. https://www.simplilearn.com/tutorials/sql-tutorial/what-is-sqlite.
 
“What Is SQLite and When to Use SQLite.” 2021. Simplilearn.com. July 22, 2021. https://www.simplilearn.com/tutorials/sql-tutorial/what-is-sqlite.
  
[[Category:Statistics]]
 
[[Category:Python basics]]
 
  
 
The [[Table of Contributors|authors]] of this entry are XX Reinhardt and Hajar Zare. Edited by Milan Maushart
 
The [[Table of Contributors|authors]] of this entry are XX Reinhardt and Hajar Zare. Edited by Milan Maushart

Latest revision as of 12:28, 3 September 2024

THIS ARTICLE IS STILL IN EDITING MODE

This article will provide an overview of the functioning of Structured Query Language (SQL), and discuss which package in Python should be used under which circumstances. First, we will have a look at how SQL actually works!

Introduction to SQL

SQL is the standard database language for relational databases. It is used for retrieving and manipulating data. When analyzing data, SQL is great to retrieve and organize data. It is used by many different institutions, such as hospitals and large companies. Basically anywhere, where data needs to be analyzed! To use SQL in Python, you need to choose a package. We will have a look at the different options later. For now, we will focus on how SQL works.

Relational Databases

A relational database is a collection of data elements having specified relationships following the relational model. This model for data organization was developed by E. F. Codd in 1970 and pursues the goal of making data from large databases accessible without knowledge about its structure and location in the database.

The relational model allows data to be organized into tables. Each table contains one type of record which is described by a set of attributes. In the tables, each attribute is represented by a column whereas each record is represented by a row. Relationships between the records of different tables are established using primary and foreign keys. Primary keys are unique identifiers for each row of a table. Foreign keys contain the primary key of another table and thus link the rows of the tables (Codd 1970).

Figure 1 shows an exemplary relational database that could be applied in a hospital. Its first table contains the patients of the hospital (rows) which are described by patient_id, name, address, and insurance_number (columns). In the second table, the treatments (rows) are stored, described by treatment_id, drug, intakes_per_day, and patient_id as attributes (columns). The first column of both tables contains a unique identifier for each record which is the primary key. The patient ID in the treatments table is a foreign key since it contains a primary key from the patients' table and thus allows to link each treatment to a patient.

Figure 1: Exemplary Relational Database

Figure 1 database.png

Query Relational Databases with SQL

The following sections provide an overview of the main SQL elements used for querying relational databases. Code examples based on the exemplary database shown above are used to illustrate the use of the language elements in typical query structures.

Remark: SQL is standardized by an ISO norm (ISO/IEC 2016). Nonetheless, the use of the language slightly varies between database systems.

SELECT (DISTINCT) Statement

The SELECT statement is used to retrieve data from a relational database. You can either specify a list of columns that should be selected from a table or select all columns using a “*”. Furthermore, the source table has to be defined using the FROM clause. Instead of the SELECT statement the SELECT DISTINCT statement can be used in order to show only unique rows in the output table.

Code example 1: Simple SELECT statement

SELECT
    name,
    address,
    insurance_number
FROM patients

The query would return a table containing the name, address, and insurance number of all patients.

Aliases

Aliases can be assigned to tables and columns using the AS keyword. Renaming columns can help to make output tables more readable (code example 5); aliases for tables are used to make queries shorter and clearer (code example 4).

Filtering Data

Data can be filtered by using the WHERE clause. It is followed by a boolean function that outputs “True” for all records which should be included and “False” for those which should be excluded.

Such Boolean functions usually contain comparisons of attributes with fixed values or other attributes. Comparisons are expressed using comparison operators (table 1).

Table 1: Common comparison operators

Operator Description
= Equal to
> Greater than
< Less than
>= Greater than or equal to
<= Less than or equal to
<> Not equal to

(Source: adapted from w3schools 2022b)

Missing values can be identified by the IS NULL operator returning “True” for all records having a null value in the data field of interest. IS NOT NULL is the negation of this operator.

Expressions can be linked with the logical operators AND and OR. AND returns "True" if both linked expressions are true; OR returns "True" if one of those is.

Code example 2: WHERE clause and filter condition

SELECT
    *
FROM treatments
WHERE
    intakes_per_day > 1
    AND drug = “ibuprofen”
    AND patient_id IS NOT NULL

The query would return a table with all treatments where ibuprofen is taken more than one time per day and the patient_id is not missing.

Sorting Data

The rows of the output table can be sorted using the ORDER BY keyword. The keyword is followed by the columns by which the output should be sorted. Thereby, the order of the columns determines the priority with which they are considered. The keywords ASC and DESC after a column determine whether the values are put in ascending or descending order.

Code example 3: ORDER BY keyword

SELECT
    *
FROM treatments
ORDER BY intakes_per_day DESC

The query would return all treatments and order them by descending intakes_per_day.

Joins

Joins are used to connect rows from different tables based on matching attributes. Different types of joins are differentiated by the rows contained in the output table. Common join clauses are the (INNER) JOIN, LEFT (OUTER) JOIN, RIGHT (OUTER) JOIN, and FULL (OUTER) JOIN. Inner joins only return rows that match a row from the other table. Full joins return all rows of both tables. Left joins return all rows from the left table and the matching rows from the right table. Right joins work vice versa. Thereby, the table mentioned before the join clause is considered to be the left table. Join clauses are used in conjunction with the ON keyword which is followed by the join condition defining one or more attributes that will be checked for matching.

Remark: In case of similar column names in two or more source tables, column names must be supplemented by table names. Aliases can help to make those references shorter (code example 4).

Figure 2: Common joins

Visualisation joins.png

(Source: adapted from w3schools 2022a)

Code example 4: LEFT JOIN

SELECT
    p.name,
    p.insurance_number,
    t.drug,
    t.intakes_per_day
FROM treatments AS t
LEFT JOIN patients AS p
ON t.patient_id = p.patient_id

The query would return all treatments, described by drug and intakes_per_day as well as the patient's name and insurance_number if there is an entry with a matching patient_id in the patients' table.

Aggregation functions

Aggregation functions aggregate the values of a specified column over several rows. Common aggregation functions are listed in table 2.

Common aggregation functions

Function Description
MIN() returns the minimal value
MAX() returns the maximal value
SUM() returns the sum of the values
AVG() returns the average of the values
COUNT() counts the values
COUNT(DISTINCT) counts the unique values

Table 2 The GROUP BY clause is used to define attributes that form the groups whose values will be aggregated (see table 2). If such attributes are not defined, the whole table gets aggregated into one row.

Code example 5: Aggregation functions and GROUP BY clause

SELECT
    patient_id,
    COUNT(DISTINCT drugs) AS number_of_drugs,
    SUM(intakes_per_day) AS total_intakes_per_day
FROM treatments
GROUP BY patient_id

The query would count the different drugs administered and sum up the intakes per day over all drugs per patient_id.

SQL in Python

There are several packages you can choose from to use SQL in Python. The two most important ones are SQLite and oracle. We will have a look at these two options and discuss their advantages and disadvantages.

SQLite

SQLite is a relational database management system (RDBMS). It can also be used as a software library, which is a topic for another article. The "Lite" refers to the lightness of setup, administration, and required resources, SQLite stores the data in a file that can be used on multiple operating systems, which is not the case for many systems making use of SQL. Deploying it is as simple as linking a library and creating a regular file, this fact causes a high capacity of accessibility.

# import the package for SQLite
import sqlite3

# Connect to an SQLite database. If no such file exists, one will be created
conn = sqlite3.connect('example.db')

# Create a table
conn.execute('''CREATE TABLE IF NOT EXISTS stocks
             (date text, trans text, symbol text, qty real, price real)''')

#The triple quotes allow spanning the statement over multiple lines. "CREATE TABLE IF NOT #EXISTS are case-insensitive SQL keywords that have been written in capital letters to make #the code more readable. A table with five columns is created. The "text" or "real" behind the #name of the columns determines whether the values should be text or floating-point numbers.

# Insert some data into the table
conn.execute("INSERT INTO stocks VALUES ('2023-04-19', 'BUY', 'AAPL', 100, 136.71)")

# Commit the changes
conn.commit()

# Query the table
cursor = conn.execute("SELECT * FROM stocks")
for row in cursor:
    print(row)

# Close the database connection
conn.close()

Here are some benefits and limitations of SQLite:

Benefits

  • SQLite is an in-memory open-source library
  • Does not need any configuration or installation and license
  • The size is less than 500kb
  • It is serverless and does not need a server process
  • Enables You to work on multiple databases in the same session
  • It is a cross-platform system and does not need a specific platform to run, such as Linux
  • Easy to backup
  • It's atomic and transactional property which means if part of the operation is failed, all the successful operations will be reverted to the original state.
  • Database can be recovered from system crashes and power outage
  • Resilient to out-of-memory and storage errors

Limitations

  • Loose approach towards the datatype handling, allows the insertion of invalid values, you can insert a string in an integer column
  • It does not support all the datatypes for example there is no Boolean or Datetime
  • The previous drawback can cause your data to be invalid when moving it from SQLite to another platform
  • On the same file, simultaneous write is impossible
  • The previous limit causes lockage when writing on file and it reduces performance in heavy scenarios
  • Database-level user management is less applicable in SQLite
  • There are a lot of limitation in the maximum usage of features for instance the maximum number of attached databases are 125, the maximum number of rows in a table is 264, the number of tables in a schema is 2147483646, the maximum number of columns is 32767 in a table, the maximum number of tables in a join is 64 tables, maximum number of arguments on a function is 127, the SQLite dataset is limited to 281 terabytes, which is why it is not a good choice for large scale projects
  • SQLite stores the entire data in a single file which makes it hard to handle under filesystem limitations
  • Due to the database accessibility, it may require more work to provide the security of

private data

When to Use SQLite?

  • SQLite is a good choice when you want to combine the SQL querying and storage ability with the ease of access
  • Can be a good selection in environments where end users should not be aware of data existence
  • As SQLite requires no installation and configuration, beginners can use SQLite for the learning purpose

What is Oracle?

Oracle database is another (RDBMS). It stocks the largest part of the current market among relational database management systems. The company provides multiple editions including: Enterprise Edition, Standard Edition, Express Edition (XE), Oracle Lite, Personal Edition. As Oracle is an RDBMS software, it is built on SQL and has some additional extensions. Oracle is a fully scalable database architecture. It is therefore being used by big enterprises which require management and process within their local network that is spread across the world. It also has its own network components to facilitate connections across different networks.

Benefits

  • Cost reduction: It Enables you to consolidate several databases into one which can cause shrinking operations and costs. You have the option of cloud bursting when peaks occur, which means that they can temporally increase their cloud resources. This increases the resistance to crashes
  • One single database is easier to manage and maintain
  • High availability and performance, including performance optimization features
  • Secure hybrid cloud environment offers more options by providing more tools than the other platforms
  • User control and identity management that is hard to reproduce which means high safety standards
  • Quick backup and recovery options to protect and reconstruct data
  • Flashback technology: this option allows to recover of missed or lost data, decreases human errors, and makes the administration much easier
  • Reliability and portability: It can run on almost 20 network protocols and 100 hardware platforms easily
  • Market presence: Oracle is the most famous and largest RDBMS vendor with more experience and researchers, tons of supported third parties, and professional staff
  • Support for enterprise applications
  • Oracle DB is compatible with the major platforms, including Windows, UNIX, Linux, and macOS. The Oracle database is supported on multiple operating systems, including HP-UX, Linux, Microsoft Windows Server, Solaris, SunOS, and macOS

Limitations

  • Complexity: It requires specialized skills to install and maintain due to its complex engine
  • Cost: While it can save costs due to the option to consolidate databases into one, it might also be costly if the potential of the offers is not fully exhausted
  • Difficult to manage: It is generally more difficult in managing certain operations
  • Not suitable for small or medium size databases.
  • High hardware requirements

When to Use Oracle?

Oracle Database supports SQL language to interact with the database. It's considered one of the best databases because it supports all data types like relational, graph, structured, and unstructured information. It is apparently a wise choice for large-scale projects and can support large quantities of data. From the announcement of 256 big companies, such as Netflix, Linkedin, ebay, it can be concluded that it is appropriate for a large amount of data.

References

“10 Benefits of Oracle’s Data Management Platform.” n.d. https://www.oracle.com/a/ocom/docs/10-benefits-of-oracle-data-management-platform.pd

Codd, E.F. 1970. "A Relational Model of Data for Large Shared Data Banks." *Communications of the ACM* 13, no. 6: 377-387. [1](https://dl.acm.org/doi/pdf/10.1145/362384.362685)

“Introduction to Oracle Database,” n.d. https://docs.oracle.com/database/121/CNCPT/intro.htm.

ISO/IEC. 2016. "ISO/IEC 9075-1:2016(en): Information technology — Database languages — SQL — Part 1: Framework (SQL/Framework)." [2](https://www.iso.org/obp/ui/#iso:std:iso-iec:9075:-1:ed-5:v1:en)

Koulianos, Petros. 2020. “5 Reasons to Use SQLite the Tiny GIANT for Your next Project.” The Startup. August 9, 2020. https://medium.com/swlh/5-reasons-to-use-sqlite-the-tiny-giant-for-your-next-project-a6bc384b2df4.

Nguyen, Spencer. 2022. “The Benefits of Oracle DBMS for Your Organization.” DreamFactory Software- Blog. May 25, 2022. https://blog.dreamfactory.com/the-benefits-of-oracle-dbms-for- your-organization/.

Singh, Hemendra. 2019. “Mobile App Development Company | Web Development Company USA India.” The NineHertz: August 14, 2019. https://theninehertz.com/blog/advantages-of-using-oracle-database.

“SQLite Pros and Cons: A.” 2022. ThinkAutomation. February 17, 2022. https://www.thinkautomation.com/our-two-cents/sqlite-pros-and-cons-a/.

“SQLite Python.” 2018. SQLite Tutorial. 2018. https://www.sqlitetutorial.net/sqlite-python/.

Team, CherryRoad Consultancy. 2021. “What Are the Top 8 Reasons to Use an Oracle Database?” CherryRoad Technologies. October 22, 2021. https://www.cherryroad.com/2021/10/22/oracle-database-cloud/.

w3schools. "SQL Joins." Accessed December 23, 2022a. [3](https://www.w3schools.com/sql/sql_join.asp)

w3schools. "SQL Operators." Accessed December 23, 2022b. [4](https://www.w3schools.co /sql/sql_operators.asp)

“What Are the Limitations of SQLite.” n.d. Www.dbtalks.com. https://www.dbtalks.com/tutorials/learn-sqlite/what-are-the-limitations-of-sqlite.

“What Is Oracle? - Definition from WhatIs.com.” n.d. SearchOracle. https://www.techtarget.com/searchoracle/definition/Oracle.

“What Is an Oracle Database? | NetApp.” https://www.netapp.com/oracle/what-is-oracle-database/.

“What Is SQLite?” n.d. Codecademy. https://www.codecademy.com/article/what-is-sqlite.

“What Is SQLite and When to Use SQLite.” 2021. Simplilearn.com. July 22, 2021. https://www.simplilearn.com/tutorials/sql-tutorial/what-is-sqlite.


The authors of this entry are XX Reinhardt and Hajar Zare. Edited by Milan Maushart