In this chapter
SQL is based on the relational model but doesn’t implement it faithfully. One departure from the model is that in SQL, primary keys are optional rather than mandatory. Consequently, tables without keys will accept duplicate rows, rendering some data inaccessible. A complete review of the many disparities is beyond the scope of this book (see the “Learning database design” section in Chapter 2). The upshot of these discrepancies is that DBMS users, and not the DBMS itself, are responsible for enforcing a relational structure. Another result is that the Model and SQL terms in Table 2.1 in Chapter 2 aren’t interchangeable.
With that warning, it’s time to learn SQL. An SQL program is a sequence of SQL statements executed in order. To write a program, you must know the rules that govern SQL syntax. This chapter explains how to write valid SQL statements and also covers data types and nulls.
Figure 3.1 shows the syntax of an example SQL statement (don’t worry about the actual meaning, or semantics, of the statement).
SQL is a free-form language whose statements can:
Despite this flexibility, you should adopt a consistent style (Figure 3.2). I use uppercase keywords and lowercase identifiers and indent each clause on its own line; see “About This Book” for my typographic and syntax conventions.
authors WhErE state
= 'NY' order
An expression is any legal combination of symbols that evaluates to a single data value. You can combine mathematical or logical operators, identifiers, literals, functions, column names, aliases, and so on. Table 3.1 lists some common expressions and examples. These expressions are covered in more detail later.
Some common SQL programming errors are:
SELECT royalty_share FROM authorsinstead of
SELECT royalty_share FROM title_authors, for example)
These errors usually are easy to catch and correct, even if your DBMS returns an obscure or unhelpful error message. Remember that the real error actually can occur well before the statement the DBMS flags as an error. For example, if you run
CREATE TABLE misspelled_name
then your DBMS will straightaway create a table with the bad name. Your error won’t show up until later, when you try to reference the table with, say,
SELECT * FROM correct_name
Tip: A common way to test and fix (debug) SQL programs is to use comments to temporarily stop SQL code from being executed. If you’re working on a long SQL statement and want to test only part of it, then you can comment out some of the code so that the DBMS sees it as comments and ignores it.
The ISO SQL technical committee has been revising the official SQL standard every few years since 1986. Each revision:
The standard is enormous—thousands of pages of dense specifications—and no vendor conforms (or ever will conform) to the entire thing. Instead, vendors try to conform to a subset of the standard called Core SQL. This level of conformance is the minimal category that vendors have to achieve to claim that they conform to standard SQL. The SQL-92 revision introduced levels of conformance, and later standards have them too, so when you read a DBMS’s conformance statement, note which SQL standard it’s referring to and which level. In fact, SQL-92 often is thought of as the standard because it defined many of the most vital and unchanging parts of the language. Except where noted, the SQL elements in this book are part of SQL-92 as well as later standards. The lowest level of SQL-92 conformance is called Entry (not Core).
Your programs should follow the SQL standard as closely as possible. Ideally, you should be able to write portable SQL programs without even knowing which DBMS you’re programming for. Unfortunately, the SQL committee is not made up of language theorists and relational-model purists but is top-heavy with commercial DBMS vendors, all jockeying and maneuvering. The result is that each DBMS vendor devotes resources to approach minimal Entry or Core SQL conformance requirements and then scampers off to add nonstandard features that differentiate their products in the marketplace—meaning that your SQL programs won’t be portable. These vendor-specific lock-ins often force you to modify or rewrite SQL programs to run on different DBMSs.
Of the DBMSs covered in this book, PostgreSQL is the “purest” with respect to the SQL standard. Your DBMS might offer settings that make it better conform to the SQL standard. MySQL has ANSI mode, for example, and Microsoft SQL Server has SET ANSI_DEFAULTS ON.
Tip: For information about using SQL across different versions of the same DBMS, see “Older DBMSs and Backward Compatibility of SQL”.
An identifier is a name that lets you refer to an object unambiguously within the hierarchy of database objects (whether a schema, database, column, key, index, view, constraint, or anything created with a CREATE statement). An identifier must be unique within its scope, which defines where and when it can be referenced. In general:
These rules let you duplicate names for objects whose scopes don’t overlap. You can give the same name to columns in different tables, for example, or to tables in different databases.
Tip: For information about addressing database objects, see Table 2.2 in Chapter 2.
DBMS scopes vary in the extent to which they require identifier names to be unique. Microsoft SQL Server requires an index name to be unique for only its table, for example, whereas Oracle and Db2 require an index name to be unique throughout the database. Search your DBMS documentation for identifiers or names.
Standard SQL has the following identifier rules for names:
Standard SQL distinguishes between reserved and non-reserved keywords. You can’t use reserved keywords as identifiers because they have special meaning in SQL. You can’t name a table “select” or a column “sum”, for example. Non-reserved keywords have a special meaning in only some contexts and can be used as identifiers in other contexts. Most non-reserved keywords actually are the names of built-in tables and functions, so it’s safest never to use them as identifiers either.
You can use a quoted identifier, also called a delimited identifier, to break some of SQL’s identifier rules. A quoted identifier is a name surrounded by double quotes. The name can contain spaces and special characters, is case sensitive, and can be a reserved keyword. Quoted identifiers can annoy other programmers and cause problems with third-party and even a vendor’s own tools, so using them usually is a bad idea.
Here’s some more advice for choosing identifier names:
Although you can’t use (unquoted) reserved words as identifiers, you can embed them in identifiers. group and max are illegal identifiers, for example, but groups and max_price are valid. If you’re worried that your identifier might be a reserved word in some other SQL dialect, then just add an underscore to the end of the name (element_, for example); no reserved keyword ends with an underscore.
You can surround Microsoft SQL Server quoted identifiers with double quotes or brackets (); brackets are preferred. In Db2, you can use reserved words as identifiers (but doing so isn’t a good idea because your program won’t be portable). MySQL ANSI_QUOTES mode allows double-quoted identifiers. DBMSs have their own nonstandard keywords; search your DBMS documentation for keywords or reserved words.
In MySQL, the case sensitivity of the underlying operating system determines the case sensitivity of database and table names.
The SQL standard directs DBMSs to convert identifier names to uppercase internally. So in the guts of your SQL compiler, the unquoted identifier myname is equivalent to the quoted identifier "MYNAME" (not "myname"). PostgreSQL doesn’t conform to the standard and converts to lowercase. To write portable programs, always quote a particular name or never quote it (don’t mix them). DBMSs aren’t consistent when it comes to case sensitivity, so the best practice is always to respect case for user-defined identifiers.
Recall from “Tables, Columns, and Rows” in Chapter 2 that a domain is the set of valid values allowed in a column. To define a domain, you use a column’s data type (and constraints, described in Chapter 11). A data type, or column type, has these characteristics:
A data type falls into one of categories listed in Table 3.2 (each covered in the following sections).
|Category||Stores These Data|
|Character string||Strings of characters|
|Binary large object||Binary data|
|Exact numeric||Integers and decimal numbers|
|Approximate numeric||Floating-point numbers|
|Boolean||Truth values: true, false, or unknown|
|Datetime||Date and time values|
|Interval||Date and time intervals|
You store literal values (constants) in character, numeric, boolean, datetime, and interval columns. Table 3.3 shows some examples; the following sections have more examples. Be sure not to confuse the string literal '2009' with the numeric literal 2009. The SQL standard defines a literal as any constant that isn’t null.
Use character string data types to represent text. A character string, or simply string, has these characteristics:
|CHARACTER||Represents a fixed number of characters. A string stored in a column defined as CHARACTER(length) can have up to length characters, where length is an integer greater than or equal to 1; the maximum length depends on the DBMS. When you store a string with fewer than length characters in a CHARACTER(length) column, the DBMS pads the end of the string with spaces to create a string that has exactly length characters. A CHARACTER(6) string 'Jack' is stored as 'Jack ', for example. CHARACTER and CHAR are synonyms.|
|CHARACTER VARYING||Represents a variable number of characters. A string stored in a column defined as CHARACTER VARYING(length) can have up to length characters, where length is an integer greater than or equal to 1; the maximum length depends on the DBMS. Unlike CHARACTER, when you store a string with fewer than length characters in a CHARACTER VARYING(length) column, the DBMS stores the string as is and doesn’t pad it with spaces. A CHARACTER VARYING(6) string 'Jack' is stored as 'Jack', for example. CHARACTER VARYING, CHAR VARYING, and VARCHAR are synonyms.|
|NATIONAL CHARACTER||This data type is the same as CHARACTER except that it holds standardized multibyte characters or Unicode characters. In SQL statements, NATIONAL CHARACTER strings are written like CHARACTER strings but have an N in front of the first quote: N'βæþ', for example. NATIONAL CHARACTER, NATIONAL CHAR, and NCHAR are synonyms.|
|NATIONAL CHARACTER VARYING||This data type is the same as CHARACTER VARYING except that it holds standardized multibyte characters or Unicode characters (see NATIONAL CHARACTER). NATIONAL CHARACTER VARYING, NATIONAL CHAR VARYING, and NCHAR VARYING are synonyms.|
|CLOB||The character large object (CLOB) type is intended for use in library databases that hold vast amounts of text. A single CLOB value might hold an entire webpage, book, or genetic sequence, for example. CLOBs can’t be used as keys or in indexes and support fewer functions and operations than do CHAR and VARCHAR. In host languages, CLOBs are referenced with a unique locator (pointer) value, avoiding the overhead of transferring entire CLOBs across a client–server network. CLOB and CHARACTER LARGE OBJECT are synonyms.|
|NCLOB||The national character large object (NCLOB) type is the same as CLOB except that it holds standardized multibyte characters or Unicode characters (see NATIONAL CHARACTER). NCLOB, NCHAR LARGE OBJECT, and NATIONAL CHARACTER LARGE OBJECT are synonyms.|
'don''t'to represent don't, for example. A double-quote character (") is a separate character and doesn’t need this special treatment.
Table 3.5 lists character-string and similar types for the DBMSs. See the DBMS documentation for size limits and usage restrictions.
Oracle treats empty strings as nulls; see “Nulls” later in this chapter.
In MySQL ANSI_QUOTES mode, string literals can be quoted only with single quotes; a string quoted with double quotes is interpreted as an identifier.
|Access||short text, long text (in older versions: text, memo)|
|SQL Server||char, varchar, nchar, nvarchar|
|Oracle||char, varchar2, clob, nchar, nvarchar2, nclob|
|Db2||char, varchar, clob, nchar, nvarchar, nclob, graphic, vargraphic, dbclob|
|MySQL||char, varchar, nchar, nvarchar, tinytext, text, mediumtext, longtext|
|PostgreSQL||char, varchar, text|
Computers store characters (letters, digits, punctuation, control characters, and other symbols) internally by assigning them unique numeric values. An encoding determines the mapping of characters to numeric values; different languages and computer operating systems use many different native encodings. Standard U.S.-English strings use ASCII encoding, which assigns values to 128 (27) different characters—not much, and not even enough to hold all the Latin characters used in modern European languages, much less all the Chinese ideographs.
Unicode is a single character set that represents the characters of almost all the world’s written languages. Unicode can encode up to about 4.3 billion (232) characters (using UTF-32 encoding). The Unicode Consortium develops and maintains the Unicode standard. The actual Unicode mappings are available in the latest online or printed edition of The Unicode Standard, available at unicode.org.
Use the binary large object (BLOB) data type to store binary data. A BLOB has these characteristics:
|Access||ole object, attachment|
|SQL Server||binary, varbinary|
|Oracle||raw, long raw, blob, bfile|
|Db2||binary, varbinary, blob|
|MySQL||binary, varbinary, tinyblob, blob, mediumblob, longblob|
Use exact numeric data types to represent exact numerical values. An exact numerical value has these characteristics:
|NUMERIC||Represents a decimal number, stored in a column defined as NUMERIC(precision [,scale]). precision is greater than or equal to 1; the maximum precision depends on the DBMS. scale is a value from 0 to precision. If scale is omitted, then it defaults to zero (which makes the number effectively an INTEGER).|
|DECIMAL||This data type is similar to NUMERIC, and some DBMSs define them equivalently. The difference is that the DBMS can choose a precision greater than that specified by DECIMAL(precision [,scale]), so precision specifies the minimum precision, not an exact precision as in NUMERIC. DECIMAL and DEC are synonyms.|
|INTEGER||Represents an integer. The minimum and maximum values that can be stored in an INTEGER column depend on the DBMS. INTEGER takes no arguments. INTEGER and INT are synonyms.|
|SMALLINT||This data type is the same as INTEGER except that it might hold a smaller range of values, depending on the DBMS. SMALLINT takes no arguments.|
|BIGINT||This data type is the same as INTEGER except that it might hold a larger range of values, depending on the DBMS. BIGINT takes no arguments.|
Table 3.8 shows how the number 123.89 is stored for different precision and scale values.
|Specified As||Stored As|
Table 3.9 lists exact-numeric and similar types for the DBMSs. See the DBMS documentation for size limits and usage restrictions. Some DBMSs accept type names that they don’t implement, converting them to suitable, supported types; Oracle converts INT to NUMBER(32), for example.
DBMSs usually implement SMALLINT as 16-bit values (−32,768 through 32,767), INTEGER as 32-bit values (−2,147,483,648 through 2,147,483,647), and BIGINT as 64-bit values (quintillions). The SQL:2003 standard introduced BIGINT to the SQL language (but most DBMSs already had a similar data type by then).
|Access||byte, decimal, integer, long integer|
|SQL Server||tinyint, smallint, int, bigint, bit, numeric, decimal, smallmoney, money|
|Db2||smallint, integer, bigint, decimal, numeric|
|MySQL||tinyint, smallint, mediumint, int, bigint, decimal, numeric|
|PostgreSQL||smallint, integer, bigint, decimal, numeric|
Use approximate numeric data types to represent approximate numerical values. An approximate numerical value has these characteristics:
|FLOAT||Represents a floating-point number, stored in a column defined as FLOAT(precision). precision is greater than or equal to 1 and expressed as the number of bits (not the number of digits); the maximum precision depends on the DBMS.|
|REAL||This data type is the same as FLOAT except that the DBMS defines the precision. REAL numbers usually are called single-precision numbers. REAL takes no arguments.|
|DOUBLE PRECISION||This data type is the same as FLOAT except that the DBMS defines the precision, which must be greater than that of REAL. DOUBLE PRECISION takes no arguments.|
|SQL Server||float, real|
|Oracle||float, binary_float, binary_double|
|Db2||real, double, float, decfloat|
|PostgreSQL||real, double precision|
Use the boolean data type to store truth values. A boolean value has these characteristics:
|MySQL||boolean, bit, tinyint(1)|
Use datetime data types to represent the date and time of day. A datetime value has these characteristics:
|DATE||Represents a date. A date stored in a column defined as DATE has three integer fields—YEAR, MONTH, and DAY—and is formatted yyyy-mm-dd (length 10) (2006-03-17, for example). Table 3.14 lists the valid values for the fields. DATE takes no arguments.|
|TIME||Represents a time of day. A time stored in a column defined as TIME has three fields—HOUR, MINUTE, and SECOND—and is formatted hh:mm:ss (length 8) (22:06:57, for example). You can specify fractional seconds with TIME(precision). precision is the number of fractional digits and is greater than or equal to zero. The maximum precision, which is at least 6, depends on the DBMS. HOUR and MINUTE are integers, and SECOND is a decimal number. The format is hh:mm:ss.ssss... (length 9 plus the number fractional digits) (22:06:57.1333, for example). Table 3.14 lists the valid values for the fields.|
|TIMESTAMP||Represents a combination of DATE and TIME values separated by a space. The TIMESTAMP format is yyyy-mm-dd hh:mm:ss (length 19) (2006-03-17 22:06:57, for example). You can specify fractional seconds with TIMESTAMP(precision). The format is yyyy-mm-dd hh:mm:ss.ssss... (length 20 plus the number fractional digits).|
|TIME WITH TIMEZONE||This data type is the same as TIME except that it adds a field, TIME_ZONE_OFFSET, to indicate the offset in hours from UTC. TIME_ZONE_OFFSET is formatted as INTERVAL HOUR TO MINUTE (see “Interval Types” later in this chapter) and can contain the values listed in Table 3.14. Append AT TIME ZONE time_zone_offset to the TIME to assign a value to the time zone (22:06:57 AT TIME ZONE -08:00, for example). Alternatively, you can append AT LOCAL to indicate that the time zone is the default for the session (22:06:57 AT LOCAL, for example). If the AT clause is omitted, then all times default to AT LOCAL.|
|TIMESTAMP WITH TIMEZONE||This data type is the same as TIMESTAMP except that it adds a field, TIME_ZONE_OFFSET, to indicate the offset in hours from UTC. The syntax rules are the same as those of TIME WITH TIME ZONE except that you must include a date (2006-03-17 22:06:57 AT TIME ZONE -08:00, for example).|
|YEAR||0001 to 9999|
|MONTH||01 to 12|
|DAY||01 to 31|
|HOUR||00 to 23|
|MINUTE||00 to 59|
|SECOND||00 to 61.999|
|TIME_ZONE_OFFSET||-12:59 to +13:00|
Table 3.15 lists datetime and similar types for the DBMSs. See the DBMS documentation for size limits and usage restrictions.
DBMSs let you enter date values in month-day-year, day-month-year, and other formats and time values based on a 12-hour (a.m./p.m.) clock. The format in which dates and times are displayed can differ from the format in which they’re entered.
In Microsoft Access, surround datetime literals with # characters instead of quotes and omit the data type name prefix. The standard SQL date DATE '2006-03-17' is equivalent to the Access date #2006-03-17#, for example.
In Microsoft SQL Server, omit the data type name prefix from datetime literals. The standard SQL date DATE '2006-03-17' is equivalent to the SQL Server date '2006-03-17', for example.
In Db2, omit the data type name prefix from datetime literals. The standard SQL date DATE '2006-03-17' is equivalent to the Db2 date '2006-03-17', for example.
|SQL Server||date, datetime, datetime2, datetimeoffset, smalldatetime, time|
|Db2||date, time, timestamp|
|MySQL||date, datetime, timestamp, time, year|
|PostgreSQL||date, time, timestamp|
DBMS conformance to standard SQL interval types is spotty or nonexistent, so you might not find this section to be useful in practice. DBMSs have their own extended data types and functions that calculate intervals and perform date and time arithmetic.
Use interval data types to represent sets of time values or spans of time. An interval value has these characteristics:
It has a single-field or multiple-field qualifier. A single-field qualifier is specified as YEAR, MONTH, DAY, HOUR, MINUTE, or SECOND. A multiple-field qualifier is specified as:
start_field TO end_field
start_field is YEAR, DAY, HOUR, or MINUTE, and end_field is YEAR, MONTH, DAY, HOUR, MINUTE, or SECOND. end_field must be a smaller time period than start_field.
A single-field column defined as INTERVAL HOUR could store intervals such as “4 hours” or “25 hours”, for example. A multiple-field column defined as INTERVAL DAY TO MINUTE could store intervals such as “2 days, 5 hours, 10 minutes”, for example.
A single-field column can have a precision that specifies the length (number of positions) of the field; INTERVAL HOUR(2), for example. The precision defaults to 2 if omitted. A SECOND field can have an additional fractional precision that specifies the number of digits to the right of the decimal point—INTERVAL SECOND(5,2), for example. The fractional precision defaults to 6 if omitted.
A multiple-field column can have a precision for start_field but not end_field (unless end_field is SECOND, in which case it can have a fractional precision)—INTERVAL DAY(3) TO MINUTE and INTERVAL MINUTE(2) TO SECOND(4), for example.
|Year-month||These intervals contain only a year value, only a month value, or both. The valid column types are INTERVAL YEAR, INTERVAL YEAR(precision), INTERVAL MONTH, INTERVAL MONTH(precision), INTERVAL YEAR TO MONTH, or INTERVAL YEAR(precision) TO MONTH.|
|Day-time||These intervals can contain a day value, hour value, minute value, second value, or some combination thereof. Some examples of the valid column types are INTERVAL MINUTE, INTERVAL DAY(precision), INTERVAL DAY TO HOUR, INTERVAL DAY(precision) TO SECOND, and INTERVAL MINUTE(precision) TO SECOND(frac_precision).|
|SQL Server||Not supported|
|Oracle||interval year to month, interval day to second|
Unique identifiers are used to generate primary-key values to identify rows (see “Primary Keys” in Chapter 2). An identifier can be unique universally (large random numbers unique in any context) or only within a specific table (simple serial numbers 1, 2, 3,...). Table 3.18 lists unique-identifier types and attributes for the DBMSs. See the DBMS documentation for size limits and usage restrictions. The SQL standard calls columns with auto-incrementing values identity columns. See also “Generating Sequences” in Chapter 15.
|DBMS||Types or Attributes|
|Access||autonumber, replication id|
|SQL Server||uniqueidentifier, identity|
|Oracle||rowid, urowid, sequences|
|Db2||rowid, identity columns and sequences|
|PostgreSQL||smallserial, serial, bigserial, uuid|
A universally unique ID is called a Universally Unique Identifier (UUID) or a Globally Unique Identifier (GUID). When you define a column to have a UUID data type, your DBMS will generate a random UUID automatically in each new row, probably according to ISO/IEC 9834-8 (iso.org) or IETF RFC 4122 (ietf.org).
A UUID in standard form looks like:
The letters actually are hexadecimal digits (a–f). Your DBMS might use an alternative form with uppercase hex digits, surrounding braces, or omitted hyphens. UUIDs aren’t technically guaranteed to be unique, but the probability of generating a duplicate ID is so tiny that they should be considered singular. For more information, read the Wikipedia article “Universally unique identifier”.
DBMSs provide functions that return UUIDs. Microsoft SQL Server has NEWID(), Oracle has SYS_GUID(), MySQL has UUID(), and PostgreSQL has gen_random_uuid(). Note that the Db2 function GENERATE_UNIQUE() returns a string that is not a UUID.
The SQL standard defines other data types than the ones covered in the preceding sections, but some of them rarely are implemented or used in practice (ARRAY, MULTISET, REF, and ROW, for example). More useful are the extended (nonstandard) data types that are available in various DBMSs. Depending on your DBMS, you can find data types for:
Microsoft SQL Server, Oracle, Db2, and PostgreSQL let you create user-defined types (UDTs). The simplest UDT is a standard or built-in data type (CHARACTER, INTEGER, and so on) with additional check and other constraints. You can define the data type marital_status, for example, as a single-character CHARACTER data type that allows only the values S, M, W, D, or NULL (for single, married, widowed, divorced, or unknown). More-complex UDTs are similar to classes in object-oriented programming languages such as Java or Python. You can define a UDT once and use it in multiple tables, rather than repeat its definition in each table in which it’s used. Search your DBMS documentation for user-defined type. UDTs are created in standard SQL with the statement CREATE TYPE.
When your data are incomplete, you can use a null to represent a missing or unknown value. A null has these characteristics:
You can get a null from a column that doesn’t allow nulls. The column au_id in the table authors doesn’t allow nulls, but the SELECT statement in Figure 3.3 returns a null for the maximum au_id.
WHERE au_lname = 'XXX';
If nulls appear in a column because actual values are not meaningful (rather than unknown), then you can split the column off into its own table with a one-to-one relationship with the other table. In Figure 3.4, the original table employees has the column commission, which specifies an employee’s sales commission. commission contains mostly nulls because most employees aren’t salespeople. To avoid the proliferation of nulls, move commission to its own table.
The display of nulls in results varies by DBMS. A null might appear as NULL, (NULL), <NULL>, -, or empty space, for example.
Oracle treats an empty string ('') as a null. This treatment might not continue to be true in future releases, however, and Oracle recommends that you do not treat empty strings the same as nulls in your SQL code. This behavior can cause conversion problems among DBMSs. In the sample database, for example, the column au_fname in the table authors is defined as NOT NULL. In Oracle, the first name of the author Kellsey (author A06) is a space (' '); in the other DBMSs, the first name is an empty string (''). For information about the sample database, see “The Sample Database” in Chapter 2.