When I was first getting started with e-discovery, I would worry a lot about what I called "proprietary file types." I was concerned that by asking for native files I would end up with a bunch of files with an extension I'd never seen before and which I would not be able to preview with any of the software I owned. As it happens, this fear was not warranted: the native files exchanged in discovery almost always consist of familiar file formats such as MS Office documents, images, video, audio files, and PDFs. So much for the scary proprietary file types!
In reality, I was simply worried about the wrong thing. There is still a 500-pound gorilla in the room when you are doing e-discovery, but it's not proprietary files: it's DATABASES.
Databases are involved in nearly all cases because they are part of nearly every aspect of our lives. Here’s how you can verify this for yourself: consider a typical day, from the time you wake up, your morning routine, arriving at work, the workday, the evening, and bed. How much software did you interact with?
Most of us:
- Read the news on our smartphones in the morning using an app, mobile website or email digest
- Have our smartphone, smart watch, or other wearable device track our footsteps, heart rate, and other fitness info
- Send and receive SMS, iMessages, or other short text and media-based messages
- Check voicemail
- Send and receive email at work
- Order items online from sites such as Amazon.com
- Use software for managing tasks, orders, projects, or other data related to our work
These common human-technology interactions rely on databases to handle what software engineers refer to as “data persistence.” This term refers to ability of applications to have a “state” that remains in place over time. Data persistence is what enables you to see your past orders, track old invoices, and see your footstep count from last week. Without data persistence, most software would be pretty useless.
Most applications rely on databases for data persistence. Accordingly, if the data in some application is relevant to your case, it is helpful to understand how it is stored in a database —and how it can be extracted. Before we cover the e-discovery side (in a later post), let’s consider what exactly a database is.
Database Essentials: Schema
As a person who has helped develop several database-backed applications, I have a pretty good grasp on what a database is. But I am not a theorist. Hence, I will give a very practically oriented definition of a database. Theorists may find flaws in my description, but I hope you’ll find it useful all the same.
A database is a software system that allows users to create and view records of various data types that are organized into a “schema.” A database’s schema defines the various types of records that can be stored in the database. It specifies fields for each type of record and, usually, how records relate to each other. The schema’s definition of a datatype generally specifies the names, primitive data types, and validation rules for each field of each type.
As an example, let's consider a very simple database to track time entries billed by a lawyer. The database would have at least three different kinds of records: lawyer records, time entry records, and matter records. Each of the records types will have various attributes and will also have a unique identifier, known as a primary key. We'll name the primary key field "ID." For this example, we'll assume that the database is set up to assign integer keys to each new record of a given type, incrementing by one with each new record. Hence, the time entry records will be numbered 1, 2, 3, and so on. As you'll see, these primary keys are immensely important because they serve as unique identifiers for each record.
The lawyer and matter records can be set up with only a few fields. Here a listing of the fields and basic data types for lawyer records:
- ID (integer)
- first name (up to 30 characters)
- last name (up to 30 characters)
Obviously, a real database would store lots of other data about lawyers, such as their job titles, salaries, addresses, and standard rates, but we're keeping things deliberately simple in this example. Matter records also will be as simple as possible:
- ID (integer)
- name (up to 100 characters)
You'll see that we specified the data type for the fields in the lawyer and case models. (By the way, in database lingo, model and table are simply words for a type of record.) Now for the time-entry model, which is where things get a bit more interesting:
- ID (integer)
- rate (decimal)
- hours (decimal)
- date (date)
- lawyer (integer foreign key to lawyer record)
- matter (integer foreign key to matter record)
Here we have a few more basic (or "primitive" in database terminology) data types (decimal and date) and we have two relational fields. The lawyer field simply contains an integer that corresponds to the record ID in the lawyers table for the attorney who billed the relevant time. The matter field is also an integer that refers to a record in another table, the matters table. These fields are called foreign key fields because they refer to the primary key of a different record.
With the help of foreign key fields, it's easy to set up one-to-many relationships among data. Here, for example, one case has many time entries and one lawyer has many time entries, too. Inversely, one time entry belongs to one lawyer and one matter.
Another important relationship that databases can track is a many-to-many relationship. A good example for this sort of relationship is for tracking students attending each course during a semester of school. You’d want to have a “Course” record for each class being offered, and you’d want to have a “Student” record for each student enrolled in school. To account for the fact that each student attends multiple courses and each course has multiple students, we would have to create another table to represent each link between a student and a course. This table would have columns for the ID of the course and the ID of the student.
Wrapping Up
The basics of databases are not as complicated as they sound at first because databases almost always deal with the same sorts of things that we encounter in the domain that the database represents - students, cases, courses, time entries, etc. If you think carefully about the things themselves and how they relate to each other, you’ll be well on your way to understanding how they’ll be represented in a database.
A database seems forbidding because it’s not something we interact with directly as computer users. If someone sent us a database file, we’d be challenged to know how to open it or explore it. What you’ll find in practice, however, is that you don’t often need or want databases themselves, so you’ll rarely need to deal with one directly. Instead, in litigation, you’ll see that there are numerous ways to extract data from a database in a format that’s easy to work with, such as Excel. Understanding database fundamentals - and being able to interpret a database schema - will ensure that you get the right data that answers the right questions for your case.
Stay tuned for the next installment, and be sure to sign up for our blog if you liked this post.