By Rebecca Pinsky
Data use is an increasingly scrutinized practice. Terms like “data use,” “data privacy,” “analytics,” and “machine learning” can be obtuse to people without experience working with data. Understanding data doesn’t have to be difficult, though. The short guide below is meant to help current and future attorneys gain foundational understanding of core data concepts so they will be better suited to analyze data-related issues.
What is Data?
Data vs. Information
Data simply means recorded values.[1] Data can be qualitative or quantitative.[2] Typically, data is stored as text or numerals. Data is granular and specific. Information is broader and purpose-driven. Think of information as the higher-level insights that can be derived from data. [3]
Types of Data
Recording data as either “text or numerals” is still not a particularly precise methodology. That is why types of data are usually classified further. Different languages store data differently, but the basics typically do not vary broadly:
- A character is a single entry, whether text or numerical. “R” is one character, “law” is three characters, and “2022” is four characters.[4]
- Boolean data can be one of two values, usually TRUE or FALSE, YES or NO, or 1 or 2.[5]
- Date data represents dates or times, like a YYYY-MM-DD date, a hh:mm:ss timestamp, a “datetime” combination of a date and a timestamp, or a date part, such as a year, month, or day.[6]
- Numeric data types store numbers. Languages vary here. Numeric data may be further classified as integers, decimals, floating point numbers, or complex numbers.[7]
- Strings are series of characters. Strings may be fixed in length, like a product ID, or variable in length, like the names of colors.[8]
Storing Data
Data is stored in tables made up of columns and rows where the columns are categories, and the rows are individual entries. Collections of data are datasets. A simple example is the wedding registry dataset in Exhibit A. Datasets may stand alone, or make up larger databases, which are organized collections of data.[9] For example, the data in Exhibits A and B could be organized into one larger database. A well-constructed database follows a database schema, which is the empty framework that models out the tables in the database and the fields included in each table.[10]
Often, data interacts with other data or media. Database models define the logical structures of databases and how the data within is connected, processed, and stored.[11] Relational databases are the most common.[12] In a relational model, the columns in the tables describe the rows, and tables can refer to data from other tables. In the wedding data examples, couple_id is created in Exhibit B and referred to by Exhibit A.
Exhibit A: Wedding Registry Dataset
couple_id | product | product_type | product_maker | product_priority |
1 | waffle_iron | cooking | cuisinart | mid |
2 | china_set | dining | wedgewood | low |
3 | vaccuum | cleaning | dyson | high |
4 | skillet | cooking | lodge | high |
Exhibit B: Engaged Couples Dataset
couple_id | partner_one | partner_two | wedding_date | wedding_city |
1 | mary | matthew | 2021-06-10 | downton |
2 | jodie | alexandra | 2020-10-20 | los_angeles |
3 | billy | adam | 2020-12-31 | manhattan |
4 | cameron | tom | 2021-04-01 | dallas |
More Data Terms to Know
- An Algorithm is a process or set of instructions used to solve a problem or answer a question from data.
- Analytics is the process of identifying patterns in data to answer questions.[13] Analytics is less sophisticated than data science and typically answers less complex questions. Analytical findings describe, not predict.
- Big Data, in essence, means lots and lots of data, often compiled from different sources. The sheer scale of the data makes it difficult to avoid uncertainty and inconsistency.[14]
- Cookies are small pieces of data stored on a user’s computer when that user visits a website. The purpose of cookies is to allow websites to recognize users’ preferences.[15]
- Cloud means driven by a remote, out-sourced server. When you store photos on Dropbox, Google Photos, or iCloud, you’re utilizing cloud storage.
- Data Architecture is the overarching term for the governance documentation of an organization’s databases and data systems.[16]
- A Data Dictionary is documentation associated with a database that catalogs and describes what is in the database. The data dictionary may also give information about the database’s structure and operation.[17]
- A Data Lake is a repository for raw data from a range of sources. The data remains formatted the way the sources formatted it.[18] The primary purpose of a data lake is to collect and hold large amounts of data.[19]
- Data Science is advanced statistical analysis and modelling used to make predictions or make data easier to understand.[20]
- A Data Warehouse is a system of aggregated data used for querying and analyzing data.[21]
- ETL stands for extract, transform, and load. It is a common process used to change varied data from multiple sources[22] to make it more usable for the context that it is needed for.
- Machine Learning refers to a process of feeding data to a program or system that will use some of the data you gave it to find patterns and make predictions about the rest of the data.[23] Machine learning models are only as good as their source data—if the source data contains historical bias or is missing important populations, the predictions the model makes will be flawed.
- Metadata is data about data that furthers how the data can be used and understood.[24] For example, metadata of users who registered for a website might include date and time of registration, whether the user was a referral, and whether the user used their phone to access the site.
- Open means free to use and distribute without restriction.[25] A foundational idea behind open data is that the sharing and use of open data is subject to an honor-bound social contract.[26]
- PII stands for personal identifiable information. Data that is directly linked to a person’s identity and data that can be used to ascertain a person’s identity when used with other data can both be considered PII.[27]
- Visualization is a way to represent information and data. Data visualizations like charts, graphs, infographics, dashboards, and maps can make it easier to understand patterns in data.[28]
- Web Scraping is the process of taking data from a website and converting it into a more convenient format.[29]
[1] See What is the difference between data and information?, Computer Hope, https://www.computerhope.com/issues/ch001629.htm (last updated Aug. 31, 2020).
[2] See What is Data?, School of Data, https://schoolofdata.org/handbook/courses/what-is-data/ (last updated Sept. 2, 2013).
[3] See Computer Hope, supra note 1.
[4]See Character, Computer Hope, https://www.computerhope.com/jargon/c/charact.htm (last updated Apr. 2, 2019).
[5]See Boolean, Computer Hope, https://www.computerhope.com/jargon/b/boolean.htm (last updated May, 16, 2020).
[6] See SQL Data Types for MySQL, SQL Server, and MS Access, W3Schools, https://www.w3schools.com/sql/sql_datatypes.asp.
[7] See id. See also Python Data Types, W3Schools, https://www.w3schools.com/python/python_datatypes.asp; Python Data Types, W3Schools, https://www.w3schools.com/js/js_datatypes.asp.
[8] See W3Schools, supra note 6.
[9] See Database defined, Oracle, https://www.oracle.com/database/what-is-database/.
[10] See What is a Database Schema, Lucidchart, https://www.lucidchart.com/pages/database-diagram/database-schema.
[11] See What is a Database Model, Lucidchart, https://www.lucidchart.com/pages/database-diagram/database-models.
[12] See Lucidchart, supra note 11.
[13] See generally Analytics defined, Oracle, https://www.oracle.com/business-analytics/what-is-analytics/.
[14] See The Four V’s of Big Data, IBM, https://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg.
[15] See Cookie, PC Mag: Encyclopedia, https://www.pcmag.com/encyclopedia/term/cookie.
[16] See Data Architecture, Snowflake: Data Warehousing Glossary, https://www.snowflake.com/data-warehousing-glossary/data-architecture/.
[17] See Data Dictionaries, USGS, https://www.usgs.gov/products/data-and-tools/data-management/data-dictionaries.
[18] See Data Lake vs Data Warehouse, Snowflake, https://www.snowflake.com/trending/data-lake-vs-data-warehouse.
[19] See Snowflake, supra note 18.
[20] Data Science Terms and Jargon: A Glossary, Dataquest (Feb. 20, 2018), https://www.dataquest.io/blog/data-science-glossary/.
[21] See Data Warehousing, Snowflake: Data Warehousing Glossary, https://www.snowflake.com/data-warehousing-glossary/data-warehousing/.
[22] See zoinerTejada et al., Extract, transform, and load (ETL), GitHub: MicrosoftDocs (Nov. 20, 2019), https://github.com/microsoftdocs/architecture-center/blob/master/docs/data-guide/relational-data/etl.md.
[23] See Machine Learning Glossary, Google Developers, https://developers.google.com/machine-learning/glossary/#m (last updated Aug. 11, 2020).
[24] See Metadata Creation, USGS, https://www.usgs.gov/products/data-and-tools/data-management/metadata-creation.
[25] See What is open?, Open Knowledge Foundation, https://okfn.org/opendata/.
[26] See generally Open Definition 2.1, Open Knowledge Foundation, https://opendefinition.org/od/2.1/en/.
[27] See Guidance on the Protection of Personal Identifiable Information, U.S. Department of Labor https://www.dol.gov/general/ppii.
[28] See Data visualization beginner’s guide: a definition, examples, and learning resources, Tableau, https://www.tableau.com/learn/articles/data-visualization.
[29] See What is Web Scraping and What is it Used For?,ParseHub: Blog (Aug. 6, 2019), https://www.parsehub.com/blog/what-is-web-scraping/.
Image Source: https://www.needpix.com/file_download.php?url=https://storage.needpix.com/rsynced_images/database-schema-1895779_1280.png