September 27, 2018

What is Data? (Part I of III)

It’s easy to get caught up in the myriad of technologies, advances, vendors, and choices in the Information Technology (“IT”) field. One of the most valuable ways to step back and reframe IT toward effective solutions is to think in terms of data before any other considerations. Data is the cornerstone of all IT; all IT systems are built to do one or more of the following operations:

  • Gather data
  • Move data
  • Transform data
  • Store data
  • Output data
  • Control machines based on data

To illustrate this reality, the following table shows a few examples of each data operation for Scientific, Business, and Personal computing. If you think through this table and add a few of your own examples, it’s easy to see how central data is to all IT.
Data Operations Examples Across Scientific, Business, and Personal Computing

Thinking this way means first understanding your organization’s data requirements, and then, with that understanding, aligning other IT features like security, compute, and storage to those needs. In fact, if you DON’T think this way, you will likely end up with computer systems that don’t fully meet your organization’s needs, leading to poor support of business activities, and higher than required costs to buy and maintain misaligned systems. Worse, security risks are much higher when IT security approaches aren’t closely aligned to the types of data and uses of data unique to every organization.

Three questions are key to understanding your data toward effective IT decision making:

  1. What kinds of data do we have?
  2. What do we need to do with our data?
  3. Who can access our data?

In this post, we’ll discuss number 1, the types of data.

What Kinds of Data Do We Have?

This graphic shows the four major types of data found in organizations today. Every organization with any IT systems at all has at least some quantity of each type, and every organization is unique in terms of how much of each type, and how much each type matters to the organization. For example, a website that sells stock images would have a great quantity of important Unstructured Data (image files), while an online banking system would contain primarily Structured Data that must be highly secured.

As the following list shows, each of these four types of data are quite different from each other and are typically stored and managed very differently.

  • Unstructured Data
    • Definition: “File System” data, documents, images, reports, multimedia, etc. Any kind of data represented as a single, distinct file intended to be used by typical computer users.
    • Storage: Unstructured data is most commonly found on file systems that can be browsed from a computer. Newer types of databases called “NoSQL” databases house Unstructured Data as key value pairs with the piece of data, like an image file, as the “value” with a unique numerical “key” assigned to retrieve the value. Typical NoSQL examples include Cassandra, MongoDB, and the Hadoop Distributed File System). Finally, Cloud services such as Amazon’s S3 are highly optimized to manage Unstructured Data and offer a great value at scale.
  • Structured Data
    • Definition: Data that can be easily stored and retrieved in a Row-Column format. For example, data in each cell of an Excel spreadsheet or individual pieces of information inside of database tables (ex. “Last Name”).
    • Storage: Structured Data is always stored and retrieved in some kind of Row-Column format. This is typically an SQL Database like Microsoft Access, Oracle RDBMS, MySQL, Microsoft SQL Server, and others. However, Structured Data can also be stored very simply in an Excel spreadsheet or even in a text file. And when processed by a computer, Structured Data is often stored and represented using a tagging language like XML.
  • Metadata
    • Definition: Any data that describes characteristics of a thing. While Structured and Unstructured Data refer to the thing itself that is being stored, Metadata refers to information about the thing. For example, an image file might have metadata including a date, a location, and a person with whom it’s associated.
    • Storage: Metadata is most often stored in the same manner as Structured Data, in some type of Row-Column format. This makes it easy to search by metadata characteristics, for example the location where a picture was taken. Interestingly, while metadata itself is typical stored as Structured Data, the thing metadata describes is usually an image or some other type of file, stored as Unstructured Data!
  • Machine Data
    • Definition: Information used within an IT system or software application that is critical for operations, but not easily usable by humans. Examples include computer log files, properties files, and security events reported by devices and software.
    • Storage: Machine data is some of the most important data an organization has for ensuring security and properly maintaining the IT environment. However, many organizations have very weak processes and systems in place to capture, manage, and use this data. Machine Data is most typically output as a file of some sort, and therefore stored as Unstructured Data. When well-designed, IT systems that capture Machine Data also store Metadata describing the machine output, so it can be easily identified for later analysis.

Knowing how much of each type of data your organization has, and how much each type of data is expected to grow, is the first step toward understanding your data. Knowing this gives great insight into the types and scale of IT systems needed to efficiently and cost-effectively manage this data.

Coming next in What is Data Part II of III: What do we need to do with our data?

Thanks for reading!


Leave a Reply

Your email address will not be published. Required fields are marked *