Columns Aren't Just for Advice and Holding Up Buildings: They Can also Help Your Analytics
Traditional databases, like an Association Management Systems, are designed to handle frequent transactions and store data. This is very different from dimensional data models that are specifically designed for analysis while aligning with the analytical workflow.
Data, Files, and Blocks
"Cloud computing" is a bit of a misnomer. Data is still stored in files made up of blocks on a computer.
To increase efficiency, databases store entire table rows of data in the same block. For example, all of a customer's attributes such as name, address, member type, and previous event attendance are stored in a single block for fast retrieval. In this scenario, each “row” represents an individual customer while each “column” represents their different attributes.
If you think about it, most analytics involves aggregating data such as sums, counts, and averages that span many rows. Your exploration might eventually lead you to detailed individual records, but it will likely take several steps to identify these records. This means that if you looking at say, average event revenue, the database will need to retrieve entire records from several blocks just to get the revenue field for the eventual calculation. Imagine having to individually navigate many shelves from left to right when you could just quickly create a stack of what you need!
Columns and Rows
Similar to the goal of a dimensional data model, database technologies can further optimize analytics by primarily storing data in columns instead of rows. For this scenario involving average event revenue, the database simply accesses a single block with all of the data for the revenue column across all rows.
These columnar databases significantly improve performance and storage while providing several other key benefits.
- Data compression: Since columns are generally the same data type, compression methods best suited for the type of data can be applied to reduce needed storage. In addition, aggregations can sometimes be performed directly on compressed data.
- Advanced analytics: Many of the algorithms underlying advanced analytics leverage vector and matrix structures that are easily populated by single columns of data.
- Change tracking: Some technologies track changes at the column level, so you can maintain granular history without having to unnecessarily repeat other data.
- Sparse data storage: For columns that maintain valuable data that is infrequently populated such as specify product purchases; traditional database technologies need to maintain “NULL” values while column-based databases avoid this storage.
- Efficient distributed processing: Similar to managing file blocks, column-based technologies can distribute data across machines based on column to rapidly process data in parallel.
Examples of columnar database technologies include Apache HBase, Google BigQuery, and Amazon Redshift. HBase is part of the open-source Hadoop ecosystem, BigQuery is a cloud-based service based on technology that served as a precursor to Hadoop, and Amazon Redshift is a cloud-based service that is part of the popular Amazon Web Services (AWS) offering.
Speaking of holding up buildings, our friends at the National Council of Architectural Registration Boards created some great visualizations based on Amazon Redshift using Tableau Public. Analytics tools such as Tableau and Microsoft Power BI offer native connectors to Amazon Redshift and other big data technologies. These technologies are another way that you can enhance your analytics using data and tools that you already have with cloud services to rapidly make data-guided decisions for your association.