Big data is loaded with big words. Having a good grasp of common data terms helps you not only understand, but also join in and influence conversations around data initiatives.
OK, let’s get started and demystify some terms you’ve heard before and introduce a couple that may be brand new.
Combining equal parts of science, business, and art, the data scientist uses knowledge of algorithms, tools, and processes to extract some value out of data. A data scientist will often run machine learning or artificial intelligence to mine, group, or analyze data sets.
Heteroscedasticity and heteroscedastic data
HeteroWHAT? This may be a new term for you, so let’s walk through a very basic example of what this means.
Some data is constant and never changes. Yesterday’s weblogs are a constant. Until we invent time travel, you won’t be able to go back and change what someone did yesterday.
The next level of complexity for data is linear. A queue or voicemail is an example of linear growth. If one worker can process ten messages per hour, then we’d need five workers to handle 50 messages per hour. Data that grows in quadratic fashion would grow at 4x (or greater) the rate. An example of this might be social media. When you write a post, 4, 10, 100, or even millions of people may read it. Those people may share your post, comment on it, or otherwise generate some metadata that changes every second. This is where we start getting into heteroscedasticity. It’s defined by high velocity (it moves and changes quickly) with high variability (i.e. no easy way to predict who comments, shares and likes a post, or what the speed of response will be).
Another great analogy is cooking. When cooking a meal, we’re combining ingredients in different ways to try to create something that’s (hopefully) delicious. As anyone who’s tried to cook knows, any number of small changes – adding a little salt, cooking for 2 minutes too long, chopping the tomatoes too large or small – can have a profound impact on the outcome and to the convergence of the final recipe for that signature dish.
Even if you’ve never used this term before, heteroscedasticity is something you’ll run into more and more with industrial IoT workloads. This is especially true when dealing with high-velocity data (like streaming), or frequently when dealing with unstructured, rapidly changing data like HTML pages that the Google web crawler traverses.
Machine Learning (ML) is a field of computer science that enables computers to recognize and extract patterns from raw data through rigorous training of data models.
ML enables “the three C’s of big data” – classification, clustering, and collaborative filtering.
Classification is the problem of identifying to which set of categories/sub-categories or population/sub-population a new pattern belongs to training sets of data that contain that pattern or instances where the category is already identified and known. For example, classification might involve training an algorithm to, say, recognize tumors in a set of MRI scans, then asking the algorithm to identify other scans that have tumors.
Clustering involves grouping raw data points into sets or “clusters.” An example here might be an ML algorithm that runs over web logs in real time, grouping valid traffic (to allow) in one category and possible attacks (to block) in another.
Collaborative filtering is just a fancy word for “recommendations.” An example is determining and displaying products that show some affinity with each other.
Much of what we do in ML is called “shallow learning.” Deep learning is usually a component in true Artificial Intelligence.
Artificial Intelligence (AI) encompasses and expands on ML by providing computers with the ability to perform a deep cognitive analysis.
Whereas ML typically involves some sort of initial human intervention in the way of algorithm creation, tuning, or training (like feeding scans of tumors to the computer), AI enables the computer to select, tune, and train itself to perform some specific function. Ultimately AI uses deep learning to emulate human decision-making and learning processes.
You might not realize it, but, AI is probably part of your daily life. More on this in the NLP definition below.
Virtual Reality (VR) allows users to step into virtual worlds that look and sound completely different from their physical surroundings.
Augmented Reality (AR) strives to overlay digital artifacts on top of the real world, enabling interaction. Recently, AR has become widely successful with the popularity of gameplay apps.
Natural language processing
Natural Language Processing (NLP) allows computers to parse and understand written or spoken human language. If you talk to your phone or home, you probably have experienced NLP.
NLP is a great place to explain the difference between deep and shallow learning. First generation NLP (shallow learning) focused on breaking a sentence up into tokens (words), and then applying some rules to the tokens. Today’s deep learning NLP, however, looks at the whole context of a statement and reasons out the true meaning.
Imagine a written web review. Shallow learning would simply look at a limited number of data tokens like “number of review rating stars” and basic “sentiment analysis.” This may involve counting the number of positive vs. negative words. These data points are fed through an often-brittle set of rules to arrive at a conclusion about whether the review was positive or negative.
A deep learning engine applies more intelligence to this analysis – almost like what a human might surmise if they read the same review. For example, if a review had lots of “positive,” like five-star ratings, good positive to negative count ratio, etc., a shallow NLP engine might conclude it was a positive review. A deep learning NLP engine, however, might interpret (as a human would) that the review was actually negative upon reading “I will never buy this product again.” That sentence alone negates any positive sentiments a user may have provided.
Image recognition gives computers the ability to suss meaning out of a simple visual image. It is frequently bundled in a provider’s ML or AI offerings (along with NLP).
Image recognition allows computers to identify objects like written language using Optical Character Recognition or OCR (text in billboards), tag objects (like “mountain,” “tree,” “car,” “skyscraper”) and even perform facial analysis (like drawing bounding boxes around faces).
Image recognition is currently being taken to a whole new level by the automotive industry with their application of facial analysis to detect and alert drivers who may be feeling fatigued.
Structured, unstructured, semi-structured data
Historically, much of the data we worked with was heavily structured. This means it fit nicely into a row / column format (like databases). As a result, many computer systems were designed to ingest and generate that form of data.
Humans are a different beast. We excel at generating and consuming unstructured data like free-flowing text, voice, and images like camera snapshots. All of this data inherently has no “structure” to it. We can’t “depend” on certain languages, words, intonations, etc.
Semi-structured data sits somewhere in the middle. A good example is email. It has some structure like “subject,” “to,” “from,” “date,” but the main payload is a blob of unstructured text in the “body” of the email.
Only in the last 10 years, have our computer systems become powerful enough to perform analyses on unstructured data.
Any analytics engine, like Hadoop, will provide both storage and compute, often, in a tightly coupled arrangement. Every time you add more processing, you inherently add more storage.
Many organizations, however, are sitting on mountains (petabytes) of data that they want to durably retain, but not analyze immediately. One reason for delay is the pre-processing and cleansing the data might need prior to analysis.
A data lake provides a low-cost, highly durable, accessible-from-anywhere storage with limited compute. It allows for much greater retention of data than what is processed at one time.
Looking at a recipe paradigm, a data lake is like your pantry of raw ingredients (vegetables, rice, bouillon). Only when you want to cook, do you pull out the right subset of ingredients, per the recipe, and prepare them for that meal.
What we commonly refer to as “a database” is also known as a Relational Database Management System (RDBMS) or an OLTP (Online Transaction Processing) system. Oracle, MySQL, SQL Server are all common examples of this.
Many small “transactions” that (typically) come from end users characterize RDBMSes.
Think of retail e-commerce websites. At any given moment, several hundreds of thousands of users are performing small reads (queries) and writes (inserts) when they browse for products, read reviews, generate orders etc. There is an expectation that these systems perform these queries very quickly.
A data warehouse (also known as an enterprise data warehouse or EDW) is where the company runs analytics to answer several important business questions. What is our fastest growing product line? Which product categories have the best ROI? What are our worst performing regions, categories, salespeople, and so on?
EDWs are typically only used by a handful (maybe a dozen or few dozen) internal users, running long-running queries on massive (possibly hundreds of TB or dozens of PB) datasets.
A visualization tool provides a visual front end to do complex analytics.
Using simple drag-and-drop, even unskilled interns can build a good deal of complex reports like quarterly sales, bestselling products, growth, etc.
These systems typically require that the engine you’re connecting them to have a SQL interface, which (not coincidentally) every RDBMS and EDW provides. If you’re like a lot of data analysts, 95% of your interaction with your systems will be via one of these visualization tools.
Hope you’ve enjoyed this quick walkthrough of common terms we find in big data. Feel free to now impress the folks at the water cooler by discussing how the visualization of unprecedented data growth, the advantages of creating a data lake, unlocking the value heteroscedastic data through ML and AI is thoroughly changing the world.