Data Preparation Is the Key to Big Data Success

Big data it is often hyped, but I encourage taking a more realistic stance. I’ve seen many organizations attempt to adopt big data solutions and ultimately fail. I fear these missteps may eventually sour the market on adopting big data solutions. This would be unfortunate, because I view big data as a transformational capability, an essential part of a new IT infrastructure. To ensure greater success, this post presents common barriers to adoption that I’ve observed and provides insight into how to adapt and overcome these challenges.

One of the primary barriers to big data success is the lack of a data preparation strategy. Data preparation includes all the steps necessary to acquire, prepare, curate, and manage the data assets of the organization. Sound data is the foundation for actionable insights delivered by advanced analytic applications. If the data is tainted then conclusions based on it become questionable—moreover, debatable—and big data, if not backed by accurate intelligence, can add to confusion and organizational turmoil.

It’s not unlike the Hippocratic oath: “First, do no harm.” The worst outcome of a big data undertaking would be to make poor decisions because of bad information—but be really confident!

Interestingly, most companies contemplating big data and, unfortunately, vendors selling such solutions rarely consider the implications of data preparation. Building the hardware infrastructure and software to support a big data lake can be complex and expensive, leading adopters to conclude that this is the most challenging element of the big data equation. However, once the infrastructure is in place, they are often dismayed to discover that the big data infrastructure is simply the tip of the iceberg. Collecting and managing trusted data can be much more expensive; especially if the big data project begins with a poorly understood idea of what data will ultimately be required.

So, what constitutes a good data preparation strategy?

As I have noted previously, the following six-step process will help:

1. Identify your decision set

Knowing the context of a company’s decision-making is an essential first step. It defines the data sets you will use to support a decision, how the data will be manipulated, and ultimately the analytical process that will define insight generation. Many assume that if data is simply cleansed and curated effectively, any analytic process can be supported. This is not true. Organizational leaders need to define the end game first, and data preparation will be much simpler.

2. Select the data sources to support the desired decisions

Granted, you cannot know in advance every possible data source that might be needed, but it is possible to identify the primary data sources that will need to be used. These will not only define the types of data available but will largely define the kinds of data cleansing that will need to be done.

3. Choose the right vendor of data cleansing technology

You will want technology that not only offers solutions that accommodate your initially identified data types but also offers a platform that feeds your existing cross-organizational analytic tools. In an analytics-driven company, many different levels in an organization will be using tools to inform decisions. It is essential that the data-preparation tool provides a platform, accessible by all, to enable access to curated and trusted data. Only by starting from the same basic data set will decisions be consistent across the organization.

4. Assess and ingest additional data sets

As noted in step two, it is not possible to predefine every data set that might be necessary to support a decision; additional data sets are constantly being discovered that might be useful. Assessing new data sources is ongoing and an important part of decision-making.

5. Identify any new analytic tools that will produce the desired insights

There are many excellent analytic packages on the market, from simple statistical tools, all the way to very advanced machine-learning-based applications. They each provide different insights and require different degrees of data cleansing. For example, machine learning may be able to handle data in an essentially native mode; while a statistical tool might need very clean data, with every field reconciled.

6. Extend data preparation to incorporate new data into existing data sets

Many data sets are dynamic and evolving. As new data is discovered or becomes available, data preparation must be conducted to ensure its ready availability.

By recognizing that data preparation is a necessary first step to a big data value, an enterprise can significantly reduce the costs of big data while accelerating the delivery of actionable insights that drive good decision-making. Fortunately, the data preparation market is growing rapidly, with Frost & Sullivan projections of over $9 billion globally by 2025; finding solutions and expertise to address the need for a data preparation strategy are readily available.

This article was written by Michael Jude from InfoWorld and was legally licensed through the NewsCred publisher network. Please direct all licensing questions to legal@newscred.com.