By Bo Hagler
This article originally ran in Supply Chain Best Practices
Built on relational databases, data warehouses were adopted throughout the end of the 20th century and became a standard business requirement after the turn of this century. Data warehouses are tremendously useful, but they present limitations as the sources and formats of data in the 21st century expand and grow in complexity.
For many organizations, integrating their legacy data warehouse into a “data lake” will allow them to leverage their legacy systems while taking advantage of the myriad unstructured data generated from their supply chains.
Secure, Solid … Slow
People are familiar with data warehouses and know how to work with them. Over the past two decades, organizations have invested in building on-premise data warehouses. This is no small investment; it requires assets beyond servers and people to support the system — there are physical facility needs, energy consumption, front-end systems, integration and so on.
Because data warehouses are generally on-site and built with mature technology, they tend to be fairly secure, which is a solid benefit.
Data warehouses provide a snapshot in time through basic reporting of structured data — clearly defined data types whose pattern makes them easily searchable. It’s a simple solution for reporting, or for the calculation of a huge amount of data. They support descriptive analysis, housing some current information alongside historical data.
Yet, the data is often out of date by the time it’s put into the analysis, and it’s impossible to glean real-time insights from a data warehouse — it can take days to generate reports. Further, it’s time-consuming and difficult to write new queries necessary to collect new information or information that was previously determined to be unnecessary.
When you begin a data warehouse implementation, the end-goal must be articulated. If KPIs aren’t defined in advance, or if business goals or market conditions change, you’re back to square one: figuring out how to aggregate the data from the source files, then building a data structure in the warehouse to support the manner the KPI is provided. Data lakes do not require the pre-work, or the burden of writing an entirely new query to get to the data.
Be a Supply Chain Hero
On the other hand, a data lake provides more complete information for analysis, and facilitates the discovery of trends. Unconstrained by the structured data of a relational database, the data lake collates and aggregates all of an organization’s available data sources — not just structured data, but also unstructured data such as documents, emails and social media engagements.
It’s easier to run reports with this diverse information. For example, creating supplier scorecards: It’s essential to gather data on the supplier’s financials, locations, the orders they’ve been given and their track record of on-time delivery. After developing the scorecard, you may later determine that it’s important to track the supplier’s non-conformance — the number of defective parts they deliver — to determine their overall score. This is quick and easy to access in the data lake.
A data lake allows organizations to not only see what happened yesterday and what’s going on now. It also employs predictive analytics to forecast what will happen tomorrow. The real edge of a data lake goes beyond leveraging machine learning to make informed decisions, by taking it to the next level with prescriptive analytics that are able to discern the optimal decision.
This has the potential to make supply chain executives a hero in their organization by avoiding disruptions, minimizing program launch delays, highlighting opportunities for cost reductions or anticipated cost increases and driving accurate, on-time shipments.
The data lake itself may reduce costs while allowing more agility and scalability. Because it most often lives in the cloud, it requires no on-site hardware and only limited resources and support. Because the data lake is still a new technology that is holding on to every piece of data that might ever need to be accessed, it requires developers to define security parameters.
Further, there is a perception that the data lake has more vulnerability or lacks some of the security features that are expected from a traditional data warehouse with preconfigured security. These concerns are being addressed, and security features are considered to be integral to the value of data lake solutions.
A Hybrid Solution
The flexibility and analysis potential of a data lake are hard to argue against, but what about the hundreds of thousands of dollars and hours that have been invested in a data warehouse? For established organizations, a combination solution is probably the best way forward. A hybrid will help companies with existing data warehouses to access and run complex, standardized reports on the data stored within them for historical analysis. Increasingly, those companies will leverage the technology of the data lake for real-time analysis of the full spectrum of their structured and unstructured data.
The hybrid solution isn’t just as simple as bolting on a data lake, though. These systems will probably hit the books differently: A data warehouse is a CapEx (capital expense of assets and people) while a data lake is an OpEx (operating expense of the service) if it is living in the cloud. This may trigger a larger look at the budget and will most likely require the supply chain executive advocating the data lake to make a strong business case for it.
The data lake requires fewer resources to manage the hardware and software. But because it does not require data analysis to be defined in advance like data warehouses do, there may be a need to bring a business intelligence analyst on staff to help the organization even know how to find the data hiding in the depths of the data lake.
Certainly, not all companies need to jump into a data lake. Some organizations don’t really see a need to see trends or run predictive analytics because their KPIs and analytics are standard.
For example, a data warehouse is well-suited for day-to-day reports and indicators drawing on predefined information to provide a snapshot of where the business stands today. Companies that aren’t worried about tomorrow but need to know what was sold yesterday — how many items, in which stores, across what regions — are often served well by a data warehouse. They wouldn’t necessarily take advantage of the additional tools powered by a data lake unless the organization’s analysis requirements changed significantly.
A data lake solution is faster and less costly to initiate. With data warehouses, the company must buy the servers and allocate personnel for two to three months of setup — and easily four months until the first report is generated. It’s possible that stakeholders may determine that the data warehouse isn’t able to give them what they wanted, particularly in today’s agile business environment. Based in the cloud, data lakes can begin delivering almost immediately.
Companies can start small and scale easily; if the data lake isn’t giving them what they need, they simply need to stop the service and move on to another solution. Data lakes are agile.
This new technology requires a different attitude, a different way of thinking and working. The data lake is an abstract concept, and it can be hard to grasp; sometimes, people just don’t get it, or they may be skeptical of the benefit. As companies move more of their systems into the cloud and integrate more tightly with their suppliers’ systems, however, the data lake model is establishing itself as a prevailing technology, even if many organizations continue to support their traditional data warehouses.