Have your lake and swim in it too
Looking around the industry, it is clear that many organizations are focused on providing an enterprise-wide data access strategy, notably through the creation of a “data lake.” The data lake methodology focuses on copying all data from across the enterprise into a single, centralized repository. The data lake is usually hosted on a scalable clustered environment, from which data scientists can then consume data as needed.
On the surface, this approach seems simple, effective and comprehensive. However, there may be requirements which make data lakes less than ideal. For instance, this would be the case when there are changes to the information that the data represents and when those changes happen more often than scheduled extract-translate-load (ETL) operations.
The IBM Z brand has long been known for housing critical applications and the associated data. This is typically structured data stored in database systems or as data sets (structured files in the z/OS environment). With the enablement of a number of analytics and machine-learning software on z/OS and Linux on IBM Z, various use cases can be optimized locally on the IBM Z data. Not only that, but if the data on IBM Z is sensitive from a security or auditing perspective, that data can now remain within the IBM Z environment.
What does this enable? The data lake no longer has to be a one-size-fits-all solution, allowing data architects to provide more options for their data science team. Now machine learning use cases that primarily consume IBM Z data can be executed on data that is resident on that system. This means that it is possible that some data may not have to be moved into the data lake!
What are the implications? This will require your data scientists to think differently. They may have to do some of their analytics on another platform and then use the output of that as a feature for their IBM Z machine learning, in the event that more than just the IBM Z data is required to build a model. For example, social media data can be processed in the cloud, and the sentiment from that analysis can then be used to augment the IBM Z data. This has the benefit of keeping your systems’ data in-place, which means that access to it remains within the same auditing system, the physical encryption of that data on disk is not affected, and the permissions to that data continue to be enforced. This can be a huge benefit for data that is under scrutiny from a security perspective.
It is time to rethink the idea that everything needs to flow into a data lake. While there are many good use cases for data copying and aggregation, this is not a one-size-fits-all approach and there are legitimate cases where bringing the processing to the data can provide great benefits for an organization.