The crumbling foundations of the Data Warehouse

There is no debating the fact that the typical data warehouse is growing rapidly - by as much as 65 percent per year, according to some estimates. But what is the impact of this growth on organizations and their IT infrastructures? How do the roles and expectations of users affect the situation - do they hamper data warehouse performance? Are the architectures adopted in the 1990s going to be able to accommodate this century s information demands? A recent study conducted by Dynamic Markets Limited, tellingly entitled The Crumbling Foundations of the Data Warehouse, offers some clues.

In a survey conducted in October 2004, data warehouse managers in the UK s top 2,000 companies indicated that they were feeling pressures directly related to the phenomenal growth of the data they were responsible for managing. Of this group, 88 percent indicated that their original data warehouse architecture would have to change significantly to accommodate current business demands. A full 97 percent of those surveyed indicated that their users were becoming more demanding, and that key business-critical requirements had the greatest impact. The problems these demands created were not isolated instances: 62 percent of respondents said that the volume and complexity of queries were having negative effects on a day-to-day basis. Increased demands were concentrated in the following areas: Speed: 62 percent Complexity: 51 percent Time data must be kept: 41 percent Forty-four percent of those surveyed acknowledged a direct correlation between the amount of data in the warehouse and the cost of maintaining it, in terms of both hardware resources and skilled human resources. 62 percent said that data volume and query complexity constrained the performance of the warehouse, or was taxing their warehouse team on at least a daily basis, with 11 percent saying this was true all the time. Alarmingly, 86 percent admitted that they managed performance in part by restricting user queries, a strategy that undermines the very mission of the organization s data warehouse program. These figures represent only the responses related to operational growth. The formidable challenges posed by regulatory and compliance issues add another dimension to the potential impact of data growth. While most managers surveyed said they did not have compliance projects in place, 64 percent said their jobs would be in jeopardy if they were not able to support compliance demands. Reflecting these concerns, 39 percent said their policy is now to keep all historical data in the warehouse rather than discard it. A notable disconnect in expectations occurred when addressing the issue of old data, which is often critical for historical comparison of key performance indicators. 32 percent of those surveyed acknowledged that queries on older, less frequently used data are significantly more expensive in terms of the resources required to assist users, and 38 percent thought that users did not understand how much time was required to carry our queries on older data. Finally, 62 percent of respondents said they expected they would have to allocate more manpower to the warehouse over the next three to five years - on average, a 15 percent increase was anticipated, with some expecting to require as much as 70 percent more resources. Surely there must be a more manageable and economical way to deal with the explosion of data and user demands? It turns out that there is another option - if we simply steal some secrets from the content management and application management arena. After all, the data kept for purposes of business intelligence represents only about 20 - 40 percent of the data that an organization has to manage. The other 60 - 80 percent is either unstructured data or semi-structured data such as e-mail- or in-production structured data associated with transaction applications. In these areas, information life cycle management ILM- has already been embraced as an approach to managing storage costs and enhancing performance for users. The basic principle of an ILM strategy is to match the availability of data - and therefore the resources it is assigned - with its current value to the organization. High-usage, high-value data is assigned the best online infrastructure to ensure high levels of user satisfaction. Data of potential value is kept near line so that it can be restored online if and when it is required. Finally, data that is no longer an active part of the application, but which will be useful in the future - for compliance purposes, for example - is kept offline in an environment where it can still service users, but where its availability does not negatively impact the performance of the primary application. Why should this discipline be applied to data warehousing? According to 48 percent of respondents in the Dynamic Markets survey, only 20 percent of the data in the warehouse is used regularly, while the other 80 percent is there just in case. Since the data warehouse as we know it takes a one-size-fits-all approach, this just-in-case data is as costly to the organization as the data that is actively used. Keeping it available in the warehouse requires resources and, more importantly, compromises the warehouse s ability to expediently service users of the high-value, high-use data. A big warehouse equals a slow warehouse - which in turn equals unhappy users. Applying an ILM discipline to the data warehouse would involve the following actions: All the current, actively used data would be kept online in the warehouse, with the highest possible level of availability in order to expedite operational business intelligence. The just-in-casedata, or data that has a high value but is associated with specific BI applications that run only occasionally, would be kept near line with the possibility of being restored to the online warehouse as and when it is required. Data that is past its usefulness to the BI applications, but which has historical or audit-related value, would be kept offline in an accessible repository, where it would be available to a new class of users using simple query applications. The online warehouse would be assigned the most powerful infrastructure, in order to ensure high service levels to regular users. It would be indexed to support the persistent applications, and would probably also support dependent data marts for complex analytical applications. The ideal near-line warehouse would be able to store data very efficiently and would allow rapid restoration of data to the online environment, either on demand or according to a business schedule, without any requirement for additional indexing. The restored data would be used as required and then purged, so as not to bloat the online warehouse. Copies of this data would be maintained at all times in the efficient near-line archive. The offline archive would also offer efficient storage for keeping audit quality read-only records, and would be accessible to users using standard business intelligence tools and methods. According to the Dynamic Markets survey, 94 percent of data warehouse managers recognize that the value of business data changes over time, but only 24 percent have strategies in place to address this aspect of the data warehouse, and only 18 percent tier their storage hierarchy according to the business value of the data. On the positive side, fully 38 percent say they may adopt this approach in the near future. The benefits of a tiered warehouse architecture are four-fold:

      Users of the primary data in the online warehouse can rely on consistently high business intelligence application performance. The organization is able to practice better business intelligence because it has an effective way of storing historical and other secondary data, thereby making available a broader and deeper view of the business than might currently be offered. The data warehouse is kept to a manageable size and user satisfaction is maintained without moving to specialized hardware platforms, thereby containing costs. ol> The regulatory and compliance burden is effectively handled by a specialized archive, without burdening the operational warehouse. The conclusions of Dynamic Markets research are certainly alarming. With the increasing automation of business functions, organizations are continuing to generate more data. Masses of older data that once would have been discarded now must be stored in order to respond to compliance requirements. The need for managers and analysts to correctly identify business opportunities on the basis of their organizations stored data becomes more and more urgent. There is no time like the present for taking steps to restructure the data warehouse infrastructure. Information life cycle management is already the gospel of other data-intensive areas. META Group believes that the database archiving market will be worth $2.7 billion in 2007. Perhaps it is time for the data warehousing industry as a whole to take a closer look at this efficient and economical approach to managing warehouse growth. Bron: