Close Window

Print Story

SOA Data Strategy

The adoption of Service Oriented Architecture (SOA) promises to further decouple monolithic applications by decomposing business functions and processes into discrete services. While this makes enterprise computing assets more accessible and reusable, SOA implementation patterns are primarily an iteration over previous application development models. Like most application development evolutions, SOA approaches inject more layers and flexibility into the application tier, but have often neglected the most fundamental building block of all applications: the underlying data.

Current Data Environment of Most IT Organizations
The condition of a typical organization's data environment is usually not where it needs to be before the organization can begin a SOA transformation - from an enterprise perspective, there's often a lack of authoritative sources and a wide array of technologies used for storing and processing data. Generally, there's no single system that offers a complete view of the organization's core business objects, since most large IT organizations have their core enterprise data spread out and replicated across multiple stove-piped systems. Each system in an enterprise often maintains data within its specific context rather than the context of the enterprise. Data quality and interoperability issues abound, especially when data-consuming systems access a variety of data-producing systems, each of which maintains an isolated view of enterprise data. These differences lead to inconsistencies and inaccurate views of the business processes. Figure 1 illustrates these data access and management challenges impacting SOA transition initiatives.

An SOA transformation amplifies and exacerbates an organization's existing data problems. Because of the integrated nature of SOA-based applications, an organization will be building on top of a very weak foundation unless it first addresses the issues with its current data environment. This is, in many ways, analogous to constructing a high-rise building on top of a landfill.

Consider the lack of authoritative enterprise sources as an illustrative example. Suppose that in an organization's supply chain systems' portfolio there are five systems that hold supplier information internally. Each of these can be considered a legitimate source of supplier data within the owning department. When building a service to share supplier data, where should the source of supplier data be?

Each of these solutions has their pros and cons; there's no right or wrong approach. The point is that these data issues must be resolved before an implementation team can proceed. By the time the implementation team takes over and begins building the requisite services and infrastructure these kinds of questions should be answered already by the organization at the business level. Unless they are these data issues will often perpetuate and hamper the benefits of creating services that share data. In other words, a service may end up sharing an incomplete set of data or, worse, exhibit incorrect behavior because it's not working with the "right" data.

Target Vision of Data Environment in a SOA
The way an organization thinks about applications and data must evolve - it must stop thinking about data as a second-class citizen that only supports specific applications and begin to recognize data as a standalone asset that has both value and utility. Organizations should establish their data environments with "hubs of specific data families" that expose data services that comply with industry standards and service contracts. The goal is to create a set of services that becomes the authoritative way to access enterprise data. In this target service-oriented environment, applications and data work together as peers. Thus, both an organization's business functionality and data can be leveraged as enterprise assets that are reusable across multiple departments and lines of business. This target vision, illustrated in Figure 2, enables the following desired characteristics of the enterprise's data environment:

SOA Data Strategy
A comprehensive strategy that defines how the enterprise's data should be managed in an SOA environment is needed to achieve the target vision. This strategy addresses issues such as data governance, data modeling from an enterprise SOA perspective, data quality, security, and technology solutions such as data services.

Data Governance
Governance is often cited as an important part of SOA. However, this generally refers to the governance of services and not the data shared through services. Just as proper governance of services is critical to an SOA, proper governance of the data is equally, if not more important. Many of the problems associated with an organization's data environment can't be solved through technology solutions alone. Decisions and policies must be issued at the organizational level that can then be implemented through the technology. For example, the absence of an enterprise data ownership concept is a classic data governance issue. Different divisions in an organization control the data within their own system boundaries. They can make changes to that data as they see fit and these changes can ripple across other divisions and ultimately impact the interoperability of the enterprise as a whole.

Without a definition of enterprise ownership and stewardship of the data controlling such changes are difficult. So an SOA data strategy should include establishing an enterprise data management function as the data governance mechanism. A centralized management function is needed to treat data as an enterprise asset instead of as the assets of individual departments. The group responsible for this function addresses data issues and establishes policies and processes that cut across multiple departments. The responsibilities of such a group should include:

Enterprise Data Models
To realize the target data environment, some agreement is needed about which core data elements and structural business rules are represented by those services accessing them. While it's possible to implement services on top of the current data sources by leveraging the existing data models in those systems, this is not optimal. Such an approach will continue to proliferate non-authoritative data sources, each with its own model designed to support specific needs without enterprise-level consistency. When creating the enterprise data models, an organization must shift away from modeling the data from a systems-only perspective. In other words, the organization must look at the data families themselves and focus less on the details of the specific applications that are using them.

How does an organization make this shift? First, it must decide on its "core" data families, which are sometimes also referred to as "master" data. Core data is relatively easy to deduce, given a general understanding of the key business processes. For example, the "supplier" data family in a supply chain business could be considered core data. While it may be tempting to model every core data family in full detail, it may be wiser to identify "a good first set" and begin with that. A good approach is to simply tackle the obvious core data families first, learn from the experience, and then apply those lessons learned to model the rest of the data.

Next, the enterprise model must decide which data elements are strategic for operational, reporting, and accountability purposes, and which are relevant only to one or several subsets of the business. Common strategic data should be thought of as the subset of fields that any application that uses this data family can fully support. All the other data attributes should be considered "optional," even if they're critical to certain applications. In the supply chain example, "supplier performance" isn't part of the core data but it may be critical to one or two systems in the organization. Since the enterprise data model is applicable to the entire organization, the standard would always expect the core data to be provided. It should also give each system the flexibility to be extended with additional data that's relevant to its own purpose.

Technical Considerations
SOA implementations usually exhibit decentralized federated topologies. So the ability to merge data properly and enable authorized enterprise-wide access are necessary to ensure that information can be leveraged to support enterprise objectives. Enabling these capabilities presents numerous challenges in the areas of data quality, security, and the data services architecture. The next sections describe some of these challenges and provide recommendations for addressing them.

Data Quality
SOA initiatives often focus on the implications of connecting disparate systems. A fundamental concern of implementing such connections is how to ensure that the data exchanged is accurate, meaningful, and understandable by all participating parties. Users, consuming services, and data sources all operate as peers in an SOA. These peers will often use data in new and unanticipated ways. So it becomes increasingly difficult to serve meaningful information without normalizing existing data assets. This includes not just schematic normalization, but instance-level de-confliction as well.

To this end, data quality studies are paramount in ensuring the success of a SOA implementation. Typically this includes understanding what data is available, where this data is located, and what state it's in:


Resolving Data Conflicts
Besides these scoping and profiling exercises to manage data quality, it's also imperative to resolve value-level conflicts that exist in the data. These conflicts can be categorized into three major types (C.H. Goh, "Representing and Reasoning about Semantic Conflicts in Heterogeneous Information Systems," Sloan School of Management, Massachusetts Institute of Technology, 16-22, January 1997.):

  • Structural and Formatting Conflicts: Conflicts in the formats of the data values and schemas used for structuring and organizing the data. Some examples of structural and formatting conflicts include type conflicts in which different data types are used to represent the same element. For example, customer ID is stored as a double in one system and as a string in another system. Another example is labeling conflicts where similar concepts are labeled differently such as "supplier" versus "vendor."
  • Semantic conflicts: Conflicts in how the meanings of certain data values are interpreted. Examples of semantic conflicts include naming in which the same concept is expressed with different values. This is similar to the labeling conflict but occurs in the data value, whereas with labeling, the conflict is in the label on the data structure (metadata). The significance of this difference is that with the semantic naming conflict, detection and resolution may be more difficult, and the detection and resolution mechanism has to be applied multiple times over the entire set of values.
  • Intensional conflicts: Conflicts arising when consumer assumptions and expectations of data content differ from those of data producers. These conflicts are prevalent when structural representations are identical but the data domains that are encapsulated in these structures vary with the data producers. Intensional conflicts often arise when varying producers have fundamentally different conceptions of integrity constraints between related entities: cardinality, nillability, or uniqueness.

    These data conflicts can often be addressed by using commercial data management tools and methodologies, as well as enterprise data modeling software. Another emerging possibility is semantics-centric modeling environments. Instead of hard-coding data cleansing routines, these tools use a semantic description of the enterprise - the business concepts and relationships between those concepts, as well as any business rules governing the relationships - and provide a mechanism to describe how legacy systems support the semantics of the enterprise. This useful abstraction lets the enterprise deterministically identify how each enterprise data asset supports the enterprise business functions, as well as any gaps between the enterprise semantic model and the underlying data representation schemes. This modeling approach can then be used to determine where physical data conflicts or duplications may exist, as well as forward engineer data consolidation and cleansing scripts.

    Data Access Controls
    In traditional application architectures, data access security is typically governed by application-specific mechanisms. In this environment, each source has its own set of users, roles, and access control policies. Which means that user profiles, roles, and access control policies lack consistency across the enterprise. An SOA environment magnifies this problem by making data sources visible across the organization. So it becomes increasingly important to move away from individual application-specific and data source-specific mechanisms in favor of enterprise-level SOA identity management and access control mechanisms.

    This means that when creating the central data services layer, the data sources must rely on central provisioning of some security functions so they can be managed centrally. The challenge is in finding the right balance between the security functions that should be managed centrally and what should be managed as part of the data sources. There are several options in implementing such a scheme, including a centrally managed data security layer, or using layered authorization through multiple policy decision points (PDP).

    With the central management option, the data sources relinquish security and rely solely on the data services to protect the access to their data. Within each data source, a single user profile is created for the data service that has full access to the data. Any request to the data through this service is authorized through this user profile. So there's no longer a concern about whether the principal's identity from the overarching security domain exists or means anything in the data source. However, this option pushes security checks into the data service layer and reduces the granularity of accountability. As a consequence, any access control policies from the data source along with the associated roles and privileges should now be re-created and maintained at the central enterprise points.

    In contrast, layering the use of multiple policy decision points encourages the reuse of existing authorization capabilities, user profiles, and access control policies of the underlying data sources. This approach allows some of the more fine-grained access control decisions to be made at the data sources rather than elevating them into the enterprise layer. Although many variations exist for this design, the premise is that different layers of authorization with multiple PDPs are making the decisions. The basic flow of this approach is as follows: Authentication still occurs at the edge using enterprise authentication services. Requests for data originate at different security domains in the enterprise. A PDP in each of these domains evaluates requests for resources in that domain. When a data service is invoked it calls the enterprise policy decision point to authorize access to the data service as well as the specific operation requested. The data service then delegates the decision to each data source so they can authorize access to their specific data object(s). Thus, coarse-grained decisions are made at the enterprise level while finer-grained decisions use data source-specific profiles and policies that aren't exposed to the enterprise.

    Data Services Architecture
    From an architectural perspective, the heart of this solution is an enterprise layer that logically centralizes access to the data spread across the enterprise. This set of logically centralized data services provides several architectural advantages. First, the enterprise can assert greater control over the governance and implementation of data access mechanisms. Second, clients use a consistent mechanism to access data. Third, the enterprise can design and implement a solution in a holistic fashion instead of the typical one-off models that are the norm in data integration. Finally, besides the basic Create, Read, Update, and Delete (CRUD) operations, the underlying architecture must also support data aggregation, inter-service transactions, and multiple access and usage patterns, all while ensuring acceptable levels of quality of service.

    Data Aggregation Scenarios
    This data services layer acts as a façade over the enterprise assets - it logically provides access to enterprise data assets in a singular manner, while physically dispatching requests and aggregations across relevant co-located assets. Three main scenarios should be considered for data aggregation:

  • The unified view of a data entity is defined by combining attributes from multiple sources. The actual data of that view is also obtained by combining data from multiple sources. The main difficulty with this aggregation scenario is linking related data from multiple systems that may not share unique identifiers. This often requires the creation of a cross-reference table to link related records.
  • The unified view of an entity is derived from the model of a single source. However, the actual data is obtained from multiple sources with different models. The main difficulty here is an understanding of de-duplication - tapping multiple systems to get a complete set of instance data can result in multiple instance records about the same thing. In this case, once duplicates are identified, which one survives to become the "golden copy"? In this model, identification and use of authoritative sources becomes important.
  • The unified view of an entity is partitioned across multiple instances of a single model. Data distribution can be the result of planned partitioning or just the ad hoc use of the same source system across multiple departments resulting in multiple instances. In case of planned partitioning, the partitioning schema can be used to optimize the performance of the data access layer, while in the case of ad hoc distribution duplicates are a problem and should be addressed through the use of authoritative data sources.

    Some of these aggregation capabilities can be supported through Enterprise Information Integration (EII) technology, which provides SOA-centric capabilities for accessing and querying co-located data in real-time. EII products provide adapters to legacy data sources and expose their underlying data in a service-oriented fashion. EII is best used in discrete query-based mechanisms where data volumes are moderate. EII isn't meant to be a replacement for traditional ETL (extract, transform, load), EAI (enterprise application integration), or MDM (master data management) technologies. For example, some of the aggregation scenarios requiring de-duplication capabilities can require the use of MDM technologies.

    The data services layer allows creates and updates to be requested once by a client and then decomposed by the supporting architecture into individual write commands to targeted data sources. Therefore, the architecture must support transactionality - ensuring that writes are consistent so that underlying data across all affected data sources are left in a consistent state. This isn't significantly different from current data integration pains. However, most systems today requiring multi-write transaction capabilities leverage the XA standards. Similar standards for the Web Services environment are only starting to emerge. OASIS has recently formed a Web Services Transaction Technical Committee (WS-TX TC) responsible for stewarding WS-AtomicTransaction, WS-Coordination, and WS-BusinessActivity specifications through the standardization process. None of these standards have been ratified yet. Because these specifications are still being developed, most SOA-related transaction support is being custom-developed, typically through the use of homegrown compensation mechanisms - effectively an "undoing" of a previously executed service invocation. Instead of providing true rollback semantics, compensation is an additional service invocation that rewrites data to its original state. While it may be beneficial to take a wait-and-see approach to building transactionality, solutions aligned with the three specifications seeding WS-TX deliberations will likely provide the path of least resistance to standards compliance.

    Quality of Service
    With all the data access operations going through this data services layer, a major concern is the potential bottleneck at this layer that may limit scalability. The obvious way to resolve this problem is to create a clustered environment with multiple instances of this data services layer.

    There are complexities with clustering dependant on whether the enterprise is using a purely federated approach or has some level of data replication. If using a purely federated approach, then it can be simple to have a cluster with multiple instances. However, the architecture must still address the issue of affinity for a particular instance - especially in the case of inter-service transactions. The architecture must address questions such as: Are all operations that are part of a transaction forced to go to the same data service instance? Can different operations that use different data service instances still be part of the transaction?

    A simple solution is to require all operations in a single transaction to interact with a single service instance. However, this solution isn't without its disadvantages since it can affect how well the load is distributed across the cluster. With some replication, clustering becomes more difficult. In addition to the server affinity issue, the architecture must include a partitioning strategy. This strategy answers questions such as: Do all instances of the data services allow access to all the data? Or are data services partitioned so that only certain instances allow access to certain data?

    Data Access and Usage Patterns
    It's important to note that different applications have different data access and usage patterns. Some applications can produce many transactions but access only a small amount of data in each transaction. For other applications, the transaction throughput can be small but the volume of data that's accessed very large. The way to tune data source performance for these patterns is very different. When using a data services solution to provide centralized access to enterprise data sources, the enterprise must accommodate all the various access and usage patterns of the applications that will be integrated with this solution. Tuning the infrastructure to support a single application's performance requirements is complicated, trying to tune it to adequately support multiple patterns of use and access will be even more difficult. Often, there will be conflicting configurations - something that optimizes the performance of one application will degrade the performance of another. The enterprise should analyze and model the access and use patterns of the applications that will be using the data services and ensure that well-defined performance criteria for each scenario have been developed. Additionally, enough time should be planned for testing the performance of a particular solution with simulations that reflect the access and usage patterns that are common to the enterprise environment.

    Summary
    Harmonizing data assets has always been a challenging problem; the problems and urgency are further exacerbated when migrating to an SOA. Developing a strategy for handling this kind of transition is essential to properly enabling data access in an enterprise SOA environment. By developing appropriate requirements and use cases and by analyzing data assets and data usage, organizations can better understand the breadth and depth of their data integration issues and begin to take steps to address them. Ultimately, every organization must develop a strategy tailored to its specific needs, but the overall approach described in this article provides guidance in understanding what types of questions should be asked and how to leverage possible technology solutions to address the resulting issues that are identified. This guidance will enable organizations to fully leverage and exploit their most important strategic asset: their data.

  • © 2008 SYS-CON Media Inc.