Saturday, January 29, 2011

Understanding Database Basics - Part 1

0 comments

Real-time data analysis and business intelligence is an essential in the modern business world to keep track of the vast amounts of information a business collects and to make the best use of it. Modern data analysis systems are highly complex, having the ability to cut and slice huge amounts of complex information and present it to the end user in a way that is both easy to understand and useful.
In fact, collecting this information, analyzing it and making business decisions based on this data is the key to success for many businesses. However, many end users do not understand the basics of even the simplest database, or how the information is organized or stored. So in this two-part article we look at the basics of databases and database management.
Basically, a database is a collection of organized information that is stored on a computer. This information can be about anything - people, flights, cars, you name it. A library catalogue is a typical example of a working database, and we will continue to use this as an example in these articles.

A simple relational database will contain three essential components - tables, records and fields. So, using the library example you may have a table that contains the information on every book in the library. This table, lets call it books, will look like any table with rows and columns.
Each row on this table is known as a record and it represents a unique item in the database, in this case a book. However, there will be a lot of information to be stored about each book so this will be contained in the columns or fields. Therefore, each row will have at least one field, or column. In this case, for each book the fields will include information like a book ID number, book title, author, publisher, publication date, whether it is checked out and so on.
However, there are more things to keeping a library catalog than storing information about books. You will also need a table containing the details of library members. In this case, each record will represent a member and the fields will contain details like membership number, name, address, and so on.
There will probably be more tables as well. There may be a book categories table (fiction, non-fiction, autobiographies, reference, etc), a book status table (checked in, checked out, reserved, missing, etc), and many more tables.
This is all essential information for a library. In the second part of this article we will look at how tables are linked to form a relational database, and how the end user can make sense of the huge amount of data stored in the database.

SOURCE:http://www.relativitycorp.com/data/article2.html
newer post

Cubes

0 comments

What is an OLAP Cube? As you saw in the definition of OLAP, the key requirement is multidimensional. OLAP achieves the multidimensional functionality by using a structure called a cube. The OLAP cube provides the multidimensional way to look at the data. The cube is comparable to a table in a relational database.
The specific design of an OLAP cube ensures report optimization. The design of many databases is for online transaction processing and efficiency in data storage, whereas OLAP cube design is for efficiency in data retrieval. In other words, the storage of OLAP cube data is in such a way as to make easy and efficient reporting. A traditional relational database treats all the data in a similar manner. However, OLAP cubes have categories of data called dimensions and measures. For now, a simple definition of dimensions and measures will suffice. A measure represents some fact (or number) such as cost or units of service. A dimension represents descriptive categories of data such as time or location.
The term cube comes from the geometric object and implies three dimensions, but in actual use, the cube may have more than three dimensions.
The following illustration graphically represents the concept of an OLAP cube.


Figure 1: OLAP Cube

In figure 1, time, product, and location represent the dimensions of the cube, while 174 represents the measure. Recall that a dimension is a category of data and a measure is a fact or value.
Three important concepts associated with analyzing data using OLAP cubes and an OLAP reporting tool are slicing, dicing, and rotating.


Slicing
 

A slice is a subset of a multidimensional array corresponding to a single value for one or more members of the dimensions not in the subset. For example, if the member Actuals is selected from the Scenario dimension, then the sub-cube of all the remaining dimensions is the slice that is specified. The data omitted from this slice would be any data associated with the non-selected members of the Scenario dimension, for example Budget, Variance, Forecast, etc. From an end user perspective, the term slice most often refers to a two- dimensional page selected from the cube.


Figure 2: Slicing-Wireless Mouse

Figure 2 illustrates slicing the product Wireless Mouse. When you slice as in the example, you have data for the Wireless Mouse for the years and locations as a result. Stated another way, you have effectively filtered the data to display the measures associated with the Wireless Mouse product.


Figure 3: Slicing-Asia

Figure 3 illustrates slicing the location Asia. When you slice as in the example, you have data for Asia for the product and years as a result. Stated another way, you have effectively filtered the data to display the measures associated with the Asia location.

Dicing
 

A related operation to slicing is dicing. In the case of dicing, you define a subcube of the original space. The data you see is that of one cell from the cube. Dicing provides you the smallest available slice.
Figure 4 provides a graphical representation of dicing.


Figure 4: Dicing

Rotating
 

Rotating changes the dimensional orientation of the report from the cube data. For example, rotating may consist of swapping the rows and columns, or moving one of the row dimensions into the column dimension, or swapping an off-spreadsheet dimension with one of the dimensions in the page display (either to become one of the new rows or columns), etc. You also may hear the term pivoting. Rotating and pivoting are the same thing.


Figure 5: Rotating

Dimensions

Recall, that a dimension represents descriptive categories of data such as time or location. In other words, dimensions are broad groupings of descriptive data about a major aspect of a business, such as dates, markets, or products. The value of OLAP in reporting data is having levels within the dimensions. Each dimension includes different levels of categories. Dimension levels allow you to view general things about your data and then look at the details of your data.
Think of the levels of categories as a hierarchy. For example, your OLAP cube could have a time dimension. The time dimension then could have year, quarter, and month as the levels, as in Figure 6. Another example is location as a dimension. For the location dimension, you could have region, country, and city as the levels (shown in Figure 6). An important concept to OLAP is drilling. Drilling refers to the ability to drill-up or drill-down. These levels of categories (hierarchies) are what provide the ability to drill-up or drill-down on data in an OLAP cube. When you drill-down on a dimension, you increase the detail level of viewing the data. For example, you start with the year and view that data, but you want to see the data by each quarter of the year. You would drill-down to see the quarterly data. You could drill-down again to see the monthly data within a specific quarter.



Figure 6: Dimensions

Categories

Dimensions have members or categories. A category is an item that matches a specific description or classification such as years in a time dimension. Categories can be at different levels of information within a dimension. You can group any category into a more general category. For instance, you can group a set of dates into a month, a set of months into quarters, and a set of quarters into years. In this example, years, quarters, and months are all categories of the time dimension.
Categories have parents and children. A parent category is the next higher level of another category in a drill-up path. For example, 2003 is the parent category of 2003 Q1, 2003 Q2, 2003 Q3, and 2003 Q4. A child category is the next lower level category in a drill-down path. For example, January is a child category of 2003 Q1.
Figure 7 below shows you the time, product, and location dimensions with the parent and children categories. In the time dimension, you see the parent Year Category and the children Quarter Category and Month Category.


Figure 7: Categories

Measures

The tutorial to this point has provided you with a definition of OLAP, a description of cubes, the meaning of dimensions, and a description of categories. The focus now is an explanation of measures.
The measures are the actual data values that occupy the cells as defined by the dimensions selected. Measures include facts or variables typically stored as numerical fields, which provide the focal point of investigation using OLAP. For instance, you are a manufacturer of cellular phones. The question you want answered is how many xyz model cell phones (product dimension) did a particular plant (location dimension) produce during the month of January 2003 (time dimension). Using OLAP, you found that plant a produced 2,500 xyz model cell phones during January 2003. The measure in this example is the 2,500.
Additionally, measures occupy a confusing area of OLAP theory. Some believe that measures are like any other dimension. For example, one can think of a spreadsheet containing cell phones produced by month and plant as a two-dimensional picture, but the values (measures)—the cell phones produced—effectively form a third dimension. However, although this dimension does have members (e.g. actual production, forecasted production, planned production), it does not have its own hierarchy. It adopts the hierarchy of the dimension it is measuring, so production by month consolidates into production by quarter, production by quarter consolidates into production by year.
The example in Figure 8 shows the measure Volume of Product. Each value in Figure 8 is the measure of Volume of Product per year listed by product. From Figure 8, ABC Company produced 84,000 modems in 2003.


Figure 8: Measures

Nesting

Nesting is a function in which you have more than one dimension or category in a row or column. Nesting provides a display that shows the results of a multi-dimensional query that returns a sub-cube, i.e., more than a two-dimensional slice or page. The column/row labels will display the extra dimensionality of the output by nesting the labels describing the members of each dimension.
Figure 9 illustrates nesting. In the top table, the nested dimensions are time (year) and location (countries) in rows. By nesting in this example, you can see the volume of products for each North American Country in 2003. In the bottom table, the nested dimensions are products (devices) and location (countries) in columns. By nesting in this example, you can see the volume of products by quarter for each North American Country for 2003 Q1 and Q2.


Figure 9: Nesting


Aggregation

One of the keywords for OLAP is fast. Recall that fast refers to the speed that an OLAP system is able to deliver most responses to the end user. Essential for OLAP is to produce fast query times. This is one of the basic tenets for OLAP—the capability to naturally control data requires quick retrieval of information. In general, the more calculations that OLAP needs to perform in order to produce a piece of information, the slower the response time. Consequently, to keep query times fast, pieces of information that users frequently will access, but need to be calculated, are pre-aggregated. Pre-aggregated data means that the OLAP system calculates the values and then stores them in the database as new data. An example of a type of data that may be precalculated is summary data, such as failure rate for days, months, quarters, or years.
Approaches to aggregation affect both the size of the database and the response time of queries. The more values a system precalculates, the more likely a user request already has a value that has already been calculated, thus increasing the the response time because the value does not need to be requested. On the other hand, if a system precalculates all possible values, not only will the size of the database be unmanageable, but also the time it takes to aggregate will be long. Additionally, when values are added to the database, or perhaps changed, that information again needs to be propagated through the precalculated values that depend on the new data. Thus, updating the database can be time-consuming if too many values are pre-calculated for the database.


Figure 10: Aggregation

Summary

To summarize what you learned in this tutorial:
  • OLAP is a category of software tools that provides Fast Analysis of Shared Multidimensional Information (FASMI).
  • An OLAP cube is the concept that achieves the multidimensional functionality that provides the method to look at data from a variety of angles.
  • Slicing, dicing, and rotating are methods in OLAP that change the view of the data in different ways.
  • Dimensions are the descriptive categories of data that apply to major aspects of a business and that dimensions have hierarchies.
  • Categories are items that match a precise organization that you use to perform drill-up or drill-down operations.
  • Measures are the actual data values for a select set of dimensions.
  • Nesting provides the ability to add more than one dimension or category to a column or row.
  • Aggregation is the process of precalculating summary data values

SOURCE:http://training.inet.com/OLAP









newer post

Online analytical processing (OLAP)

0 comments
Online analytical processing (OLAP) allows you to access aggregated and organized data from business data sources, such as data warehouses, in a multidimensional structure called a cube.

Microsoft SQL Server 2008 Analysis Services (SSAS) provides tools and features for OLAP that you can use to design, deploy, and maintain cubes and other supporting objects.

Before you start integrating cubes and other OLAP functionality into your business intelligence solutions, make sure you engage with Quantix to provide guidance, foundations, concepts and strategic considerations.

OLAP
Working with Relational Schemas You generally define OLAP objects based on an existing data source, such as a data warehouse or production database. However, you can also define OLAP objects without a data source and then have the Schema Generation Wizard create the underlying relational schema based on the OLAP objects defined in an Analysis Services project. For more information about working with relational schemas, see Working with Relational Schemas.
Dimensions, attributes, and hierarchies are OLAP objects that you define and configure at the Analysis Services database level. These OLAP objects exist independent of any OLAP cubes and can be used in one or more cubes. To a limited extent, these objects can be customized within each cube. For more information about defining and configuring dimensions, attributes, and hierarchies, see Defining and Configuring Dimensions, Attributes, and Hierarchies. For conceptual information about dimensions, attributes, and hierarchies, see Dimensions (Analysis Services).

Cubes are OLAP objects consisting of related measures and dimensions that you configure within an Analysis Services database. You can define and configure multiple cubes within a single database and each cube can use some or all of the same dimensions. You can also define a single cube that contains multiple measure groups in the same database rather than defining separate cubes. When you define a cube with multiple measure groups, you need to define how dimensions relate to each measure group and customize, as appropriate, dimension objects within each cube and measure group. When defining a cube, you also define advance cube properties, including calculations, KPIs, actions, partitions and aggregations, perspectives and translations.


SOURCE:http://www.quantix-uk.com/olap.aspx
newer post

OLAP Workshop Part 2 : Understanding OLAP Technology

0 comments
In the last posting I hopefully explained some of the basic concepts behind OLAP. In this posting I want to explore how those basic concepts are exposed by the various OLAP aware ETL and reporting tools provided by Oracle and other BI vendors.

Architecture of Oracle OLAP





For a long time now Oracle has been unique in the marketplace. With Oracle Database 9i, 10g and 11g, all data (relational and multidimensional) is stored in one Oracle database. Only Oracle OLAP provides native multidimensional data types within the database.

A high-level architectural view of the Oracle OLAP option contains three parts:

Oracle Database 10g OLAP option, which comprises:
  • Multidimensional data types
  • OLAP calculation engine
  • Open-access interfaces
End-user tools, which provide access to OLAP data for a wide spectrum of analytic needs
Administrative tools used to create and manage multidimensional data types. Oracle provides two administrative tools that can be used to create multidimensional data types in Oracle Database 10g:
  • Oracle Warehouse Builder
  • Analytic Workspace Manager
The following sections examine these three layers in more detail.

Components of Oracle OLAPWith the Oracle OLAP option, you get two powerful arenas of functionality:
  • OLAP API and the analytic workspace (AW).
  • OLAP API Functionality





Analytic Workspace The Analytic Workspace is the container for the multidimensional data types and leverages the multidimensional calculation engine of the OLAP Option. The AW also provides a standard SQL interface to the multi-dimensional model. This provides an industry standard access layer that can be used by any BI reporting tool that generates SQL commands, from SQL Developer, to Application Express on to more sophisticated tools such as BI EE.

The AW also povides an XML API for administration, and a programming language (OLAP DML).

Multidimensional Data Store The OLAP option provides true array-based multidimensional data types within the Oracle database. These multidimensional data types are contained in special tables in Oracle called analytic workspaces. Some data types are used to store data, whereas others are calculated instantaneously using the multidimensional engine.

Multidimensional Calculation Engine The OLAP Option’s multidimensional engine includes an impressive library of multidimensional-aware calculation functions and support for planning functionalities such as statistical forecasts, models, allocations, projections, and “what-if”scenarios, in the context of Analytic Workspaces. The multidimensional engine interacts with the multidimensional data types in the analytic workspace in the Oracle database.

The Oracle OLAP option provides a specialized Java API that developers can use to exploit the full power of the Oracle OLAP option by using advanced dimensionally-aware tools and applications. This API is used by Oracle Business Intelligence tools such as OracleBI Beans, OracleBI Discoverer, OracleBI Spreadsheet Add-In, and Oracle Reports OLAP Plug-in to provide a true multi-dimensional query and calculation environment.

OLAP DML The OLAP DML is an extremely powerful and analytically rich feature of the AW. It is a dimensionally-aware, high-level procedural language that runs in the database and exploits the multidimensional engine and multidimensional data types.

With the Oracle OLAP option developers can exploit the OLAP DML to add more sophisticated calculations and analysis to AWs and to extend the functionality of applications that access them. The OLAP DML is briefly introduced in the lesson titled “Previewing Advanced Oracle OLAP Features.”

AW API The AW API is a Java API that is used to define and physically build multidimensional analytic workspaces inside Oracle Database 10g. The AW API is used by administrative tools such as Analytic Workspace Manager and may be used by developers, if required, to enhance and extend analytic workspaces as necessary for a specific application.

Query Access to Oracle OLAPDifferent users with different end-user tool requirements can all access the same data, taking advantage of the same calculations, and benefit from the same security, scalability, performance, and availability of the Oracle database.




OLAP API The OLAP API is designed to work with both multidimensional data types and relational data types in the Oracle database.

  • The API enables you to directly access multidimensional data types in the AW.
  • To access relational data types, you can register a relational physical dimensional model (such as a star or snowflake schema) to the Oracle Database OLAP Catalog.

Therefore, tools that use the OLAP API can be deployed against multidimensional analytic workspaces or suitable relational schemas that are registered to the Oracle OLAP option’s metadata layer.


Many Oracle source business intelligence tools take full advantage of the multidimensional query data model provided by the Oracle OLAP option through the OLAP API. These include OracleBI Spreadsheet Add-In, OracleBI Discoverer, OracleBI Beans, and OracleBI Reports.

SQL Interface
Unlike other multidimensional OLAP server products, Oracle OLAP provides not only a specialized API but also industry-standard SQL to access multidimensional data types.

You can use a simple SQL query with relationally oriented tools and applications to gain access to the multidimensional data types in the Oracle database. As a result, your SQL-based applications (such as report generators and ad hoc query tools) can access multidimensional data and calculations managed by the Oracle OLAP option.
SQL and PL/SQL are also used to manage and maintain multidimensional analytic workspaces and to move data between relational and multidimensional data types within the Oracle database.

For example, Oracle Application Express is an easy-to-use tool that is supplied with Oracle Database 10g for Web access to the Oracle database. Oracle Application Express is an example of a SQL-based application with no built-in OLAP knowledge that can nevertheless leverage the power of Oracle OLAP.

Many third-party tools from independent software vendors and Oracle partners, such as Arcplan, Business Objects, Cognos, and a large and growing number of business intelligence vendors throughout the world, access OLAP data through the Oracle OLAP option. Some of these vendors’ tools leverage the OLAP API, whereas others exploit the SQL query interface.

Dimensionally Aware Products
Oracle BI Spreadsheet Addin OracleBI Spreadsheet Add-In makes it easy to access OLAP data through the familiar spreadsheet environment of Microsoft Excel. After installation of OracleBI Spreadsheet Add-In, “OracleBI” appears as a new menu item in Excel. By using OracleBI Spreadsheet Add-In, you can establish a secure connection to the OLAP data source and use Excel as the front-end access tool to the data in the database.

Here are some of the features of OracleBI Spreadsheet Add-In:

  • It combines the flexibility and familiarity of Excel and the power, scalability, and security of the Oracle OLAP option.
  • OracleBI Query and Calculation Builders: After the connection is established, you can use the wizard-driven interface to drill, pivot, page through large cubes, and create reports.
  • Access to native Excel features
    • Powerful data-formatting features of Excel,
    • Combine Oracle OLAP data with other Excel data
    • Write Excel macros that leverage all your data.
    • Create formulas and graphs in Excel
Excel users can quickly and easily combine the powerful analytic capabilities of Oracle OLAP with standard Excel functions that you know and use each day.



When using Excel OLAP calculations are performed directly in the database: The benefit of using OracleBI Spreadsheet Add-In is that you no longer need to download massive amounts of data to your spreadsheet. Oracle Database 10g OLAP performs all the OLAP calculations quickly and efficiently in the database. The calculations and business logic are defined only once in the database and then shared across the user community.

SOURCE:http://oraclebi.blogspot.com/2007/11/olap-workshop-part-2-understanding-olap.html
newer post

Online Analytical Processing (OLAP) and Business

0 comments
There are many business intelligence tools available to help make sense of the huge volumes of data that the average business must process. One of the most commonly used methods is Online Analytical Processing, also known as OLAP.
OLAP is only one kind of data analysis method but it is particularly useful in certain business circumstances such as process management, forecasting, budgeting, marketing and financial reporting among other areas. One of its defining features is that the databases allow complex queries that return results quickly.
OLAP systems run on the OLAP cube concept and have the potential to return results from queries in a fraction of the time taken by other methods. There are a number of different types of OLAP system.
    Multidimensional - MOLAP
    This is also simply known as OLAP and uses databases whose structures are optimized in advance for certain functionalities. Advantages include the fact that query results are returned very quickly and data can be stored using much smaller space than relational databases.
    Relational - ROLAP
    ROLAP differs from MOLAP in that it does not require any pre-computation, nor does it need to store any information. All information is accessed directly from the relational database. It can use standard query language (SQL) queries to request certain information from the database and present it to the end user. One of the main advantages of ROLAP is that it can handle very large volumes of data. Also, because data is stored on a standard relational database it can be accessed by SQL queries and presented with normal reporting tools.
    Hybrid - HOLAP
    There is no exact consensus on what a HOLAP is precisely, but as you might imagine it combines features of both multidimensional and relational OLAPs. In this scenario, some of the data will be kept in the MOLAP store and the rest in the ROLAP store. It is up to the designer how much and what kind of data is kept in each, and how it is partitioned.
    These are the three main types of OLAPs you will encounter. However, you may also hear of the following:
          o WOLAP - Web-based OLAP
          o DOLAP - Desktop OLAP
          o RTOLAP - Real-Time OLAP
          o SOLAP - Spatial OLAP
    There are a range of OLAP products on the market today, however the most commonly used by far is Microsoft Analysis Services, which has been available since 1998. There are also a number of other commercial offerings like Hyperion and Cognos. Since 2000, a number of open source products have emerged including Palo and Mondrian.

SOURCE:http://www.relativitycorp.com/data/article1.html
newer post

Tuesday, January 25, 2011

The Evolution of a Real-time Data Warehouse

0 comments
Understanding Real-time Systems
Today, real time computing is everywhere, from customer information control systems (CICSs) to real-time data warehouse systems. Real-time systems have the ability to respond to user actions in a very short period of time. This computing behavior gives real-time systems special features such as instant interaction: users request information from the system and they receive the answer. Also, users have the possibility to remain connected (online) so they can start this interaction with the system anytime—which is called an online transaction processing (OLTP) system.
In general, a real-time system generates a large amount of up-to-date data and operates in a transactional way (meaning, it is used to register business transactions). Even though a traditional real-time system can store historical information, it is not designed to exploit this information for analytical purposes. In real-time systems, the importance of data resides mainly in the possibility of reacting or responding to the most current data in the system and uses historical data for monitoring purposes. Common real-time systems are designed for bookkeeping purposes (e.g., invoice information, flight reservations, etc.). Transactional data stored in the traditional real-time system will be changed (updated) according to operational purposes, which explains why some traditional real-time systems are also called "operational systems."
The Data Warehouse
Operational systems are able to generate a large amount of data to support the transactional operations of an organization. The data generated by these systems is fresh and valuable; as well, it can be very diverse and heterogeneous, and can come from a wide variety of sources. It is common for many companies to have several systems to support their operations. In order to solve the problem (gathering, integrating, cleaning, and analyzing this heterogeneous information), new software systems were developed. The data warehouse is its most important character.
A collection of subject-oriented, integrated, time-variant, and non-volatile data is what we can call a data warehouse. This data is used to support the decision-making process of an organization's management team. A data warehouse is used to integrate all of an organization's historical data, and has the ability to store snapshots of its transactions. All information generated from the operational data source is extracted, cleansed, transformed, and loaded into the data warehouse. Once in the data warehouse, the data can be subject to a wide range of analysis and exploration processes.
In a traditional data warehouse design, data generated within the company is concentrated in a unique place and will not be deleted. This ensures that all data is integrated and homologated to reflect one single version of the truth throughout the organization (every quarter, month, week, or even day). There are many tools that enable data analysis from very diverse points of view—reporting and analysis tools, online analytical processing (OLAP) analysis tools, data mining, forecasting, etc.
Managers and decision makers realized that a data warehouse—and its technology—had the ability to provide a complete view of the status of an organization. But as operational systems evolved, businesses merged, and globalization augmented, the number of business transactions increased substantially. Soon it became clear that having only historical information available wouldn't be enough to give companies all the necessary data for decision support. Managers needed data to be refreshed at a high speed, forcing data warehouses refresh cycles to be increased in frequency.
Real Time and the Data Warehouse
At first glance, real-time and data warehouse concepts appear to be very distant and disparate, but if we look more closely, we will be able to see that the real time data warehouse paradigm has more logic than we might have thought. As I mentioned before, the volume (data) generated from operational data systems has grown intensively. Data speed has forced data warehouses to radically change the way data is stored and handled.
As a natural process and consequence of the increasing speed of data generation, the traditional data warehouse design was forced to incorporate more frequent refresh cycles, and new challenges in data warehouse design had to be confronted. One of the first steps in managing this major issue was the creation of the near real-time data warehouse.
The Near Real-time Data Warehouse
The extraction, transformation, and load (ETL) process represents one of the major challenges in real-time data warehouse design. All ETL data warehouse processes were originally designed to be executed in batch mode, during previously scheduled downtimes. All operational data from distinct sources (e.g. ERP systems) was extracted, cleansed under a stage repository, and loaded into the data warehouse over long periods of time, mostly at night. These processes can take minutes or hours, depending on the volume of data being uploaded to the data warehouse.
With the pressure to load more recent data into the data warehouse, all ETL processes were forced to augment their frequency with new designs. This approach was derived from a very simple solution: if the business does not really need real-time technology, it might be enough to implement a near-real-time data warehouse. This means that the main part consists only in doing more ETL processes. The near real-time approach has some challenges such as increasing the downtime frequency; the pressure to decrease downtime period duration; and avoiding inconsistency in data results. If there is no actual need for a real-time data warehouse solution, a good option could be to implement a near real-time data warehouse.
The Real-time Data Warehouse
A real-time data warehouse enables data to be stored at the very moment it is generated and it is immediately captured, cleansed, and stored within the data warehouse structure. Traditional refresh cycles are no longer valid. The data warehouse is capable of reading the same data that moves around the operational systems at the same time it is generated. Despite the technical difficulties of implementing a true real-time data warehouse, there are some advantages.
It shortens information delivery times.
It improves integration throughout the organization.
It eases the analysis of future trends.
Basic Principles to Consider
With the growing popularity and increasing implementation of real-time data warehouses, it is important to consider some basic principles when considering a real-time data warehouse implementation.
Data on Time, at the Right Time. The data must flow to the real-time data warehouse at the necessary speed in order to be considered valuable data. In a real-time data warehouse, the ETL batch mechanism based on a table or file transfer is replaced with a design in which data flows from very different yet synchronized data sources into the data warehouse. Data will be considered valuable if it flows at the right speed and not necessarily at real-time speed. This data cycle will depend on the analytical purpose the data is used for. Real-time data is an essential part of the analytical process, but historical data remains an essential part of the design paradigm. While tactical analysis requires immediate or recent data, other sorts of strategic decision making require a higher volume of historical data. For all business analysis, a combination of both historical and real-time data is necessary.
The Analytical Purpose. Real-time data warehouses are not intended to replace traditional operational systems. One of the functions of a real-time data warehouse is to support analytical purposes, and not to perform operational functions. Even when a real-time data warehouse has the ability to store real-time data, its design is intended to perform intensive data loading, not transaction-based registrations. Real-time data warehouses are based on business data flows rather than business transactions, and need to maintain their main analytical purpose.
The Enterprise Focus. One of the key differentiators from other real-time systems such as ERPs, CRMs etc. is integration. A real-time data warehouse integrates data that flows from different sources to one single site. Consider a real-time data warehouse as an enterprise data warehouse—this repository will be accessible for all business units across the organization.
Some Players in the Real-time Data Warehouse Field
There are several vendors in the data warehouse space. Some are big and offer very robust solutions. Others vendors are innovative and offer very state-of-the-art solutions. Here are some data warehouse vendors to consider when planning a real-time data warehouse solution implementation (listed in alphabetic order).
Greenplum
Greenplum Database is data warehouse software built to support large-scale analytics processing. Massive parallel processing (MPP) is architecture-based and multi-level fault tolerance. This database supports industry standard interfaces (structured query language [SQL], open database connectivity [ODBC], java database connection [JDBC], and interoperation with most common business intelligence (BI) and ETL tools.
IBM
IBM counts on InfoSphere Warehouse to deliver an entire data warehouse deployment solution, with different editions (e.g., Enterprise and Departmental), and a complete warehousing solution (InfoSphere Balanced Warehouse). Recently, IBM announced InfoSphere System Z as part of a complete IBM data warehouse solution. It enables applications to populate data warehouses under DB2 for z/OS operating systems.
Microsoft
SQL Server 2008 provides a scalable data warehouse solution for BI. This product's features include data compression, partitioned tables, parallelism capabilities, and change data capture. The scalable integration services tools enable fast ETL operations and connectivity to non-SQL server data sources.
Netezza
Netezza Data Warehouse Appliances is a combination of software and hardware tools designed for analytical processing of very large amounts of data. Netezza's architecture is massively parallel processing (MPP)-based and uses a "streaming" processing to help enable advanced or complex analytical processes.
Oracle
Oracle offers a set of products for deploying data warehousing solutions, such as Exadata Storage Server (based on the HP ProLiant DL180 G5 server), HP Oracle Database Machine designed for multiterabyte data warehouses, and the Oracle Warehouse Builder integration tool that supports advanced data loads and data capture for low latency databases. They also offer other products like Oracle Partitioning, which can lower query times and increase data availability.
Teradata
Among other products offered, Terada Database 12 is an enterprise data warehouse (EDW) solution with parallel operation capability, mission-critical availability, and ease of integration. With support from symmetric multiprocessing (SMP), massively parallel processing (MPP), and its own tools and utilities such as Tpump, FastLoad, MultiLoad, data is loaded continuously. Also, the Teradata Replication Services are used for real-time capture and delivery of changed data.
Vertica
The Vertica Analytic Database is a column-oriented, MPP architecture-based database that handles new generation data warehouses. It is designed to handle large-scale data analysis and many concurrent users, and features an aggressive data compression.
Data warehouse implementation has changed radically. It still is an enterprise data repository that enables data analysis. But nowadays data warehouses are able to incorporate not only historical data but real-time data. This new feature expands the functionality of data warehouses and enables them to store information at a higher speed. It is also capable of supporting tactical and analytical decision for an organization at the right time. The data warehouse is not only alive and well, but it is evolving and maturing. It is enabling companies to evolve and expand by using new technology to analyze past and present information, to support future decisions.
http://www.technologyevaluation.com/research/articles/the-evolution-of-a-real-time-data-warehouse-20472/
newer post

The Necessity of Data Warehousing

0 comments
Why the market is necessary
Data warehousing is an integral part of the "information age". Corporations have long known that some of the keys to their future success could be gleaned from their existing data, both current and historical. Until approximately 1990, many factors made it difficult, if not impossible, to extract this data and turn it into useful information. Some examples:
Data storage peripherals such as DASD (Direct Access Storage Device) were extremely expensive on a per-megabyte basis. Therefore, much of the needed data was stored offline, typically on magnetic tape.
Processing power was very expensive as measured in MIPS (Millions of Instructions per Second). Mainframes had to reserve most of their processing power for day-to-day operations, reports could only be run overnight in batch mode (without interaction from the user).
Relational database technology was still in its infancy, and server engines were not powerful enough to support the data loads required.
The type of programming that had to be done with third generation languages (3GL's) was tedious and expensive. Fourth generation languages were needed to abstract some of the required coding, but 4GL's were still in their infancy.
Most operational data is stored in what is referred to as an OLTP (On-Line Transaction Processing) system. These systems are specifically designed for high levels of transaction volume with many concurrent users. If the database is relational, it has probably been "normalized" (the process of organizing data in accordance with the rules of a relational database). If the database is non-relational, custom programs have to be written to store and retrieve data from the database. (This is often accomplished with the COBOL programming language). Whether relational or non-relational, the very design that makes an OLTP system efficient for transaction processing makes it inefficient for end-user queries. In the 1980's, many business users referred to their mainframes as "the black hole", because all the information went into it, but none ever came back out - all requests for reports had to be programmed by the Information Systems staff. Only "pre-canned" reports could be generated on a scheduled basis, ad-hoc real-time querying was virtually impossible.
To resolve these issues, data warehousing was created. The theory was to create a database infrastructure that was always on-line, contained all the information from the OLTP systems, including historical data, but structured in such a way that it was fast and efficient for querying. The most common of these schemas (logical and physical database designs) is known as the star schema. A star schema consists of facts (actual business facts) and dimensions (ways of looking at the facts). One simple way to look at a star schema is that it is designed such that the maximum amount of information can be derived from the fewest number of table reads. Another way to reduce the amount of data being read is to pre-define aggregations (summaries of detail data, such as monthly total sales) within the star, since most queries ask questions like "how many were sold last month?"
Data warehousing also led to the development of the concept of metadata management. Metadata is data about data, such as table and column names, and datatypes. Managing metadata makes it possible to understand relationships between data elements and assists in the mapping of source to target fields. (For more information of Metadata see "Metadata Standards in the Marketplace ")
Next came the creation of Extract/Transform/Load (ETL) tools, which made use of the metadata to get the information from the source systems into the data warehouse.
Additional tools, which made use of SQL (Structured Query Language), were developed to give end-users direct access to the data in the warehouse. As time went by, the query tools became user-friendly, and many now have a parser that can turn plain English questions into valid SQL. These end-user tools are now loosely referred to as "business intelligence" tools. In addition, there are other database constructs used to assist business intelligence tools in multi-dimensional analysis of data in the warehouse. These databases are referred to as hypercubes (also known as cubes, multi-dimensional cubes, or MDB's).
Since the early 1990's, data warehouses have become ubiquitous, technology and methodology have been improving, and costs have been decreasing. In 1998, data warehousing was a $28 Billion (USD) industry, and growing at over 10% per year. In addition, a recent survey of top IT executives indicated that data warehousing would be the number one post-Y2K priority. Data warehousing is now recognized as an important way to add business value and improve return on investment, if it is properly planned and implemented.
Selection Issues
Selecting a set of products for a data warehouse effort is complex. The first and most important issue is to ensure that the Extract/Transform/Load tool that is chosen can effectively and efficiently extract the source data from all the required systems.
The selection of the ETL tool requires an understanding of the source data feeds. The following issues should be considered:
Many warehouses are built from "legacy" systems that may be difficult to access from the computer network. ETL tools often do not reside on the same machine as the source data.
The data structures of the legacy systems may be hard to decompose into raw data.
Legacy data is often "dirty" (containing invalid data, or missing data). Care must be taken in the evaluation of the tool to ensure it has an adequate function library for cleansing the data. Depending on the complexity of the cleansing required, a separate tool designed specifically for cleansing and validation may have to be purchased in addition to the ETL tool.
The ETL tool should have a metadata ("data about data") repository, which allows the data sources, targets, and transformations to be tracked in an effective manner.
The tool should be able to access legacy data without the need for pre-processing (usually with COBOL programs) to get the data into sequential "flat files". This becomes increasingly complex when working with filesystems like VSAM (Virtual Sequential Access Method), and files that contain COBOL Occurs and Re-Defines clauses (repeating groups and conditionally defined fields). It should be noted that a large percentage of the world's data is stored in VSAM files.
A final issue is whether the ETL tool moves all the data through its own engine on the way to the target, or can be a "proxy" and move the data directly from the source to the target.
Selection of the business intelligence tool(s) requires decisions such as:
Will multi-dimensional analysis be necessary, or does the organization need only generalized queries? Not all warehouse implementations require sophisticated analysis techniques such as data mining (statistical analysis to discover trends in the data), data visualization (graphical display of query results), or multi-dimensional analysis (the so called "slice and dice").
Will the architecture be two-tiered or three-tiered? Three-tiered architectures offload some of the processing to an "application server" which sits between the database server and the end-user.
Will the tool employ a "push" or a "pull" technology? ("Push" technology publishes the queries to subscribed users, much like Pointcast works, "pull" requires that the user request the query).
Will the information be broadcast over a corporate intranet, extranet, or the Internet?
How will the organization implement data security, especially if information is being broadcast outside the corporate firewalls?
Average acquisition cost
A minimum tool set, excluding hardware, would include a database engine, ETL tool, and a number of BI tools for the individual users. A reasonable starting point for this set would be in the range of $100,000. Costs vary greatly based on vendor selection, number of seats purchased, software maintenance fees, and many other factors. In addition, consulting services drive the cost up immensely. A full blown, large-scale data warehouse in the terabyte range with hundreds of user licenses could cost tens of millions of dollars to implement and maintain.
Conclusions
Data Warehousing has taken many years to develop into its current form, and is still an immature technology. Improvements in hardware and software have made it possible to build warehouses that are many terabytes (one terabyte is 1,099,511,627,776 bytes) in size. The ability to access current and historical data to examine market trends and to perform business analysis has proven to provide significant return on investment. In today's dynamic marketplace, companies will continue to use data warehousing to derive competitive advantage.
SOURCE:http://www.technologyevaluation.com/research/articles/the-necessity-of-data-warehousing-15998/
newer post

SAS/Warehouse 2.0 Goes Live

1 comments
"CARY, N.C. (Feb. 24, 2000) - SAS Institute, the market leader in integrated data warehousing and decision support, has announced the production availability of SAS/Warehouse Administrator software, Version 2.0. Demonstrated at the Data Warehousing Institute conference in Anaheim, Calif., this new version provides IT the ability to proactively publish data warehouse information and track its usage, plus aggressively manage the process of change in the data warehouse."
"Data warehouses and data marts have become a vital component of all successful data mining, knowledge management, business portal, e-intelligence, customer relationship management (CRM) and supplier relationship management (SRM) applications today," said Frank Nauta, product manager for SAS/Warehouse Administrator at SAS Institute.
Nauta added, "Successful warehouses are continuously changing to keep pace with the changing business rules that they support. SAS/Warehouse Administrator simplifies change management by providing information delivery through e-enabled viewers like MetaSpace Explorer and integration with business intelligence and reporting tools. It helps make IT professionals more productive by giving them the ability to publish data from the warehouse - putting information in the hands of those who need it and freeing up IT staff for other projects."
Version 2.0 offers proactive information delivery with the addition of publish-and-subscribe tools robust enough for the entire enterprise, and offers enhanced intelligent access to data sources including SAP AG's R3, Baan, Oracle, DB2, Teradata, SQL Server 7.0, Sybase and many others.
SAS Institute was voted No. 1 in data warehousing and business intelligence in DM Review's 1999 Data Warehouse 100. SAS/Warehouse Administrator is a key component of this award-winning solution. Installed at more than 600 sites worldwide, SAS/Warehouse Administrator is the leading tool to help IT professionals meet the demands of administering a warehouse.
"IT and business users will find significant enhancements when building and designing warehouses and extraction, transformation and loading (ETL) processes," Nauta said. "By tracking the usage of information in the warehouse, IT staff can identify and remove data that is not being used. Removing unnecessary data makes the warehouse more efficient and maximizes hardware investments."

Publish/Subscribe metaphors are becoming much more common in the data warehouse arena. The ability for users to subscribe (request information on a regular basis), and for the server to publish ("push") that information automatically to those users is a powerful feature. Platinum Technology had already begun on an effort in this area, known as "Project C", which was delayed by their acquisition by Computer Associates.
It has been made clear by the success of products such as PointCast that push technology is an important aspect of information distribution. In addition, although SAS Institute has not been well known in the Extract/Transform/Load arena, they have a strong offering.
SAS has also formed a business intelligence alliance with IBM which may be leveraged by customers in the general data warehousing arena.
Customers evaluating data warehouse solutions should include SAS Institute on a long list of candidates for selection. With their background in statistical products, SAS is particularly strong in the data-mining arena. SAS Institute's extract/transform/load tool has native drivers for many of the legacy data formats on the market (including databases such as CA/IDMS, CISAM, Model 204, MUMPS, and Unidata), and should have an interface for almost any customer's requirement. The products also provide integration with data modeling tools such as CA/ERWin to simplify the consolidation of warehouse metadata.
SOURCE:http://www.technologyevaluation.com/research/articles/sas/warehouse-2.0-goes-live-15410/
newer post

IBM Builds on Hadoop with New Storage Architecture

0 comments
At the Supercomputing 2010 conference, IBM pulled back the curtain on a new storage architecture that, according to Big Blue, can double analytics processing speed for big data and the cloud.
Created by scientists at IBM Research – Almaden, the new General Parallel File System-Shared Nothing Cluster (GPFS-SNC) architecture was built on the IBM (NYSE: IBM) GPFS and incorporates the Hadoop Distributed File System (HDFS) to provide high availability through advanced clustering technologies, dynamic file system management and advanced data replication techniques.
The cluster "shares nothing," in a distributed computing architecture in which each node is self-sufficient. The GPFS-SNC divides tasks between independent nodes and no one waits on the other.
According to IBM, the GPFS-SNC can “convert terabytes of pure information into actionable insights twice as fast as previously possible.”
In addition, the GPFS-SNC design won the Supercomputing 2010 Storage Challenge. The judging system for the Storage Challenge measures technologies based on performance, scalability and storage subsystem utilization, to determine the most innovative and effective design in high performance computing.
Prasenjit Sarkar, master inventor, Storage Analytics and Resiliency, IBM Research – Almaden, described the GPFS-SNC as a general purpose file system that allows IBM to compete in “all worlds,” whether it be against Google’s (NASDAQ: GOOG) MapReduce framework, in traditional data warehouse environments against Oracle’s (NASDAQ: ORCL) Exadata Database Machine and EMC’s (NYSE: EMC) Greenplum Data Computing Appliance, or in the cloud.
Sarkar said the GPFS-SNC boasts twice the performance of competing architectures, supports POSIX for backward compatibility, and includes advanced storage features such as caching, replication, backup and recovery, and wide area replication for disaster recovery.
“The world is overflowing with petabytes to exabytes of data and the challenge is to store this data efficiently so that it can be accessed quickly at any point in time. This new way of storage partitioning is another step forward on this path as it gives businesses faster time-to-insight without concern for traditional storage limitations,” Sarkar said.
IBM’s GPFS currently serves as the basis for the IBM Scale Out Network Attached Storage (SONAS) platform, which is capable of scaling both capacity and performance while providing parallel access to data and a global name space that can manage billions of files and up to 14.4PB of capacity. It is also used in IBM's Information Archive and the IBM Smart Business Compute Cloud.
Sarkar did not comment on when or how the GPFS-SNC storage technology will find its way into IBM’s commercially available product portfolio. However, it stands to reason that the GPFS-SNC will be used in IBM’s recently announced VISION Cloud initiative, which has been formed to develop a new approach to cloud storage, where data is represented by smart objects that include information describing the content of the data and how the object should be handled, replicated, or preserved.
IBM announced the VISION project last week as a joint research initiative of 15 European partners to develop a so-called “smart cloud storage architecture.” The effort centers on delivering storage services within and across cloud boundaries through a better understanding what’s inside the data.
The VISION Cloud storage cloud architecture concept combines (a) a rich object data model, (b) execution of computations close to the stored content, (c) content-centric access, and (d) full data interoperability.
The VISION Cloud initiative will be spearheaded by scientists at IBM Research in Haifa, Israel, and supported by partners, including SAP AG, Siemens Corporate Technology, Engineering and ITRicity, Telefónica Investigación y Desarrollo, Orange Labs and Telenor, RAI and Deutche Welle, the SNIA Europe standards organization. The National Technical University of Athens, Umea University, Swedish Institute of Computer Science and University of Messin, will also contribute to the effort.
newer post

Oracle Warehouse Builder: Better Late than Never?

0 comments
"REDWOOD SHORES, Calif., Mar. 6, 2000 - Oracle announced the immediate availability of Oracle Warehouse Builder, an extensible and easy-to-use data warehouse design and deployment framework. As part of Oracle's Intelligent WebHouse Initiative, Oracle Warehouse Builder automates much of the work that goes into creating a powerful, single data store for e-business analysis, specifically with its ability to integrate historical data with the massive, daily influxes of online data from web sites. The product, already in production at nearly 20 beta sites worldwide, is available for purchase tomorrow."
The Oracle Warehouse Builder framework eliminates the previous time-consuming and error-prone methods for deploying data warehouses, data marts, and e-business applications, even when integrating ERP data from multiple vendors such as Oracle and SAP. These previous methods included hand coding the software and stitching together point solutions. Oracle Warehouse Builder enables customers to deploy even terabyte-sized data warehouses quickly, an essential benefit in today's cutthroat business environment.
"With Oracle Warehouse Builder, Honeywell is building a data warehouse infrastructure that top management will use daily to spot inefficiencies that affect the bottom line. For example, we can find out if an order is being held up on the manufacturing floor or in shipping, and then take the needed steps to get the order out and revenue booked," said Jason Haugland, consultant for Honeywell. "Because of the great data integration features in Oracle Warehouse Builder, we can easily pull flat-file data from our legacy system into our warehouse. The product also interfaces extremely well with Oracle8i, allowing us to create specific features such as materialized views and dimension capabilities, which we use to build high-level aggregate summaries directly into the database."
"Oracle Warehouse Builder integrates the areas that were previously addressed by separate tools into one common environment. Functions include modeling and design, data extraction, movement and loading, aggregation, metadata management and integration of analysis tools are all part of Oracle Warehouse Builder's easy-to-use and easy-to-implement framework."
Oracle has publicly stated that being late to market was caused by "lengthy beta testing". In reality, Oracle has had problems meeting release dates, and difficulty integrating the various technologies involved. They have also stated that data quality management software provided by the acquisition of Carleton Software will not be integrated for "a few months." The tool does currently have modeling software, tools for generating PL-SQL (Oracle's proprietary stored procedure language), a workflow manager, and gateway software for data access.
Whether this product is too late to market to meet customers' needs remains to be seen. Doubtless, Oracle will throw their considerable marketing muscle into the fray to try to offset the delay.
Customers evaluating complete data warehouse solutions should consider Oracle's Warehouse Builder on a long list of candidates, along with offerings from Informatica, Ardent Software (now part of Informix), and Computer Associates, among others. Oracle has stated that the key attraction is that "the product offers warehouse design, warehouse administration, and metadata management capabilities all in one suite."
Given that the product has just been released into general availability, the full functionality of the product is unknown. Oracle can throw a wide array of resources at the product to ensure its success, so it should not be ignored.
Customers should also evaluate "best-of-breed" solutions which combine warehouse design tools such as CA/Erwin, administration tools such as Embarcadero's DBArtisan, and metadata management facilities such as CA/Platinum Open Enterprise Edition Repository. Complete solutions from a single vendor may not comprise the best answer to a customer's problem.

SOURCE:http://www.technologyevaluation.com/research/articles/oracle-warehouse-builder-better-late-than-never-15622/
newer post

50 Open Source Replacements for Storage Software

1 comments
Open source storage software clearly fills a need.
According to researchers at IDC, the digital universe included 1.2 million petabytes, or 1.2 zettabytes, of data by the end of 2010. In case you have trouble picturing those numbers, that's enough data to fill a stack of DVDs that stretches to the moon and back.
With the amount of data growing exponentially, storage has become big business. 2010 saw a number of large tech firms snap up smaller storage-related vendors, including HP's notable acquisition of 3PAR after a bidding war with Dell.
But enterprises and small business don't have to spend a lot of money on their storage solutions. A number of open source projects offer backup, network attached storage (NAS), data warehouse, compression, encryption and other storage-related capabilities. Even if companies pay for support or related services, these open source options usually cost considerably less than their commercial counterparts.
Here are 50 noteworthy open source replacements for commercial storage-related tools. As always, feel free to suggest any others you think we should have included in the comments section below.
Open Source: Backup
1. Amanda Replaces Symantec NetBackup, NovaBackup, Barracuda Backup Service
Suitable for both small and large organizations, Amanda allows IT administrators to backup up multiple multi-platform systems to tape, disk, or other media. Development of the software is now sponsored by Zmanda, which provides cloud-based backup services that rely on the same software. Operating System: Windows, Linux, OS X.
2. Areca Backup Replaces Norton Ghost, McAfee Online Backup, NovaBackup
More suitable for home users, Areca Backup offers advanced features like compression, encryption, file version tracking and backup simulation, along with a fairly easy-to-use interface. It offers full (backs up the entire drive), incremental (backs up files changed since last backup), differential (backs up files changed since last full backup), and delta (backs up the portions of the files changed since the last backup) backup options. Operating System: Windows, Linux.
Related Articles
Top 20 Open Source Applications to Cut Business Costs
9 Free and Open Source Software Stories to Watch in 2011
Cisco to Leverage Open Source in Cloud Computing
50 Android Apps for the Mobile Knowledge Worker
Disaster Recovery Survey: Pessimism Reigns
3. Bacula Replaces Symantec NetBackup, NovaBackup, Barracuda Backup Service
This enterprise-ready network backup solution claims to be the most popular open source backup option for enterprise users. Commercial support is available through Bacula Systems. Operating System: Windows, Linux, OS X.
4. Clonezilla Replaces Symantec NetBackup, Norton Ghost
Specifically designed as a replacement for Norton Ghost and other Symantec backup products, Clonezilla is an open-source backup/imaging/cloning app that allows bare metal recovery and has unicasting and multicasting capabilities. It comes in two flavors: Clonezilla Live for backing up a single machine, and Clonezilla SE which can clone more than 40 systems at once. Operating System: Linux.
5. Create Synchronicity Replaces Norton Ghost, McAfee Online Backup, NovaBackup
Extremely lightweight and easy-to-use, Create Synchronicity is great for home or small business users. It has good scheduling capabilities and is available in a number of different languages. Operating System: Windows.
6. FOG Replaces Symantec NetBackup, NovaBackup, Barracuda Backup Service
FOG creates a Linux-based server that can backup both Windows and Linux systems. It's a good choice for small organizations, because it's very easy to use but also very powerful with a number of advanced features. Operating System: Linux, Windows.
7. Partimage Relaces Norton Ghost
This network tool can back up entire disks or just partitions. Note that it runs on Linux, but can also back up Windows machines connected to the network. Operating System: Linux.
8. Redo Replaces Norton Ghost, McAfee Online Backup, NovaBackup
Redo Backup and Recovery boasts that it can do a bare-metal restore in as little as 10 minutes. Unlike many other backup and recovery tools for home users, it can boot from a CD even if you can't run your operating system due to a virus or system crash. Operating System: Windows, Linux.
Open Source: Compression
9. 7-zip Replaces WinZip
This utility offers a 2-10 percent better compression ratio on zip files than WinZip. And it also lets you create self-extracting 7z files with 30-70 percent better compression than zip files. Operating System: Windows, Linux, OS X.
10. KGB Archiver Replaces WinZip
The KGB Archiver offers both good compression and good encryption capabilities, with all files protected by AES-256 encryption automatically. It's also pretty fast, but you will need a fairly robust system in order to use it. Operating System: Windows.
11. PeaZip Replaces WinZip PeaZip writes to 10 different compression file formats and extracts from 129 file types, making it one of the most versatile archiving tools available. It also offers encryption capabilities, and it's available in a portable version that you can use from a thumb drive. Operating System: Windows, Linux, OS X.
Open Source: Databases
12.Kexi Replaces Microsoft Access, FileMaker
It's Web site proclaims Kexi to be “a long-awaited competitor for programs like MS Access or Filemaker.” It's part of the KDE suite, but can also be used on its own. Operating System: Windows, Linux, OS X.
13. LucidDB Replaces Microsoft SQL Server
Unlike the other databases on our list, LucidDB was specifically designed to serve the analytics needs of data warehouse and business intelligence projects. It was created with "flexible, high-performance data integration and sophisticated query processing in mind." Operating System: Windows, Linux.
14. MySQL Replace: Microsoft SQL Server
MySQL claims to be the "world's most popular open source database." It's very popular with Web developers because of its excellent performance and scalability capabilities. In addition to the free community edition, Oracle offers a number of paid commercial editions, as well as services and training. Operating System: Windows, Linux, OS X
15. PostgreSQL Replaces Microsoft SQL Server
This database touts its reliability saying, "Unlike many proprietary databases, it is extremely common for companies to report that PostgreSQL has never, ever crashed for them in several years of high activity operation. Not even once." It offers a wide array of advanced enterprise-class features like Multi-Version Concurrency Control (MVCC), point-in-time recovery, tablespaces, asynchronous replication, nested transactions (savepoints), online/hot backups and much more. Operating System: Windows, Linux, OS X.
Open Source: Data Destruction
16. BleachBit Replaces Easy System Cleaner
Among its many functions, BleachBit can "shred" files previously stored on your system so that they can't be recovered. It also cleans up your cache, temporary files, logs, cookies, etc. to improve system performance. Operating System: Windows, Linux.
17. Darik's Boot And Nuke Replaces Kill Disk, BCWipe Total WipeOut
If you need to get rid of all of the files stored on a system, DBAN does the job quickly and easily. It runs from a disk and automatically erases all drives it can detect so that they cannot be recovered. Operating System: OS Independent.
18. Eraser Replaces BCWipe Enterprise
If you need to eliminate personal or sensitive files completely, Eraser does the job by overwriting the old file with random data. It also comes with a scheduler so you can eliminate old archives on a set date. Operating System: Windows.
19. FileKiller Replaces BCWipe Enterprise
FileKiller works much like Eraser, but it gives the user the option of determining how many times the old data is overwritten. It's also very fast. Operating System: Windows.
Related Articles
Top 20 Open Source Applications to Cut Business Costs
9 Free and Open Source Software Stories to Watch in 2011
Cisco to Leverage Open Source in Cloud Computing
50 Android Apps for the Mobile Knowledge Worker
Disaster Recovery Survey: Pessimism Reigns
20. Wipe Replaces BCWipe Enterprise
Wipe also performs the same data destruction as Eraser, only for Linux instead of Windows. Operating System: Linux.
Open Source: Data Warehouse (DW) Tools
21. Clover ETL Replaces Oracle Data Integration Suite, Informatica
Designed for those with "modest data transformation and ETL requirements," the community edition of CloverETL makes it easy to move data between different types of databases and spreadsheets. For those with more advanced needs, CloverETL also offers a variety of fee-based versions. Operating System: OS Independent.
22. DataCleaner Replaces DataFlux Data Management Studio, IBM InfoSphere Master Data Management, Oracle Master Data Management Suite
DataCleaner profiles, validates, and compares data in order to ensure its quality. It can work with nearly any kind of datastore, including Oracle, MySQL, XML files, Microsoft SQL, Excel spreadsheets, and more. Operating System: OS Independent.
23. KETL Replaces Oracle Data Integration Suite, Informatica
This "premier" extract, transform, load (ETL) tool offers excellent scalability, which allows it to compete against commercial tools. It integrates with most popular security and data management tools. Operating System: Linux, Unix.
24. MailArchiva Replaces ArcMail, Barracuda Message Archiver, GFI Mail Archiver
Highly scalable, MailArchiva offers long-term storage of e-mails in an easy-to-search format for compliance purposes. In addition to the free community version, it's also available in a paid enterprise edition, which includes additional features. Operating System: Windows, Linux.
25. Talend Open Studio Replaces Oracle Data Integration Suite, Informatica
Talend offers a variety of open-source tools for data integration, data quality, master data management and application integration. In addition to the free, open-source version of their software, they also offer paid SaaS versions and commercial support, services and training. Operating System: Windows, Linux, Unix.
Open Source: Document Management Systems (DMS)
26. Epiware Replaces Documentum, Microsoft SharePoint, OpenText
Epiware merges document management capabilities with some collaboration capabilities, including calendaring, project management and a wiki. Other features include search, check-in/check-out, and access and version history. Operating System: Windows, Linux.
27. LogicalDOC Replaces: Documentum, Microsoft SharePoint, OpenText
This app aims to help enterprises replace their paper files with digital versions that make it fast and easy to find the information you need. In addition to the community version, it's also available in paid enterprise and SaaS versions. Operating System: OS Independent.
28. OpenKM Replaces: Documentum, Microsoft SharePoint, OpenText
While not quite as polished as some of the other open-source document management systems, OpenKM also offers enterprise-ready capabilities. Key features include drag-and-drop functionality, a powerful search engine, and tag clouds. Operating System: OS Independent.
Open Source: Encryption
29. AxCrypt Replaces McAfee Anti-Theft, Symantec Endpoint Encryption, CryptoForge
The "leading open source file encryption software for Windows," AxCrypt lets you encrypt files from the Windows file manager with a simple right click, and decrypting just takes a double-click. You can also use it to send self-decrypting files via e-mail. Operating System: Windows.
Related Articles
Top 20 Open Source Applications to Cut Business Costs
9 Free and Open Source Software Stories to Watch in 2011
Cisco to Leverage Open Source in Cloud Computing
50 Android Apps for the Mobile Knowledge Worker
Disaster Recovery Survey: Pessimism Reigns
30. Crypt Replaces McAfee Anti-Theft, Symantec Endpoint Encryption, CryptoForge
Designed for speed, Crypt is very lightweight—just 44 kb. Note that it runs from the command line (no GUI), so it might be a little intimidating to less experienced computer users. Operating System: Windows.
31. Gnu Privacy Guard Replaces McAfee Anti-Theft, Symantec Endpoint Encryption, CryptoForge
Based on the OpenPGP standard, GPG lets you sign and encrypt both data and e-mail. Like Crypt, this app runs from the command line, but it does have a number of front-ends available if you prefer a GUI. Operating System: Linux.
32. gpg4win Replaces McAfee Anti-Theft, Symantec Endpoint Encryption, CryptoForge
If you want to use GPG on a Windows machine, this is the app for you. Operating System: Windows.
33. Mac GNU Privacy Guard
Replaces McAfee Anti-Theft, Symantec Endpoint Encryption, CryptoForge Yet another version of GPG—this time for Mac users. Operating System: OS X.
34. MailCrypt Replaces McAfee Anti-Theft, Symantec Endpoint Encryption, CryptoForge
If you want to encrypt all of your e-mail, try MailCrypt. It lets you select from PGP or GnuPGP encryption (though it recommends GnuPGP). Note that this app is not as user-friendly as some of the other options Operating System: OS Independent.
35. NeoCrypt Replaces McAfee Anti-Theft, Symantec Endpoint Encryption, CryptoForge
A better option for beginners, NeoCrypt offers a simple user interface and right-click integration with Windows Explorer. It supports 10 different encryption algorithms and has bulk encryption options for quickly protecting a lot of stored files. Operating System: Windows.
36. TrueCrypt Replaces McAfee Anti-Theft, Symantec Endpoint Encryption, CryptoForge
Downloaded more than 17 million times, this app is one of the most sophisticated and easy-to-use open-source encryption tools available. It can encrypt an entire disk or a disk partition, or create virtual encrypted disk within a file and mount it as a real disk. Operating System: Windows.
Open Source: File Managers
37. Explorer++ Replaces Windows Explorer
This utility doesn't so much replace Windows Explorer as enhance it. It adds tabbed browsing, a preview window, keyboard shortcuts and other features to the familiar interface. Operating System: Windows.
38. muCommander Replaces Windows Explorer, xplorer_
This Java-based file manager supports FTP, SFTP, SMB, NFS, HTTP, Amazon S3, Hadoop HDFS and Bonjour. Other key features include built-in compression and decompression tools and a universal bookmarks and credentials manager. Operating System: Windows, Linux, OS X.
39. PCManFM Replaces Windows Explorer, xplorer_
While it's the default file manager for LXDE, you can also use PCManFM with other Linux desktop environments. It's currently undergoing a major re-write, but you can still download the older code. Operating System: Linux.
40. QTTabBar Replaces Windows Explorer
Like Explorer++, QTTabBar adds tabs to the Windows Explorer interface. It's technically still in alpha, but has a large and enthusiastic user base. Operating System: Windows.
Related Articles
Top 20 Open Source Applications to Cut Business Costs
9 Free and Open Source Software Stories to Watch in 2011
Cisco to Leverage Open Source in Cloud Computing
50 Android Apps for the Mobile Knowledge Worker
Disaster Recovery Survey: Pessimism Reigns
41. SurF Replaces Windows Explorer, xplorer_
This file manager offers a unique, tree-based view. Other key features include real-time highlighting of new and changed files, directory auto-complete search and support for advanced NTFS capabilities. Operating System: Windows.
42. TuxCommander Replaces Windows Explorer, xplorer_
For Linux only, TuxCommander offers a familiar two-panel tabbed graphical interface. It's not flashy, but easy-to-use. Operating System: Linux.
Open Source: File Transfer
43. FileZilla Replaces CuteFTP, FTP Commander
With FileZilla you can set up an Windows-based FTP server or download FTP files from another site using any operating system. It supports FTP, FTP over SSL/TLS (FTPS) and SSH File Transfer Protocol (SFTP), and it's ready for IPv6. Operating System: Windows, Linux, OS X.
44. WinSCP Replaces: CuteFTP/GlobalSCAPE, FTP Commander
This very popular FTP client has been downloaded more than 40 million times. It supports SFTP, FTP, SSH, and SCP transfers. Operating System: Windows.
Open Source: NAS Software
45. FreeNAS Replaces Isilon products, IPDATA appliances, Netgear ReadyNAS
This BSD-based app allows you to create your own NAS server. It supports CIFS (samba), FTP, NFS, TFTP, AFP, RSYNC, Unison, iSCSI (initiator and target) and UPnP. Operating System: FreeBSD.
46. Gluster Replaces Isilon products, IPDATA appliances, Netgear ReadyNAS
Gluster offers open-source scale-out storage for on-premise NAS, virtual machine images and cloud storage. In addition to the free community edition, the company offers several subscription-based support options. Operating System: Linux.
47. Openfiler Replaces , IPDATA appliances, Netgear ReadyNAS
With Openfiler, you can convert almost any standard PC into a unified NAS/SAN appliance. You will need to be fairly knowledgeable in order to use it, but support and other add-ons are available for a fee. Operating System: Linux.
48. Turnkey Linux Replaces , IPDATA appliances, Netgear ReadyNAS
Turnkey offers a variety of pre-configured appliances, including a NAS appliance, which lands it on our list. In addition, all Turnkey appliances come with TurnKey Linux Backup and Migration pre-installed. Operating System: Linux.
Open Source: Utilities
49. Bulk File Manager Replaces NoClone 2010, FalconStor Data Deduplication
This helpful utility finds duplicate files and gives users the options of deleting or moving them. It also offers bulk file moving, re-naming and secure deletion capabilities. Operating System: Windows.
50. Copy Handler Replaces standard Windows copy and move capabilities
If you're re-organizing or cleaning up your hard drive, Copy Handler can help. It's faster than the standard Windows copy and move features, and it gives you more options, like task queuing, filtering files according to specific criteria, pausing and resuming copying files and changing the copying parameters on the fly. Operating System: Windows.

SOURCE:http://itmanagement.earthweb.com/osrc/article.php/12068_3920996_4/50-Open-Source-Replacements-for-Storage-Software.htm
newer post

Syncsort Sigma Manages Database Aggregates

0 comments
Programmers and administrators of large databases and data warehouses can create, manage and process data aggregates with speed, ease and accuracy with a new software product unveiled by Syncsort Incorporated.
Called Sigma, the vendor states that the product combines a high-performance data aggregate engine with an intuitive graphical user interface. By simplifying aggregation creation, Sigma reduces the time it takes to define and maintain aggregates, saves computer resources when processing aggregates, and dramatically cuts response time for database queries.
"Having the right aggregates available can improve runtime performance up to a thousand times according to leading data warehouse experts," said Vic Werner, director of marketing at Syncsort. "Sigma helps database administrators manage the explosive growth of data in their companies and meet the demands by their users for faster query processing of click-stream data from web sites, 24/7 business operations and electronic commerce initiatives. Sigma frees administrators to concentrate on selecting those aggregates that will optimize query performance."
Data warehouse administrators have long been vexed by the difficulty of creating aggregations of data. (An aggregate is defined by Ralph Kimball in his "Data Warehouse Toolkit" as "the items in a SQL Select list that are one of: SUM, COUNT, MIN, MAX, or AVG"). They are a method to improve query response time for frequently used reporting questions such as "what were my sales numbers by region for the last month?"
IT professionals have used Syncsort for many years to sort data into a predetermined order to reduce processing time in programs (i.e., sorting data for batch input to a COBOL program on mainframes), and they have extensive experience in this area. Many data warehousing products have very counter-intuitive approaches to creating aggregates. Syncsort, in contrast, should have a strong influence in this market because IT professionals are familiar with the Syncsort product family.
Customers evaluating aggregation engines should include Syncsort's Sigma product on their list of vendors to be considered. The importance of aggregation in a data warehouse to reduce query response time and overall database load cannot be underestimated. Syncsort has much experience in the details of sorting data, and will doubtless make a strong impact on this market. The largest drawback we currently see in the product is the lack of ability to output data into IBM's DB2 Universal Database, except via an ODBC driver. However, they do have native support for Oracle and Microsoft SQL Server.

http://www.technologyevaluation.com/research/articles/syncsort-sigma-manages-database-aggregates-15977/
newer post

5 Signs of an Aging Data Warehouse

0 comments
It’s not a stretch to say that data warehouses are maturing. Most of our clients’warehouses and marts have not only reached adulthood, many are showing their age. This has prompted some executives and business users to inquire about their retirement. Read on to see whether or not predictions of your data warehouse’s demise are premature.
In consulting with companies on their data warehouses over the past 20 years, I’ve witnessed the evolution of data warehouses from a technology proof of concept reserved for the “early adopters” to a widespread solution found, in one form or another, in most of today’s companies. Information has gone from a luxury to a must-have for these companies, and thus their IT environments have likewise evolved to support data delivery. With this evolution comes the assumption of the data warehouse as a commodity. Data warehousing is simply not as sexy as it used to be, and many  arehouses and marts are showing their age. Outdated ETL processes, shaky logical data models, slipshod business requirements, and an emerging “back burner” approach to new BI needs have put many otherwise valuable data warehouses in peril.
How does your data warehouse measure up? This report introduces five signs that indicate your data warehouse or mart might need help:
1. No description of the overall role of the warehouse People assume—and we know what that means!—that the data warehouse is fulfilling a stated purpose, be it to integrate cross functional data, or to serve as the system-of-record for critical data subject areas like Customer. Although the warehouse may be operating well and providing data to a range of business users who are making powerful and differentiating business decisions, if there is no consensus around its purpose, it’s likely to be put on the chopping block when the inevitable IT budget cuts are taken up. I once sat in a meeting where an executive got no answer to the question, “What’s the EDW doing for us that it didn’t do last year?” After an uncomfortable silence, someone spoke up and added that new compliance measures were going to drive the need for data audits from the warehouse. Everyone breathed a sigh of relief, and the executive erased the data warehouse from her list. You can be sure, though, that the team convened the next day to come up with a mission statement and application inventory for the warehouse. Better late than never.
2. Lack of consistent business requirements To practitioners who have been around data warehousing since the early days, this one is what Baseline Partner Jill Dyche calls “an old standby.” We learned the hard way that a business-driven data warehouse—in which the data answers key business questions enabling end-users to take quick business action—is one of the basic enets of a sustainable data warehouse.
You’d be surprised at how few companies actually create and maintain business requirements for BI applications in a consistent and repeatable way. A formal requirements definition process enlists usually reluctant business people to become involved in defining new information capabilities for the enterprise. It also makes it easier to incrementally extend an existing environment with new requirements to add functional capabilities. In our Data Warehouse Scorecard projects, Baseline actually maps the clients’ requirements against a comprehensive set of best practices. These include a solid overall requirements definition process; differentiating between business and data requirements; and making requirements activities iterative. The exercise usually illustrates that companies aren’t doing as good a job as they think they are when gathering and documenting BI business requirements.
3. Few internal checks and balances You can usually spot a data warehouse development team that is under fire. They hoard project plans and don’t publicize successes. They aren’t user focused. They don’t really answer to anyone. Most of our clients now have some sort of a governance body in place that establishes accountability measures for a range of IT projects, including the data
warehouse. Governance committees not only sign off on new BI business cases, they approve capital expenditures and upgrades, rendering it easier for the data warehouse to deploy value to the business over time. This committee needs to understand how the data warehouse has delivered value in the past—back to the concept of internal PR!—so that it can approve its future growth. There are other checks and balances involved in the data warehouse. Data auditability, security, and control are common and often high-profile concerns. The data warehouse development team should be beholden to a Chief Security Officer or a privacy czar for the deployment and use of company-sensitive information.
4. Rigid and inflexible infrastructure You’ve probably heard about a data warehouse that can’t support end-user, ad-hoc queries, or one that can only support ten concurrent queries simultaneously. Many of these data warehouses have hit the funding wall and are thus constrained by inadequate resources. (Sorry, but I have to bring up that all-important internal PR on this one. If the value is easily perceived, funding is probably flowing.)
However, other data warehouses were simply built in an inflexible and rigid way. Bad designs, poor ETL processes, technologies that can’t scale, or standards and conventions that have become “religion” can reduce the data warehouse to a mere technology platform for discrete processing. We recently asked a Scorecard client, “If a new, high-profile business initiative required the addition of a dozen new tables and a few hundred gigabytes of data from a handful of source systems, would the work take weeks, months, or years?” The quantified answer isn’t as interesting to us as the reasons why, indicating the problem areas and the potential fixes.
5. Nothing new It doesn’t matter how valuable your actuarial reporting is, or the fact that you can match your customers with the products they’ve purchased. If the data warehouse isn’t adding business value in a regular and visible way, it will eventually be at risk. Sure, as long as you have a consistent and updated return-on-investment story, you’re probably safe for now. But the constituency will take you for granted and people’s attention will wander to the technologies or projects they perceive are adding bang for the buck. The data warehouse must be perceived to be continuously creating value, be it through the support of new business analytics, the deployment of new data, or the ability to take fresh and deliberate business decisions as a result of data inquiry. If no one has this opinion, the data warehouse is just another isolated island in a long archipelago of technologies. In a follow-up article, I’ll suggest some tips and techniques for rejuvenating an aging data warehouse. Some may need major surgery, while others may only require a simple “nip and tuck.”

SOURCE:http://www.baseline-consulting.com/uploads/BCG_pub_5SignsOfAgingDW_2009.pdf
newer post
newer post older post Home