Wednesday, February 15, 2012

Informatica Scenario Based Questions - Part 2

0 comments
1. Consider the following employees data as source

employee_id, salary
10, 1000
20, 2000
30, 3000
40, 5000


Q1. Design a mapping to load the cumulative sum of salaries of employees into target table?
The target table data should look like as

employee_id, salary, cumulative_sum
10, 1000, 1000
20, 2000, 3000
30, 3000, 6000
40, 5000, 11000

Solution:

Connect the source Qualifier to expression transformation. In the expression transformation, create a variable port V_cum_sal and in the expression editor write V_cum_sal+salary. Create an output port O_cum_sal and assign V_cum_sal to it.


Q2. Design a mapping to get the pervious row salary for the current row. If there is no pervious row exists for the current row, then the pervious row salary should be displayed as null.
The output should look like as

employee_id, salary, pre_row_salary
10, 1000, Null
20, 2000, 1000
30, 3000, 2000
40, 5000, 3000

Solution:

Connect the source Qualifier to expression transformation. In the expression transformation, create a variable port V_count and increment it by one for each row entering the expression transformation. Also create V_salary variable port and assign the expression IIF(V_count=1,NULL,V_prev_salary) to it . Then create one more variable port V_prev_salary and assign Salary to it. Now create output port O_prev_salary and assign V_salary to it. Connect the expression transformation to the target ports.

In the expression transformation, the ports will be

employee_id
salary
V_count=V_count+1
V_salary=IIF(V_count=1,NULL,V_prev_salary)
V_prev_salary=salary
O_prev_salary=V_salary


Q3. Design a mapping to get the next row salary for the current row. If there is no next row for the current row, then the next row salary should be displayed as null.
The output should look like as

employee_id, salary, next_row_salary
10, 1000, 2000
20, 2000, 3000
30, 3000, 5000
40, 5000, Null

Solution:

Step1: Connect the source qualifier to two expression transformation. In each expression transformation, create a variable port V_count and in the expression editor write V_count+1. Now create an output port O_count in each expression transformation. In the first expression transformation, assign V_count to O_count. In the second expression transformation assign V_count-1 to O_count.

In the first expression transformation, the ports will be

employee_id
salary
V_count=V_count+1
O_count=V_count

In the second expression transformation, the ports will be

employee_id
salary
V_count=V_count+1
O_count=V_count-1

Step2: Connect both the expression transformations to joiner transformation and join them on the port O_count. Consider the first expression transformation as Master and second one as detail. In the joiner specify the join type as Detail Outer Join. In the joiner transformation check the property sorted input, then only you can connect both expression transformations to joiner transformation.

Step3: Pass the output of joiner transformation to a target table. From the joiner, connect the employee_id, salary which are obtained from the first expression transformation to the employee_id, salary ports in target table. Then from the joiner, connect the salary which is obtained from the second expression transformaiton to the next_row_salary port in the target table.


Q4. Design a mapping to find the sum of salaries of all employees and this sum should repeat for all the rows.
The output should look like as

employee_id, salary, salary_sum
10, 1000, 11000
20, 2000, 11000
30, 3000, 11000
40, 5000, 11000

Solution:

Step1: Connect the source qualifier to the expression transformation. In the expression transformation, create a dummy port and assign value 1 to it.

In the expression transformation, the ports will be

employee_id
salary
O_dummy=1

Step2: Pass the output of expression transformation to aggregator. Create a new port O_sum_salary and in the expression editor write SUM(salary). Do not specify group by on any port.

In the aggregator transformation, the ports will be

salary
O_dummy
O_sum_salary=SUM(salary)

Step3: Pass the output of expression transformation, aggregator transformation to joiner transformation and join on the DUMMY port. In the joiner transformation check the property sorted input, then only you can connect both expression and aggregator to joiner transformation.

Step4: Pass the output of joiner to the target table.


2. Consider the following employees table as source

department_no, employee_name
20, R
10, A
10, D
20, P
10, B
10, C
20, Q
20, S


Q1. Design a mapping to load a target table with the following values from the above source?

department_no, employee_list
10, A
10, A,B
10, A,B,C
10, A,B,C,D
20, A,B,C,D,P
20, A,B,C,D,P,Q
20, A,B,C,D,P,Q,R
20, A,B,C,D,P,Q,R,S

Solution:

Step1: Use a sorter transformation and sort the data using the sort key as department_no and then pass the output to the expression transformation. In the expression transformation, the ports will be

department_no
employee_name
V_employee_list = IIF(ISNULL(V_employee_list),employee_name,V_employee_list||','||employee_name)
O_employee_list = V_employee_list

Step2: Now connect the expression transformation to a target table.


Q2. Design a mapping to load a target table with the following values from the above source?

department_no, employee_list
10, A
10, A,B
10, A,B,C
10, A,B,C,D
20, P
20, P,Q
20, P,Q,R
20, P,Q,R,S

Solution:

Step1: Use a sorter transformation and sort the data using the sort key as department_no and then pass the output to the expression transformation. In the expression transformation, the ports will be

department_no
employee_name
V_curr_deptno=department_no
V_employee_list = IIF(V_curr_deptno! = V_prev_deptno,employee_name,V_employee_list||','||employee_name)
V_prev_deptno=department_no
O_employee_list = V_employee_list

Step2: Now connect the expression transformation to a target table.


Q3. Design a mapping to load a target table with the following values from the above source?

department_no, employee_names
10, A,B,C,D
20, P,Q,R,S

Solution:

The first step is same as the above problem. Pass the output of expression to an aggregator transformation and specify the group by as department_no. Now connect the aggregator transformation to a target table.
newer post

Informatica Scenario Based Interview Questions with Answers - Part 1

0 comments
1. How to generate sequence numbers using expression transformation?

Solution:
In the expression transformation, create a variable port and increment it by 1. Then assign the variable port to an output port. In the expression transformation, the ports are:
V_count=V_count+1
O_count=V_count

2. Design a mapping to load the first 3 rows from a flat file into a target?

Solution:
You have to assign row numbers to each record. Generate the row numbers either using the expression transformation as mentioned above or use sequence generator transformation.
Then pass the output to filter transformation and specify the filter condition as O_count <=3

3. Design a mapping to load the last 3 rows from a flat file into a target?

Solution:
Consider the source has the following data.
col
a
b
c
d
e

Step1: You have to assign row numbers to each record. Generate the row numbers using the expression transformation as mentioned above and call the row number generated port as O_count. Create a DUMMY output port in the same expression transformation and assign 1 to that port. So that, the DUMMY output port always return 1 for each row.

In the expression transformation, the ports are
V_count=V_count+1
O_count=V_count
O_dummy=1

The output of expression transformation will be
col, o_count, o_dummy
a, 1, 1
b, 2, 1
c, 3, 1
d, 4, 1
e, 5, 1

Step2: Pass the output of expression transformation to aggregator and do not specify any group by condition. Create an output port O_total_records in the aggregator and assign O_count port to it. The aggregator will return the last row by default. The output of aggregator contains the DUMMY port which has value 1 and O_total_records port which has the value of total number of records in the source.

In the aggregator transformation, the ports are
O_dummy
O_count
O_total_records=O_count

The output of aggregator transformation will be
O_total_records, O_dummy
5, 1

Step3: Pass the output of expression transformation, aggregator transformation to joiner transformation and join on the DUMMY port. In the joiner transformation check the property sorted input, then only you can connect both expression and aggregator to joiner transformation.

In the joiner transformation, the join condition will be
O_dummy (port from aggregator transformation) = O_dummy (port from expression transformation)

The output of joiner transformation will be
col, o_count, o_total_records
a, 1, 5
b, 2, 5
c, 3, 5
d, 4, 5
e, 5, 5

Step4: Now pass the ouput of joiner transformation to filter transformation and specify the filter condition as O_total_records (port from aggregator)-O_count(port from expression) <=2

In the filter transformation, the filter condition will be
O_total_records - O_count <=2

The output of filter transformation will be
col o_count, o_total_records
c, 3, 5
d, 4, 5
e, 5, 5

4. Design a mapping to load the first record from a flat file into one table A, the last record from a flat file into table B and the remaining records into table C?

Solution:
This is similar to the above problem; the first 3 steps are same. In the last step instead of using the filter transformation, you have to use router transformation. In the router transformation create two output groups.

In the first group, the condition should be O_count=1 and connect the corresponding output group to table A. In the second group, the condition should be O_count=O_total_records and connect the corresponding output group to table B. The output of default group should be connected to table C.

5. Consider the following products data which contain duplicate rows.
A
B
C
C
B
D
B

Q1. Design a mapping to load all unique products in one table and the duplicate rows in another table.
The first table should contain the following output
A
D

The second target should contain the following output
B
B
B
C
C

Solution:
Use sorter transformation and sort the products data. Pass the output to an expression transformation and create a dummy port O_dummy and assign 1 to that port. So that, the DUMMY output port always return 1 for each row.

The output of expression transformation will be
Product, O_dummy
A, 1
B, 1
B, 1
B, 1
C, 1
C, 1
D, 1

Pass the output of expression transformation to an aggregator transformation. Check the group by on product port. In the aggreagtor, create an output port O_count_of_each_product and write an expression count(product).

The output of aggregator will be
Product, O_count_of_each_product
A, 1
B, 3
C, 2
D, 1

Now pass the output of expression transformation, aggregator transformation to joiner transformation and join on the products port. In the joiner transformation check the property sorted input, then only you can connect both expression and aggregator to joiner transformation.

The output of joiner will be
product, O_dummy, O_count_of_each_product
A, 1, 1
B, 1, 3
B, 1, 3
B, 1, 3
C, 1, 2
C, 1, 2
D, 1, 1

Now pass the output of joiner to a router transformation, create one group and specify the group condition as O_dummy=O_count_of_each_product. Then connect this group to one table. Connect the output of default group to another table.

Q2. Design a mapping to load each product once into one table and the remaining products which are duplicated into another table.
The first table should contain the following output
A
B
C
D

The second table should contain the following output
B
B
C

Solution:
Use sorter transformation and sort the products data. Pass the output to an expression transformation and create a variable port,V_curr_product, and assign product port to it. Then create a V_count port and in the expression editor write IIF(V_curr_product=V_prev_product, V_count+1,1). Create one more variable port V_prev_port and assign product port to it. Now create an output port O_count port and assign V_count port to it.

In the expression transformation, the ports are
Product
V_curr_product=product
V_count=IIF(V_curr_product=V_prev_product,V_count+1,1)
V_prev_product=product
O_count=V_count

The output of expression transformation will be
Product, O_count
A, 1
B, 1
B, 2
B, 3
C, 1
C, 2
D, 1

Now Pass the output of expression transformation to a router transformation, create one group and specify the condition as O_count=1. Then connect this group to one table. Connect the output of default group to another table.



newer post

Convert single row from source to three rows in target

0 comments
Scenario1:
We have a source table containing 3 columns : Col1, Col2 and Col3. There is only 1 row in the table as follows:
 
Col1 Col2 Col3
a b c

There is target table containg only 1 column Col. Design a mapping so that the target table contains 3 rows as follows:
 
Col
a
b
c
Without using normaliser transformation.
Solution:
row to column transformation
 
Create 3 expression transformations exp_1,exp_2 and exp_3 with 1 port each. Connect col1 from Source Qualifier to port in exp_1.Connect col2 from Source Qualifier to port in exp_2.Connect col3 from source qualifier to port in exp_3. Make 3 instances of the target. Connect port from exp_1 to target_1. Connect port from exp_2 to target_2 and connect port from exp_3 to target_3.
newer post

Sunday, February 12, 2012

Optimal decision tree based multi-class support vector machine

0 comments
In this paper, decision tree SVMs architecture is constructed to solve multi-class problems. To maintain high generalization ability, the optimal structure of decision tree is determined using statistical measures for obtaining class separability. The proposed optimal decision tree SVM (ODT-SVM) takes advantage of both the efficient computation of the decision tree architecture and the high classification accuracy of SVM. A robust non-parametric test is carried out for statistical comparison of proposed ODT-SVM with other classifiers over multiple data sets'. Performance is evaluated in terms of classification accuracy and computation time. The statistical analysis on UCI repository datasets indicate that ten cross validation accuracy of our proposed framework is" significantly better than widely used multi-class classifiers. Experimental results and statistical tests' have shown that the proposed ODT-SVM is significantly better in comparison to conventional OvO and OAA in terms of both training and testing time.

Povzetek: Metoda odlocitvenega drevesa s SVM dosega signifikantno boljse rezultate kot izvirni SVM.

Keywords: support vector machine, decision tree, class separability, information gain, Gini Index and Chi-squared, interclass scatter, intraclass scatter.

1 Introduction

Support Vector Machine (SVM) has been proved to be a successful learning machine in literature, especially for classification. SVM is based on statistical learning theory developed by Vapnik [6, 25]. Since it was originally designed for binary classification [3], it is not easy to extend binary SVM to multi-class problem. Constructing k-class SVMs (k > 2) is an on-going research issue [1, 4]. Two approaches are suggested in literature to solve multi-class SVM. One is considering all data in one optimization [7]. The other is decomposing multi-class into a series of binary SVMs, such as "One-Against-All" (OAA) [25] and "One-versus-One" (OvO) [16].
( | ) = () - ( | )

(4) where ( | ) is the information gain of the label for a given attribute E, ( )is the system's entropy and ( | ) is the system's relative entropy when the value of the attribute E is known.

The system's entropy indicates its degree of disorder and is given by the following formula

( )=- ()log( ( )) (5)

where ( ) is the probability of class C. The relative entropy is calculated as

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]

Where ( ) is the probability of value j for attribute e, and C is the probability of C with a given

The optimal binary SVM model is selected on the basis of maximum value of that signifies more separability between patterns belonging to two different classes, for a given independent binary SVM containing [n.sub.i] elements of [C.sub.i] and [n.sub.j] elements of [C.sub.j] can be calculated as

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (7)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (8)

(C) = -- and C = -- (9)

where , , , and denote number of true positive, false positive, true negative and false negative data points respectively.

The higher value of signifies less overlap or more distance between two different classes of data points. Hence, can be a natural measure to determine class separability of different classes of data points.

Similarly for every independent binary OAA SVM, assume there are two classes of dataset, C and C and training set D contains elements of C and elements of class C . for a given OAA SVM model i can be calculated as
= (19)

where and are given as

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (20)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (21)

Using kernel trick [7], data points from and are implicitly mapped from [R.sup.d] to a high dimension feature space H. Let [empty set](x) : R [right arrow] H denote the mapping and , ={[empty set] ( ), [empty set] ) denote the kernel function, where is the set of kernel parameters and <.,.> is the inner product. K denotes the kernel matrix and { } , is defined as , . Let [K.sub.A, B] be kernel matrix computed with the data points from A and B which denote two subsets of training sample set D. Let [empty set] and [empty set] denotes the between class scatter matrix and within class scatter matrix in H, respectively and defined as follows

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (22)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (23)

newer post

Email task, Session and Workflow notification : Informatica

0 comments
One of the advantages of using ETL Tools is that functionality such as monitoring, logging and notification are either built-in or very easy to incorporate into your ETL with minimal coding. This Post explains the Email task, which is part of Notification framework in Informatica. I have added some guidelines at the end on a few standard practices when using email tasks and the reasons behind them.
1. Workflow and session details.
2. Creating the Email Task (Re-usable)
3. Adding Email task to sessions
4. Adding Email Task at the Workflow Level
5. Emails in the Parameter file (Better maintenance, Good design).
6. Standard (Good) Practices
7. Common issues/Questions
1. Workflow and session details.
Here is the sample workflow that I am using. The workflow (wkf_Test) has 2 sessions.
s_m_T1 : Loads data from Source to Staging table (T1).
s_m_T2 : Loads data from Staging (T1) to Target (T2).
The actual mappings are almost irrevant for this example, but we need atleast two sessions to illustrate the different scenarios possible.
Workflow Test with the two sessions.
Test Workflow (2 Sessions)
2. Creating the Email Task (Re-usable)
Why re-usable?. Becuase we’d be using the same email task for all the sessions in this workflow.
1. Go to Workflow Manager and connect to the repository and the folder in which your workflow is present.
2. Go to the Workflow Designer Tab.
3. Click on Workflow > edit (from the Menu ) and create a workflow variable as below (to hold the failure email address).
Failure Email workflow variable
Failure Email workflow variable
4. Go to the “Task Developer” Tab and click create from the menu.
5. Select “Email Task”, enter “Email_Wkf_Test_Failure” for the name (since this email task is for different sessions in wkf_test).
Click “Create” and then “Done”. Save changes (Repository -> Save or the good old ctrl+S).
6. Double click on the Email Task and enter the following details in the properties tab.
Email User Name : $$FailureEmail   (Replace the pre-populated session variable $PMFailureUser, 
                                    since we be setting this for each workflow as needed).
Email subject   : Informatica workflow ** WKF_TEST **  failure notification.
Email text      : (see below. Note that the server varibles might be disabled, but will be available during run time).
Please see the attched log for Details. Contact ETL_RUN_AND_SUPPORT@XYZ.COM for further information.
 
%g
Folder : %n
Workflow : wkf_test
Session : %s
Create Email Task
Create_Email_Task
3. Adding Email task to sessions
7. Go to the Workflow Tab and double click on session s_m_T1. You should see the “edit task” window.
8. Make sure you have “Fail parent if this task fails” in the general tab and the “stop on errors” is 1 on the config tab.
Go to “Components” tab.
9. For the on-failure email section, select “reusable” for type and click the LOV on Value.
10. Select the email task that we just created (Email_Wkf_Test_Failure), and click OK.
Adding Email Task to a session
Adding Email Task to a session
4. Adding Email Task at the Workflow Level
Workflow-level failure/suspension email.
If you are already implementing the failure email for each session (and getting the session log for the failed session), then you should consider just suspending the workflow. If you don’t need session level details, using the workflow suspension email makes sense.
There are two settings you need to set for Failure notification emails at workflow level.
a) Suspend on error (Check)
b) Suspension email (Select the email task as before). Again, remember that if you have both session and workflow level emails, you’ll get two emails, if a session fails and causes the parent to fail.
Informatica workflow suspension email
Informatica workflow suspension email
Workflow Sucesss email
In some cases, you might have a requirement to add a success email once the entire workflow is complete.
This helps people know the workflow status for the day without having to access workflow monitor or asking run teams for the status each day. This is particularly helpful for business teams who are more concerned whether the process completed for the day.
1) Go to the workflow tab in workflow manager and click Task > Create > Email Task.
2) Enter the name of the email task and click OK.
3) In the general tab, select “Fail parent if this task fails”. In the properties tab, add the necessary details
Note that the variables are not available anymore, since they are only applicable at the session level.
4) Add the necessary Session.status=”succeedeed” for all the preceding tasks.
Here’s how your final workflow will look.
Success Emails
Informatica success emails
5. Emails in the Parameter file (Better maintenance, Good design).
We’ve created the workflow variable $$FailureEmail and used it in the email task. But how and when is the value assigned?
You can manage the failure emails by assigning the value in the parameter file.
Here is my parameter file for this example. You can seperate multiple emails using comma.
infa@ DEV /> cat wkf_test.param
[rchamarthi.WF:wkf_Test]
$$FailureEmail=rajesh@etl-developer.com
 
[rchamarthi.WF:wkf_Test.ST:s_m_T1]
$DBConnection_Target=RC_ORCL102
 
[rchamarthi.WF:wkf_Test.ST:s_m_T2]
$DBConnection_Target=RC_ORCL102
While it might look like a simpler approach initially, hard-coding emails IDs in the email task is a bad idea. Here’s why.
Like every other development cycle, Informatica ETLs go thorugh Dev, QA and Prod and the failure email for each of the environment will be different. When you promote components from Dev to QA and then to Prod, everything from Mapping to Session to Workflow should be identical in all environments. Anything that changes or might change should be handled using parameter files (similar to env files in Unix). This also works the other way around. When you copy a workflow from Production to Development and try to make changes, the failure emails will not go to business users or QA teams as the development parameter file only has the developer email Ids.
If you use parameter files, here is how it would be set up in different environments once.
After the initial set up, you’ll hardly change it in QA and Prod and migrations will never screw this up.
In development   : $$FailureEmail=developer1@xyz.com,developer2@xyz.com"
In QA / Testing  : $$FailureEmail=r=developer1@xyz.com,developer2@xyz.com,QA_TEAM@xyz.com
In Production    : $$FailureEmail=IT_OPERATIONS@xyz.com,ETL_RUN@xyz.com,BI_USERS@xyz.com
6. Standard (Good) Practices
These are some of the standard practices related to Email Tasks that I would recommend. The reasons have been explained above.
a) Reusable email task that is used by all sessions in the workflow.
b) Suspend on error set at the workflow level and failure email specified for each session.
c) Fail parent if this task fails (might not be applicable in 100% of the cases).
c) Workflow Success email (based on requirement).
d) Emails mentioned only in the parameter file. (No Hard-coding).
7. Common issues/Questions
Warning unused variable $$FailureEmail and/or No failure emails:
Make sure you use the double dollar sign, as all user-defined variables should. (unless you are just using the integration service variable $PMFailureEmailUser). Once that is done, the reason for the above warning and/or no failure email could be…
a) You forgot to declare the workflow variable as described in step 3 above or
b) the workflow parameter file is not being read correctly. (wrong path, no read permissions, invalid parameter file entry etc.)
Once you fix these two, you should be able to see the success and failure emails as expected.
newer post

Information Technology – A Definition:

0 comments
We use the term information technology or IT to refer to an entire industry. In actuality, information technology is the use of computers and software to manage information. In some companies, this is referred to as Management Information Services (or MIS) or simply as Information Services (or IS). The information technology department of a large company would be responsible for storing information, protecting information, processing the information, transmitting the information as necessary, and later retrieving information as necessary.
History of Information Technology:

In relative terms, it wasn't long ago that the Information Technology department might have consisted of a single Computer Operator, who might be storing data on magnetic tape, and then putting it in a box down in the basement somewhere. The history of information technology is fascinating! Check out these history of information technology resources for information on everything from the history of IT to electronics inventions and even the top 10 IT bugs.
Modern Information Technology Departments:

In order to perform the complex functions required of information technology departments today, the modern Information Technology Department would use computers, servers, database management systems, and cryptography. The department would be made up of several System Administrators, Database Administrators and at least one Information Technology Manager. The group usually reports to the Chief Information Officer (CIO).
newer post
newer post older post Home