etl testing: November 2013

Monday 18 November 2013

Unix Commands

Man ual command.
man man This is help command, and will explains you about online manual pages you can also use man in conjunction with any command to learn more about that command for example.

man ls will explain about the ls command and how you can use it.
man -k pattern command will search for the pattern in given command.

Banner command.
banner prints characters in a sort of ascii art poster, for example to print wait in big letters. I will type
banner wait at unix command line or in my script. This is how it will look.

 #    #    ##       #     #####

 #    #   #  #      #       #

 #    #  #    #     #       #

 # ## #  ######     #       #

 ##  ##  #    #     #       #

 #    #  #    #     #       #

Cal command
cal command will print the calander on current month by default. If you want to print calander of august of 1965. That's eightht month of 1965.
cal 8 1965 will print following results.

    August 1965

 S  M Tu  W Th  F  S

 1  2  3  4  5  6  7

 8  9 10 11 12 13 14

15 16 17 18 19 20 21

22 23 24 25 26 27 28

29 30 31

Clear command
clear command clears the screen and puts cursor at beginning of first line.

Calendar command
calendar command reads your calendar file and displays only lines with current day.
For example in your calendar file if you have this

12/20   Test new software.

1/15    Test newly developed 3270 product.

1/20    Install memory on HP 9000 machine.

On dec 20th the first line will be displayed. you can use this command with your crontab file or in your login files.

Nohup command.
nohup command if added in front of any command will continue running the command or process even if you shut down your terminal or close your session to machine. For exmaple, if I want to run a job that takes lot of time and must be run from terminal and is called update_entries_tonight .
nohup update_entries_tonight will run the job even if terminal is shut down in middle of this job.

Tty command
Tty command will display your terminal. Syntax is
tty options

Options

-l will print the synchronous line number.
-s will return only the codes: 0 (a terminal), 1 (not a terminal), 2 (invalid options) (good for scripts)

File Management commands.

Pwd command.
pwd command will print your home directory on screen, pwd means print working directory.

 /u0/ssb/sandeep

is output for the command when I use pwd in /u0/ssb/sandeep directory.

Ls command
ls command is most widely used command and it displays the contents of directory.

options

ls will list all the files in your home directory, this command has many options.
ls -l will list all the file names, permissions, group, etc in long format.
ls -a will list all the files including hidden files that start with . .
ls -lt will list all files names based on the time of creation, newer files bring first.
ls -Fxwill list files and directory names will be followed by slash.
ls -Rwill lists all the files and files in the all the directories, recursively.
ls -R | more will list all the files and files in all the directories, one page at a time.

Mkdir command.
mkdir balaji will create new directory, i.e. here balaji directory is created.

Cd command.
cd balaji will change directory from current directory to balaji directory.
Use pwd to check your current directory and ls to see if balaji directory is there or not.
You can then use cd balaji to change the directory to this new directory.

Cat command
cat cal.txt cat command displays the contents of a file here cal.txt on screen (or standard out).

Head command.
head filename by default will display the first 10 lines of a file.
If you want first 50 lines you can use head -50 filename or for 37 lines head -37 filename and so forth.

Tail command.
tail filename by default will display the last 10 lines of a file.
If you want last 50 lines then you can use tail -50 filename.

More command. more command will display a page at a time and then wait for input which is spacebar. For example if you have a file which is 500 lines and you want to read it all. So you can use
more filename

Wc command
wc command counts the characters, words or lines in a file depending upon the option.

Options

wc -l filename will print total number of lines in a file.
wc -w filename will print total number of words in a file.
wc -c filename will print total number of characters in a file.

File command.
File command displays about the contents of a given file, whether it is a text (Ascii) or binary file. To use it type
file filename. For example I have cal.txt which has ascii characters about calander of current month and I have resume1.doc file which is a binariy file in microsoft word. I will get
file resume.doc

resume1.doc:    data

file cal.txt

cal.txt:        ascii text

Cp command.
cp command copies a file. If I want to copy a file named oldfile in a current directory to a file named newfile in a current directory.
cp oldfile newfile
If I want to copy oldfile to other directory for example /tmp then
cp oldfile /tmp/newfile. Useful options available with cp are -p and -r . -p options preserves the modification time and permissions, -r recursively copy a directory and its files, duplicating the tree structure.

Rcp command.
rcp command will copy files between two unix systems and works just like cp command (-p and -i options too).
For example you are on a unix system that is called Cheetah and want to copy a file which is in current directory to a system that is called lion in /usr/john/ directory then you can use rcp command
rcp filename lion:/usr/john
You will also need permissions between the two machines. For more infor type man rcp at command line.

Mv command.
mv command is used to move a file from one directory to another directory or to rename a file.

Some examples:

mv oldfile newfile will rename oldfile to newfile.
mv -i oldfile newfile for confirmation prompt.
mv -f oldfile newfile will force the rename even if target file exists.
mv * /usr/bajwa/ will move all the files in current directory to /usr/bajwa directory.

Ln command.
Instead of copying you can also make links to existing files using ln command.
If you want to create a link to a file called coolfile in /usr/local/bin directory then you can enter this command.
ln mycoolfile /usr/local/bin/coolfile

Some examples:

ln -s fileone filetwo will create a symbolic link and can exist across machines.
ln -n option will not overwrite existing files.
ln -f will force the link to occur.

Rm command.
To delete files use rm command.

Options:

rm oldfile will delete file named oldfile.
rm -f option will remove write-protected files without prompting.
rm -r option will delete the entire directory as well as all the subdirectories, very dangerous command.

Rmdir command.
rmdir command will remove directory or directories if a directory is empty.

Options:

rm -r directory_name will remove all files even if directory is not empty.
rmdir sandeep is how you use it to remove sandeep directory.
rmdir -p will remove directories and any parent directories that are empty.
rmdir -s will suppress standard error messages caused by -p.

Comparison and Searching

Diff command.
diff command will compare the two files and print out the differences between.
Here I have two ascii text files. fileone and file two.
Contents of fileone are

This is first file

this is second line

this is third line

this is different    as;lkdjf

this is not different

filetwo contains

This is first file

this is second line

this is third line

this is different    xxxxxxxas;lkdjf

this is not different

diff fileone filetwo will give following output

4c4

< this is different    as;lkdjf

---

> this is different    xxxxxxxas;lkdjf

Cmp command.
cmp command compares the two files. For exmaple I have two different files fileone and filetwo.
cmp fileone filetwo will give me

fileone filetwo differ: char 80, line 4

if I run cmp command on similar files nothing is returned.
-s command can be used to return exit codes. i.e. return 0 if files are identical, 1 if files are different, 2 if files are inaccessible.
This following command prints a message 'no changes' if files are same
cmp -s fileone file1 && echo 'no changes' .

no changes

Dircmp Command.
dircmp command compares two directories. If i have two directories in my home directory named
dirone and dirtwo and each has 5-10 files in it. Then
dircmp dirone dirtwo will return this

 Dec  9 16:06 1997  dirone only and dirtwo only Page 1

./cal.txt                                   ./fourth.txt

./dohazaar.txt                              ./rmt.txt

./four.txt                                  ./te.txt

./junk.txt                                  ./third.txt

./test.txt

Grep Command
grep command is the most useful search command. You can use it to find processes running on system, to find a pattern in a file, etc. It can be used to search one or more files to match an expression.
It can also be used in conjunction with other commands as in this following example, output of ps command is passed to grep command, here it means search all processes in system and find the pattern sleep.
ps -ef | grep sleep will display all the sleep processes running in the system as follows.

     ops 12964 25853  0 16:12:24 ttyAE/AAES  0:00 sleep 60

     dxi 12974 15640  0 16:12:25 ttyAH/AAHP  0:00 sleep 60

     ops 12941 25688  0 16:12:21 ttyAE/AAEt  0:00 sleep 60

     ops 12847 25812  0 16:11:59 ttyAH/AAH6  0:00 sleep 60

     ops 12894 25834  0 16:12:12 ttyAE/AAEX  0:00 sleep 60

     dxi 13067 27253  2 16:12:48 ttyAE/ABEY  0:00 sleep 1

     ops 13046 25761  0 16:12:44 ttyAE/AAE0  0:00 sleep 60

     dxi 12956 13078  0 16:12:23 ttyAG/AAG+  0:00 sleep 60

     ops 12965 25737  0 16:12:24 ttyAE/AAEp  0:00 sleep 60

     ops 12989 25778  0 16:12:28 ttyAH/AAHv  0:00 sleep 60

     ssb 13069 26758  2 16:12:49 ttyAH/AAHs  0:00 grep sleep

     pjk 27049  3353  0 15:20:23 ?           0:00 sleep 3600

Options:

-b option will precede each line with its block number.
-c option will only print the count of matched lines.
-i ignores uppercase and lowercase distinctions.
-l lists filenames but not matched lines.

other associated commands with grep are egrep and fgrep. egrep typically runs faster. for more information type man egrep or man fgrep in your system.

Find command.
Find command is a extremely useful command. you can search for any file anywhere using this command provided that file and directory you are searching has read write attributes set to you ,your, group or all. Find descends directory tree beginning at each pathname and finds the files that meet the specified conditions. Here are some examples.

Some Examples:
find $HOME -print will lists all files in your home directory.
find /work -name chapter1 -print will list all files named chapter1 in /work directory.
find / -type d -name 'man*' -print will list all manpage directories.
find / -size 0 -ok rm {} \; will remove all empty files on system.
conditions of find

-atime +n |-n| n will find files that were last accessed more than n or less than -n days or n days.
-ctime +n or -n will find that were changed +n -n or n days ago.
-depth descend the directory structure, working on actual files first and then directories. You can use it with cpio command.
-exec commad {} \; run the Unix command on each file matched by find. Very useful condition.
-print print or list to standard output (screen).
-name pattern find the pattern.
-perm nnnfind files whole permission flags match octal number nnn.
-size n find files that contain n blocks.
-type c Find file whole type is c. C could be b or block, c Character special file, d directory, p fifo or named pipe, l symbolic link, or f plain file.

Text processing

Cut command.
cut command selects a list of columns or fields from one or more files.
Option -c is for columns and -f for fields. It is entered as
cut options [files]
for example if a file named testfile contains

this is firstline

this is secondline

this is thirdline

Examples:
cut -c1,4 testfile will print this to standard output (screen)

ts

ts

ts

It is printing columns 1 and 4 of this file which contains t and s (part of this).

Options:

-c list cut the column positions identified in list.
-f list will cut the fields identified in list.
-s could be used with -f to suppress lines without delimiters.

Paste Command.
paste command merge the lines of one or more files into vertical columns separated by a tab.
for example if a file named testfile contains

this is firstline

and a file named testfile2 contains

this is testfile2

then running this command
paste testfile testfile2 > outputfile
will put this into outputfile

this is firstline       this is testfile2

it contains contents of both files in columns.
who | paste - - will list users in two columns.

Options:

-d'char' separate columns with char instead of a tab.
-s merge subsequent lines from one file.

Sort command.
sort command sort the lines of a file or files, in alphabetical order. for example if you have a file named testfile with these contents

zzz

aaa

yuer

wer

qww

wwe

Then running
sort testfile
will give us output of

aaa

qww

wer

wwe

yuer

zzz

Options:

-b ignores leading spaces and tabs.
-c checks whether files are already sorted.
-d ignores punctuation.
-i ignores non-printing characters.
-n sorts in arithmetic order.
-ofile put output in a file.
+m[-m] skips n fields before sorting, and sort upto field position m.
-r reverse the order of sort.
-u identical lines in input file apear only one time in output.

Uniq command.
uniq command removes duplicate adjacent lines from sorted file while sending one copy of each second file.
Examples

sort names | uniq -d will show which lines appear more than once in names file.

Options:

-c print each line once, counting instances of each.
-d print duplicate lines once, but no unique lines.
-u print only unique lines.

Awk and Nawk command.
awk is more like a scripting language builtin on all unix systems. Although mostly used for text processing, etc.
Here are some examples which are connected with other commands.
Examples:
df -t | awk 'BEGIN {tot=0} $2 == "total" {tot=tot+$1} END {print (tot*512)/1000000}' Will give total space in your system in megabytes.
Here the output of command df -t is being passed into awk which is counting the field 1 after pattern "total" appears. Same way if you change $1 to $4 it will accumulate and display the addition of field 4
which is used space.
for more information about awk and nawk command in your system enter man awk or man nawk.

Sed command.
sed command launches a stream line editor which you can use at command line.
you can enter your sed commands in a file and then using -f option edit your text file. It works as
sed [options] files

options:

-e 'instruction' Apply the editing instruction to the files.
-f script Apply the set of instructions from the editing script.
-n suppress default output.

for more information about sed, enter man sed at command line in your system.

Vi editor.
vi command launches a vi sual editor. To edit a file type
vi filename
vi editor is a default editor of all Unix systems. It has several modes. In order to write characters you will need to hit i to be in insert mode and then start typing. Make sure that your terminal has correct settings, vt100 emulation works good if you are logged in using pc.
Once you are done typing then to be in command mode where you can write/search/ you need to hit :w filename to write
and in case you are done writing and want to exit
:w! will write and exit.

options:

i for insert mode.

I inserts text at the curson
A appends text at the end of the line.
a appends text after cursor.
O open a new line of text above the curson.
o open a new line of text below the curson.

: for command mode.

<escape> to invoke command mode from insert mode.
:!sh to run unix commands.
x to delete a single character.
dd to delete an entire line
ndd to delete n number of lines.
d$ to delete from cursor to end of line.
yy to copy a line to buffer.
P to paste text from buffer.
nyy copy n number of lines to buffer.
:%s/stringA/stringb /g to replace stringA with stringB in whole file.
G to go to last line in file.
1G to go to the first line in file.
w to move forward to next word.
b to move backwards to next word.
$ to move to the end of line.
J join a line with the one below it.

/string to search string in file.
n to search for next occurence of string.

Report Testing

Report testing basic scenarios:

1.Verify data shown on the dashboard/reports are as expected.

2.users can see reports according to their user profile(check authentication and authorization)

3.verification of appropriate report layout to ensure it's in synch with the original document

4.verification on the accuracy and completeness of the reports

5.verify drill through reports,cross tab reports,parent/child reports etc to see if they are all working as expected

6.'Analysis functions' and 'Data Analysis' should be working

7.Previewing and/or exporting of reports to different formats such as spreadsheet,PDF,html,e-mail etc to ensure two things:

a.report layout in the new format is not distorted

b.data shown are accurate and consistent

8.if a report shows graph as well as related data in tabular format,then both should reflect consistent data.

9.Create SQL?MDX queries to fetch the data in tabular format,then both should reflect consistent data

10.verify the reports to ensure the data shown on the reports are consistent with what are beeing supplied by upstream systems(could be oltp database, datamart, dwh,cube,etc..) data from the upstream could be extracted using the sql/mdx as specified in point 9.

11.verify calculations or business rules(if any) that applies to the underlying data(eg. check business formula that applies to a metric/KPI etc)

12.verify report level rules/validations/errors/warnings. for example-error message for wrong input, optional/mandatory prompts, 'No Data' warnings, field/prompts validation etc.

13.verify usability of reports(e.g. general appearance(color,look and feel), navigation etc)

14.verify performances of report.

In General Test scenarios for reports
Headers and Footers:

1)Is there a report title?
2)Is the report title centered? Is it underlined?
3)Was the correct font used for the Report Title?
4)Is the last report refresh datetime displayed in the right corner of page header?
5)Is the report data refresh date displayed in the right corner of the page header?
6)Was the correct font (for ex:Arial Narrow 9) used in the page header?
7)Was the correct(for ex:Arial Narrow 7) font used in the page footer?
8)Does the footer include the page number and # of pages?
9)Does the footer contain text (for ex:" Confidential")?
10)Does the footer contain "XYZ Company”?

11)Does the “XYZ Company” text appear in blue color?

Report Body/Formatting:

1)Was Arial Narrow 9 font used for column/data headers?
2)Was Arial Narrow 9 font used for data fields?
3)Were commas used for numbers (ex 2,000)?
4)For percentages, was the % sign included in either the table header or in each corresponding data cell, as well as any subtotals and grand totals?
5)For percentages and currency, was the decimal limit set to two?
6)For currency, was the correct currency sign included in each cell, subtotal, and grand total?
7)For percentages and currency, were parenthesis used for negatives?
8)Was the format "dd mon yyyy" (ex 29 Jul 2009) used for dates?
9)Was 24 hour military time used?
10)Were column headers centered?
11)Were row labels left justified?
12)Was text data left justified?
13)Was numeric data right justified?
14)Was calculated data right justified?
15)Were domestic phone numbers formatted like 999-999-9999?

Readability & Accuracy:

1)Are prompt names meaningful?
2)Is the overall report readable (e.g. does shading print appropriately)?
3)Does the report print out in the required format(s)?
4)Has the report been checked for spelling and grammatical errors?

5)Does the report name conform to Centre Of Excellence(COE) report naming standards?

Query Design:

1)Is the data drawn from the appropriate source (e.g., does the data come from the folders that provide accurate results and optimum performance)?

2)Are all of the queries named?

3)Are the query results correct? Have they been checked against an appropriate source?

4)Is the performance acceptable?

Dashboard Testing scenarios:

1)Verify data shown on the dashboard/reports are as expected

2)Users can see reports according to their user profile (check authentication and authorization)

3)Verification of appropriate report layout to ensure it’s in synch with the original requirement.

4)Verification on the accuracy and completeness of the reports

5)Verify drill through reports, cross tab reports, parent / child reports etc to see if they are all working as expected

6)'Analysis Functions' and 'Data Analysis' should be working

7)Previewing and/or exporting of reports to different formats such as spreadsheet, PDF, html, e-mail etc to ensure two things:

8)report layout in the new format is not distorted

9)data shown are accurate and consistent

10)If a report shows graph as well as related data in tabular format, then both should reflect consistent data

11)Create SQL/MDX queries to fetch the data from the source to compare and validate with the target (in our case ‘reports’).

12)Verify reports to ensure data shown on the reports are consistent with what are being supplied by upstream systems (could be OLTP database, datamart, datawarehouse, cube etc.). Data from the upstream systems could be extracted using the SQL/MDX as specified in point 9.

13)Verify calculations or business rules (if any) that applies to the underlying data (e.g. check business formula that applies to a metric/KPI etc)

14)Verify report level rules/validations/errors/warnings. For example - error message for wrong input, Optional/Mandatory prompts, ‘No data’ warnings, Field/Prompts validation etc.

15)Verify usability of reports (e.g. General appearance(Color, look and feel), Navigation etc)

16)Verify performances of report

17)Reference data check, validate the report data as per records in reference table.

ETL Process

Why do organizations need Data Warehouse?

Organizations with organized IT practices are looking forward to create a next level of technology transformation. They are now trying to make themselves much more operational with easy-to-interoperate data. Having said that data is most important part of any organization, it may be everyday data or historical data. Data is backbone of any report and reports are the baseline on which all the vital management decisions are taken.

Most of the companies are taking a step forward for constructing their data warehouse to store and monitor real time data as well as historical data. Crafting an efficient data warehouse is not an easy job. Many organizations have distributed departments with different applications running on distributed technology. ETL tool is employed in order to make a flawless integration between different data sources from different departments. ETL tool will work as an integrator, extracting data from different sources; transforming it in preferred format based on the business transformation rules and loading it in cohesive DB known are Data Warehouse.

Well planned, well defined and effective testing scope guarantees smooth conversion of the project to the production. A business gains the real buoyancy once the ETL processes are verified and validated by independent group of experts to make sure that data warehouse is concrete and robust.Well planned, well defined and effective testing scope guarantees smooth conversion of the project to the production. A business gains the real buoyancy once the ETL processes are verified and validated by independent group of experts to make sure that data warehouse is concrete and robust.

ETL or Data warehouse testing is categorized into four different

engagements irrespective of technology or ETL tools used:

• New Data Warehouse Testing – New DW is built and verified from scratch. Data input is taken from customer requirements and different data sources and new data warehouse is build and verified with the help of ETL tools.

• Migration Testing – In this type of project customer will have an existing DW and ETL performing the job but they are looking to bag new tool in order to improve efficiency.

• Change Request – In this type of project new data is added from different sources to an existing DW. Also, there might be a condition where customer needs to change their existing business rule or they might integrate the new rule.

• Report Testing – Report are the end result of any Data Warehouse and the basic propose for which DW is build. Report must be tested by validating layout, data in the report and calculation.

ETL Testing Techniques:

1) Verify that data is transformed correctly according to various business requirements and rules.

2) Make sure that all projected data is loaded into the data warehouse without any data loss and truncation.

3) Make sure that ETL application appropriately rejects, replaces with default values and reports invalid data.

4) Make sure that data is loaded in data warehouse within prescribed and expected time frames to confirm improved performance and scalability.Apart from these 4 main ETL testing methods other testing methods like integration testing and user acceptance testing is also carried out to make sure everything is smooth and reliable.

ETL Testing Process:

Similar to any other testing that lies under Independent Verification and Validation, ETL also go through the same phase.

• Business and requirement understanding

• Validating

• Test Estimation

• Test planning based on the inputs from test estimation and business requirement

• Designing test cases and test scenarios from all the available inputs

• Once all the test cases are ready and are approved, testing team proceed to perform

pre-execution check and test data preparation for testing

• Lastly execution is performed till exit criteria are met

• Upon successful completion summary report is prepared and closure process is done.

It is necessary to define test strategy which should be mutually accepted by stakeholders before starting actual testing. A well defined test strategy will make sure that correct approach has been followed meeting the testing aspiration. ETL testing might require writing SQL statements extensively by testing team or may be tailoring the SQL provided by development team. In any case testing team must be aware of the results they are trying to get using those SQL statements.

Data Warehouse Testing scenarios

ETL testing:

- Verify that data transformation from source to destination works as expected

- Verify that expected data is added in target system

- Verify that all DB fields and field data is loaded without any truncation

- Verify data checksum for record count match

- Verify that for rejected data proper error logs are generated with all details

- Verify NULL value fields

- Verify that duplicate data is not loaded

- Verify data integrity

ETL Testing Challenges:

ETL testing is quite different from conventional testing. There are many challenges we faced while performing data warehouse testing. Here is the list of few ETL testing challenges I experienced on my project:

- Incompatible and duplicate data.

- Loss of data during ETL process.

- Unavailability of inclusive test bed.

- Testers have no privileges to execute ETL jobs by their own.

- Volume and complexity of data is very huge.

- Fault in business process and procedures.

- Trouble acquiring and building test data.

- Missing business flow information.

Data is important for businesses to make the critical business decisions. ETL testing plays a significant role validating and ensuring that the business information is exact, consistent and reliable. Also, it minimizes hazard of data loss in production.

Sunday 17 November 2013

ETL Testing concepts

What is ETL? ?what are the scenarios we can check in ETL process???

Ans)The process of extracting the data from different source(operational databases) systems, integrating the data and transforming the data into a homogeneous format and loading into the target warehouse database is called as ETL(Extraction, Transformation and Loading) and some times also called as Data Acquisition Process.

Extraction
1) Data is able to extract the required fields
2)The execution logic for each source system is working
3)Extraction scripts are generated security access to the source systems.
4)source to extraction destinations working in terms of completeness and accuracy.
5)Extraction is getting completed within the expected result.

Transformation:
1)Transformation scripts are transforming the data as per the expected logic.
2)The onetime transformation for historical snap-shots are working.
3)Detailed and aggregated data sets are created and are matching.

Loading:
1)data sets in staging to loading destination is working
2)Any transformations during loading process is working
3)One time historical snap-shots are working
4)Both incremental and total refresh are working
5)Loading is happening with the expected result.

ETL Testing Common scenario's:

1)Verify all the DB Objects are in place as defined in mapping document. ( Source, Staging /ODS, DWH)
2)Verify all the ETL objects are deployed to QA Enviorment and are valid.
3)Verify table structures of DB are same as in the document.
4)All the flat files that are required for QA to run test are in place and is in required format.
5)Verify that data consolidation of duplicate, truncations or merged data was properly handled.
6)Verify that data field types,formats, field prescision and field length are as specified (specially if it supports south asian languages)
7)Verify that defaults values, Nulls records, Error records, Invalid data row and status type row are properly handeled.
9)Verify that expected ranges of field contents are specified where known.
10)Verify data completeness in target table.
11)Verify data integrity.
12)Verify to ensure that no null rows present in EDW tables
13)Compare records counts between source and target..check for any rejected records.
14)Verify Data from Source and Target .it should be matched
15)Verify all possible outcomes of the transformation rules, default values, straight moves and as specified in the Business Specification document
16)Verify Incorrect sequence is the referential integrity failures, incorrect end-dating (if applicable) etc, reject, records etc.
17)Verify if ETL jobs/processes fail because of number of reasons (say for ex: database related failures, connectivity failures etc). Then all data should be rollback up to last successfully batch run.
18)Verify if ETL fail during data validation then all error should be handle through error handing process .
19)Check unique values has to load in to the target. No duplicate records should be existing.
20)Verify if the ETL process is completing within the load window.
21)Verify with invalid inputs and out of boundary scenarios and to check the behavior of the application.
22)If there are unix/shell/perl script in workflow verify the result of the scripts as well.
23)Reference data check, validate the report data as per records in reference table.
24)Verify CDC.

UNIX Real Time Scenarios

1.How to find out row count in a file?
A. wc -l filename

2. How to display first 5 lines?
A. head -5 filename

3.How to display last 5 rows?
A. tail -5 filename

4.How to copy files from once server to another sever?
scp filename username@servername:path
ex: scp abcd.txt computer@124.127.55: ETL\source\files

5.How to delete files
A. rm filename

6.How to diplay file?
A. cat filename

7.How to display zip file?
A. Zcat filename

8.How to unzip file?
A. gunzip filename

9.How to rename file?
A. mv oldfilename newfilename
Ex: mv abc.txt abcd.txt

10.How to display first 5 lines in zipfile?
A. zcat filename | head -5

11. How to display last 5 lines in zipfile?
A. zcat filename | tail -5

12. How to get the process id?
A. ps -ef | grep job(process id)

13.How to Kill the job?
A. kill -9 jobname

14. How to display list of fles(list revers order date)?
A. ls -lrt

Oracle

Oracle :- Oak Ridge Analytical Computer Logical Engine

Statements of Oracle:-

DDL – Data definition language

1) Create – Used to make new database object

2) Alter- To modify the structure of the table

3) Add – Used to add new column

4) Modify – Used to modify structure of existing columns

5) Drop – Used to remove columns

DML – Data manipulation language

1) Insert – New statement to be inserted

2) Update – Used to modify the existing records

3) Delete – Used to remove the records from the table

DCL – Data control language

1) Grant – Used to give permissions

2) Revoke – Used to cancel permissions for (insert, update, delete and select)

TCL – Transaction control language

1) Commit – Used to save changes made by DML statements

2) Roll back – It is used to cancel transactions which are not saved

3) Save Point – It is used to mark the transaction here

DRL – Data retrieval language

1) Select – Used to retrive data from tables it is for read only purpose

Set Operators:-

Union – Union operator returns all rows selected by either query. Use the union operator to return all rows from multiple tables and eliminate any duplicate rows.

Null values are not ignored during duplicate checking.

Union all – Combines the result of two select statements into one result set including the duplicates.

Minus – Minus operator returns rows from the first query that are not present in the second query.

Mathematical Functions:-

ABS – This function returns the absolute value of a number.

Syntax – ABS(number)

Ex- select ABS(-32) from dual; o/p – 32

ABS(-32.4)

o/p: 32.4

ABS(-32.45)

o/p: 32.4

ABS(32.67)

o/p: 32.67

ABS(32.67 * -1)

o/p: 32.67

CEIL - This function returns the smallest integer value that is greater than or equal to a number.

Syntax- CEIL( number )

Ex - CEIL(32.65)

o/p: 33

CEIL(32.1)

o/p: 33

CEIL(32)

o/p: 32

CEIL(-32.65)

o/p: -32

CEIL(-32)

o/p: -32

EXTRACT - This function extracts a value from a date or interval value.

You can only extract YEAR, MONTH, and DAY from a DATE.

You can only extract TIMEZONE_HOUR and TIMEZONE_MINUTE from a timestamp with a time zone datatype.

Syntax:

EXTRACT (

{ YEAR | MONTH | DAY | HOUR | MINUTE | SECOND }

| { TIMEZONE_HOUR | TIMEZONE_MINUTE }

| { TIMEZONE_REGION | TIMEZONE_ABBR }

FROM { date_value | interval_value } )

Ex -

EXTRACT(YEAR FROM DATE '2003-08-22')

o/p: 2003

EXTRACT(MONTH FROM DATE '2003-08-22')

o/p: 8

EXTRACT(DAY FROM DATE '2003-08-22')

o/p: 22

FLOOR - This function returns the largest integer value that is equal to or less than a number.

Syntax- FLOOR( number )

Ex -

FLOOR(5.9)

o/p: 5

FLOOR(34.29)

o/p: 34

FLOOR(-5.9)

o/p: -6

ROUND - This function returns a number rounded to a certain number of decimal places.

Syntax (WITH NUMBERS): ROUND( number, [ decimal_places ] )

Ex-

ROUND(125.315)

o/p: 125

ROUND(125.315, 0)

o/p: 125

ROUND(125.315, 1)

o/p: 125.3

ROUND(125.315, 2)

o/p: 125.32

ROUND(125.315, 3)

o/p: 125.315

ROUND(-125.315, 2)

o/p: -125.32

TRUNC - This function returns a number truncated to a certain number of decimal places.

Syntax (WITH NUMBERS) : TRUNC( number, [ decimal_places ] )

Ex-

TRUNC(125.815)

o/p: 125

TRUNC(125.815, 0)

o/p: 125

TRUNC(125.815, 1)

o/p: 125.8

TRUNC(125.815, 2)

o/p: 125.81

TRUNC(125.815, 3)

o/p: 125.815

TRUNC(-125.815, 2)

o/p: -125.81

TRUNC(125.815, -1)

o/p: 120

TRUNC(125.815, -2)

o/p: 100

TRUNC(125.815, -3)

o/p: 0

TO_CHAR – This function converts a number or date to a string.

Syntax : TO_CHAR( value, [ format_mask ], [ nls_language ] )

Ex-

TO_CHAR(1210.73, '9999.9')

o/p: ' 1210.7'

TO_CHAR(-1210.73, '9999.9')

o/p: '-1210.7'

TO_CHAR(1210.73, '9,999.99')

o/p: ' 1,210.73'

TO_CHAR(1210.73, '$9,999.00')

o/p: ' $1,210.73'

TO_CHAR(21, '000099')

o/p: ' 000021'

TO_CHAR(sysdate, 'yyyy/mm/dd')

o/p: '2003/07/09'

TO_CHAR(sysdate, 'Month DD, YYYY')

o/p: 'July 09, 2003'

TO_CHAR(sysdate, 'FMMonth DD, YYYY')

o/p: 'July 9, 2003'

TO_CHAR(sysdate, 'MON DDth, YYYY')

o/p: 'JUL 09TH, 2003'

TO_CHAR(sysdate, 'FMMON DDth, YYYY')

o/p: 'JUL 9TH, 2003'

TO_CHAR(sysdate, 'FMMon ddth, YYYY')

o/p: 'Jul 9th, 2003'

TO_CHAR(sysdate, 'FMMonth DD, YYYY')

o/p: 'July 9, 2003'

TO_CHAR(sysdate, 'FMMON DDth, YYYY')

o/p: 'JUL 9TH, 2003'

TO_CHAR(sysdate, 'FMMon ddth, YYYY')

o/p: 'Jul 9th, 2003'

TO_DATE - This function converts a string to a date.

Syntax : TO_DATE( string1, [ format_mask ], [ nls_language ] )

Ex-

TO_DATE('2003/07/09', 'yyyy/mm/dd')

o/p: date value of July 9, 2003

TO_DATE('070903', 'MMDDYY')

o/p: date value of July 9, 2003

TO_DATE('20020315', 'yyyymmdd')

o/p: date value of Mar 15, 2002

TO_NUMBER - This function converts a string to a number.

Syntax : TO_NUMBER( string1, [ format_mask ], [ nls_language ] )

Ex-

TO_NUMBER('1210.73', '9999.99')

o/p: 1210.73

TO_NUMBER('546', '999')

o/p: 546

TO_NUMBER('23', '99')

o/p: 23

TO_NUMBER('1210.73')

o/p: 1210.73

TO_TIMESTAMP - This function converts a string to a timestamp.

Syntax : TO_TIMESTAMP( string1, [ format_mask ] [ 'nlsparam' ] )

Ex:

TO_TIMESTAMP('2003/12/13 10:13:18', 'YYYY/MM/DD HH:MI:SS')

would return '13-DEC-03 10.13.18.000000000 AM' as a timestamp value.

TO_TIMESTAMP('2003/DEC/13 10:13:18', 'YYYY/MON/DD HH:MI:SS')

would also return '13-DEC-03 10.13.18.000000000 AM' as a timestamp value.

TO_TIMESTAMP_TZ - This function converts a string to a timestamp with time zone.

Syntax:TO_TIMESTAMP_TZ( string1 , [ format_mask ] [ 'nlsparam' ] )

Ex-

TO_TIMESTAMP_TZ('2003/12/13 10:13:18 -8:00', 'YYYY/MM/DD HH:MI:SS TZH:TZM')

would return '13-DEC-03 10.13.18.000000000 AM -08:00' as a timestamp with time zone value.

TO_TIMESTAMP_TZ('2003/DEC/13 10:13:18 -8:00', 'YYYY/MON/DD HH:MI:SS TZH:TZM')

would also return '13-DEC-03 10.13.18.000000000 AM -08:00' as a timestamp with time zone value.

Data Warehousing concepts

Data warehousing Architecture

Data Warehouse Architecture

 Data Warehouse server – almost always a relational DBMS,rarely flat files

 OLAP servers – to support and operate on multi-dimensional data structures

 Clients – Query and reporting tools – Analysis tools – Data mining tools

Difference between OLTP V/s Data warehouse???

What is Data warehousing???

Definition:-

A data warehouse is collection of data marts.

Represents historical data. DW is a relational Database which is specially designed for analysis Purpose rather then for transactional purpose.

-subject-oriented,

-integrated,

-time-variant,

-nonvolatile

collection of data in support of management’s decision making process.

Subject-oriented

 Data warehouse is organized around subjects such as sales, product, customer.

Integration

 Data Warehouse is constructed by integrating multiple heterogeneous sources.

 Data Preprocessing are applied to ensure consistency.

Time-variant

 Provides information from historical perspective

Eg. past 5-10 years.

 Which supports business users in analyzing the business and comparing the business with different time periods.

Ex:- Yearly, Half yearly, Quarterly, Monthly,weekly, Daily.

Nonvolatile

 Data once recorded cannot be updated.

 Data warehouse requires two operations in data

accessing– Initial loading of data – Access of data

Data mart

 Data mart is a subset of data warehouse which concentrates on a specific business

Ex:- Sales dept, Purchase dept, HR dept

Why a Data Warehouse?

 For analysis and decision support, end users require access to data captured and stored in an organization’s operational or production systems

 This data is stored in multiple formats, on multiple platforms, in multiple data

structures, with multiple names, and probably created using different business rules

Dimension Table

 Contains descriptive or textual.

 Relatively static(Occasionally data will be changed – SCD’s I, II, III).

 Hierarchical data.

 Joined to another tables – Fact table(Are joined to a fact table through foreign key reference).

Fact table

 Fact table contain measures and the key of the dimension table.

 Can hold large volume of data

 Are join to dimension table through foreign keys that reference primary keys in the

dimension tables.

Star Schema

 A Star schema is defined as a logical database design in which there will be a centrally located fact table which is surrounded by at least one or more dimension tables.

SnowFlake Schema

 Variant of star schema model.

 A single,large and central fact table and one or more tables for each dimension.

 Dimension tables are normalized i.e. split dimension table data into additional tables

Fact Constellation

 Multiple fact tables share dimension tables.

 This schema is viewed as collection of stars hence called galaxy schema or fact constellation.

 Sophisticated application requires such schema.

Refresh???

• Propagate updates on source data to the warehouse

• Issues:

– when to refresh

– how to refresh -- refresh techniques

When to Refresh???

• periodically (e.g., every night, every week) or after significant events

• on every update: not warranted unless warehouse data require current data (up to the minute stock quotes)

• refresh policy set by administrator based on user needs and traffic

• possibly different policies for different sources

Data Warehousing Tools

- ETL Tools

- Informatica

- Data stage

- Ab initio

- SAS

- MSBI

- Reporting tools

- Cognos

- BO(Business objects)

- Micro strategy

- MSBI