Organize Africa world Cup 2010 participants Database

The aim of this project is to build understanding of various database systems and then to propose the best solution for sport organization database system. Database systems are one of the most popular and widely used IT software in this information age. One decade ago, when memory was expensive, companies were storing only very important business data and that to in normalized form to ensure that data redundancy is minimum with constraint of hitting storage space capacity, but this has thrown other challenges 1. Company lost quite useful information without storing it  2. Accessing simple info became very complex and for simple reports, business users had to rely on the IT team. Now memory is available in abundance and so each and every organization is trying to collect and store each and every bit of meaningful data. Nowadays all the organizations have their IT teams managing their data, but unfortunately this is the harsh fact that even today most of the time business users still rely on IT team to supply the information to make the business decision . The users most often that required information does not come at the right time and in right format. Thus it has become very important that we select the appropriate database management system and store the data in such a format so that it becomes easier for business users to get the required info and their dependency on IT team is reduced and limited only for typical cases. Amidst all this, another challenge would be that the solution should fit the budget of the organization.

Scope  
Under this project we have tried to design one Sport Management system. System will store information like who all attended this mega event along with other information such as visitors types (if they were athlete or Staff, Volunteer or just a spectator- Who came to see the event or any other type), what all events they participated in, what all bonus or incentives they availed and other.

We have following types of data which we want to store in appropriate database management system in easy to access format.

People
People (employees, volunteers, athletes) involved in the Africa world cup.

Event
There are various types of sports events in world cup

BonusIncentive
Bonuses to the employees depending upon the events they were involved into and incentive to the volunteers.

Objective
The main objective of this project is
To study and understand various important database systems and do the comparison analysis amongst them
To understand the requirement of the sport organization and then to design the best suitable database solution for them.

Technologies Available
Database Management System
There are various types of database management systems available in market and depending upon the need and budget, organization chooses the best one for their business needs. These could be as simple as flat file system or may be sophisticated solutions like Relation Databases system and multidimensional database systems (e.g. SAS OLAP Cube). Database systems evolved from simple flat file to relational databases. RDBMS systems are rich in features and can address multifarious needs of business organization. Before we move forward lets discuss different available database technologies and their usage.
Flat Model Databases
Hierarchical Model Databases
Network Model Databases
Relational Model Databases

Flat Model Databases
Flat model databases comprise of two dimensional flat layouts of data. There are columns and each column holds a relation with an entity or entities in other coulmn. For instance, a flat model database includes only zip codes. In this database there will be only column and several rows where each row will contain new zip code. One could easily see that storing multiple types of data like World cup data would be difficult to store and retrieve. This solution would unnecessarily complicate the management of our Sport Management System and its recommended not to use it for this purpose.

Network Model Databases
A network model databases has records linked to one another in other worked they are networked. These networks which are referred to as pointers can be different types of information like node numbers or  a disk addresses.

Hierarchical Model Databases
The structure of hierarchical model database is same as tree one node is connected to another node with above hierarchy, an example of same is the organization of files and folders in MS Windows operating system . For instance, a hierarchal database of a manufacturing company, may list each days production units as a separate file. Within this nested file are all of the production related data (same types of data) for the day. Storing the data in this format is particularly useful when we have lot of hierarchy in data. As we can easily infer that data in our Sports Management is not hierarchical in nature, therefore it is not advisable to use Hierarchical Model Database.

Relational Model Database
The relational model is the most widely used database management system and is an extremely powerful tool, not just for storing the information, but also to access it that is why these kinds of databases are being widely used in the Modern world. In a typical RDBMS Systems tables are linked to each other through a relationship. These tables can be updated without reorganizing the data of the table per se. Tables are also referred as a relation. Relational databases use a program interface called SQL or Standard Query Language. SQL is used on practically all relational databases. Different relational databases may support slightly different SQL language. Relational databases are extremely easy to customize to fit into almost any kind of data storage. You can easily create relations for items that you sell, employees that work for your company, etc. Some of the leading RDBMS products are  ORACLE, DB2, MYSQL and MSSQL.

Multidimensional database
A multidimensional database (MDDB) is able to pull data from a data warehouse or from any other data sources and stores the data in a matrix-like format. The data in multidimensional database stores the data in pre-summarized format which helps user to analyze and drill down the data across multiple levels and dimensions very quickly. The MDDB enables users to quickly retrieve multiple levels of pre-summarized data through a multidimensional view. An MDDB is not about any specific dataset like SQL table or SAS data set, rather it is way of storing the data in multidimensional format and depending upon the database we use it is called multidimensional tables of Cubes (SAS). A crossing is a definition of one or more classification variables that exist in the data from which the multidimensional database is created. Each crossing represents a grouping on which summary statistics could be calculated. For building Cubes in SAS, we use SAS OLAP studio or we can launch the same from SAS DI studio.

Star Schema Vs Normalized data
Star schema architecture
Star schema architecture is the most basic data warehouse design. The main feature of a The table at the centre, called the fact table is the characteristic feature of this architecture .Besides this there are dimension tables which allow browsing of specific categories. The fact tables in a star schema are mostly in database third normal form, while dimensional tables are de-normalized (second normal form). Despite the fact that the star schema keeps the data redundancy,  but it is still most commonly used in the data warehouse implementations across the world today (about 90-95 cases) due to its simplicity because accessing the information from start schema is easy. Generally we have 1-2 fact tables in one start schema. When we try to attain some kind of normalization within the start schema, then it gets distorted a bit and called snowflake schema. In fact table, we generally keep numeric measurable fact rather in dimension tables we keep generally non-additive characteristics of the data.  

Data normalization
Let me first define that what is data normalization is Data normalization is a method in which data attributes within a data model are organized to increase the cohesion of entity types. The purpose of data normalization is to reduce the storage requirement by minimizing the data redundancy and to store consistent information. Data consistency is very important for any organization because if we do not maintain the consistency in data, then two people from the same company would be talking about different number while we know that both can not be true at the same time. There are different levels of data normalization, but third level of normalization data is considered to be best for all practical purposes and more normalization is not advisable as it can create unnecessary more complexity in the solution and then it would be again tedious task to retrieve the information.

LevelRuleFirst normalized form (1NF)An entity type is in 1NF when it contains no repeating groups of data.Second normalized form (2NF)An entity type is in 2NF when it is in 1NF and when all of its non-key attributes are fully dependent on its primary key. Third normaizedl form (3NF) An entity type is in 3NF when it is in 2NF and when all of its attributes are directly dependent on the primary key.

Frontend Technologies
Very similar to backend technologies, people use several front end technologies to create user interfaces which could help user to access the system. There are various technologies available like HTML, Java, .Net, MFC and VB. Depending upon systems usage one can plan of using the technologies. If wide number of people will be using this application from different locations in that it would make sense to write one web application. Anyone can access the functionality from anywhere to everywhere provided place has internet access, which is not a big deal now a days. Web application can be developed in .Net or Java along with HTML and if one wants to give this management system a user friendly interface people can use CSS etc. J2EE framework is Sun Microsystemss proprietary technology, while .Net is Microsofts one. But both technologies are supported by most of the platform and browsers and hence can be used But if this systems usage is not globally and it will be used from one or couple of desktops only in that case it would be easy to write one desktop application rather then web application. Desktop application can be written easily with out any major hazard.

Assuming this application would not be used by the limited people within the organization, I will suggest using VB interface as it is easy, simple and economic to create unlike web application where we need to register the domain and buying space for hosting.

Selection of Appropriate Solution 
To store data in flat file database model would be difficult to store and to retrieve the information, also it would be difficult to check the data consistency and thus it is not advisable to consider this solution.  Since our data is not hierarchical in nature, so we would discard this as well. Network data model is also not applicable in our case. Storing data in MDDB is also not advisable here because we do not want to much slice and dice on the data. The kind of data we have which is relation in nature and thus we would recommend relational database system solution.

Looking at the various type of databases and their pros and cons, my personal recommendation is to use some relational database. But then immediate question arise that we have so many relational database available in the market which one should be used for this project.

Both Oracle 9i Database and DB2 Universal Database version 8.1 support all known platforms, including Windows-based platforms, AIX-Based Systems, HP-UX systems, Linux Intel, Sun Solaris and so on. And one of the most widely used databases for ERP software. In order to install Oracle on some windows desktop, you need following configuration.

HardwareRequirementProcessorPentium 166 MHz or higherPrimary Memory (RAM)RAM 128 MB (256 MB recommended)Virtual Memory Initial Size 200 MB, Maximum Size 400 MBHard disk space140 MB on the System Driveplus 4.5 GB for the Oracle Home Drive (FAT)or 2.8 GB for the Oracle Home Drive (NTFS)

While DB2 requires following configuration on windows based operating system
HardwareRequirementProcessorPentium 166 MHz or higherPrimary Memory(RAM)A minimum of 512 MB RAMHard disk space4.5 GB but then it also require minimum of 2 x RAM or 400 MB, whichever is greater of swap sapceOnce we have looked at the requirements for the ORACLE or DB2, we should look at our system requirement and think do we really need such a heavy databases. I guess your answer will also be same as mine NO. We really dont need such heavy system and then cost may also be a concern for us. And hence should evaluate some smaller database systems like MS SQL, MY SQL (freeware) or MS Access etc.

My SQL is freeware and has as good support as any other database system (mentioned above) and hence I would recommend use of MySQL, it is free and easy to manage and maintain.

User Interface
Assuming this application would not be used by the limited people within the organization, I will suggest using VBA interface as it is easy, simple and economic to create.

Design Discussion
We will store our data into four tables and out of these four, three will be populated by the data keyed in by the company and one will be generated automatically using some trigger event.

People Table
This table will contain information related to the  people who would be part of the organization or the events.The people would fall in three categories . (1 --- employee,  2 ---- volunteer, 3 ---- athlete)

Primary Key  Person_id

SchemaTable NameVariable
Data LoadingUpdate flow
Company will have their employee information in some SAP database. Information about the athletes and volunteers would come via csv files or excel sheet. So, job written in PLSQL would extract the data from SAP and these flat files and will store this in People table of Sport schema of MYSQL database
Event Table
This table will contain the information (name of event, starting time, end time, stadium, bonuses and incentive amount) about all the sporting events which would take place during the world cup.

Data LoadingUpdate Flow
The schedule of the events, its venue and bonus amount would come from flat files in beginning and would be loaded at first time, later on to make any update, there would be one user interface written in VBA to make any amendment.

Information about the people who would be involved in particular event would come from flat files in beginning and would be loaded at first time, later on to  make any update, there would be one user interface written in VBA to make any amendment. The reason behind loading from flat file is the ease of user input, because if user will try to input one by one then it will take more effort thats why this batch process of reading data from the flat file is introduced.

BONUS_INCENTIVE Table
This table will contain the information about bonusincentive earned by each and every person who was involved in world cup events.
Primary Key  Person_id

Data LoadingUpdate Flow
This table will be calculatedupdated by one routine which would be triggered at every refresh of the database. This routine will calculate the total amount of bonusincentive for a person (employeevolunteer).

SchemaTable NameVariable

Workflow
Using the user interface written in VBA, we would do following
Enter details that Athlete ABC came to visit 2012 world cup.
Create an entry for Athlete in table, if this already not present in the table and assign a unique person_id.
Createamend an entry of any event and if it is entered for the first time, then create unique event id and ask for all remaining details.

Enter the bonusincentives for each event and if there is any change in decision of bonusincentive of any pre entered event, then make the required change using the user interface.

Enter BonusesIncentives for Athletes
In order to do that, we need to perform following steps
Check if athlete is already stored in the People_ID table if no insert an entry for athlete and get the unique id and if this already exists then get the unique id assigned to athlete.

Use the unique id obtained from step1 to check if there is a unique entry in AvailableMap table or else add one entry (Note that TypeIDs are enum in our case or else we can have another table for that also). This table will show if bonus or incentives are allowed for athletes or not.
Use the unique ID from step1 to add details in Bonus_Incentive_Event_Map.
ABC wants to buy an ice cream of retail price 10USD, how much he has to pay
Get the VisitorsID and ID( Type  Athlete or Volunteer etc) from Visitor Table using hisher name (ABC in this case)
Use the ID to check if bonus or incentive assigned to ABC ( using table Available Map) and get the Bonuses or Incentive IDs if these were assigned
Go through all the bonuses or incentives list to check if there is some incentive applied on Ice-cream price if there is something apply the discount and charge the new price from ABC
Also add this Entry in to DataWarehouse for later reporting.

Data Quality
To ensure that we store the valid and quality data in our database, there are various data quality solution available. In this project, we would validate the data by writing our own data validation rules. For example, if we are adding the phone no, we would ensure that it has only numbers and that to a certain limit, like 10 digit mobile numbers in most of the countries or but few validation will be imposed by design only such as uniqueness validation rule, where we say that one person could be either athlete or staff or volunteer but not all 3.

Reporting Interface
We will create one reporting interface in VBA which would enable the organizers and people from management team to extract the required information in pre-designed format. For example, some one would want to find out the events that would take place on particular day or he may like to find out the time when there is no activity, so that stadium can be cleaned during that period. Another example, organization committee wants to know how much bonus they have to pay, just to ensure that it is under budget or if they have made some changes, and then just to analyze the impact of change on budget. These kinds of predefined reports can be designed and created using user interface where person can choose appropriate parameter to generate the report.

Pros and Cons of the solution

Pros
Above mentioned solution doesnt keep redundancy in data as it stores the data in highly normalized form.
Above mentioned solution is a generic solution so if tomorrow we try to add another field referee it can be added without changing any schema or if we want to add another type of  bonus and incentive then it can be easily added
Selection of Relational database adds value to the solution as data consistency can be checked and at the same time data integrity can be maintained by creating Integrity constraint rule.

Cons
It requires much more CPU, memory, and IO to process normalized data and therefore Server performance degradation is observed during peak loading times
It requires more joins to get the desired result. A poorly-written query can halt the entire system.
Maintenance overhead. The higher the level of normalization, the greater the number of tables in the database.
Business users need to understand database design to access the required information or they will have to rely on IT team.

Considering the size, type and business requirement of data, we have proposed MYSQL database and data is stored in normalized form. This solution fits our need , but if we want to do some more detailed analytical analysis, then we may have to change our database design and might require to change it to de-normalized  format from normalized format.

0 comments:

Post a Comment