Sunday, July 6, 2008

General Overview

For my first post, I decided that it would probably be beneficial to give a general overview on what parts of Oracle I work with, the type of work I do, etc. I finally decided having a blog would allow me an outlet of sorts to gather my thoughts in some sort of cohesive manner where I could easily find them for further study and/or clarification.

I've been working with Oracle since around 1990, version 5.3C, on a Prime minicomputer (which really bogged down with just 4 users) and the corresponding PC version, which basically hogged all the system resources to point where nothing else could run.

I started as a contractor to the U.S. Bureau of Mines (BOM) to fix simple little problems, like typos on the data entry screens (Forms 2.3) created by the original contractors from Oracle, for the Minerals Availability System (MAS). Before this, I was a dBase programmer for almost 10 years, having written a complete (at least for the business) accounting program to track labor costs, billable and non-billable, supplies, parts, etc., with automatic inventory reduction, and monthly/yearly reports for taxes, sales commisions, etc.

Anyway back to the Oracle 'stuff'. The guy I worked for in BOM headed off to Australia for three weeks to take a look at their Oracle system for mineral resource related data. While he was gone, I finally finished all of the typo corections and needed something else to do. Without anyone to give me guidance on what else to work on, I ran the program as normal new user would and identified further areas for improvements or fixes. The biggest problems I quickly saw were with the hard-copy reporting system. There were separate screens to request each of the various reports, which was a very cumbersome way to generate the reports, so I first re-wrote the report request screens. I simply added a few more fields to the report request screen and consolidated them all into one screen, which would then fire off whatever report was selected with the appropriate parameters. This only took a day, so I still needed more to do.

The reports themselves were a bit more problemmatic. They were all written in Fortran, and I'd never done any Fortran programming before. Luckily the language itself wasn't to convoluted and I was able to easily understand it with the help of a decent book. I went through the 8 different Fortran reports and fixed numerous errors in them, added a few more input parameters to make the reports more flexible, formatting problems (since in those days we had perform our own page breaks), etc. After finishing this, I still had about a week left before my boss returned from Australia, so I went into his office to peruse through the stack of user reported problems (screen shots with handwritten comments about problems or errors). Finding some of the easier ones, I went to work on getting those corrected and working correctly.

Well, when my boss returned, he was quite surprised and pleased with the initiative I'd showed while he was gone and the extent of the fixes and enhancements I'd made to the system. A few years later, one of the government employess in the office left for a job with a different agency, and her job was pretty much identical to what I was doing, so I applied for it and got it. I've been a government employee ever since.

My background with MAS was very beneficial. The BOM had a well thought out strategy towards databases, and was making strong progress on migrating numerous small databases into one larger, cohesive database with ties into other related systems. However, in 1995, there was a mounting pressure from Sen. Ralph Regula (D-OH) to help the Democratic party show they could cut governemnt spending, so they decided to shut down the BOM in '96. Luckily for me, some members of Congress opposed to Regula's moves, allowed for the U.S. Geological Survey (USGS) to take some of us from BOM. I was in a group of 14 people here in Denver who managed to transfer. My boss and I were the only computer folks to transfer, along with the MAS database.

Now that I was in USGS, the plan was to merge MAS with the USGS's mineral database, the Mineral Resources Data System (MRDS), to create a larger, more comprehensive database. Well, plans and actions are generally two separate things, especially in government. Anybody reading this from some non-US countries, or even those who have a decent grasp of some foriegn languages, will notice that MRDS sounds like something similar what people gnerally would want to scrape off their shoes, and the database structure really was that way as well. My boss decided to take an early retirement shortly after the transfer, leaving just me to deal with the mess. Of course, coming from an agency that had just been shut down, I obviously couldn't have been knowledgeable, so I was put in the position of having to point out all of the deficiencies in the original MRDS 'structure' and when that fell on deaf ears, an outside contractor from Oracle was finally brought in, who after a careful analysis of MRDS, basically said the same thing I'd already been saying for almost three years, though much more bluntly. I remember it very well, when he finally said the existing MRDS structure "...was piece of shit, it's not a database, it's an oversized spreadsheet".

At least after this, my opinions and other statements finally seemed to carry a bit more weight, as they were confirmed by a nuetral outside expert, and some progress finally started to get underway on the database consolidation. We spent well over a year just haggling out the database structure for the new consolidated database, which when it was finally over, was remarkably similar to the structure of MAS. The hardest part was trying to figure out how to map the MRDS fields (and pieces of fields), into the new structure. MRDS was contained in some RDMS system called Universe (now owned by IBM), which allowed multi-valued fields. Unfortunately, the designers of MRDS probably shouldn't have ever been allowed anywhere near a computer, let alone a database, as even though Universe allowed a true relational structure, MRDS was designed as just one big 'table' with around 250 fields (mostly unrelated), every field was allowed to be free-form unlimited text, and no data integrity or validation rules.

We spent the most of ten years developing ways to try and parse as much data as possible out various fields into the correct locations in the new structure, but there were still many data errors even after this. MRDS wan't designed to be a real database, with interactice users. It was designed to simply be an archive of existing data from various projects, and all acces was performed by the MRDS database group. When users wanted MRDS information, they requested what the wanted, and eventually they recieved a spreadsheet of all 250 or so fields. Well, the first thing anybody does with a spreadsheet is to delete the columns they don't want, then reorder the ones they do to be in the order they want. Sometimes they'll rename the columns to make more sense to them, and then add in a few new columns as well. This is well and good, but unfortunately, once the 'project' was done with the spreadsheet, they sent it back to the MRDS group to update the 'database'. Another unfortunate step occered here, where the database group didn't verify the spreadsheet, and just read it in as is to replace the existing record in the database. This resulted in a lot of data going into the wrong field, like a bibliographic reference going into the year of discovery field, or a geologic age (unsure if it was for host rock or associated rock or age of mineralization) winding up in something like total yearly production, etc.

Another major problem was with the multi-value fields themselves. They used different delimiters all over the place, with no apparent rhyme or reason. Sometimes it was a ';', other times a ':', other times a '', and other times just a space or some other character. Sometimes they varied with the same multi-valued field, and again at other times, they inserted multi-valued data within the same field ("One value;Another value-Yet another;And a third value", using ";" as the delimiter in this example). So while this was supposed to be four separate values, it got parsed out as only three, since the second one was input incorrectly. Usually, these cases also had a ill-defined relation with another field or two as well, which may or may not have had the corresponding data input correctly, to further compound the problem.

Anyway, we finally got most of the data migrated into the new structure with minimal data loss. The new structure is (by far), a much more flexible relaional structure, with both data integrity constraints and data validation routines.

For the last three years or so, I've been working on creating an interface to the database, so users can directly work with the data, instead of working on a copy that becomes extremely problemmatic to get back into the system, especially with multiple users and the time spans involved in how long some of these projects can run. I know how this makes many (most?) of you database people cringe at the very thought, but it's a mindset within USGS that I've been trying to change for around 12 years now. The other huge mindset change I've been working on is the almost sense of aproval for all the various projects (studies), to create their own little 'database' in FileMaker, Excel, Access, etc. with no thought or regard to the new MRDS database structure, and then they give us their 'database' upon the completion of the project, and we're stuck trying to figure out how and where to place all the various pieces of information, after (and iff) it passes the data validation routines required of our database.

Besides being very inefficient, it's extremely error-prone. There's also the problem that this approach basically has each project gathering and correcting the same types of information over and over again, instead of sharing the new or corrected piece of information at the same time. Imagine there is a project studying gold deposits in the Western US (project 1). And then there's another project studying the mineral sites of Nevada (project 2), and yet another project studying placer mines in Idaho and Nevada (project 3), and all three of these projects start at the same time. Well, all three of these projects can easily encompass many of the same locations. Under the current approach, each project is going to wind up gathering and correcting many of the same pieces of information (site name, commodities present, amount of commodities, location, etc.). Say project 3 finishes in two years, and provides their 'database' (an Excel spreadsheet) to us to update the database with. This project was meticulous with the data, ensuring the validity of each piece of information, but unfortunately, didn't meet many of the business rules required of the MRDS database. We spend (at a minimum) six months parsing the data out of the spreadsheet into the correct fileds and tables and correcting various pieces of data so that it does meet our database requirements.

Well, project 3 is extremely detail oriented and takes a longer time than project 1 did, as they also performed a very detailed research of various literature to populate the resource and production tables, along with numerous field studies of the placer sites, and their locations are based upon GPS data gathered in the field. This project finishes one year after project 1 did, and we subsequently parse their Access database into MRDS in yet another 6 months. We now have two different sets of of information from the two projects, that are very good sources of data.

Well, a year later, project 2 finally finishes. This project had team members who were spread over numerous other projects as well, and they were under extreme pressure to finally wrap this project up. So instead of having the time to actaully go out and get good GPS coordinates of where the sites are located at, they use the information they find in old reference material and plot a few of the locations through a GIS system, based upon some point of reference in this old reference material (like "3 miles NW of Boondock, Nevada, near the headwaters of the stream that originates at the top of the gulch"). Okay, this is an extreme case, but also a very common one as well. This project finally finishes and provides us this type of information, and since we have no way to guage the validity or accuracy of this project's work, we spend yet another 6 months getting this data parsed into the database to update the existing records.

So, after 4 1/2 years, all three projects have finally completed their work, and their data is finally in the master MRDS database. The first two projects, which were meticulous in their data work, though covering different subject areas and therefore concentrated more on different aspects of the data, didn't have to many data differences. Project 2, taking a year longer, had an oppurtunity to get a few name changes on some sites that changed owners during that extra one year. There's a few other minor data differences, but they aren't that major, most of them are just the more detailed resource and production data.

However, project 3, being overworked and spread thin, did things the fast or easy way, and therefore they couldn't validate as much of the informations as they wanted to, and wound up using old and sometimes inaccurate data to update their FileMaker program. This data, being the last in, winds up replacing some of the very good data previously parsed into the database from the earlier projects.

So after 4 1/2 years what do we wind up with? Some good data, some bad data, and no way to really distinguish the good from the bad. We could have just created a new record for each of those from the various projects, but then we'd wind up with different records for the same site in many cases. This gets extremely cumbersome when trying to plot all the locations in a given area, especially if they had different locations. Then there's the problem of trying to consolidate the information from all three records into just one that no only is accurate (or at least at accurate as we can make it), but also of trying to determine which property is which (because of name changes), and which location is correct (project one used the north end of the site, project 3 used the center of the site, and project 2 used an estimate based on some old reference source and plotted on a large scale GIS coverage).

We (the database 'staff'), spent 1 1/12 years years parsing that data into the database, and making a few small mistakes ourselves (not knowing the data, and trying to get the data to fit within the business rules of the database, changed a few pieces of information or just dropped the information). The projects published three different studies, sometimes with different information contradicting the other publications, and possibly even reaching different conclusions. The 1 1/2 years the database staff spent parsing information into what we assumed were the correct fields in the correct format could have easily been spent on adding new features, functionality, or even adding more ties into other database systems, to allow a more comprehensive overview for future studies. Intead, the three different projects all spent time trying to gather many of the same pieces of information with varying degrees of accuracy, and wound up with three different reports which weren't based upon the same information, and therefor reached different conclusions.

If all three projects were told they must use the master database, and not allowed to create their own, all three projects could have greatly increased their chance to share the same information, and base their results upon the same information. They could have all saved time by being able to take advantage of the newer corrections or additions to the data from the other projects, when the newer data was available.

There are signs that some of these inefficiencies are slowly being addressed and identified, but it's a long, slow tedious process. Part is managemant, part is the reluctance of users to quit doing something the way they've always done it, and part of it is just plain staffing problems. For most of this, I am the 'database staff'. There are a few others working on this as well, but they are primarily working on the parsing of the older data sets into the new structure or correcting existing data problems from the consolidation. I'm trying to work primarily on enhancing the data entry and retrieval part of the system, making it as simple and useful for the suers (and therefore the various projects), as possible, so there is less of a reason or inclination to try and create their own sub-system.

Since I also work on data cleanup, etc. as problems arise, I don't have the oppurtunity to solely concentrate on the interface. However, Oracle's Application Express (Apex) program has greatly simplified this for me, allowing me to fairly easily create and maintain a web-based interface to the database with a minimum of training and without me being a full time duh-veloper (so I don't 'lose' some knowledge from lack of use), and utilizes my existing skill set without learning new, cumbersome ones. I also perform the DBA duties, end-user computer support (hardware and software) for the office I work out of, system administration and web server duties, and also work on another database that is semi-related to MRDS. If I can ever get both of them semi-snycronized in field names and types (where similarities exist), then I can create an interface to pull data from both systems, which would greatly improve the usability for many projects.

If you read this blog this far, you're probably shaking your head at the waste and inefficincies (and probably typos). Thanks for reading, and hopefully future posts won't be anywhere near as long. This was just long to kind of 'set the stage' for future posts.