Table of Contents
How Do We Talk To A Data Architect About An Agile Approach?
I've found that one place you will see a level of resistance to an agile approach is when you work with the data and data analytics people. It took me a while to understand why. I kept hearing “but we dealing with giga-bytes data, not small change” and “we test close to a production environment” and so on.
Over time I started to understand that I needed a different toolbox when talking to the data people. I really had to learn that there was a difference in this problem space and that I needed learn a different approach.
My “ah-ha” moment when attending a presentation by Lynn Winterboer and Cher Fox - Test Automation - Agile Enablement for Data Warehousing and Business Intelligence Teams. They pointed out that there is a different language between developers and data people and, when talking about automating tests if we use the language of developers puts up barriers to understanding for the data people. Developers talk in terms of APIs (testing approaches require this). For data people the primary language is SQL. If we want to talk to data people we need to present ideas in terms of this thinking.
One of the big areas that we have to work is the understanding of an how an incremental approach testing would work. Data people are used to dealing with large sets of data. The equivalent of a unit test in this space is the idea of a representative sample of the data. As we work, for example, a data mapping, we identify maximum and minimum values of fields etc and then use these discussions to drive the selection of a much smaller set of data for testing.
This does not say we we don't test against production data. It's just that we use this approach to make sure that we have a reasonable expectation that it works before we get to that point (we should be surprised, and generate additional specific cases, if there is a problem detected in production).
From an implementation perspective we also put these tests and the data under version control. Over time our coverage increases. In this way, over time we develop a set of regression tests for mapping in general.
For data people, the next issue is “how do we deal with the changes”. For example how do we merge two tables together (if this is something we want to do). Won't all the existing tests fail.
And that is the point. Yes some subset of the existing tests will fail. The ones that are effected by the change. Automatic tests function like double-entry book-keeping in accounting. If a test fails, then either the logic that we are now using is a problem, or the existing test is a problem. Once we identify the source of the problem, we fix it so that everything is “green” again. And we have the confidence that all the other tests work - we have not inadvertently broken something else.
There is often a question about how you evolve a schema forward so you can do incremental work. An agile approach to this is “database refactoring”. Simplistically the approach works as follows. Let's say we want to rename a field from “FName” to “FirstName”. One refactoring approach could be:
- Add a field with the new name to the database
- Set and publicize a date for retirement of old field
- Behind the scenes, set things up so that changes in either fields are kept synced until the retirement date
- Eventually remove field, and the syncing process
This idea is just to provide a flavor of the incremental approach in the data world. There are whole books on the thinking here. And clearly just having a technical discussion here on the steps. You would also have to work the “process” side of the conversation as well.
One issue that I have seen many business struggle with is the value of all this data. It's not that businesses don't see value in having reports and analytics but rather that business people cannot easily use the data to get the information they need. Data people explain “all the data is here, and there is all this other data as well.” An agile approach would suggest that the best approach is to start with the “simplest approach that can work” for the customer. So rather than work from the data modeling perspective, you start with a report, or an API that the customer is requesting and then provide the data required to fulfill that request, evolving the data model over time. Feedback from the customer is through the use of the report or API, and that is the first thing delivered.