Data preparation
Data preparation is the first stage of data analysis. It is relatively unglamorous compared to the process of running statistics and building models, but is essential for those statistics to work.
Preparation includes investigating the data structures, understanding the types of data, data cleaning and deduplication, coding and aggregating and transforming the data for analysis.
Data for analysis can come in many forms and types - from simple flat-file data structures, to complex relational databases, to unstructured data in the form of text, images or even video.
The process of data preparation involves understanding what types of data are available and how the data is structured and coded. The data will need to be pulled from pre-existing sources into an dedicated analysis database, usually using extraction scripts to ensure the data pull can be repeated if necessary (never work directly on raw data).
The data will then need to be cleaned - this can involve deduplication, looking for missing values, checking coding consistency and outliers. This stage is normally through an active investigation and summarisation of the data. Do all addresses have postcodes? Is the naming format consistent through the data. Do the data link properly - eg linking customers to purchases, or web-visitors to their journey history?
The cleaning stage is an attempt to standardise and normalise the data, removing low quality records, coding or recoding fields if required. Building aggregation fields such as total sales or total purchases and deciding how to deal with missing values or erroneous data.
The result is a cleaned file for analysis. Cleaning can take a number of rounds, particularly with large datasets, or where rules are being applied to a large dataset based on analysis of a sample of records. For this reason all data extraction and cleaning should be scripted and coded.
Once the data has been cleaned, a second stage is often data enhancement. Enhancement is normally adding or linking additional data to the analysis file. Transaction data may be matched to web-journey data for instance, or external data sources such as GIS information can be added by address.
Blending different data sources increases the possibility of deeper data analysis. For example, adding geoclassifiers enables the data to be analysed by potential income or lifestage based on location. This then allows analysis by different segments, or to identify sales per subgroup for core target groups.
Alternatively, using AI tools to extract sentiments or topics from text might be used to tune models of customer retention.
The outcome from the data preparation phase provides the bedrock for analysis. It also often has to be revisited as analyses and models are developed to validate modelling hypotheses and to ensure that relationships in the data are genuine.
For help and advice on transforming data into insight contact info@dobney.com