Website accessibility
Show or hide the menu bar

Database analysis and Big Data

X26_1_data_analysis1.jpg Companies have vast stores of information in their databases that cover customer behaviour over time, but often this is locked away and difficult to use. Data science (previously data analysis) is the process of extracting this data, cleaning it, coding it, potentially merging it with other data and performing a statistical analysis to better understand customers and customer behaviour, and increasingly using automated algorithmic systems or AI to extract insight from the databases to target offers and adverts to different customers.

Increasingly, the phrase Market Intelligence is being used to describe the use of data science to link and analyse databases of information held within an organisation. Although, we see market intelligence more broadly than, database analysis is now a central business function for market insight.

The aim of data science and database analysis is to build predictive statistical models with the aim of increasing a customers interest in purchasing, the amount they spend or to influence their purchase behaviour.

Data is typically found in transaction or sales databases, contact and customer service databases, loyalty programs, vast web or internet app-based databases of online behaviour and purchasing, and can be combined with external data.

For an analyst, the basic procedures for analysing database information, whether a simple contact database or Big Data are:

It is not uncommon for there to be many separate databases in an organisation, each holding different information. Newer companies are more likely to have unified database systems, but it is more common to have operational databases that are lightly linked (eg customer ID) that then need pulling together and unifying for analysis.

For on-going database analysis, automating as many of these tasks as possible becomes vital with a large dataset, both to ensure that the data is of the same quality for each run of analysis and to save time and effort repeating the same work with each data snapshot. While smaller data snapshots can be handled by hand, anything over a few tens of thousands of records needs to be properly automated and documented.

Our data science analysts start by building scripts to extract, pool and link the data sources before then taking the data to statistical analysis, modeling or into machine learning/AI.

Extracting information

Many internal databases grow and develop through use and contingency, and consequently identifying and extracting the data can be complicated. For long-standing or legacy systems, particularly where an operational database has evolved over time, databases and tables can be poorly documented, with data that is missing or has been moved, or where table schema or data fields have changed in definition over time.

The data from live systems needs to be pulled for analysis, and fields matched and checked for content quality and table relationships confirmed and validated.

For external data, such as social media feeds, data may be brought in from data brokers, or obtained directly by scraping (subject to privacy rules). These data feeds also need to be cleaned and matched and may bring additional complications such as de-translating.


Once data has been obtained, it has to be cleaned. Many databases tend to build up inaccuracies and duplications over time. For instance as addresses change, postcodes are entered incorrectly, or there may be duplication of records sometimes caused by mistaken data entry, but more often than not, because customers have changed and duplicate records have been created (in a typical business-to-business database 20-25% of the data will be out of date after a year just because of people changing jobs). Similarly text feeds need a level of processing to standardise the data and to screen for potential problems.

Within an internal database, or when merging datasets, deduping is an important, but sometimes challenging task. Automated systems exist, but some level of 'eyeballing' has to be done to check the quality of the dedupe.

Next data may need to be recoded, and missing or erroneous values identified. When looking at aspects such as purchase histories, it is often the case that the data has to be grouped up and reclassified. For instance each product on a database will have a separate product code, but for analysis several individual products may need to be grouped together.

The process of cleaning eventually leads to automated scripts including de-duplication and cleaning up missing or bad data, but often there is an element of verification that needs doing by hand - often by examining smaller samples of data.


Once the individual data sources have been cleaned, they can be merged with other data sources. Merging again is not entirely straightforward as some allowance may be necessary for the same customer having a different name on different databases. For instance Acme Building Contractors might also be known as ABC. Consequently, there may also be a second period of cleaning necessary once the data has been merged.

A common merge for consumer sets is to add geographical-based classification data from external agencies such as the ACORN or MOSAIC or to link in external data from consumer data companies such as Experian. These provide an additional layer of classification or segmentation data on top of the existing data that can add fine detail for modeling.


There are many different types of analysis that can be carried out on the data from the database. The first part of any analysis is usually an exploratory stage just to see what's there. A very common simple approach is called Pareto Analysis which involves ranking customers by value and then breaking them into quintiles or deciles to see who the most valuable customers are and what their purchasing characteristics are. In text analysis it might be a simple word frequency count prior to any attempt at sentiment or concept analysis.

Standard transactional database measures are recency, frequency and value. So who bought in the last month, 3 months, 6 months? Who has bought once a year, twice a year etc? How much did they spend? What was the difference between those spending a lot and those spending in the next category down (and so can we get uplift).

Increasingly businesses look to track customers and then look at customer journeys - particularly for web-based analytics - what transactions happened when and how did a customer move from one transaction to the next; what did the customer journey look like?

The core aim for many types of analysis is to build a 'propensity model'. That is a model to identify customers who are most likely act in a certain way. For instance, those people who are most likely to buy,  or those people who would be most likely to respond to a particular communication, or those who are likely to leave or stop buying.

Various types of statistical tools and analysis can be used to build propensity models. From classifying, grouping and labelling customers, to various forms of regression. Much large scale database analysis is done via machine learning using automated statistical investigation or artificial intelligence using deep neural networks. Data is typically analysed, and then validated against hold out samples to reduce the likelihood of overfitting.

The classification and grouping means database data can be used for segmentation. A major difference between database segmentation and market research segmentation is that the results can be marked back onto the database - each customer is labelled with their segment. This means that if you need to contact or track a particular segment from the database this is entirely possible, whereas for market research you are typically taking a second level guess.

Implement, and learn

Once the analysis is creating marketing insights, the next stage is implementation - that is to use the data to affect customer behaviour. For instance, to apply a segmentation with tailored communication, specifically targeted offers and a system of response measurement and management.

Implementation means tracking how well the analysis performs compared to the modeling, and so reflects back onto the databases.

This need for multi-faceted implementation leads to the development of algorithmic and experimental marketing and the importance of bringing the analysis back to websites.

Blending Big Data and research

A recurring view of Big Data is the idea that all the information you need is sitting in the databases and just needs to proper analysis and the business will be able to predict exactly what the customer wants and will do. Unfortunately, that is far from the truth.

Big Data analysis can find relationships and correlations in the data and therefore help improve and optimise products and services, but the main problem with database data is that it is backwards looking - that is it tells you what customers have done. If a new competitor enters the market, or you launch a new product, there is no data about what will happen next. There is also an 'analytical delay' - that is analysis, finding useful insights, takes time. By the time the analysis is finished the market may have moved on to new things.

For this reason, research and experimentation are also still required. Big Data can be combined with small-scale live experimentation to test how people react to changes, offers and communications, or blended with research to understand the why of behaviour, for instance tracking e-commerce journeys and then following up with research into purchase motivations and objectives.

For help and advice on the effective use of database analysis and Big Data contact

Previous article: Competitive intelligence Next article: Using published reports
More details

Go to Notanant menuWebsite accessibility

Access level: public

This site uses essential cookies only. By continuing to use this site you accept our use of cookies: OK