Using CRISP-DM method to analyze Big Sales Data

A method of processing data from the human sales process (calls/meetings) to optimize the funnel, shorten the deal cycle and increase conversions.

Basic Process

Calls/meetings are conducted with connection. Integration with CRM system allows collecting clean and accurate data that can be analyzed by Big Data tools: Process Mining, Spaghetti Diagram, Reverse Engineering, etc.

If we have a high-quality and voluminous array of clean data, we can conduct a Reverse Engineering operation to optimize the sales process, e.g. rebuild the sales funnel or identify the shortest process, or gather the most effective tactics/phrases that lead to conversions with a high probability. The reverse problem can also be solved: identify epic fails in the process.

We tell you in detail how it looks like by the steps and what results we can get from it.

CRISP-DM methodology

Fig.1 CRISP-DM methodology


Break down the process into steps, according to the CRISP-DM methodology


Stage goals





The goal of the project is to identify points of growth of sales conversion. Depending on products/regions and other variables, due to in-depth analysis of available internal data. A secondary goal is to determine the external data to be taken into account for a more accurate result.


- Identify growth areas

- Determine loss zones

- Describe a new and effective process

- Determine process performance metrics

The sample size may be insufficient for unambiguous conclusions. Storage of data, possibility to obtain them. Lack of interesting insights for the customer.

Data Understanding

Ideally, since all dialogs are passed through and there is integration with CRM, all necessary data is stored in CRM: a list of events for each transaction, some records of conversations/meetings, records of transaction statuses and related objects, probably minutes of meetings, emails, contracts, TOR, CP (all versions), pre-sale documentation, etc. Understand the strengths and weaknesses of the existing set of documents.

Understanding the total volume and categories of data available for analysis. Enriching data with third-party sources. Identifying initial hypotheses for analysis:


- data sufficiency

- data relevance

Descriptive statistics of the data (target metrics) were calculated, and their graphs were plotted

Adjustment of business goals based on data status

The problems of data consistency are that there may be very little complete set of data united by one transaction ID: no historical data, data in CRM with errors (90%), not complete, untimely changes, missing bricks, documents not saved, or stored in different places, unreadable, communication was not conducted in authorized channels that are not subject to accounting, etc.

The main problem post factum will be to collect the full chain of events for each transaction (Data Evidence). Otherwise, the accuracy of the analysis will be too low.

That's why it's hard to do a retrospective analysis, because you have to dig through the archives for a long time to find everything. 

For the most efficient way, the Data Management Policy and are implemented first, then data is collected, then analysis is performed on the collected data.

Data Preparation

Let's say we don't have complex transactions that are conducted over the phone and there are no long correspondence and exchange of documents. The CP is redone at most once. Let's say we need to figure out what and how to say on the phone to the manager, so that the conversion itself grows. Then we need to take the entire volume of available call records for the last 3-24 months and do the preparatory work.

Preparation of datasets:

1. Arranging calls (events) in chronological chains linked to each transaction ID, since within one transaction we may have several stages of negotiations.

2. transcribing/recognizing voice to text.

3. Data partitioning step 1. We need to define the purpose of each call and assign a label to each conversation. If the call was useful. For example, we may have calls to qualify a lead, may be to close a deal, etc. Also the initiator of the call can be a client or manager, this too we need to reflect in the properties.

4. Data partitioning stage 2, a deeper level: decompose each call into elements of context, what happened inside. Here, you may have to use a hypothesis where you define a set of entities and decompose everything within a given set of entities, or a set of hypotheses and each hypothesis will have a different set of entities. Later it will be possible to determine which hypothesis was most successful.

5. Carry out clustering with the model and compare the results with the hypotheses used.

- Recognition quality

- a set of hypotheses can be false and give false positives

- choosing the wrong entities, etc.

- few unbroken chains


Data visualization and scoring model creation

1. determine the representativeness of the sample: the Spaghetti Diagram, which will allow you to show:

- number of complete chains

- number of successful chains

- Number of broken event chains

2. Detail of successful circuits:

- the chain of events on the timeline, which increased the scoring of the lead. By indicating the scoring

- KFU in every mailstone

- Detailed to key phrases

- Successful negotiation frameworks

3. similarly for unsuccessful circuits

4. general characteristics of clients with whom transactions are closed successfully: qualification criteria.

5. How many trades could be closed on that array by the new process.

6. How does the new process look like on the time/accelerator graph?

The difficulty of making the zoomin/zoomout process beautiful


Assessing the quality of the model

Conducting a test simulation (retrospective analysis) or pilot as an A/B test, for example, how much we could have sold in the past (on the same dataset) if we had applied the new process.

Define the metrics for the effectiveness of the new process (perhaps better: evaluation of the effectiveness of the new process by metrics defined in advance)

The design of the experiment in the form of an A/B test was developed:

- the volume of control and target samples is determined

- experiment duration

- criteria for stopping the experiment

- The resources and technology needed to conduct the experiment are identified.

Pilot launched

Make another iteration of changes, if necessary.

Repeat the experiment if necessary, if the result is achieved (metrics have changed), then proceed to implement

Influence of the external environment

The Human Factor



Implementation Plan:

- Implementing data collection automation with

- Presentation of the results of the analysis to the team.

- Hyperbolizing the difference: how much they spent and how much they could have earned if they had worked differently.

- An explanation of what the new process will look like now.

- What routine they will no longer have

- Real-time tips for seals.

A pure data collection pipeline and feedback loop that allows you to modify elements of the process and actually tailor the process to each customer segment or to each target customer (ABM in action).

Human factor/self-preservation instinct



Unwillingness to rebuild

Changes in the external environment


Learning curve



Get notified first on new sales insights



  1. One-time process improvements at the depth at which you have clustering. This can be done manually if you have clean data.

  2. A continuous process of process improvement coupled with consistent clean data - with

🤖 Schedule a demo meeting to increase the efficiency of the sales department with


Similar posts

Get notified on new marketing insights

Be the first to know about new B2B SaaS Sales insights to increase the conversion rate with the tools and knowledge of today’s industry.