The goal of this project is to develop software to predict revenue from the Lead Generation (LeadGen) business model at startup.
The LeadGen revenue is fluctuating because of various factors including seasonality, number of paying customers, the number of generated leads and so on. The startup has been collecting data from the very beginning of this business model but only starting from January 1st, 2016 has it been free from budget constraints on marketing and this date will be used as the starting point.
Dataset exploration and selection of the algorithm
I have been exploring available data trying to engineer sensible features. Some of the features will be generated by the website code, some will be added by the predictor before training.
After trying out various linear regressions including polynomial with varying regularizations (Lasso and Ridge) I have finally chosen SVR (Support Vector Regression) with the linear kernel. For the SVR I also tried to use polynomial preprocessing, but after tests, the polynomial degree was set to 1.
The idea of the predictor is to use the dataset generated on a given date to receive a revenue prediction for the entire month. At any time of the month (with the exception of the first day) I have the cumulated monthly revenue value available. This means, I only need to predict the total revenue of the remaining days and add it to the already known part of the monthly revenue.
Because of this, the SVR predictor actually predicts the monthly average daily revenue. The predicted value is then multiplied by the number of the remaining days in the month to get the unknown part of the monthly revenue.
So, the pipeline of the application will be as follows:
- Receive the dataset and convert it to the pandas DataFrame.
- Add additional features.
- Train the SVR predictor using the data from the past months.
- Generate the prediction for the daily average revenue for the days that have passed in the current month.
- Sum up the already known daily revenue from the days passed, add them to the predicted monthly average daily revenue to get the total revenue prediction for the month.
- Return the prediction and other relevant data to the requesting application.
Using this approach, I can generate revenue predictions also for the months in the past. I only need to pretend not to know the actual total revenue of the month and treat it as if it is not over yet (by picking any day in it). For the training, I then use only the months preceding the given one.
By doing so I was able to pinpoint the SVR as the best prediction algorithm for my task and fine-tune its parameters using the GridSearch.
Predictor application
The predictor application is implemented in two modules predictor.py
and source.py
located within the mtm_rev_pred
module. The main script initializes classes Source
and Pedictor
, feeds the dataset file to the Source instance to be converted into the pandas DataFrame and then channels the output to the Predictor instance that trains the algorithm and returns the prediction:
import sys import json from mtm_rev_pred.source import Source from mtm_rev_pred.predictor import Predictor source = Source() predictor = Predictor() try: fn = sys.argv[1] except IndexError: print('File name is required') exit(1) try: dataset = source.get_dataset(fn) res = predictor.predict(dataset) result = json.dumps(res) print(result) except FileNotFoundError as ex: print('No train dataset file found under {:s}'.format(fn)) except ValueError as ex: print(str(ex))
The Source module
The source.py
module contains a single class, Source
. The purpose of this module is to provide the functionality required to:
- Read the dataset file from the pre-set source folder
- Convert the data into pandas DataFrame
- Separate target values from the features
- Generate additional features
- Scale the data
In the main script, the Source
class instance only reads the file and converts it to the pandas DataFrame. The actual preprocessing is called from the Predictor
class’ method predict
.
The original features provided by the website are:
revenue
– revenue made on the given daycum_month_rev
– cumulated month revenue up to the given dayavg_daily
– average daily revenue that is calculated by the diving the cumulated month revenue by the number of days passed in the monthlead_count
– number of leads made on the given daylead_num_14
– number of leads made in the last 14 dayslead_num_7
– number of leads made in the last 7 daysrev_14
– revenue cumulated in the last 14 daysrev_7
– revenue cumulated in the last 7 dayslead_num_14_ly
– number of leads in 14 days last year (this date last year and 14 days back)lead_num_7_ly
– number of leads in 7 days last yearrev_14_ly
– cumulated 14 days revenue last yearrev_7_ly
– cumulated 7 days revenue last yearnum_days
– number of days in the current monthweekdays
– number of weekdays in the current monthactive_partners_14_d
– number of partners that received leads in the last 14 days
The additional features that the Source class generates are:
- Months – the source module translated this categorical feature into 12 separate features. That is the row will get value 1 if the day’s month matches the feature and 0 if not.
- Days left in the month – calculated by subtracting the current day from the number of days in the month
- Average daily revenue – the total monthly revenue divided by the number of days in the month. This is the target feature that the predictor will try to estimate. It is discarded from the data in the evaluated month as it is presumed unknown.
After the additional features are generated, a sample entry from the dataset looks like this (actual numbers are hidden):
lead_count ---.000000 revenue ---.000000 lead_num_14 ---.000000 lead_num_7 ---.000000 rev_14 ---.0000 rev_7 ---.0000 lead_num_14_ly ---.000000 lead_num_7_ly ---.000000 rev_14_ly ---.000000 rev_7_ly ---.000000 monthly_rev ---.0000 num_days ---.000000 weekdays ---.000000 cum_month_rev ---.0000 active_partners_14_d ---.000000 avg_daily ---.00000 Jan ---.000000 Feb ---.000000 Mar ---.000000 Apr ---.000000 May ---.000000 Jun ---.000000 Jul ---.000000 Aug ---.000000 Sep ---.000000 Oct ---.000000 Nov ---.000000 Dec ---.000000 month_days_left ---.000000
Next, these data are scaled using the StandardScaler
. This process removes the mean and scales the data to unit variance. The docstring of theStandardScaler
reads:
The standard score of a sample `x` is calculated as:
z = (x – u) / s
where `u` is the mean of the training samples or zero if `with_mean=False`,
and `s` is the standard deviation of the training samples or one if
`with_std=False`.
Finally, the Source
instance returns a tuple of the scaled features, target values, and the original (non-scaled, target values together with features) dataset. The latter will be used to calculate the current month’s known cumulated revenue later.
Predicting the monthly revenue
The Predictor
class receives the dataset as a Pandas DataFrame. Before using it to train the estimator and get the predictions, the method predict
instantiates an instance of the Source
class that performs the preprocessing. Then, the preprocessed data are fed to the method monthly_predictor
that also receives the estimator object and the month for which the prediction must be made.
def predict(self, dataset, date=None): """Provides predictions for each date of the month of the given date :param DataFrame dataset: :param string date: :return: """ if date is None: date = dataset.index[-1].to_pydatetime() else: date = datetime.datetime.strptime(str(date), '%Y-%m-%d') X, y, dataset = source.process_dataset(dataset) self.dataset = dataset res = self.monthly_predictor( self.get_svr(degree=1, include_bias=False, kernel='linear', c=1.9, epsilon=66 ), X, y.to_frame(), str(date.year), str(date.month), raw_dataset=dataset) return res
Method predict_period
To get a monthly revenue estimation, the system first generates the predicted average daily revenue for the month. This is done in the method predict_period
. This method receives the estimator model, the dataset, and three dates. The start_date
and end_date
limit the period in which the value is predicted. The period between the init_date
and start_date
is used to train the estimator.
This method returns a DataFrame with predicted values in the column avg_daily
:
def predict_period(self, model, X, y, start_date, end_date, date_init='2016-01-01'): """ Returns predictions for the specified date range using a model trained in the specified training date range :param model the predicting model :param pandas.DataFrame X: the whole available dataset of features :param pandas.DataFrame y: the whole available target values :param string start_date: date to start predicting from :param string end_date: date to predict until :param string date_init: date to start training from :return pandas.DataFrame: predictions """ X_train = source.get_data_from_date_range(X, date_init, start_date) y_train = source.get_data_from_date_range(y, date_init, start_date) X_val = source.get_data_from_date_range(X, start_date, end_date) y_val = source.get_data_from_date_range(y, start_date, end_date) if isinstance(model.named_steps['lin_reg'], SVR): model.fit(X_train, y_train.values.ravel()) else: model.fit(X_train, y_train) pred = model.predict(X_val) y_pred = y_val.copy() y_pred['avg_daily'] = pred return y_pred
Method predict_month
The method predict_month
takes these predictions of the average daily earnings, multiplies it by the number of the remaining days in the month and adds to that the known cumulated revenue of the month. This produces the prediction of the total revenue for the month,
def predict_month(self, month_data, start_date, end_date, montlhy_cum, monthly_rev): """ Predicts monthly revenue on each day of the given month :param pandas.DataFrame month_data: a dataframe with predicted average daily revenue for the month :param string start_date: starting date of the month :param string end_date: end date of the month :param pandas.DataFrame montlhy_cum: the monthly cumulative revenue values for the entire dataset :param pandas.DataFrame monthly_rev: real monthly revenue :return pandas.DataFrame: predicted monthly revenue on each day of the month """ X_month = source.get_data_from_date_range(montlhy_cum, start_date, end_date) month_data['month_rev_pred'] = ( X_month['cum_month_rev'] + (month_data['avg_daily'] * self.dataset['month_days_left']).astype(float)) month_data['monthly_rev'] = monthly_rev return month_data
Method monthly_predictor
The monthly_predictor
method takes the estimator model, the features and the target values to produce predictions of the monthly revenues for all months starting from the specified year and month. The predicted month, however, must be present in the features dataset.
This method will try to evaluate the prediction quality by comparing the predictions on each day of the month with the actual monthly revenue. This, however, only possible when the estimation is run for a month that has known real total revenue.
def monthly_predictor(self, model, X, y, start_year, start_month, raw_dataset, date_init='2016-01-01'): """ Generates monthly revenue predictions for all months from the specified year and month using training data from the date_init. For each month the speciafied model is trained on data from the init date up to the given month. The month's data is used to generate predictions and calculate score :param model: predicting model :param pandas.DataFrame X: features dataset :param pandas.DataFrame y: target values dataset :param string start_year: year to start predicting from :param string start_month: month to start predicting from :param pandas.DataFrame raw_dataset: the initial dataset containint the monthly cumulated and montly total revenues :param string date_init: date to start training from :return dict """ months = self.get_months(date_init, X.index) date_str = str(start_year) + '-' + str(start_month) + '-01' start_date_ = datetime.datetime.strptime(date_str, '%Y-%m-%d') date_str = start_date_.strftime('%Y-%m-%d') last_month_date = months[-1]['end'] monthly_rev = self.dataset['monthly_rev'].to_frame() y_comp = source.get_data_from_date_range(monthly_rev, date_init, last_month_date) y_pred = source.get_data_from_date_range(monthly_rev, date_init, date_str) last_month_params = None for month in months: if month['start'] >= date_str: y_month_avg_daily_pred = self.predict_period(model, X, y, month['start'], month['end']) month_data = self.predict_month(y_month_avg_daily_pred, month['start'], month['end'], raw_dataset['cum_month_rev'].to_frame(), raw_dataset['monthly_rev'].to_frame() ) y_month_rev_pred = month_data['month_rev_pred'].to_frame() y_month_rev_pred['monthly_rev'] = y_month_rev_pred['month_rev_pred'] y_month_rev_pred = y_month_rev_pred.drop(columns=['month_rev_pred']) y_pred = pd.concat([y_pred, y_month_rev_pred]) if isinstance(model.named_steps['lin_reg'], SVR): coef = model.named_steps['lin_reg'].coef_[0] last_month_params = pd.Series(coef.round(0), index=X.columns.values).sort_values() score = np.sqrt(mean_squared_error(y_comp, y_pred)) d_y_comp = self.process_ts_dict(y_comp.to_dict(orient='dict')['monthly_rev']) d_y_pred = self.process_ts_dict(y_pred.to_dict(orient='dict')['monthly_rev']) d_lmp = last_month_params.to_dict() return {'y_comp': d_y_comp, 'y_pred': d_y_pred, 'score': score, 'last_month_params': d_lmp}
Processing the results
The dictionary returned by the predictor is converted into a JSON string that is sent to the standard output. The website application that runs the predictor using a command-line expression, receives and parses the output.
The prediction data containing the monthly revenue prediction and the data are then saved to the database of the website.
Tests
The application is covered by unit tests. The tests are found in the subpackage tests
that contains two modules:
helper.py
– here I placed functions that generate mock data for the teststests.py
actual unit tests covering the methods of both thesource.py
andpredictor.py
modules.
Deployment
The application is deployed using a fabric script. The installation path is at /var/python/mtm_predictor/leadgen_revenue/
. The application can be installed as master
, testing
, or stage
. The repository has to have these branches present. To indicate the type of deployment, provide one of these three values to the dep_type
parameter of the installation command:
fab deploy:host=USERNAME@HOSTNAME,dep_type=TYPE -i /PATH_TO_PRIVATE_KEY
Under the target directory (e,g, /var/python/mtm_predictor/leadgen_revenue/master
) the installation script will create a releases
sub-directory. There, the installation will create sub-directories where it will clone the current state of the deployment branch to. At the end of the installation, a symlink will be created (/var/python/mtm_predictor/leadgen_revenue/master/app
) that points to the latest release. The installation will keep a maximum of 3 releases sub-directories and delete older ones after each installation.
Besides the releases sub-directories, the installation script creates alocal
subdirectory outside of the releases
directory hierarchy. This directory will contain the dataset file that the website application updates each time it asks for new predictions. There is no need to destroy it between deployments and it is created one time.
The local
sub-directory is available via a symlink from the mtm_rev_pred
package sub-directory.
Thus, the deployment directory structure is like this:
app > /var/python/mtm_predictor/leadgen_revenue/master/releases/1555083334.4494643/ local/ releases/ 1555083334.4494643/ mtm_rev_pred/ local -> /var/python/mtm_predictor/leadgen_revenue/master/local/ ... ... 1555083330.4494625/ 1555083324.4494600/
The build is set up in Jenkins as a freestyle project using the Virtualenv builder:
0 Comments