电销语音文本数据在营销模型上的应用分享
Improve the accuracy of speech content classification by incorporating the fully connected layer from a Convolutional Neural Network (CNN) as supplementary features.(本人文章从Medium搬过来)
In the rapidly changing Chinese FinTech landscape, traditional customer acquisition methods like stream advertising campaigns, SMS messaging, and cold calls are quickly losing effectiveness. As potential new customers become harder to reach, costs are soaring due to the shrinking availability and rising service charges influenced by inflation. Therefore, companies are strongly encouraged to maximize the use of legally obtained data from various existing channels to expand their reach and opportunities.
In marketing, predicting user behavior with machine learning models is a highly effective strategy. By analyzing a customer’s recent actions through various channels, tailored advertising campaigns can be applied to different segments of the customer base. One common medium available to most companies globally is the text from customer speech recorded during telemarketing activities.
I recently embarked on a project where I developed a tree-based classification model using the recorded conversations between our company’s AI saleswoman and customers. Remarkably, both the AUC (Area Under the Receiver Operating Characteristic Curve) and KS (Kolmogorov-Smirnov) statistics performed well in comparison to other classification models that were based on non-speech user behaviors.
I maintained a straightforward table structure (see below) after cleaning the data, comprising two columns. The first column, named “customer_speech,” solely recorded the responses of the customers, deliberately omitting the repetitive AI speeches that couldn’t be customized due to technological limitations. The second column was used to identify whether the customers’ subsequent actions were positive or negative.
Preparations for training
The following steps were designed to manipulate and prepare the data for training, with specific focus on processing Chinese text.
- Data Splitting: The data is divided into two parts based on the specified ‘out of time’ date. The portion of data before this date is used for training and testing, and the portion on or after this date is designated as ‘out of time’ data. The train and test data are saved to CSV files.
- Chinese Text Processing: Libraries such as jieba are used to tokenize the Chinese text in the ‘customer_speech’ column. Tokenization is the process of splitting a large paragraph into sentences or words, and in this context, it’s applied specifically to the Chinese language.
- Feature Extraction: Libraries like CountVectorizer are used in text processing to convert the text data into numerical format. This transformation enables machine learning algorithms to process the text.
- Training and Testing Split: The ‘train.csv’ file is read into a pandas DataFrame, and then split into training and testing sets using train_test_split from the sklearn library, with 10% of the data reserved for testing.
Construct a basic Convolutional Neural Network (CNN) model that extracts the output of the last fully connected layer, utilizing it as input features for a subsequent tree-based model.
-
Defining the CNN Model: A class named CNNModel is created using PyTorch’s nn.Module. The constructor (_init_) initializes the layers, including:
- An embedding layer to transform the input words into continuous vectors (nn.Embedding).
- A one-dimensional convolutional layer (nn.Conv1d) to detect spatial patterns in the data.
- A fully connected linear layer (nn.Linear) for binary classification.
-
Forward Pass: The forward method defines how the data flows through the network. Here, it goes through the embedding layer, a transpose operation to switch dimensions, a ReLU activation function following the convolution, and a max-pooling operation. Finally, a sigmoid activation is applied to the fully connected layer.
-
Creating the Model: An instance of CNNModel is created with specified vocabulary size and embedding dimension. The Adam optimizer and Binary Cross Entropy with Logits loss function (BCEWithLogitsLoss) are set up for training.
-
Training Loop: a loop that goes through the training data for one epoch. In each iteration:
- The forward pass is computed.
- The loss between the predicted and target values is calculated.
- The gradients are computed using backpropagation.
- The model’s weights are updated using gradient descent.
Performance comparison
The concluding step involved retrieving the fully connected layer from the CNN and merging it with the original word vector components to serve as input features for a tree-based model. The outcome revealed that the newly trained XGBoost classification model displayed a noticeably increased KS (Kolmogorov-Smirnov) value in comparison to the initial model that lacked these additional features.
Original model result without supplementary features
Current model result with supplementary features included
版权声明:
本文为[等等肯尼亚]所创,转载请带上原文链接,感谢
https://blog.csdn.net/kqw1983/article/details/132141951