Generate More Attractive News Title

Himaeda
8 min readApr 28, 2021

Introduction

We used to get news from one particular site, such as the New York Times or CNN, but we can have direct access to news from several internet sites in recent years. If we were employees at news sites, what would we do to get more readers to read our story?

In this project, we will analyze how to generate more attractive news titles. We will identify the features that are included in engaging news titles and try to generate more attractive news titles from news content. We tried the below three stuffs in this project:

Classification news topics:

We are interested in comparing the attractiveness of titles to news stories containing the same content. To classify news stories containing the same content, we applied a topic model and then aggregate the news story data by the topic.

Explore features affect attractive news title:

We explore what features affect the attractiveness of news titles. Specifically, we created a prediction model targeting attractiveness and measured the feature importance.

Generate more attractive news title:

We built an ensemble model mixed with generating more attractive titles from news content model and prediction model.

We believe that this analysis can be used not only for news titles but also for generating catchphrases for marketing advertisements.

Dataset

For this analysis, we used webhose.io to retrieve news data. We retrieved 24,236 English news articles in the last 30 days from March 15th. This data set contains the data as shown in the Table1. You can see our code in Github repository.

Table 1 data columns

On the other hand, the English news articles were labeled by the API provider, and some of them contained news in other languages. To extract the English news articles, we used the langdetect library and removed non-English news articles. As a result, this analysis deals with 23,280 news articles.

Exploratory Data Analysis

The Table 2 shows that the top 5 countries and the top 5 news provider in this dataset. Most news articles in this dataset are about the United States.

Table 2: Top 5 country and news site

Words distribution

We analyze the characteristics of the words in the entire data set. The top five words with the highest number of occurrences are also shown in the Table 3. Since this is a news article, ‘say’ is the word with the highest frequency of occurrence, and other words such as ‘new’ and ‘people’ that we often see in the news are also in the top five.

Table 3: top 5 frequent words

Method

Topic modeling

To classify the topics of news articles, we apply the LDA model to the data set. In this analysis, we use MALLET LDA model with gensim library. Though LDA model in gensim library could converge faster than other model, it is difficult to interrupt the result well. This is because we used MALLET LDA model in this project. The hyperparameter, the number of topics, was determined based on Coherence and Perplexity.

To install mallet model in your google colab, you should run the below code.

The Figure 1 shows the Perplexity and Coherence for each number of topics. Based on this figure, we determined the number of topics should be 25.

Figure 1: Perplexity and Coherence for each number of topics

Predict model

baseline model: We apply the topic model with 25 topics to determine the topic of each article. The Figure 2 shows the distribution of the number of documents per topic, and the Figure 3 shows the bar plot of the log number of Facebook shares per topic.

Figure 2: Distribution of topics
Figure 3: Log of shares for each topic

Next, we will create baseline model by using LightGBM, which is developed by Microsoft and performed excellent results in many data science competition. This method is a tree based machine learning algorithm, so that we could see the features importance easily from the results.

Figure 4: training performance of baseline model
Figure 5: feature importance of the baseline model

To build a valid model, we split the dataset into training data, validation data and test data. As features for the model, we used ‘news site’, ‘image in news’, ‘country’, ‘page views’, ‘the title length’, ‘the text length’, and ‘the length of cleaned text’. As a result, RMSE for the test set is 0.7294. The Figure 4 and 5 shows that the training performance the model and the feature importance of the model.

Predict model with article’s sentiment: As Jin et al. (2020) suggested in their paper they proposed that the accuracy of article titles can be improved by using article sentiment. In this project, we will also investigate how the accuracy of the model changes when sentiment is extracted from articles. For sentiment analysis, there are a number of pre-trained models. In this analysis, we used Transformers sentiment analysis. We run the below code to get the article sentiment (It’s too easy!!).

We created the model including sentiment as well as baseline model, and the RMSE of test data was 0.728.

Figure 6: features importance with sentiment analysis

Predict neural network model with language model: On the other hand, we attempt to build a model using deep neural network (DNN) learning here. Specifically, we use the text of the article as input, calculate the score of the text using the DistillBertmodel suggested by Sanh et al (2020), and then combine the score with the metadata of the article to work on the network. Our model framework is shown in the Figure 7.

Figure 7: model framework with text information
Figure 8 : training performance of DNN with text information

By adding the score from the text to the metadata, it is possible to train the scoring model of the text and the DNN of the metadata at the same time. The results of the training are shown in the Figure 8. The RMSE was 1.131, still worse than the score of lightGBM, but this suggests that the article’s content also affects the number of shares of news articles.

We created Dataset and Network as shown the below, and you can also see all code in our repository.

Predict neural network model with generate model: Finally, we propose a prediction model using a generative model. For the generative model, we used the T5 model proposed by Raffel et. al (2020). Originally, it is necessary to tune the generative model at the same time by working on the network with the titles output from the generative model as shown the Figure 9, but here, we used a pre-trained model outside the model in order to examine the changes caused by differences in the content of the titles in the Figure 10.

Figure 9: model framework with text information and generate model
Figure 10: model framework with text information and generate model in this project

Besides, the RMSE 2.564 was pretty worse than the score of lightGBM. This is because we just used pre-trained model without fine-tuning.

Evaluation

For topic modeling, we used Coherence and Perplexity to decide the appropriate number of topic. Also, from important words in the Table 4 , it seems that the topics are well-classified. In the model for predicting the attractiveness of news titles, the accuracy of the model was evaluated by RMSE and other methods. As shown in the Table 5, the simple model using LightGBM had the highest accuracy.

Table 4: keywords in each topic
Table 5: model comparison

On the other hand, if we focus only on the DNN model, the model that uses text information has better accuracy, so we will work on developing a model that uses text in the future.

Conclusion

In this report, we have categorized the topics of news articles and built a model for predicting the number of shares. As expected, we found that the number of shares is related to the news topics and the information of article.

In addition, by combining DNN and language models, we have built a model for scoring text information and predicting the number of shares, which can not only tackle text information that LightGBM cannot handle but can also be extended to embed generative models in the future.

However, since language model has tons of parameters which we need tuning, we also need to consider the calculation cost.

What’s Next

As we mentioned in conclusion, in the future, we plan to work on building a generative model like the one shown in the Figure 9, embedded in a DNN. On the other hand, since the title of the generative model needs to be pre-processed by the scoring model, it may take a huge amount of time to train. We may need to think about this from the architecture of the model for more efficient model training.

Thank you for reading!

Reference

Di Jin, Zhijing Jin, Joey Tianyi Zhou, Lisa Orii, and Peter Szolovits. 2020. Hooks in the headline: Learn-ing to generate headlines with controlled styles [paper]

J. G. Lee, S. Moon, and K. Salamatian. 2010. Anapproach to model and predict the popularity ofonline contents with explanatory factors. [paper]

Sho Takase, Jun Suzuki, Naoaki Okazaki, TsutomuHirao, and Masaaki Nagata. 2016.Neural head-line generation on Abstract Meaning Representation. [paper]

Chenhao Tan, Lillian Lee, and Bo Pang. 2014. The ef-fect of wording on message propagation: Topic- andauthor-controlled natural experiments on Twitter.[paper]

--

--