15%

2500 vs 2683

DESCRIPTION

Our sentiment indices are a high-frequency measure of consumer/market emotion based off text scraped from different web sources. These texts are scored at a high volume through several NLP models, primarily utilizing BERT based scoring models.

The indices measure the overall sentiment of the market, ranging from a score of 0 (most negative) to 100 (most positive). Scores are normalized such that 50 is a baseline level of sentiment.

  • The Social Media Financial Sentiment Index measures social media sentiment regarding financial markets (primarily U.S. stock markets). Text data is sourced primarily from Reddit.
  • The Social Media News Sentiment Index measures social media sentiment regarding general news and current events. Text data is sourced primarily from Reddit.
  • The Social Media Labor Market Index measures social media sentiment regarding job searching, career advancement, and other labor market details. Text data is sourced primarily from Reddit.
  • The Traditional Media Financial Sentiment Index measures traditional news sentiment regarding financial markets. Text data is sourced primarily from Reuters, the Financial Times, and CNBC.

Indices are updated daily twice, at roughly 2PM and 8PM Eastern time (UTC 18:00/19:00 and UTC 00:00/01:00). Due to data lags, index values for dates less than 3 days old are preliminary and may be subject to minor revisions over the next few days.


Scraping

Our scripts run daily to collect thousands of news article and Reddit posts. The below table shows which sources are scraped, which posts/articles are retained for scoring, and what sentiment indices they are components of.

SourceRetention ConditionSentiment Index
CNBC (All Articles)Retain AllTraditional Media - Financial
FT (Articles tagged "Economy")Retain AllTraditional Media - Financial
Reuters (Business Articles)Retain AllTraditional Media - Financial
StockMarketVotes > 10Social Media - Financial
reddit.com/r/StockMarketVotes > 10Social Media - Financial
reddit.com/r/EconomicsVotes > 10Social Media - Financial
reddit.com/r/wallstreetbetsVotes > 100Social Media - Financial
reddit.com/r/CryptoCurrencyVotes > 100Social Media - Financial
reddit.com/r/StockMarketVotes > 10Social Media - Financial
reddit.com/r/stocksVotes > 10Social Media - Financial
reddit.com/r/investingVotes > 5Social Media - Financial
reddit.com/r/newsVotes > 1000Social Media - News
reddit.com/r/worldnewsVotes > 1000Social Media - News
reddit.com/r/politicsVotes > 1000Social Media - News
reddit.com/r/jobsVotes > 5Social Media - Labor Market
reddit.com/r/careerguidanceVotes > 5Social Media - Labor Market
reddit.com/r/personalfinanceVotes > 10Social Media - Labor Market

News articles are collected from Reuters, the Financial Times, and CNBC. Articles are scraped from the business sections of each respective website; the full text content of the title and articles are retained and stored in our database.

Reddit posts are collected using the following algorithmn:

  • All posts made in the following boards are scraped, using Pushshift as well as the official REST API.
  • For each board, only posts with more votes than the Retention Condition threshold stated above are retained. For these posts, the title as well as any textual post content (if it exists) are retained and stored in our database.

Scrape jobs run daily and will attempt to collect today's new articles/posts, as well as identify any old articles/posts if they were missed.

Scoring

Each retained article/post is then cleaned, tokenized and fed through 3 NLP scoring models:

  1. A RoBERTa model trained on the CARER (Contextualized Affect Representations for Emotion Recognition) Twitter emotions dataset. This yields a dominant emotion (joy, sadness, neutral, anger, fear, surprise) for the article/post.
  2. A DistilBERT model trained on the SST-2 dataset, which outputs a dominant emotion (positive or sadness) for the article/post.
  3. A simple lexical-based dictionary score, which splits the article/post into its component words, detects whether each word is positive or negative (based off the NRC lexicon), then finds the more common emotion.
Index Creation

For each article/post, we then take the output of all 3 models and convert them to a numerical score: positive and negative scores from the DistilBERT and dictionary models are assigned 1 and -1 respectively. From the RoBERTa model, joy/surprise are assigned 1, neutral 0, and sadness/anger/fear -1.

For each model type and index type (see table above), we sum up the scores for all posts/article from each component board, then take the seven-day moving average of the score. Minor cleaning and deseasonalization steps are also conducted at this point.

Lastly, for each index type, we stack all 3 models using an equal-weighting scheme to generate a single score for each index.

Sentiment Analysis

For more details please contact charles (at) cmefi (dot) com.

The below charts show the emotion-level scores (see Methodology section for details) for each social media index and subreddit.

DETAILED CHART
Sentiment Indices(Click below to hide/show)
    Benchmarks