Initial setup steps

This commit is contained in:
2026-05-20 12:01:39 +02:00
commit 1315ff3d99
6 changed files with 589 additions and 0 deletions

1
.gitignore vendored Normal file
View File

@@ -0,0 +1 @@
*.csv

21
LICENSE Normal file
View File

@@ -0,0 +1,21 @@
MIT License
Copyright (c) 2025 Marcel Weschke
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

58
README.md Normal file
View File

@@ -0,0 +1,58 @@
# Google Trends Market Sentiment Analysis Tool
## Overview
Traditional market data captures what has happened, but rarely explains *why* or what happens next. This project introduces a systematic framework that leverages alternative data—specifically online search volumes via Google Trends—as a leading indicator for tactical asset allocation and risk control.
By analyzing real-time shifts in collective investor attention, the tool quantifies market psychology before it fully materializes into trading decisions.
---
## The Core Scaling Challenge & Solution
> **The Problem:** Google Trends normalizes search volume to a relative $0 \text{ to } 100$ scale *per individual request*. This makes it statistically impossible to directly compare or chain together data from different batch requests.
>
> **The Algorithmic Solution:** This script implements an **"Anchor-Logic"** to establish a unified global scale. Every automated batch request includes a high-volume, neutral reference term (configurable via `--anchor`, default: `'weather'`). The pipeline then dynamically rescales parallel batches using the **median ratio** of the overlapping anchor series:
>
> $$\text{Scaling Factor} = \text{median}\left(\frac{\text{Anchor}_{\text{Target Batch}}}{\text{Anchor}_{\text{Reference Batch}}}\right)$$
>
> This technique achieves true cross-batch comparability across independent API calls.
---
## Methodology & Pipeline Architecture
The prototype (`google_trends_sentiment_prototype.py`) is structured as a modular quantitative pipeline:
### 1. Data Ingestion (Anchor-Based)
Automated retrieval of pre-defined Risk-On, Risk-Off, and Macroeconomic keywords via the `pytrends` API, structurally unified globally using the Anchor-Logic described above.
### 2. Normalization Layer
Applies a **Z-score transformation** to the rescaled raw data. This establishes statistical parity across keywords with vastly different structural search volumes by centering the mean at $0$ and scaling variance to $1$:
$$z = \frac{x - \mu}{\sigma}$$
Where:
* $x$ is the anchor-adjusted search volume intensity.
* $\mu$ is the historical mean of that specific keyword series.
* $\sigma$ is the historical standard deviation of the series.
### 3. Index Construction & Signal Extraction
* **Sentiment Spread:** Measures the relative strength of optimism versus pessimism in the market:
$$\text{Sentiment Spread} = \left( \frac{1}{N} \sum_{i=1}^{N} z_{\text{Risk-On}, i} \right) - \left( \frac{1}{M} \sum_{j=1}^{M} z_{\text{Risk-Off}, j} \right)$$
* **Macro PCA Factor:** Extracts the first principal component ($PC_1$) from the combined Z-score feature matrix using Singular Value Decomposition (SVD) via `scikit-learn`:
$$\mathbf{Z} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T \implies PC_1 = \mathbf{Z}\mathbf{v}_1$$
This isolates the dominant underlying psychological driver capturing the highest common variance.
### 4. Market Validation (Optional)
Resamples the extracted signals to a weekly frequency and performs quantitative correlation analysis against live financial benchmarks using `yfinance` without compromising the statistical independence of the core signal.
*Note: This prototype currently focuses on contemporaneous correlation as a proof-of-concept. Time horizons and keyword definitions are structurally predefined rather than data-driven optimized.*
---
## Getting Started
### Dependencies
Install the required quantitative stack:
```bash
pip install pytrends pandas numpy scikit-learn yfinance matplotlib

Binary file not shown.

After

Width:  |  Height:  |  Size: 445 KiB

View File

@@ -0,0 +1,504 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Author: Marcel Weschke
Date: 2026-02-19
Script Name: google_trends_sentiment_prototype.py
Description:
------------
Task: Prototyping a Market Sentiment Indicator using Google Trends.
This tool extracts and analyzes search query intensities as leading indicators
for risk management and tactical asset allocation.
Scaling Challenge & Algorithmic Solution:
-----------------------------------------
Google Trends normalizes search volume to a 0-100 scale per individual request.
To ensure cross-batch comparability, this script implements an "Anchor-Logic":
- A common reference term (default: 'weather') is included in every request.
- Batches are rescaled using the median ratio of the anchor series to
establish a unified global scale.
Objective:
----------
Construct a robust sentiment factor by synthesizing search intensities
of risk-related and macroeconomic keywords.
Methodology:
------------
1) Data Ingestion: Automated retrieval of Risk-On, Risk-Off, and Macro terms.
2) Normalization: Applying Z-score transformations for cross-keyword statistical parity.
3) Index Construction:
- Sentiment Spread: (Avg Risk-On Z-score) - (Avg Risk-Off Z-score).
- Macro PCA Factor: Extraction of the first principal component (common variance).
4) Validation: Quantitative correlation analysis against market benchmarks (e.g., MSCI World).
Outputs:
--------
- trends_raw_<geo>.csv
- trends_features_<geo>.csv
- sentiment_data_<geo>.csv
- sentiment_plot_<geo>.png / combined_sentiment_analysis_<geo>.png
Dependencies:
-------------
pip install pytrends pandas numpy scikit-learn yfinance matplotlib
Usage Examples:
---------------
python google_trends_sentiment_prototype.py --geo GLOBAL --no-plot
python google_trends_sentiment_prototype.py --geo GLOBAL --ticker URTH
python google_trends_sentiment_prototype.py --geo DE --ticker ^GDAXI
"""
import argparse
import time
from dataclasses import dataclass
from typing import List, Tuple
import numpy as np
import pandas as pd
from pytrends.request import TrendReq
from sklearn.decomposition import PCA
import yfinance as yf
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
pd.set_option('future.no_silent_downcasting', True) # Prevends future Warning
# ------
# Config
# ------
@dataclass
class TrendsConfig:
"""Configuration settings for the Google Trends API request."""
timeframe: str = "today 5-y" # e.g. "today 12-m", "2019-01-01 2024-12-31"
geo: str = "" # ""=GLOBAL, "DE"=Deutschland, "US"=USA, ...
hl: str = "en-US"
tz: int = 360 # Timezone (Minutes); 360 ~ CET/CEST (proxy)
cat: int = 0 # 0=all categories
gprop: str = "" # ""=web search, "news", "images", "youtube", "froogle"
retries: int = 5
sleep_s: float = 1.0
anchor: str = "weather" # Used to normalize across different keyword batches
batch_size: int = 4 # +1 Anchor => max 5 Keywords pro pytrends call
# Anchor-Keyword-Sets: For DE-Geo may add german synonyms - default: GLOBAL
RISK_ON = ["buy stocks", "equity rally", "risk on", "emerging markets", "carry trade"]
#RISK_OFF = ["recession", "market crash", "credit spread", "default", "safe haven"]
RISK_OFF = ["energy crisis", "market crash", "credit spread", "debt ceiling", "trade war"]
MACRO = ["inflation", "interest rates", "central bank", "unemployment", "bond yields"]
# -------------------------
# Utility: pytrends wrapper
# -------------------------
def _chunks(xs: List[str], n: int) -> List[List[str]]:
"""Split a list into smaller chunks of size n."""
return [xs[i:i+n] for i in range(0, len(xs), n)]
def build_pytrends(cfg: TrendsConfig) -> TrendReq:
"""Initialize the Pytrends request object."""
return TrendReq(hl=cfg.hl, tz=cfg.tz)
def fetch_interest_over_time(pytrends: TrendReq, keywords: List[str], cfg: TrendsConfig) -> pd.DataFrame:
"""
Fetch search volume data for a specific keyword list with retry logic.
Returns a DataFrame with the search interest over the specified timeframe.
"""
last_err = None
for attempt in range(1, cfg.retries + 1):
try:
pytrends.build_payload(
kw_list=keywords,
timeframe=cfg.timeframe,
geo=cfg.geo,
cat=cfg.cat,
gprop=cfg.gprop,
)
df = pytrends.interest_over_time()
if df is None or df.empty:
raise RuntimeError("Empty response from Google Trends")
if "isPartial" in df.columns:
df = df.drop(columns=["isPartial"])
return df
except Exception as e:
last_err = e
time.sleep(cfg.sleep_s * attempt)
raise RuntimeError(f"Failed to fetch trends after {cfg.retries} retries. Last error: {last_err}")
# --------------------------------
# Core: Batch-Rescaling via Anchor
# --------------------------------
def rescale_batches_via_anchor(batch_frames: List[pd.DataFrame], anchor: str) -> pd.DataFrame:
"""
Normalizes multiple API batches by using a common 'anchor' keyword.
This overcomes Google's 0-100 scaling limitation for different requests.
"""
if not batch_frames:
return pd.DataFrame()
base = batch_frames[0].copy()
if anchor not in base.columns:
raise ValueError("Anchor not present in base batch")
out = base.drop(columns=[anchor], errors="ignore")
base_anchor = base[anchor].replace(0, np.nan)
for i in range(1, len(batch_frames)):
df = batch_frames[i].copy()
if anchor not in df.columns:
raise ValueError(f"Anchor not present in batch {i}")
a = df[anchor].replace(0, np.nan)
# Calculate the ratio between the current batch's anchor and the base batch
ratio = (base_anchor / a).replace([np.inf, -np.inf], np.nan)
scale = np.nanmedian(ratio.values)
if not np.isfinite(scale) or scale <= 0:
scale = 1.0
df_rescaled = df.drop(columns=[anchor], errors="ignore") * scale
out = out.join(df_rescaled, how="outer")
return out.sort_index()
def fetch_trends_all_keywords(cfg: TrendsConfig, all_keywords: List[str]) -> pd.DataFrame:
"""Orchestrates the fetching and rescaling of all keywords in batches."""
pytrends = build_pytrends(cfg)
keywords = [kw for kw in all_keywords if kw.lower() != cfg.anchor.lower()]
batches = _chunks(keywords, cfg.batch_size)
batch_frames = []
for b in batches:
kw_list = b + [cfg.anchor]
df = fetch_interest_over_time(pytrends, kw_list, cfg)
batch_frames.append(df)
rescaled = rescale_batches_via_anchor(batch_frames, cfg.anchor)
# weekly frequency alignment (Google Trends usually returns weekly for multi-year)
rescaled = rescaled.asfreq("W-SUN").ffill()
return rescaled
# ----------------------------
# Features + Sentiment Indices
# ----------------------------
def zscore(df: pd.DataFrame) -> pd.DataFrame:
"""Calculate the Z-score (Standardization) for each column."""
mu = df.mean(skipna=True)
sd = df.std(skipna=True).replace(0, np.nan)
return (df - mu) / sd
def ewma(df: pd.DataFrame, span: int = 8) -> pd.DataFrame:
"""Apply Exponentially Weighted Moving Average to smooth the time series."""
return df.ewm(span=span, adjust=False).mean()
def build_sentiment_indices(trends: pd.DataFrame, risk_on: list, risk_off: list):
"""
Calculates Z-Scores, applies EWMA smoothing, and performs PCA.
Includes a robustness check against keywords with no data (NaN/Zero-Variance).
"""
features = pd.DataFrame(index=trends.index)
# Calculate Z-Score & EWMA for each keyword
for col in trends.columns:
features[f"raw_{col}"] = trends[col]
# Z-Score Normalization
std = trends[col].std()
if std > 0:
z = (trends[col] - trends[col].mean()) / std
features[f"z_{col}"] = z
features[f"z_ewma_{col}"] = z.ewm(span=10).mean()
else:
# Handle keywords with zero variance or all NaN
features[f"z_{col}"] = np.nan
features[f"z_ewma_{col}"] = np.nan
# Robustness-Check: Identify usable EWMA columns for PCA
z_ewma_cols = [f"z_ewma_{c}" for c in trends.columns]
# Keep only columns that do not contain all NaNs and have variance > 0
valid_cols = []
for c in z_ewma_cols:
if c in features.columns and not features[c].isnull().all():
if features[c].std() > 0:
valid_cols.append(c)
if not valid_cols:
raise ValueError("No valid keyword data found for PCA calculation!")
print(f"PCA-Input: Using {len(valid_cols)} of {len(z_ewma_cols)} keywords (rest had insufficient volume).")
# PCA Calculation using only valid data
# Fill remaining NaNs with 0 for PCA stability, though valid_cols should be clean
pca_data = features[valid_cols].fillna(0)
pca = PCA(n_components=1)
features["sentiment_pca"] = pca.fit_transform(pca_data)
# Correct PCA sign (should correlate positively with the average of Z-Scores)
if np.corrcoef(features["sentiment_pca"], pca_data.mean(axis=1))[0, 1] < 0:
features["sentiment_pca"] *= -1
# Difference Index (Risk-On vs Risk-Off) - only use valid columns
on_cols = [f"z_ewma_{c}" for c in risk_on if f"z_ewma_{c}" in valid_cols]
off_cols = [f"z_ewma_{c}" for c in risk_off if f"z_ewma_{c}" in valid_cols]
avg_on = features[on_cols].mean(axis=1) if on_cols else 0
avg_off = features[off_cols].mean(axis=1) if off_cols else 0
features["sentiment_diff"] = avg_on - avg_off
# Final Sentiment DataFrame for plotting
sentiment = features[["sentiment_pca", "sentiment_diff"]].copy()
return features, sentiment
# -------------------------------
# Extra:
# Validation Feature: Correlation
# -------------------------------
def validate_against_ticker(sentiment: pd.DataFrame, ticker: str, timeframe: str) -> float:
"""
Fetches ticker data, aligns schedules, and calculates correlation with sentiment_pca.
"""
print(f"Validating sentiment against ticker: {ticker}...")
# Download daily ticker data to ensure we have enough data points
data = yf.download(ticker, period="5y", interval="1d")
if data.empty:
print("Warning: Could not fetch ticker data.")
return 0.0
# Resample ticker data to weekly, taking the Friday close
ticker_weekly = data['Close'].resample('W-FRI').last()
# Ensure sentiment data is also mapped to Friday for alignment
# Google Trends usually gives Sunday, so we shift it to Friday to match yfinance
sentiment_aligned = sentiment.copy()
sentiment_aligned.index = sentiment_aligned.index + pd.Timedelta(days=5)
# Align dataframes (inner join ensures we only compare dates present in both)
combined = pd.concat([sentiment_aligned['sentiment_pca'], ticker_weekly], axis=1).dropna()
if combined.empty:
print("Warning: Date alignment failed. Cannot calculate correlation.")
return 0.0
# Calculate Pearson Correlation
correlation = combined.corr().iloc[0, 1]
print(f"Correlation between Sentiment PCA and {ticker}: {correlation:.2f}")
return correlation
def print_statistical_summary(features: pd.DataFrame):
"""
Prints a clean descriptive statistics summary to the terminal.
Focuses on the Z-Scores relevant for PCA analysis.
"""
print(f"\n{'='*25} STATISTICAL ANALYSIS {'='*25}")
# Filter columns representing the smoothed Z-Scores (PCA Input)
z_cols = [c for c in features.columns if c.startswith('z_ewma_')]
stats = features[z_cols].describe().transpose()
# Filter keywords that provided valid data (std > 0 and no NaNs)
clean_stats = stats[stats['std'] > 0].dropna()
print(f"\n[Key Metrics] Descriptive Statistics of Input Signals (Smoothed Z-Scores):")
if not clean_stats.empty:
# Display key metrics to validate normalization
print(clean_stats[['mean', 'std', 'min', 'max']].to_string(float_format=lambda x: f"{x:,.4f}"))
print(f"\nNote: Means close to 0 and Std Dev close to 1 validate successful normalization.")
else:
print("Note: No valid data found for statistical summary.")
print(f"\n{'='*72}\n")
# ---------------------------------------------------------------------------
# I/O + Plot (mapping Investor Psychology)
# - What people are searching for on Google -> a leading indicator
# - Example: If the Red Line drops sharply, it suggests that market anxiety
# is rising rapidly, which usually (can) precedes a drop in
# equity funds or ETFs.
# ---------------------------------------------------------------------------
def save_outputs(prefix: str, trends: pd.DataFrame, features: pd.DataFrame, sentiment: pd.DataFrame) -> None:
"""Exports data to CSV files for further analysis in Excel or Bloomberg."""
trends.to_csv(f"trends_raw_{prefix}.csv")
features.to_csv(f"trends_features_{prefix}.csv")
sentiment.to_csv(f"sentiment_index_{prefix}.csv")
# Mapping for cleaner labels in plots and reports
TICKER_MAP = {
# --- Good Examples: Equity & Growth (High Risk Sensitivity) ---
"^GSPC": "S&P 500 (US Proxy)",
"URTH": "MSCI World (Global Proxy)",
"^GDAXI": "DAX 40 (EU/DE Proxy)",
"^STOXX50E": "Euro Stoxx 50 (EU Proxy)",
"^IXIC": "NASDAQ Composite (Growth/Tech Proxy)",
# --- Good Examples: Risk Metrics (Volatility & Credit) ---
"^VIX": "CBOE Volatility Index (Fear Barometer - Expect Inverse Corr)",
"HYG": "iShares High Yield Corporate Bond ETF (Credit Risk)",
# --- Less Good Examples (Specific/Inverse Drivers) ---
"BTC-USD": "Bitcoin (Speculative/Idiosyncratic)",
"GC=F": "Gold Futures (Safe Haven/Often Inverse)"
}
def plot_sentiment(prefix: str, sentiment: pd.DataFrame, ticker: str = None, correlation: float = None) -> None:
"""
Generates the visualization. If a ticker is provided, a combined dual-axis plot is created;
otherwise, a single sentiment index plot is shown.
"""
# Reference the global TICKER_MAP. If ticker not found, use the raw ticker symbol.
display_name = TICKER_MAP.get(ticker, ticker)
# Determine filename and layout based on ticker presence
if ticker:
filename = f"combined_sentiment_analysis_{prefix}.png"
fig = plt.figure(figsize=(14, 10))
gs = gridspec.GridSpec(2, 1, height_ratios=[2, 1])
else:
filename = f"sentiment_plot_{prefix}.png"
fig = plt.figure(figsize=(12, 7))
gs = gridspec.GridSpec(1, 1)
# --- Top Plot: Sentiment Indices (Always present) ---
ax1 = fig.add_subplot(gs[0])
ax1.plot(sentiment.index, sentiment["sentiment_diff"],
label="Risk-On/Off Spread", color='royalblue', alpha=0.4, linewidth=1.5)
ax1.plot(sentiment.index, sentiment["sentiment_pca"],
label="Macro PCA Factor", color='crimson', linestyle='--', linewidth=2.5)
ax1.axhline(0, color='black', linewidth=1)
title = f"Market Sentiment Index ({prefix})"
if ticker and correlation is not None:
title += f"\nValidation Correlation: {display_name} vs. PCA Factor = {correlation:.2f}"
ax1.set_title(title, fontweight='bold', fontsize=14)
ax1.set_ylabel("Z-Score")
ax1.legend(loc='upper left')
ax1.grid(True, linestyle=':', alpha=0.6)
# --- Bottom Plot: Ticker Comparison (Only if ticker is provided) ---
if ticker:
data = yf.download(ticker, start=sentiment.index.min(), end=sentiment.index.max())
if not data.empty:
# Handle potential MultiIndex from yfinance
price_series = data['Close'][ticker] if isinstance(data.columns, pd.MultiIndex) else data['Close']
ax2 = fig.add_subplot(gs[1], sharex=ax1)
ax2.plot(price_series.index, price_series, color='darkgreen', linewidth=2, label=display_name)
# Use .values.flatten() to avoid Pandas Series attribute errors
ax2.fill_between(price_series.index, price_series.values.flatten(), color='darkgreen', alpha=0.1)
ax2.set_ylabel("Price / Index Level")
ax2.legend(loc='upper left')
ax2.grid(True, linestyle=':', alpha=0.6)
plt.tight_layout()
plt.savefig(filename, dpi=300)
print(f"-> Plot saved as: {filename}")
plt.show()
# ----
# Main
# ----
def parse_args():
"""Parses command-line arguments for tool configuration."""
p = argparse.ArgumentParser(
description="Market Sentiment Analysis Tool using Google Trends and Ticker Correlation."
)
p.add_argument(
"--geo",
type=str,
default="GLOBAL",
help="Geographic region code (ISO 3166-1 alpha-2). Use 'US', 'DE', etc. Default: 'GLOBAL'."
)
p.add_argument(
"--ticker",
type=str,
default=None,
help="Yahoo Finance ticker symbol for validation (e.g., '^GSPC', 'URTH'). Default: None."
)
p.add_argument(
"--timeframe",
type=str,
default="today 5-y",
help="Data duration. Use 'today 12-m', 'today 5-y', or 'YYYY-MM-DD YYYY-MM-DD'. Default: 'today 5-y'."
)
p.add_argument(
"--gprop",
type=str,
default="",
help="Google property to filter (e.g., 'news', 'images', 'froogle', 'youtube'). Default: '' (Web Search)."
)
p.add_argument(
"--anchor",
type=str,
default="weather",
help="Reference term used to rescale and link multiple keyword batches. Default: 'weather'."
)
p.add_argument(
"--no-plot",
action="store_true",
help="Disable visual plot generation and only save CSV data. Default: False."
)
return p.parse_args()
def main():
"""Main execution flow for the sentiment analysis tool."""
# Parse arguments using your defined function
args = parse_args()
# Determine region and prefix
geo = "" if args.geo.upper() == "GLOBAL" else args.geo.upper()
prefix = "GLOBAL" if geo == "" else geo
# Create config based on PARSED arguments
cfg = TrendsConfig(
geo=geo,
timeframe=args.timeframe, # Use parsed timeframe
gprop=args.gprop, # Use parsed gprop
anchor=args.anchor # Use parsed anchor
)
all_keywords = sorted(set(RISK_ON + RISK_OFF + MACRO))
print(f"Starting sentiment extraction for {prefix}...")
trends = fetch_trends_all_keywords(cfg, all_keywords)
features, sentiment = build_sentiment_indices(trends, RISK_ON, RISK_OFF)
# Aave CSV files (trends_raw_GLOBAL.csv / sentiment_data_GLOBAL.csv)
save_outputs(prefix, trends, features, sentiment)
# Perform Validation if ticker is provided
corr = None
if args.ticker:
print(f"Validating against Ticker: {args.ticker}...")
corr = validate_against_ticker(sentiment, args.ticker, cfg.timeframe)
save_outputs(prefix, trends, features, sentiment)
# Handle plot result by --ticker and/or --no-plot flag call
if not args.no_plot:
# With Ticker flag -> combined_sentiment_analysis_GLOBAL.png
# Without Ticker flag -> sentiment_plot_GLOBAL.png
plot_sentiment(prefix, sentiment, args.ticker, corr)
# Extra: Descriptive Statistics:
print_statistical_summary(features)
print(f"--- Process complete. Files saved with prefix: {prefix} ---")
if __name__ == "__main__":
main()

5
requirements.txt Normal file
View File

@@ -0,0 +1,5 @@
# Main requirements
pandas
pytrends
yfinance
scikit-learn