Initial setup steps

2026-05-20 12:01:39 +02:00
commit 1315ff3d99
6 changed files with 589 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,58 @@
+# Google Trends Market Sentiment Analysis Tool
+
+## Overview
+Traditional market data captures what has happened, but rarely explains *why* or what happens next. This project introduces a systematic framework that leverages alternative data—specifically online search volumes via Google Trends—as a leading indicator for tactical asset allocation and risk control. 
+
+By analyzing real-time shifts in collective investor attention, the tool quantifies market psychology before it fully materializes into trading decisions.
+
+---
+
+## The Core Scaling Challenge & Solution
+
+> **The Problem:** Google Trends normalizes search volume to a relative $0 \text{ to } 100$ scale *per individual request*. This makes it statistically impossible to directly compare or chain together data from different batch requests.
+>
+> **The Algorithmic Solution:** This script implements an **"Anchor-Logic"** to establish a unified global scale. Every automated batch request includes a high-volume, neutral reference term (configurable via `--anchor`, default: `'weather'`). The pipeline then dynamically rescales parallel batches using the **median ratio** of the overlapping anchor series:
+>
+> $$\text{Scaling Factor} = \text{median}\left(\frac{\text{Anchor}_{\text{Target Batch}}}{\text{Anchor}_{\text{Reference Batch}}}\right)$$
+>
+> This technique achieves true cross-batch comparability across independent API calls.
+
+---
+
+## Methodology & Pipeline Architecture
+
+The prototype (`google_trends_sentiment_prototype.py`) is structured as a modular quantitative pipeline:
+
+### 1. Data Ingestion (Anchor-Based)
+Automated retrieval of pre-defined Risk-On, Risk-Off, and Macroeconomic keywords via the `pytrends` API, structurally unified globally using the Anchor-Logic described above.
+
+### 2. Normalization Layer
+Applies a **Z-score transformation** to the rescaled raw data. This establishes statistical parity across keywords with vastly different structural search volumes by centering the mean at $0$ and scaling variance to $1$:
+
+$$z = \frac{x - \mu}{\sigma}$$
+
+Where:
+* $x$ is the anchor-adjusted search volume intensity.
+* $\mu$ is the historical mean of that specific keyword series.
+* $\sigma$ is the historical standard deviation of the series.
+
+### 3. Index Construction & Signal Extraction
+* **Sentiment Spread:** Measures the relative strength of optimism versus pessimism in the market:
+    $$\text{Sentiment Spread} = \left( \frac{1}{N} \sum_{i=1}^{N} z_{\text{Risk-On}, i} \right) - \left( \frac{1}{M} \sum_{j=1}^{M} z_{\text{Risk-Off}, j} \right)$$
+* **Macro PCA Factor:** Extracts the first principal component ($PC_1$) from the combined Z-score feature matrix using Singular Value Decomposition (SVD) via `scikit-learn`:
+    $$\mathbf{Z} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T \implies PC_1 = \mathbf{Z}\mathbf{v}_1$$
+    This isolates the dominant underlying psychological driver capturing the highest common variance.
+
+### 4. Market Validation (Optional)
+Resamples the extracted signals to a weekly frequency and performs quantitative correlation analysis against live financial benchmarks using `yfinance` without compromising the statistical independence of the core signal.
+
+*Note: This prototype currently focuses on contemporaneous correlation as a proof-of-concept. Time horizons and keyword definitions are structurally predefined rather than data-driven optimized.*
+
+---
+
+## Getting Started
+
+### Dependencies
+Install the required quantitative stack:
+```bash
+pip install pytrends pandas numpy scikit-learn yfinance matplotlib