# Project Presentation — Tourist POI Recommendation System

---

## Opening

This project is a local-first tourist recommendation system. The goal was simple — a tourist arrives in a city and wants to know which places are worth visiting, and what is similar to something they already liked. Everything runs locally, all data is stored in a SQLite database, and the interface is a Django web application.

The system covers four cities: **Berlin, London, New York, and Paris**.

---

## User Interface — City View

> *Show: browser open on the city list page, then navigate to Berlin*

When the user opens the application they see a list of cities. Clicking a city opens the city view. Every place here has an **interest score** that determines the order. That score comes from PageRank — explained in detail below. The user can also sort alphabetically or search by name. Each card shows the place name, its category, and a thumbnail if an image was collected.

---

## User Interface — Place Detail

> *Show: click on a specific place to open its detail page*

When the user opens a place they see the full information — name, description from Wikidata, location with a link to Google Maps, opening hours if available, and the images collected. Below that there are **three recommendation sections**:

1. **Structural similarity** — similar places in the same city based on what kind of place it is
2. **Image similarity (same city)** — visually similar places within the same city
3. **Image similarity (other cities)** — visually similar places across Berlin, London, New York, Paris

This is the core navigation loop: from a city → to a place → to its neighbours → and back.

---

## Data & Project Structure

> *Show: the `databases/` folder, then the project root*

The project is fully self-contained. The `databases/` folder holds two SQLite databases:

| File | Size | Contents |
|---|---|---|
| `bigdata_ld2_database.sqlite3` | 102 MB | Raw OpenStreetMap data + Wikidata enrichment for all four cities |
| `ld3_claude.sqlite3` | 8.2 MB | Pre-computed PageRank scores from the MPI pipeline |

The Django application **never reads these source databases at runtime**. They are only used once by the `load_data` management command, which processes everything and writes into the application's own `data/db.sqlite3`.

---

## Database Schema

The schema follows **higher normal forms**:

```
City        (id, name, slug, place_count)
Place       (id, city_id, osm_id, osm_type, name, lat, lon, wikidata_qid,
             description, website, phone, opening_hours,
             historic, tourism, amenity, shop, leisure, man_made,
             memorial, artwork_type, heritage,
             pagerank, interest_score)
Category    (id, name)
PlaceCategory (place_id, category_id)
SimilarPlace  (main_place_id, similar_place_id, score, method)
ImageFeature  (id, image_path, city_id, place_id, category_name, feature_vector)
PlaceImage    (id, place_id, image_path, caption)
```

One place can belong to many categories and one category belongs to many places — the `PlaceCategory` junction table handles this. `SimilarPlace` stores all pre-computed recommendations with the method name (`structural`, `image_same_city`, `image_other_city`). The UI never computes anything at request time — it only reads.

---

## PageRank — Parallel Pipeline (MPI)

> *Show: `pagerank/` folder — process.py, graph.py, pagerank_mpi.py, database.py*

The PageRank pipeline lives in the `pagerank/` directory and runs in four steps.

### Step 1 — `process.py` (Task 1)

Reads all place objects from the LD2 source database and computes a **wiki relevance score** for each one, based on number of Wikidata claims, sitelinks, and text richness.

**MPI collectives used:**
- `scatter` — rank 0 distributes place chunks to all processes
- `gather` — each process returns its computed scores to rank 0
- `allgather` — all ranks share chunk sizes
- `Allreduce(SUM)` — global total score sum
- `Allreduce(MAX)` — global maximum score

**Output:** `pagerank/data/task1_results.pickle`

### Step 2 — `graph.py` (Task 3)

Builds a **directed weighted graph** where nodes are places. Four edge types:

| Edge type | Description | Weight |
|---|---|---|
| `wikidata_claim_ref` | Place A's Wikidata article references Place B | 1.0 |
| `shared_instance_of` | Both places are instances of the same Wikidata class | 0.25 |
| `lexical_jaccard` | Descriptions share vocabulary (Jaccard similarity) | jaccard score |
| `shared_osm_classifier` | Same OSM tag value (historic=*, shop=*) | 0.15 |

Each node is capped at 32 outgoing edges, keeping the highest-weight connections. The resulting graph has approximately **76,000 edges across 5,500 nodes**.

**Output:** `pagerank/data/task3_edges.pickle`

### Step 3 — `pagerank_mpi.py` (Task 4)

Distributed PageRank computation.

**Algorithm:**
- Damping factor `d = 0.85`
- Maximum 60 iterations, convergence tolerance `1e-6`
- N is padded to be divisible by the number of MPI processes

**MPI collectives used:**
- `bcast` — broadcast the full graph and initial scores to all ranks
- `scatter` — distribute node index ranges to each rank
- `allgather` — synchronise the full PageRank vector after each iteration
- `Allreduce(MAX)` — global convergence check (max score change across all ranks)
- `Allreduce(SUM)` — sum dangling node mass across all ranks

Each rank updates its slice of PageRank scores using the backward graph. Dangling nodes (no outgoing edges) redistribute their mass globally via `Allreduce(SUM)`. Convergence is checked globally via `Allreduce(MAX)`.

**Output:** `pagerank/data/task4_pagerank.pickle`

### Step 4 — `database.py` (Task 2)

Assembles task 1, 3, and 4 results into `databases/ld3_recomputed.sqlite3` — the same schema as the original LD3 database — so that `load_data` can reload the Django application with the newly computed scores.

### How to run the full pipeline

```bash
cd pagerank

# Step 1 — wiki relevance scores
mpirun --oversubscribe -np 4 python process.py

# Step 2 — build graph
python graph.py

# Step 3 — distributed PageRank
mpirun --oversubscribe -np 4 python pagerank_mpi.py

# Step 4 — write SQLite
python database.py

# Step 5 — reload Django with new scores
cd ..
LD3_DB_PATH=databases/ld3_recomputed.sqlite3 python manage.py load_data --clear
```

---

## KMeans — Image Feature Extraction

> *Show: `processing/algorithms/kmeans.py`*

To compare images numerically each image is represented as a **colour palette feature vector**. The KMeans algorithm is implemented from scratch — no scikit-learn.

**Pipeline per image:**
1. Load image with Pillow, resize to at most 100×100 pixels
2. Reshape to a flat array of RGB pixels (shape `N × 3`)
3. Run **KMeans++** initialisation — first centroid is random, subsequent ones are chosen with probability proportional to squared distance from the nearest existing centroid
4. Iterate up to 20 steps, convergence threshold = 1.0 (centroid shift in RGB space)
5. Sort the 10 resulting cluster centroids by how many pixels they represent (dominant colours first)
6. Flatten and normalise to `[0, 1]` → **30-dimensional feature vector** (10 colours × RGB)

This vector is stored in the `ImageFeature` table as a JSON array. On the place detail page the extracted dominant colours are displayed visually below the main image.

---

## KNN — Similarity Computation

> *Show: `processing/algorithms/knn.py` and `processing/management/commands/compute_similarities.py`*

### Structural KNN (same city)

Each place is encoded as a **one-hot binary vector** over its OSM tag values:
`historic`, `tourism`, `amenity`, `shop`, `leisure`, `man_made`, `memorial`, `artwork_type`

KNN finds the **10 most similar places** in the same city using Euclidean distance in this feature space. Places with no tag data are excluded. Distance is converted to a similarity score `[0, 1]` by linear scaling against the maximum possible Euclidean distance.

### Image KNN (same city and other cities)

Uses the 30-dimensional colour feature vectors. Finds the **5 most visually similar places** per image. Two separate passes: one restricted to the same city, one restricted to other cities.

### Parallel execution

Both computations use Python's `multiprocessing.Pool` with `cpu_count()` workers:
- Structural: each city is an independent job — cities run in parallel
- Image: each image is an independent job — all images processed in parallel

Results are bulk-inserted into `SimilarPlace` in batches of 1000 records.

To run:
```bash
python manage.py compute_similarities
```

---

## Recommendations in the UI

> *Show: place detail page — scroll through the three recommendation sections*

All three recommendation types are visible on every place detail page:

- **Structural neighbours** — places of the same character. A museum will be neighbours with other museums and monuments that share classification tags.
- **Image neighbours (same city)** — places that look visually similar regardless of type. The matching is based purely on dominant colour palette.
- **Cross-city image neighbours** — the most interesting for a traveller. Finds the equivalent visual experience in another city — what Paris has that looks like something in Berlin.

Every recommendation was computed once and stored. The view is a simple database read.

---

## Data Quality & Scale

| City | Places loaded |
|---|---|
| Berlin | 2,308 |
| London | 1,646 |
| Paris | 1,171 |
| New York | 466 |

Places without a valid name were excluded during loading. Quality is maintained by requiring either a Wikidata entity or at least one meaningful OSM tag. All cities exceed the 50-place minimum by a large margin.

PageRank scores are **normalised per city** using min-max scaling to `[0, 1]`, so the interest score is comparable within a city. The raw PageRank correlation with Wikidata wiki relevance is `0.16`, which is expected — PageRank captures graph structure (how places relate to each other) while wiki relevance captures editorial richness (how much is written about a place). They measure different things.

---

## Installation

```bash
# 1. Install dependencies
pip install -r requirements.txt

# 2. Run migrations
python manage.py migrate

# 3. Load data (databases already included)
python manage.py load_data

# 4. Compute similarities
python manage.py compute_similarities

# 5. Start the server
python manage.py runserver
```

The `databases/` folder with all source data is included in the project. No external API calls or downloads are needed.

---

## Summary

| Component | Technology | Location |
|---|---|---|
| Web framework | Django | `places/`, `processing/`, `config/` |
| PageRank pipeline | MPI (`mpi4py`) | `pagerank/` |
| KMeans (image features) | NumPy, custom implementation | `processing/algorithms/kmeans.py` |
| KNN (similarity) | NumPy, `multiprocessing` | `processing/algorithms/knn.py` |
| Similarity computation | `multiprocessing.Pool` | `processing/management/commands/compute_similarities.py` |
| Source databases | SQLite | `databases/` |
| Application database | SQLite | `data/db.sqlite3` |

The data pipeline runs once, offline, using MPI for PageRank and multiprocessing for KNN and KMeans. The Django application serves only pre-computed results. The entire project is self-contained — source databases, processing scripts, and application are all in one directory with no dependencies on external files or services.
