# POI Recommender — Project Documentation

## Table of Contents

1. [Collected Data Analysis](#1-collected-data-analysis)
2. [System Documentation](#2-system-documentation)
   - [Project Structure](#21-project-structure)
   - [Algorithms](#22-algorithms)
   - [Installation Guide](#23-installation-guide)
   - [User Interface Guide](#24-user-interface-guide)

---

## 1. Collected Data Analysis

### 1.1 Data Sources

The project integrates data from three external sources:

| Source | File | Purpose |
|--------|------|---------|
| LD2 database | `bigdata_ld2_database.sqlite3` | Place records and city metadata from OpenStreetMap |
| LD3 database | `ld3_database.sqlite3` | PageRank scores for Berlin places |
| Openverse API | fetched on demand | Creative-commons images per place |

### 1.2 Data Amounts

#### Places

| City | Places loaded |
|------|--------------|
| Berlin | 2 152 |
| London | 1 450 |
| New York | 466 |
| Paris | 1 097 |
| **Total** | **5 165** |

All places come directly from the LD2/LD3 lab databases. Every place has a PageRank score.

#### Images

| City | Images stored |
|------|--------------|
| Berlin | — |
| London | — |
| New York | — |
| Paris | — |

Run `python manage.py fetch_poi_images --limit 0` to download images for all places.

Images are downloaded from Openverse and stored under `media/poi/<city_slug>/`. Each image has a corresponding `ImageFeature` record (924 feature vectors total).

#### Similarity records

| Method | Records |
|--------|---------|
| Structural KNN (same city) | 51 650 |
| Image KNN — same city | computed after image download |
| Image KNN — other cities | computed after image download |

### 1.3 Place Attributes Collected

Each place record stores the following fields sourced from OpenStreetMap and Wikidata:

| Field | Source | Description |
|-------|--------|-------------|
| `osm_id`, `osm_type` | OSM | Unique OSM identifier |
| `name` | OSM / Wikidata label | English display name (Wikidata label preferred) |
| `lat`, `lon` | OSM | Geographic coordinates |
| `wikidata_qid` | OSM tag | Linked Wikidata entity |
| `description` | Wikidata / OSM | English description |
| `website`, `phone`, `opening_hours` | OSM | Contact and operational info |
| `historic`, `tourism`, `amenity` | OSM tags | Primary classification tags |
| `shop`, `leisure`, `man_made` | OSM tags | Secondary classification tags |
| `memorial`, `artwork_type`, `heritage` | OSM tags | Specialised classification tags |
| `pagerank` | LD3 database | Graph-based importance score (Berlin only) |
| `interest_score` | Computed | Normalised ranking score [0, 1] |

### 1.4 Interest Score Computation

**All cities** — PageRank scores are loaded from the LD3 graph database (`ld3_claude.sqlite3`) which contains scores for Berlin, London, New York, and Paris. Scores are normalised per-city to [0, 1] using min-max scaling: `(score − city_min) / (city_max − city_min)`. The tiny number of places missing a PageRank entry receive a heuristic fallback: +0.5 for a Wikidata QID, +0.3 for a `tourism` tag, +0.2 for a `historic` tag (capped at 1.0).

### 1.5 Image Feature Vectors

Each downloaded image is processed into a **30-dimensional feature vector** (10 dominant colours × 3 RGB channels, values normalised to [0, 1]). These vectors are stored in the `ImageFeature` table and used for visual similarity search.

### 1.6 Data Processing Pipeline (end-to-end)

```
LD2 SQLite  ──► load_data ──► Django DB (places, categories, cities)
LD3 SQLite  ──► load_data ──► PageRank → interest_score (Berlin)
                                │
Openverse API ──► fetch_poi_images ──► media/poi/<city>/<id>_<n>.ext
                                        │
                                   KMeans clustering
                                        │
                               ImageFeature (30-dim vector)
                                        │
                              compute_similarities
                                        │
                          ┌─────────────┴────────────┐
                    Structural KNN            Image KNN
                  (one-hot OSM tags)    (colour feature vectors)
                          │                    │
                     SimilarPlace         SimilarPlace
                  method=structural   method=image_same_city
                                      method=image_other_city
```

---

## 2. System Documentation

### 2.1 Project Structure

```
engineering/
├── manage.py                  # Django entry point
├── requirements.txt           # Python dependencies
├── config/                    # Django project config
│   ├── settings.py            # App settings (DB paths, media root, installed apps)
│   ├── urls.py                # Root URL routing
│   └── wsgi.py
├── places/                    # Main Django app — models, views, templates
│   ├── models.py              # City, Place, Category, PlaceImage, ImageFeature, SimilarPlace
│   ├── views.py               # city_list, city_detail, place_detail
│   ├── urls.py                # URL patterns
│   ├── admin.py
│   └── templates/places/
│       ├── base.html          # Shared layout (Tailwind CSS, Bootstrap Icons)
│       ├── city_list.html     # Home page — city grid
│       ├── city_detail.html   # Place listing for one city
│       ├── place_detail.html  # Individual place page
│       ├── _similar_card.html # Reusable similar-place card component
│       └── _stars.html        # Interest-score star display
├── processing/                # Data processing app
│   ├── algorithms/
│   │   ├── kmeans.py          # KMeans++ colour clustering (NumPy, no sklearn)
│   │   └── knn.py             # KNN search + one-hot encoding (NumPy)
│   └── management/commands/
│       ├── load_data.py       # Import LD2/LD3 source databases
│       ├── fetch_poi_images.py# Download images from Openverse API
│       └── compute_similarities.py  # Compute all similarity records
├── data/
│   ├── db.sqlite3             # Application database (ready to use)
│   └── sources/               # Raw source databases (needed only for load_data)
│       ├── README.txt
│       ├── bigdata_ld2_database.sqlite3
│       └── ld3_database.sqlite3
└── media/
    └── poi/                   # Downloaded place images
        ├── berlin/
        ├── bologna/
        ├── london/
        └── new_york/
```

#### Database models

| Model | Purpose |
|-------|---------|
| `City` | A city with name, URL slug, bounding box, and place count |
| `Place` | A point of interest with OSM tags, coordinates, scores, and Wikidata link |
| `Category` | A single OSM tag value (e.g. `tourism:museum`) |
| `PlaceCategory` | Many-to-many join between Place and Category |
| `PlaceImage` | File path and caption for one image linked to a Place |
| `ImageFeature` | 30-dim colour feature vector for one image |
| `SimilarPlace` | A scored pair of places with a `method` label |

---

### 2.2 Algorithms

#### PageRank (interest scoring for Berlin)

Raw PageRank scores are loaded from the LD3 graph database, which encodes the importance of OSM objects based on their link structure. Scores are normalised to [0, 1] with min-max scaling and stored on each `Place` as `interest_score`. This field drives the default sort order in the UI.

#### KMeans++ (colour palette extraction)

**File:** `processing/algorithms/kmeans.py`

A from-scratch NumPy implementation — no scikit-learn dependency.

**Pipeline:**

1. **Load** — open image with Pillow, convert to RGB, resize to at most 100×100 pixels, reshape to an (N, 3) float32 array.
2. **Initialise centroids** — KMeans++ seeding: the first centroid is chosen at random; each subsequent centroid is sampled with probability proportional to the squared distance from the nearest existing centroid.
3. **Iterate** — assign each pixel to its nearest centroid (Euclidean distance), recompute centroid means, repeat until convergence (`max_iter=20`, centroid shift tolerance `tol=1.0`).
4. **Build feature vector** — sort clusters by pixel count (largest first), flatten the k centroid RGB values into a 1-D array of length `k×3`, and normalise to [0, 1].

Default: **k = 10 clusters → 30-dimensional feature vector**.

The result is stored as a JSON array in `ImageFeature.feature_vector` and displayed in the UI as a colour swatch row.

#### K-Nearest Neighbours (similarity search)

**File:** `processing/algorithms/knn.py`

Pure NumPy Euclidean-distance KNN. Three similarity methods are computed:

| Method | Input features | Scope |
|--------|---------------|-------|
| `structural` | One-hot encoded OSM categorical tags (`historic`, `tourism`, `amenity`, `shop`, `leisure`, `man_made`, `memorial`, `artwork_type`) | Same city only |
| `image_same_city` | 30-dim KMeans colour vectors | Same city only |
| `image_other_city` | 30-dim KMeans colour vectors | Other cities only |

**One-hot encoding** (`build_onehot_matrix`): builds a vocabulary per tag column from all place records, then creates a binary feature matrix where each cell is 1 if the place has that tag value. Places with no recognised tags are excluded from structural similarity.

**Distance → score**: raw Euclidean distance is converted to a similarity score in [0, 1] by `score = max(0, 1 − distance / max_distance)` where `max_distance = √(number_of_features)`.

**Parallelism**: `compute_similarities` uses Python `multiprocessing.Pool` with one worker per CPU core. The database connection is closed before forking and re-opened inside each worker.

---

### 2.3 Installation Guide

#### Prerequisites

- Python 3.10 or later
- pip

#### Steps

**1. Clone or extract the project**

```bash
cd /path/to/engineering
```

**2. Create and activate a virtual environment** (recommended)

```bash
python -m venv .venv
source .venv/bin/activate        # macOS / Linux
.venv\Scripts\activate           # Windows
```

**3. Install dependencies**

```bash
pip install -r requirements.txt
```

Dependencies: `Django==6.0.5`, `numpy==2.4.4`, `Pillow==12.2.0`, `requests==2.33.1`.

**4. Apply database migrations**

```bash
python manage.py migrate
```

The application database (`data/db.sqlite3`) is already populated. This step only ensures the schema is up to date.

**5. Start the development server**

```bash
python manage.py runserver
```

Open `http://127.0.0.1:8000` in a browser.

---

#### Optional: re-loading data from source databases

Required only if you want to rebuild the database from the raw LD2/LD3 sources.

Place the source files as described in `data/sources/README.txt`, then run:

```bash
# Load cities and places from LD2, PageRank from LD3
python manage.py load_data

# Download images from Openverse (requires internet access)
python manage.py fetch_poi_images

# Compute KMeans colour features for downloaded images
python manage.py fetch_poi_images --features-only

# Compute all similarity records
python manage.py compute_similarities
```

Environment variables can be used instead of copying files:

```bash
export LD2_DB_PATH=/path/to/bigdata_ld2_database.sqlite3
export LD3_DB_PATH=/path/to/ld3_database.sqlite3
export LD4_IMAGES_ROOT=/path/to/images/
```

#### Command reference

| Command | Description |
|---------|-------------|
| `python manage.py load_data` | Import cities, places, and PageRank from LD2/LD3 source databases |
| `python manage.py load_data --clear` | Clear all data before importing |
| `python manage.py fetch_poi_images` | Download images from Openverse for all cities |
| `python manage.py fetch_poi_images --city berlin` | Limit to one city |
| `python manage.py fetch_poi_images --compute-features` | Also compute KMeans features after downloading |
| `python manage.py fetch_poi_images --features-only` | Only compute missing features, skip downloading |
| `python manage.py compute_similarities` | Compute all three similarity methods |
| `python manage.py compute_similarities --method structural` | Compute one method only |
| `python manage.py compute_similarities --clear` | Clear existing similarity data before computing |

---

### 2.4 User Interface Guide

The web interface has three pages accessible via a browser at `http://127.0.0.1:8000`.

#### Home page — City list (`/`)

Displays a card grid of all available cities.

- Each card shows a representative image, the city name, and the number of places of interest.
- Clicking a card navigates to the city's place listing.
- A "How it works" section at the bottom explains the three algorithms (PageRank, KNN, KMeans) used by the system.

#### City detail page (`/city/<slug>/`)

Lists all places of interest for a selected city.

- **Search** — type any part of a place name in the search box and press *Search* to filter results. The *×* button clears the search.
- **Sort** — toggle between *Interest* (by `interest_score`, descending by default) and *Name* (alphabetical, ascending by default). Clicking the active sort button reverses the direction.
- **Place cards** — each card shows a thumbnail image, place name, up to two category tags, an interest-score star rating, and the numeric score. Clicking a card opens the place detail page.
- **Pagination** — results are paginated at 20 places per page. Previous / Next links and nearby page numbers are shown below the grid.

#### Place detail page (`/place/<id>/`)

Shows full information for one place of interest.

**Left column — image and colour palette**

- Primary image for the place.
- Thumbnail strip if the place has more than one image.
- **Dominant colours** swatch: up to 10 colour tiles extracted by KMeans clustering from the primary image, displayed as hex-colour squares. Hovering a square shows its hex code.

**Right column — place information**

- City badge and category tags.
- Place name heading.
- Interest-score star rating with numeric value.
- Details table: GPS coordinates with a *Maps* link (opens Google Maps), description, website, phone, opening hours, historic/tourism tag values, Wikidata QID link, and raw PageRank value (Berlin only).

**Similar places sections (below)**

Up to three horizontal scrollable rows of similar place cards:

| Section | Method | Basis |
|---------|--------|-------|
| *Similar in \<city\>* | `structural` | Shared OSM categorical tags, KNN |
| *Visually similar in \<city\>* | `image_same_city` | Image colour vectors, KMeans + KNN |
| *Visually similar elsewhere* | `image_other_city` | Image colour vectors across other cities |

Each similar-place card shows a thumbnail, name, city, and similarity score. Clicking navigates to that place's detail page.

If no similarity data exists, a notice instructs the user to run `python manage.py compute_similarities`.