# Defence Preparation — POI Recommender System

---

## 1. UI Implementation and Navigation _(max 1 pt)_

**Q: Walk me through how a user navigates the system.**

Start at the home page (`/`) — it shows all four cities as cards. Click a city to open the city detail page (`/city/<slug>/`) which lists all places for that city, paginated 20 per page. Click any place card to open the place detail page (`/place/<id>/`) with full info and similarity sections. From there, breadcrumb links at the top and a *Back to \<city\>* button at the bottom navigate back up.

**Q: What does the home page show?**

A grid of city cards with a representative image, city name, and total place count. Below it, a "How it works" section explains the three algorithms — PageRank, KNN, KMeans.

**Q: What happens if a place or city has no image?**

A styled placeholder is shown. Every `<img>` tag has an `onerror` fallback and templates use `{% if img %}` guards, so the UI never breaks on missing images.

---

## 2. POI Sorting and Display _(max 1.5 pt)_

**Q: How is the interest score calculated?**

PageRank scores come from the LD3 lab database (`ld3_claude.sqlite3`), which contains scores for all four cities — Berlin, London, New York, and Paris. The raw scores are normalised per city to [0, 1] using min-max scaling: `(score − city_min) / (city_max − city_min)`. This gives every place a comparable interest score regardless of city. The very few places with no PageRank entry get a heuristic fallback: +0.5 for a Wikidata QID, +0.3 for a tourism tag, +0.2 for a historic tag.

**Q: Where exactly is PageRank computed — is it computed by us or loaded?**

It is loaded, not computed by us. The LD3 lab database already contains pre-computed PageRank scores produced during Lab 3 using MPI-based parallel graph processing across all four cities. Our `load_data` command reads those scores, normalises them, and stores the result as `interest_score` on each place.

**Q: What is PageRank and why is it appropriate here?**

PageRank scores a node in a graph by the importance of other nodes that point to it. In the LD3 dataset, OSM places are connected by semantic and geographic relationships. A place that many important places link to gets a higher score — which correlates well with real-world significance and tourist interest.

**Q: How does sorting work in the UI?**

The city detail page has a sort toggle: *Interest* (by `interest_score`, descending by default) and *Name* (alphabetical, ascending by default). Clicking the active button reverses direction. Sort and order are passed as query parameters (`?sort=score&order=desc`) and applied via Django ORM `.order_by()`.

**Q: What main information is shown on a place detail page?**

Primary image with a thumbnail strip (if multiple images), place name, city badge, category tags, interest score with a star display, GPS coordinates with a Google Maps link, description, website, phone, opening hours, historic/tourism tag values, Wikidata QID link, and the raw PageRank value.

---

## 3. Similar POI Recommendations _(max 1.5 pt)_

**Q: What are the three types of similarity and how do they differ?**

| Section shown in UI | Method stored | Input data |
|---------------------|--------------|------------|
| Similar in same city | `structural` | One-hot encoded OSM categorical tags |
| Visually similar in same city | `image_same_city` | KMeans colour feature vectors |
| Visually similar in other cities | `image_other_city` | KMeans colour feature vectors |

**Q: How does structural similarity work?**

Each place's OSM categorical tags (`historic`, `tourism`, `amenity`, `shop`, `leisure`, `man_made`, `memorial`, `artwork_type`) are one-hot encoded into a binary feature matrix — one column per unique tag value. KNN finds the k=10 nearest neighbours by Euclidean distance, restricted to the same city. Distance is converted to a score with `max(0, 1 − distance / max_distance)` where `max_distance = √(number of features)`.

**Q: How does image similarity work?**

Each downloaded image is resized to at most 100×100 px and its pixels are clustered into 10 colour groups using KMeans++. The 10 centroid RGB values (sorted by cluster size, largest first) are flattened into a 30-dimensional vector normalised to [0, 1] and stored as `ImageFeature`. KNN then searches this vector space — within the same city for `image_same_city`, across other cities for `image_other_city`, k=5 neighbours each.

**Q: What is KMeans++ and how does it differ from basic KMeans?**

Basic KMeans picks initial centroids randomly, which can lead to poor convergence. KMeans++ chooses the first centroid randomly, then each subsequent centroid is sampled with probability proportional to its squared distance from the nearest existing centroid — spreading initial centroids out and giving faster, more stable convergence.

**Q: Where are similarity results stored?**

In the `places_similarplace` table: `main_place_id`, `similar_place_id`, `score`, `method`. This matches the required schema `similar_places(main_place_id, similar_place_id, sim_score)`. The `method` column distinguishes the three types.

**Q: Why do some places have no image-based similarity?**

Image KNN requires a downloaded image and a computed `ImageFeature` vector. Places without an image are excluded. Images are fetched with `fetch_poi_images --limit 0` — running it with the default `--limit 50` only covers the top 50 places per city.

---

## 4. Data Collection and Storage _(max 1 pt)_

**Q: What data sources are used?**

Two lab databases and one API:
- `bigdata_ld2_database.sqlite3` (LD2) — OSM place records with all categorical tags and contact info
- `ld3_claude.sqlite3` (LD3) — PageRank scores and Wikidata English labels/descriptions for all cities
- Openverse API — creative-commons licensed images fetched per place

**Q: How many places are stored per city?**

Berlin: 2 152, London: 1 450, New York: 466, Paris: 1 097 — all well above the required minimum of 50.

**Q: How does `load_data` combine LD2 and LD3?**

It loads all place records from `final_features` (LD2), then enriches each one by joining on `(city_name, osm_id, osm_type)` with `ld3_object` (LD3) to get PageRank, Wikidata label, and description. Contact details (website, phone, opening hours) come from a second join with `intermediate_data` (also LD2). All three sources share the same OSM IDs so the join is 100% lossless.

**Q: How is data quality ensured?**

`load_data` skips rows with an empty name. Wikidata English labels are preferred over raw OSM names. `fetch_poi_images` scores each Openverse result by how many place-name tokens appear in its title/tags and only downloads results above a 50% match threshold (`--min-match 0.5`). Wikimedia-heavy results can be filtered out with `--exclude-wikimedia`.

**Q: How is the database schema normalised?**

Follows the schema from the requirements exactly:
- `places_city (id, name, slug, …)`
- `places_place (id, city_id, name, …)`
- `places_category (id, name)` — one row per unique tag value
- `places_placecategory (place_id, category_id)` — many-to-many join table

A category like `tourism:museum` exists once in `places_category` and is shared across every museum in the database.

---

## 5. Data Processing and Parallel Computing _(max 2 pt)_

**Q: How is data processing separated from the web application?**

All heavy processing runs as standalone Django management commands — `load_data`, `fetch_poi_images`, `compute_similarities`. They are invoked from the terminal independently of the web server. The web app only reads from the database; it never runs any processing itself.

**Q: How is parallel computing implemented?**

`compute_similarities` uses Python's `multiprocessing.Pool` spawning one worker per CPU core (`cpu_count()`). The database connection is explicitly closed before forking (`connection.close()`) to avoid sharing a connection across processes. Each worker receives a batch of data, computes KNN independently, and returns results. The main process collects everything with `pool.map()`.

**Q: What collective communication operations are used?**

`pool.map()` performs two collective operations:
- **Scatter** — the main process distributes work items across all worker processes simultaneously
- **Gather** — the main process collects results from all workers once they finish

This is the scatter–gather collective communication pattern. Workers do not communicate with each other; each processes its batch independently, which avoids synchronisation overhead and is well-suited to this embarrassingly parallel workload.

**Q: What is the difference between this and MPI collective communication?**

MPI provides explicit low-level collective primitives — broadcast, reduce, all-reduce, barrier — where all processes participate in a shared operation mid-computation. `pool.map()` is a higher-level abstraction that only communicates at the boundaries (scatter at start, gather at end). The current implementation does not use MPI; the workload is embarrassingly parallel so inter-worker communication is not needed.

---

## 6. User Interface and Database Integration _(max 1 pt)_

**Q: How does the UI communicate with the database?**

All three views (`city_list`, `city_detail`, `place_detail`) query the SQLite database directly through Django's ORM on every HTTP request. There is no intermediate cache or file layer — HTML is rendered from live database queries each time.

**Q: What framework is used and why?**

Django. It provides the ORM for database access, a template engine for HTML, URL routing, and the management command system used for all data processing — everything needed in one framework, matching the project requirements.

---

## 7. Collected Data Analysis _(max 1 pt)_

**Q: Summarise the collected data.**

5 165 places across 4 cities (Berlin, London, New York, Paris). Every place has a PageRank-based interest score. 51 650 structural similarity records. Image counts and image-based similarity records depend on how many images have been fetched — each downloaded image produces a 30-dimensional KMeans colour feature vector stored in `ImageFeature`. Full data tables are in `DOCUMENTATION.md` Section 1.

**Q: How are the two source databases related?**

`final_features` (LD2) and `ld3_object` (LD3) contain exactly the same 5 591 places identified by the same `osm_id + osm_type + city_name` keys — verified by a join that produces zero unmatched rows. LD2 provides OSM tags; LD3 adds PageRank and Wikidata enrichment on top.

**Q: What were the main data quality challenges?**

Openverse image search uses free-text queries, so generic place names like "Park" or "Museum" return many irrelevant results. This is handled with a token-matching scorer that requires at least 50% of the place name's significant tokens to appear in the result. Wikimedia images can dominate results for well-known places; the `--exclude-wikimedia` flag avoids this.

**Q: Why do some places have no PageRank score?**

All 5 165 loaded places have a PageRank score — the LD3 database covers all four cities completely. The heuristic fallback only activates for the very few places where the `final_features` row has no matching `ld3_object` entry (zero in practice since both tables are identical sets).

---

## 8. System Documentation _(max 1 pt)_

**Q: Where is the documentation?**

`DOCUMENTATION.md` at the project root covers data analysis, project structure, all algorithms, installation guide, and UI guide.

**Q: What does the installation guide cover?**

Virtual environment setup, `pip install -r requirements.txt`, `python manage.py migrate`, and `runserver`. It also documents every management command with all relevant flags, and environment variables for overriding source database paths.

**Q: How would someone extend the system with a new city?**

Add the city's records to the LD2/LD3 source databases and re-run `load_data`. Then run `fetch_poi_images --city <slug>` to download images and `compute_similarities` to generate similarity records. No code changes are needed.
