Wikidata: Events SPARQL Query

Alexander Dunkel, Institute of Cartography, TU Dresden

•••
Out[1]:

Last updated: Jul-26-2023, Carto-Lab Docker Version 0.14.0

Visualization of events (for Nevada example) queried from Wikidata using SPARQL.

Preparations

Create environment

In [4]:
!python -m venv /envs/wikidata_venv

Install qwikidata in a venv and link the Python Kernel to Jupyter Lab.

In [9]:
%%bash
if [ ! -d "/envs/wikidata_venv/lib/python3.10/site-packages/qwikidata" ]; then
    /envs/wikidata_venv/bin/python -m pip install qwikidata ipykernel pandas > /dev/null 2>&1
else
  echo "Already installed."
fi
# link
if [ ! -d "/root/.local/share/jupyter/kernels/qwikidata" ]; then
    echo "Linking environment to jupyter"
    /envs/wikidata_venv/bin/python -m ipykernel install --user --name=qwikidata
else
  echo "Already linked."
fi
Already installed.
Linking environment to jupyter
Installed kernelspec qwikidata in /root/.local/share/jupyter/kernels/qwikidata

Hit F5 and select the qwikidata Kernel on the top-right corner of Jupyter Lab.

See the package versions used below.

•••
List of package versions used in this notebook
package python ipykernel pandas qwikidata
version 3.10.12 6.24.0 2.0.3 0.4.2

Query wikidata using SPARQL

import dependencies

In [4]:
import csv
import pandas as pd
from qwikidata.sparql import return_sparql_query_results

Define query:

  • use distance query to Nevada (centroid)
  • filter based on country geometry is done later in Geopandas
  • see SPARQL examples here and here

Parameters

There are two parameters that needs modification, the entity name that is used to get the centroid (location), for filtering based on geodistance (the second parameter).

In [1]:
## Example 1:
loc_name = "Nevada"
entity = "Q1227"
geodistance = 400

## Example 2:
# loc_name = "Leipzig"
# geodistance = 80
# entity = "Q2079" # Leipzig, Germany

In [6]:
sparql_query = f"""
#title: All events in {loc_name}, based on distance query ({geodistance})
SELECT ?event ?eventLabel ?date ?location ?eventDescription
WITH {{
  SELECT DISTINCT ?event ?date ?location
  WHERE {{
    # find events
    wd:{entity} wdt:P625 ?loc_ref. 
    ?event wdt:P31/wdt:P279* wd:Q1190554.
           # wdt:P17 wd:Q30;
    # with a point in time or start date
    OPTIONAL {{ ?event wdt:P585 ?date. }}
    OPTIONAL {{ ?event wdt:P580 ?date. }}
    ?event wdt:P625 ?location.
    FILTER(geof:distance(?location, ?loc_ref) < {geodistance}).
  }}
  LIMIT 5000
}} AS %i
WHERE {{
  INCLUDE %i
  SERVICE wikibase:label {{ bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en,de" .}}
}}
"""
In [9]:
%%time
result = return_sparql_query_results(sparql_query)
CPU times: user 26.6 ms, sys: 0 ns, total: 26.6 ms
Wall time: 39.8 s

Format and convert to pandas DataFrame

In [10]:
import dateutil.parser

event_list = []
for event in result["results"]["bindings"]:
    date_val = event.get('date')
    if date_val:
        date_val = date_val.get('value')
        date_val = pd.to_datetime(dateutil.parser.parse(date_val), errors = 'coerce')
    event_desc = event.get('eventDescription')
    if event_desc:
        event_desc = event['eventDescription']['value']
    event_tuple = (
        event['event']['value'],
        event['eventLabel']['value'],
        date_val,
        event['location']['value'],
        event_desc)
    event_list.append(event_tuple)
In [11]:
df = pd.DataFrame(event_list, columns=result['head']['vars'])
In [12]:
df.head()
Out[12]:
event eventLabel date location eventDescription
0 http://www.wikidata.org/entity/Q116448291 California Revealed NaT Point(-121.49633 38.575783) online project of archival resources
1 http://www.wikidata.org/entity/Q29098186 Hilton Grand Vacations Club NaT Point(-115.161261 36.140165) hotel in Las Vegas, Nevada
2 http://www.wikidata.org/entity/Q29098186 Hilton Grand Vacations Club NaT Point(-115.160386 36.140174) hotel in Las Vegas, Nevada
3 http://www.wikidata.org/entity/Q4602566 2004 Bridgestone 400 2004-09-25 00:00:00+00:00 Point(-115.01112 36.27134) motor car race
4 http://www.wikidata.org/entity/Q16274840 1964 LPGA Championship 1964-01-01 00:00:00+00:00 Point(-115.125 36.128) golf tournament
In [13]:
print(len(df))
328

Store to disk

In [14]:
from pathlib import Path
OUTPUT = Path.cwd().parents[0] / "out" 
df.to_pickle(OUTPUT / f"wikidata_events_{loc_name.lower()}.pkl") 

Visualize on a map

Select worker_env as the visualization environment.

In [2]:
%load_ext autoreload
%autoreload 2

Load dependencies

In [3]:
import sys
import pandas as pd
import geopandas as gp
from pathlib import Path
from shapely.geometry import Point
from shapely import wkt
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
    sys.path.append(module_path)
from modules.base import tools
•••
List of package versions used in this notebook
package python Shapely geopandas pandas
version 3.9.15 1.7.1 0.13.2 2.0.3
In [5]:
OUTPUT = Path.cwd().parents[0] / "out" 
df = pd.read_pickle(OUTPUT / f"wikidata_events_{loc_name.lower()}.pkl") 
In [6]:
CRS_WGS = "epsg:4326"

df['geometry'] = df.location.apply(wkt.loads)
gdf = gp.GeoDataFrame(df, crs=CRS_WGS)

Get Shapefile for US States/ Germany

In [7]:
if loc_name == "Nevada":
    source_zip = "https://www2.census.gov/geo/tiger/GENZ2018/shp/"
    filename = "cb_2018_us_state_5m.zip"
    shapes_name = "cb_2018_us_state_5m.shp"
elif loc_name == "Leipzig":
    source_zip = "https://daten.gdz.bkg.bund.de/produkte/vg/vg2500/aktuell/"
    filename = "vg2500_12-31.utm32s.shape.zip"
    shapes_name = "vg2500_12-31.utm32s.shape/vg2500/VG2500_LAN.shp"
In [8]:
SHAPE_DIR = (OUTPUT / "shapes")
SHAPE_DIR.mkdir(exist_ok=True)

if not (SHAPE_DIR / shapes_name).exists():
    tools.get_zip_extract(uri=source_zip, filename=filename, output_path=SHAPE_DIR)
else:
    print("Already exists")
Already exists
In [9]:
shapes = gp.read_file(SHAPE_DIR / shapes_name)
shapes = shapes.to_crs("EPSG:4326")
In [10]:
ax = shapes.plot(color='none', edgecolor='black', linewidth=0.5)
ax = gdf.plot(ax=ax)
ax.set_axis_off()
buffer = 0.5
minx, miny, maxx, maxy = gdf.total_bounds
ax.set_xlim(minx-buffer, maxx+buffer)
ax.set_ylim(miny-buffer, maxy+buffer)
Out[10]:
(35.117, 43.0)

Highlight/Select all in Region

We want to filter those events whose location falls within the state boundary (Nevada, Saxony)

In [11]:
if loc_name == "Nevada":
    state_name = "Nevada"
    col_name = "NAME"
elif loc_name == "Leipzig":
    state_name = "Sachsen"
    col_name = "GEN"
In [12]:
sel_geom = shapes[shapes[col_name]==state_name].copy()
In [13]:
tools.drop_cols_except(df=sel_geom, columns_keep=["geometry", col_name])
sel_geom.rename(columns={col_name: "country"}, inplace=True)
In [14]:
gdf_overlay = gp.overlay(
    gdf, sel_geom,
    how='intersection')
In [15]:
ax = shapes.plot(color='none', edgecolor='black', linewidth=0.5)
ax = gdf.plot(ax=ax)
ax = gdf_overlay.plot(ax=ax, color='red')
ax.set_axis_off()
buffer = 1
minx, miny, maxx, maxy = gdf.total_bounds
ax.set_xlim(minx-buffer, maxx+buffer)
ax.set_ylim(miny-buffer, maxy+buffer)
Out[15]:
(34.617, 43.5)
In [16]:
print(f'{len(gdf_overlay)} events queried from wikidata that are located in Nevada')
117 events queried from wikidata that are located in Nevada
In [17]:
gdf_overlay.head(20)
Out[17]:
event eventLabel date location eventDescription country geometry
0 http://www.wikidata.org/entity/Q29098186 Hilton Grand Vacations Club NaT Point(-115.161261 36.140165) hotel in Las Vegas, Nevada Nevada POINT (-115.16126 36.14017)
1 http://www.wikidata.org/entity/Q29098186 Hilton Grand Vacations Club NaT Point(-115.160386 36.140174) hotel in Las Vegas, Nevada Nevada POINT (-115.16039 36.14017)
2 http://www.wikidata.org/entity/Q4602566 2004 Bridgestone 400 2004-09-25 00:00:00+00:00 Point(-115.01112 36.27134) motor car race Nevada POINT (-115.01112 36.27134)
3 http://www.wikidata.org/entity/Q16274840 1964 LPGA Championship 1964-01-01 00:00:00+00:00 Point(-115.125 36.128) golf tournament Nevada POINT (-115.12500 36.12800)
4 http://www.wikidata.org/entity/Q4571929 1965 LPGA Championship 1965-01-01 00:00:00+00:00 Point(-115.125 36.128) golf tournament Nevada POINT (-115.12500 36.12800)
5 http://www.wikidata.org/entity/Q4570360 1961 LPGA Championship 1961-01-01 00:00:00+00:00 Point(-115.125 36.128) golf tournament Nevada POINT (-115.12500 36.12800)
6 http://www.wikidata.org/entity/Q4572336 1966 LPGA Championship 1966-01-01 00:00:00+00:00 Point(-115.125 36.128) golf tournament Nevada POINT (-115.12500 36.12800)
7 http://www.wikidata.org/entity/Q4570751 1962 LPGA Championship 1962-01-01 00:00:00+00:00 Point(-115.125 36.128) golf tournament Nevada POINT (-115.12500 36.12800)
8 http://www.wikidata.org/entity/Q4571127 1963 LPGA Championship 1963-01-01 00:00:00+00:00 Point(-115.125 36.128) golf tournament Nevada POINT (-115.12500 36.12800)
9 http://www.wikidata.org/entity/Q111021622 3-Cushion World Cup 2022-2 2022-01-01 00:00:00+00:00 Point(-115.18708 36.116869) Internationales Karambolageturnier Nevada POINT (-115.18708 36.11687)
10 http://www.wikidata.org/entity/Q24906942 Real World: Go Big or Go Home NaT Point(-115.140444444 36.170972222) thirty-first season of Real World Nevada POINT (-115.14044 36.17097)
11 http://www.wikidata.org/entity/Q7759664 The Real World: Las Vegas, 2002 season 2002-09-17 00:00:00+00:00 Point(-115.194 36.1139) twelth season of The Real World Nevada POINT (-115.19400 36.11390)
12 http://www.wikidata.org/entity/Q7759665 The Real World: Las Vegas, 2011 season 2011-03-09 00:00:00+00:00 Point(-115.154 36.11) twenty-fifth season of The Real World Nevada POINT (-115.15400 36.11000)
13 http://www.wikidata.org/entity/Q104786210 2021 NHL Outdoor Games NaT Point(-119.949 38.968) outdoor National Hockey League game Nevada POINT (-119.94900 38.96800)
14 http://www.wikidata.org/entity/Q25316469 1954 NCAA Skiing Championships 1954-01-01 00:00:00+00:00 Point(-119.872 39.318) None Nevada POINT (-119.87200 39.31800)
15 http://www.wikidata.org/entity/Q15092916 Sparks Middle School shooting NaT Point(-119.76838889 39.55191667) Shooting in Sparks, Nevada, on October 21, 2013 Nevada POINT (-119.76839 39.55192)
16 http://www.wikidata.org/entity/Q15806674 Dreiband-Weltmeisterschaft 1978 1978-01-01 00:00:00+00:00 Point(-115.172816 36.114646) 33. Turnier des Karambolagebillards Nevada POINT (-115.17282 36.11465)
17 http://www.wikidata.org/entity/Q15806682 1986 UMB World Three-cushion Championship 1986-01-01 00:00:00+00:00 Point(-115.172816 36.114646) 41. Turnier des Karambolagebillards Nevada POINT (-115.17282 36.11465)
18 http://www.wikidata.org/entity/Q15806666 Dreiband-Weltmeisterschaft 1970 1970-01-01 00:00:00+00:00 Point(-115.172816 36.114646) 25. Turnier des Karambolagebillards Nevada POINT (-115.17282 36.11465)
19 http://www.wikidata.org/entity/Q6492580 Las Vegas Grind NaT Point(-115.193 36.1166) ls Vegas Grind Festival Nevada POINT (-115.19300 36.11660)

Store results as CSV

In [18]:
gdf_overlay.to_csv(OUTPUT / f"wikidata_events_{loc_name.lower()}.csv")

Create notebook HTML

In [19]:
!jupyter nbconvert --to html_toc \
    --output-dir=../resources/html/ ./03_wikidata_event_query.ipynb \
    --output 03_wikidata_event_query_{loc_name.lower()} \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False >&- 2>&-
In [ ]: