Reading Data#

In practice, we rarely create spatial data entirely from scratch using raw coordinates. Instead, we most often work with existing datasets that have already been collected, structured, and stored in various formats.

In this section, we’ll learn how to read spatial data from different sources using Python. This includes:

  • Loading vector data formats such as Shapefiles, GeoJSON, and GeoPackages using GeoPandas

  • Working with CSV files that contain coordinate information

  • Exploring the contents of spatial datasets, including attributes and geometry

Knowing how to read and explore spatial data is an essential step before performing any kind of mapping, spatial analysis, or transformation.

Import libraries#

import pandas as pd
import geopandas as gpd
  • pandas (pandas) — a powerful Python library for data analysis and manipulation. It provides easy-to-use data structures, such as DataFrame, which is ideal for working with tabular (non-spatial) data like CSV files, spreadsheets, or database tables.

  • GeoPandas (geopandas) — an extension of pandas that makes working with geospatial data easy. It builds on the familiar DataFrame structure and adds support for spatial operations, geometry columns, and reading/writing spatial file formats like Shapefile, GeoJSON, and GeoPackage.

Reading Data#

From Different Spatial Data Formats#

Spatial data can be stored in many different file formats — each designed for specific use cases, tools, and types of analysis. In this section, we’ll take a closer look at three of the most commonly used formats for vector spatial data:

  • Shapefile (SHP) — a classic format developed by Esri; stores geometry and attributes across multiple files (.shp, .shx, .dbf, etc.).

  • GeoJSON — a lightweight, human-readable format based on JSON; ideal for web mapping and simple spatial data sharing.

  • GeoPackage (GPKG) — a modern, single-file SQLite-based format that supports multiple layers (vector, raster, and more).

GeoJSON (or Shapefile)#

We can read spatial data directly using the read_file() function from GeoPandas. In this example, we’re loading a GeoJSON file that contains metro station data in Vienna:

Once loaded, we can display it on an interactive map using .explore():

metro = gpd.read_file('../data/vienna_metro.geojson')

metro.explore(tiles='cartodbpositron')
Make this Notebook Trusted to load map: File -> Trust Notebook

GeoPackage#

We can also read data stored in a GeoPackage (GPKG) using the same read_file() function from GeoPandas. Here, we’re loading administrative boundaries of Vienna:

admin = gpd.read_file('../data/vienna_admin.gpkg')

admin.explore(tiles='cartodbpositron')
Make this Notebook Trusted to load map: File -> Trust Notebook

Done! But… GeoPackage files can contain multiple layers, so if needed, you can specify the layer name using the layer= parameter. Otherwise, by default, only the first layer in the file will be loaded.

To find out which layers are included in a GeoPackage, we can use the listlayers function from the fiona library. Let’s import the library and check.

import fiona

layers = fiona.listlayers('../data/vienna_admin.gpkg')

print(layers)
['districts']

Now we know which layers are in our GeoPackage, and we can access them by name

admin_district = gpd.read_file('../data/vienna_admin.gpkg', layer="cadastral_districts")
admin_district.explore(tiles='cartodbpositron')
---------------------------------------------------------------------------
DataLayerError                            Traceback (most recent call last)
Cell In[5], line 1
----> 1 admin_district = gpd.read_file('../data/vienna_admin.gpkg', layer="cadastral_districts")
      2 admin_district.explore(tiles='cartodbpositron')

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/geopandas/io/file.py:316, in _read_file(filename, bbox, mask, columns, rows, engine, **kwargs)
    313             filename = response.read()
    315 if engine == "pyogrio":
--> 316     return _read_file_pyogrio(
    317         filename, bbox=bbox, mask=mask, columns=columns, rows=rows, **kwargs
    318     )
    320 elif engine == "fiona":
    321     if pd.api.types.is_file_like(filename):

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/geopandas/io/file.py:576, in _read_file_pyogrio(path_or_bytes, bbox, mask, rows, **kwargs)
    567     warnings.warn(
    568         "The 'include_fields' and 'ignore_fields' keywords are deprecated, and "
    569         "will be removed in a future release. You can use the 'columns' keyword "
   (...)
    572         stacklevel=3,
    573     )
    574     kwargs["columns"] = kwargs.pop("include_fields")
--> 576 return pyogrio.read_dataframe(path_or_bytes, bbox=bbox, **kwargs)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pyogrio/geopandas.py:275, in read_dataframe(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, mask, fids, sql, sql_dialect, fid_as_index, use_arrow, on_invalid, arrow_to_pandas_kwargs, **kwargs)
    270 if not use_arrow:
    271     # For arrow, datetimes are read as is.
    272     # For numpy IO, datetimes are read as string values to preserve timezone info
    273     # as numpy does not directly support timezones.
    274     kwargs["datetime_as_string"] = True
--> 275 result = read_func(
    276     path_or_buffer,
    277     layer=layer,
    278     encoding=encoding,
    279     columns=columns,
    280     read_geometry=read_geometry,
    281     force_2d=gdal_force_2d,
    282     skip_features=skip_features,
    283     max_features=max_features,
    284     where=where,
    285     bbox=bbox,
    286     mask=mask,
    287     fids=fids,
    288     sql=sql,
    289     sql_dialect=sql_dialect,
    290     return_fids=fid_as_index,
    291     **kwargs,
    292 )
    294 if use_arrow:
    295     import pyarrow as pa

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pyogrio/raw.py:198, in read(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, mask, fids, sql, sql_dialect, return_fids, datetime_as_string, **kwargs)
     59 """Read OGR data source into numpy arrays.
     60 
     61 IMPORTANT: non-linear geometry types (e.g., MultiSurface) are converted
   (...)
    194 
    195 """
    196 dataset_kwargs = _preprocess_options_key_value(kwargs) if kwargs else {}
--> 198 return ogr_read(
    199     get_vsi_path_or_buffer(path_or_buffer),
    200     layer=layer,
    201     encoding=encoding,
    202     columns=columns,
    203     read_geometry=read_geometry,
    204     force_2d=force_2d,
    205     skip_features=skip_features,
    206     max_features=max_features or 0,
    207     where=where,
    208     bbox=bbox,
    209     mask=_mask_to_wkb(mask),
    210     fids=fids,
    211     sql=sql,
    212     sql_dialect=sql_dialect,
    213     return_fids=return_fids,
    214     dataset_kwargs=dataset_kwargs,
    215     datetime_as_string=datetime_as_string,
    216 )

File pyogrio/_io.pyx:1318, in pyogrio._io.ogr_read()

File pyogrio/_io.pyx:280, in pyogrio._io.get_ogr_layer()

DataLayerError: Layer 'cadastral_districts' could not be opened

Great! We’ve learned how to read spatial data from various commonly used formats — including GeoJSON, Shapefile, and GeoPackage.

Now that we can confidently load spatial datasets, we’re ready to start working with them!

From Tabular Data#

Sometimes, we don’t start with a spatial file — instead, we may have tabular data (like a CSV file) that contains coordinates for each object. In this section, we’ll learn how to read tabular data and create a GeoDataFrame for further spatial analysis

CSV#

CSV (Comma-Separated Values) is a plain-text format commonly used to store tabular data. While it’s not a spatial format by design, it’s frequently used in spatial workflows when a file includes coordinate fields such as longitude and latitude.

In such cases, we can extract these coordinates and use them to create spatial objects (like points).

Let’s read the CSV file using the pandas library and take a look at its structure. This will help us understand how the data is organized and where the coordinate information is stored.

poi = pd.read_csv('../data/top_locations_wien.csv', sep=";", decimal=',')

poi.head()
title category Beschreibung address zip city geo_latitude geo_longitude tel_1 email web_url
0 21er Haus museum Das Museum wurde 2011 saniert und stellt unter... Arsenalstraße 1 1030 Wien 48.185771 16.383622 +43 1 795 57-134 NaN http://www.21erhaus.at/
1 A.E. Köchert shopping Dieser Traditions-Juwelier schmückt heute mit ... Neuer Markt 15 1010 Wien 48.206573 16.370589 NaN NaN http://www.koechert.com/
2 Aida cafes Aida ist eine Wiener Konditoreikette. Das Desi... Stock-im-Eisen-Platz 2 1010 Wien 48.208019 16.372047 +43 1 512 79 25 NaN http://www.aida.at
3 Akademietheater musicstage Seit 1922 ist das Akademietheater die zweite S... Lisztstraße 1 1030 Wien 48.200246 16.377087 +43 1 51444 4140 NaN http://www.burgtheater.at
4 Albertina museum Die Albertina besitzt nicht nur eine der größt... Albertinaplatz 1 1010 Wien 48.204854 16.368159 +43 1 534 83 0 info@albertina.at http://www.albertina.at/

This is a list of “Top locations” in Vienna. The coordinates are stored in geo_latitude and geo_longitude columns.

To work with this dataset as a spatial data we need to convert it into a GeoDataFrame.

To create a GeoDataFrame from a regular DataFrame, we need to:

  1. Generate geometry objects (in our case — points) using the coordinate columns. We do this using the points_from_xy() function from GeoPandas, which takes longitude and latitude values and returns a list of Point geometries.

  2. Assign those geometries to a new geometry column.

  3. Define the Coordinate Reference System (CRS) — here we’ll use EPSG:4326, which corresponds to the standard WGS84 latitude/longitude system used in GPS.

poi_gdf = gpd.GeoDataFrame(poi, geometry=gpd.points_from_xy(poi['geo_longitude'], poi['geo_latitude']), crs=4326)

Let’s check the output

poi_gdf.explore(tiles='cartodbpositron')
Make this Notebook Trusted to load map: File -> Trust Notebook

And that’s it — we’ve successfully transformed our tabular data into a spatial dataset!

Exploring Data#

Once you’ve loaded a spatial dataset into a GeoDataFrame, it’s important to understand what it contains. Here are some key characteristics you can inspect to better understand your data.

Let’s take the example of a dataset called metro (loaded from spb_metro.geojson).

Basic Info#

metro.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 98 entries, 0 to 97
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   OBJECTID    98 non-null     float64 
 1   LINFO       98 non-null     float64 
 2   HSTNR       0 non-null      float64 
 3   HTXT        98 non-null     object  
 4   HBEM        5 non-null      object  
 5   EROEFFNUNG  98 non-null     float64 
 6   EROEFFNUN0  98 non-null     float64 
 7   geometry    98 non-null     geometry
dtypes: float64(5), geometry(1), object(2)
memory usage: 6.2+ KB

Displays a summary of the DataFrame: number of entries, column names, data types, and missing values.

Preview the Data#

metro.head()
OBJECTID LINFO HSTNR HTXT HBEM EROEFFNUNG EROEFFNUN0 geometry
0 341256.0 6.0 NaN Am Sch�pfwerk NaN 1995.0 4.0 POINT (16.32423 48.16072)
1 341257.0 3.0 NaN Stubentor NaN 1991.0 4.0 POINT (16.37913 48.20682)
2 341258.0 3.0 NaN Simmering NaN 2000.0 12.0 POINT (16.42070 48.16965)
3 341259.0 4.0 NaN Meidling Hauptstra�e NaN 1980.0 10.0 POINT (16.32776 48.18365)
4 342444.0 4.0 NaN Friedensbr�cke NaN 1976.0 5.0 POINT (16.36401 48.22777)

Shows the first 5 rows of the dataset — a quick way to understand the structure and content.

Number of Features#

len(metro)
# or
metro.shape
(98, 8)

Returns the number of rows (features). .shape also gives you the number of columns.

Geometry Type#

metro.geom_type.unique()
array(['Point'], dtype=object)

Tells you what kind of geometries are included (e.g. Point, Polygon).

Coordinate Reference System (CRS)#

metro.crs
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

Shows the coordinate reference system — for example, EPSG:4326 (WGS84).

Bounding Box#

metro.total_bounds
array([16.26083539, 48.13051529, 16.50843802, 48.27751795])

Returns the extent of the dataset: [minx, miny, maxx, maxy].

Geometry Column#

metro.geometry
0     POINT (16.32423 48.16072)
1     POINT (16.37913 48.20682)
2     POINT (16.42070 48.16965)
3     POINT (16.32776 48.18365)
4     POINT (16.36401 48.22777)
                ...            
93    POINT (16.41480 48.17472)
94    POINT (16.38129 48.21913)
95    POINT (16.26084 48.19696)
96    POINT (16.34295 48.18848)
97    POINT (16.31896 48.18605)
Name: geometry, Length: 98, dtype: geometry

Displays the geometry objects for each row — these represent the spatial component of the data.

Attribute Fields#

metro.columns
Index(['OBJECTID', 'LINFO', 'HSTNR', 'HTXT', 'HBEM', 'EROEFFNUNG',
       'EROEFFNUN0', 'geometry'],
      dtype='object')

Lists all columns in the GeoDataFrame, including the geometry column and any additional attributes.

Overview#

print("CRS:", metro.crs)
print("Number of features:", len(metro))
print("Geometry types:", metro.geom_type.unique())
print("Bounds:", metro.total_bounds)
metro.head()
CRS: EPSG:4326
Number of features: 98
Geometry types: ['Point']
Bounds: [16.26083539 48.13051529 16.50843802 48.27751795]
OBJECTID LINFO HSTNR HTXT HBEM EROEFFNUNG EROEFFNUN0 geometry
0 341256.0 6.0 NaN Am Sch�pfwerk NaN 1995.0 4.0 POINT (16.32423 48.16072)
1 341257.0 3.0 NaN Stubentor NaN 1991.0 4.0 POINT (16.37913 48.20682)
2 341258.0 3.0 NaN Simmering NaN 2000.0 12.0 POINT (16.42070 48.16965)
3 341259.0 4.0 NaN Meidling Hauptstra�e NaN 1980.0 10.0 POINT (16.32776 48.18365)
4 342444.0 4.0 NaN Friedensbr�cke NaN 1976.0 5.0 POINT (16.36401 48.22777)

This gives you a quick overview of what your spatial data contains and how it’s structured — a crucial step before analysis or visualization.

Summary#

In this module, we learned how to read spatial data from different sources and formats using Python.

Specifically, we covered:

  • How to load vector data formats such as Shapefile (SHP), GeoJSON, and GeoPackage (GPKG) using GeoPandas

  • How to work with CSV files that contain latitude and longitude fields, and convert them into a proper GeoDataFrame using points_from_xy()

  • How to explore the structure and content of spatial datasets, including geometry types and attributes

By the end of this section, you should be comfortable with reading and inspecting spatial data in various formats, preparing it for mapping and further geospatial analysis.