Reading Data

Reading Data#

In practice, we rarely create spatial data entirely from scratch using raw coordinates. Instead, we most often work with existing datasets that have already been collected, structured, and stored in various formats.

In this section, we’ll learn how to read spatial data from different sources using Python. This includes:

Loading vector data formats such as Shapefiles, GeoJSON, and GeoPackages using GeoPandas
Working with CSV files that contain coordinate information
Exploring the contents of spatial datasets, including attributes and geometry

Knowing how to read and explore spatial data is an essential step before performing any kind of mapping, spatial analysis, or transformation.

Import libraries#

import pandas as pd
import geopandas as gpd

pandas (pandas) — a powerful Python library for data analysis and manipulation. It provides easy-to-use data structures, such as DataFrame, which is ideal for working with tabular (non-spatial) data like CSV files, spreadsheets, or database tables.
GeoPandas (geopandas) — an extension of pandas that makes working with geospatial data easy. It builds on the familiar DataFrame structure and adds support for spatial operations, geometry columns, and reading/writing spatial file formats like Shapefile, GeoJSON, and GeoPackage.

Reading Data#

From Different Spatial Data Formats#

Spatial data can be stored in many different file formats — each designed for specific use cases, tools, and types of analysis. In this section, we’ll take a closer look at three of the most commonly used formats for vector spatial data:

Shapefile (SHP) — a classic format developed by Esri; stores geometry and attributes across multiple files (.shp, .shx, .dbf, etc.).
GeoJSON — a lightweight, human-readable format based on JSON; ideal for web mapping and simple spatial data sharing.
GeoPackage (GPKG) — a modern, single-file SQLite-based format that supports multiple layers (vector, raster, and more).

GeoJSON (or Shapefile)#

We can read spatial data directly using the read_file() function from GeoPandas. In this example, we’re loading a GeoJSON file that contains metro station data in Vienna:

Once loaded, we can display it on an interactive map using .explore():

metro = gpd.read_file('../data/vienna_metro.geojson')

metro.explore(tiles='cartodbpositron')

Make this Notebook Trusted to load map: File -> Trust Notebook

GeoPackage#

We can also read data stored in a GeoPackage (GPKG) using the same read_file() function from GeoPandas. Here, we’re loading administrative boundaries of Vienna:

admin = gpd.read_file('../data/vienna_admin.gpkg')

admin.explore(tiles='cartodbpositron')

Make this Notebook Trusted to load map: File -> Trust Notebook

Done! But… GeoPackage files can contain multiple layers, so if needed, you can specify the layer name using the layer= parameter. Otherwise, by default, only the first layer in the file will be loaded.

To find out which layers are included in a GeoPackage, we can use the listlayers function from the fiona library. Let’s import the library and check.

import fiona

layers = fiona.listlayers('../data/vienna_admin.gpkg')

print(layers)

['districts']

Now we know which layers are in our GeoPackage, and we can access them by name

admin_district = gpd.read_file('../data/vienna_admin.gpkg', layer="cadastral_districts")
admin_district.explore(tiles='cartodbpositron')

---------------------------------------------------------------------------
DataLayerError                            Traceback (most recent call last)
Cell In[5], line 1
----> 1 admin_district = gpd.read_file('../data/vienna_admin.gpkg', layer="cadastral_districts")
admin_district.explore(tiles='cartodbpositron')

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/geopandas/io/file.py:316, in _read_file(filename, bbox, mask, columns, rows, engine, **kwargs)
           filename = response.read()
if engine == "pyogrio":
--> 316     return _read_file_pyogrio(
       filename, bbox=bbox, mask=mask, columns=columns, rows=rows, **kwargs
   )
elif engine == "fiona":
   if pd.api.types.is_file_like(filename):

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/geopandas/io/file.py:576, in _read_file_pyogrio(path_or_bytes, bbox, mask, rows, **kwargs)
   warnings.warn(
       "The 'include_fields' and 'ignore_fields' keywords are deprecated, and "
       "will be removed in a future release. You can use the 'columns' keyword "
   (...)
       stacklevel=3,
   )
   kwargs["columns"] = kwargs.pop("include_fields")
--> 576 return pyogrio.read_dataframe(path_or_bytes, bbox=bbox, **kwargs)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pyogrio/geopandas.py:275, in read_dataframe(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, mask, fids, sql, sql_dialect, fid_as_index, use_arrow, on_invalid, arrow_to_pandas_kwargs, **kwargs)
if not use_arrow:
   # For arrow, datetimes are read as is.
   # For numpy IO, datetimes are read as string values to preserve timezone info
   # as numpy does not directly support timezones.
   kwargs["datetime_as_string"] = True
--> 275 result = read_func(
   path_or_buffer,
   layer=layer,
   encoding=encoding,
   columns=columns,
   read_geometry=read_geometry,
   force_2d=gdal_force_2d,
   skip_features=skip_features,
   max_features=max_features,
   where=where,
   bbox=bbox,
   mask=mask,
   fids=fids,
   sql=sql,
   sql_dialect=sql_dialect,
   return_fids=fid_as_index,
   **kwargs,
)
if use_arrow:
   import pyarrow as pa

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pyogrio/raw.py:198, in read(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, mask, fids, sql, sql_dialect, return_fids, datetime_as_string, **kwargs)
"""Read OGR data source into numpy arrays.

IMPORTANT: non-linear geometry types (e.g., MultiSurface) are converted
   (...)

"""
dataset_kwargs = _preprocess_options_key_value(kwargs) if kwargs else {}
--> 198 return ogr_read(
   get_vsi_path_or_buffer(path_or_buffer),
   layer=layer,
   encoding=encoding,
   columns=columns,
   read_geometry=read_geometry,
   force_2d=force_2d,
   skip_features=skip_features,
   max_features=max_features or 0,
   where=where,
   bbox=bbox,
   mask=_mask_to_wkb(mask),
   fids=fids,
   sql=sql,
   sql_dialect=sql_dialect,
   return_fids=return_fids,
   dataset_kwargs=dataset_kwargs,
   datetime_as_string=datetime_as_string,
)

File pyogrio/_io.pyx:1318, in pyogrio._io.ogr_read()

File pyogrio/_io.pyx:280, in pyogrio._io.get_ogr_layer()

DataLayerError: Layer 'cadastral_districts' could not be opened

Great! We’ve learned how to read spatial data from various commonly used formats — including GeoJSON, Shapefile, and GeoPackage.

Now that we can confidently load spatial datasets, we’re ready to start working with them!

From Tabular Data#

Sometimes, we don’t start with a spatial file — instead, we may have tabular data (like a CSV file) that contains coordinates for each object. In this section, we’ll learn how to read tabular data and create a GeoDataFrame for further spatial analysis

CSV#

CSV (Comma-Separated Values) is a plain-text format commonly used to store tabular data. While it’s not a spatial format by design, it’s frequently used in spatial workflows when a file includes coordinate fields such as longitude and latitude.

In such cases, we can extract these coordinates and use them to create spatial objects (like points).

Let’s read the CSV file using the pandas library and take a look at its structure. This will help us understand how the data is organized and where the coordinate information is stored.

poi = pd.read_csv('../data/top_locations_wien.csv', sep=";", decimal=',')

poi.head()

	title	category	Beschreibung	address	zip	city	geo_latitude	geo_longitude	tel_1	email	web_url
0	21er Haus	museum	Das Museum wurde 2011 saniert und stellt unter...	Arsenalstraße 1	1030	Wien	48.185771	16.383622	+43 1 795 57-134	NaN	http://www.21erhaus.at/
1	A.E. Köchert	shopping	Dieser Traditions-Juwelier schmückt heute mit ...	Neuer Markt 15	1010	Wien	48.206573	16.370589	NaN	NaN	http://www.koechert.com/
2	Aida	cafes	Aida ist eine Wiener Konditoreikette. Das Desi...	Stock-im-Eisen-Platz 2	1010	Wien	48.208019	16.372047	+43 1 512 79 25	NaN	http://www.aida.at
3	Akademietheater	musicstage	Seit 1922 ist das Akademietheater die zweite S...	Lisztstraße 1	1030	Wien	48.200246	16.377087	+43 1 51444 4140	NaN	http://www.burgtheater.at
4	Albertina	museum	Die Albertina besitzt nicht nur eine der größt...	Albertinaplatz 1	1010	Wien	48.204854	16.368159	+43 1 534 83 0	info@albertina.at	http://www.albertina.at/

This is a list of “Top locations” in Vienna. The coordinates are stored in geo_latitude and geo_longitude columns.

To work with this dataset as a spatial data we need to convert it into a GeoDataFrame.

To create a GeoDataFrame from a regular DataFrame, we need to:

Generate geometry objects (in our case — points) using the coordinate columns. We do this using the points_from_xy() function from GeoPandas, which takes longitude and latitude values and returns a list of Point geometries.
Assign those geometries to a new geometry column.
Define the Coordinate Reference System (CRS) — here we’ll use EPSG:4326, which corresponds to the standard WGS84 latitude/longitude system used in GPS.

poi_gdf = gpd.GeoDataFrame(poi, geometry=gpd.points_from_xy(poi['geo_longitude'], poi['geo_latitude']), crs=4326)

Let’s check the output

poi_gdf.explore(tiles='cartodbpositron')

Make this Notebook Trusted to load map: File -> Trust Notebook

And that’s it — we’ve successfully transformed our tabular data into a spatial dataset!

Exploring Data#

Once you’ve loaded a spatial dataset into a GeoDataFrame, it’s important to understand what it contains. Here are some key characteristics you can inspect to better understand your data.

Let’s take the example of a dataset called metro (loaded from spb_metro.geojson).

Basic Info#

metro.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 98 entries, 0 to 97
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   OBJECTID    98 non-null     float64 
 1   LINFO       98 non-null     float64 
 2   HSTNR       0 non-null      float64 
 3   HTXT        98 non-null     object  
 4   HBEM        5 non-null      object  
 5   EROEFFNUNG  98 non-null     float64 
 6   EROEFFNUN0  98 non-null     float64 
 7   geometry    98 non-null     geometry
dtypes: float64(5), geometry(1), object(2)
memory usage: 6.2+ KB

Displays a summary of the DataFrame: number of entries, column names, data types, and missing values.

Preview the Data#

metro.head()

	OBJECTID	LINFO	HSTNR	HTXT	HBEM	EROEFFNUNG	EROEFFNUN0	geometry
0	341256.0	6.0	NaN	Am Sch�pfwerk	NaN	1995.0	4.0	POINT (16.32423 48.16072)
1	341257.0	3.0	NaN	Stubentor	NaN	1991.0	4.0	POINT (16.37913 48.20682)
2	341258.0	3.0	NaN	Simmering	NaN	2000.0	12.0	POINT (16.42070 48.16965)
3	341259.0	4.0	NaN	Meidling Hauptstra�e	NaN	1980.0	10.0	POINT (16.32776 48.18365)
4	342444.0	4.0	NaN	Friedensbr�cke	NaN	1976.0	5.0	POINT (16.36401 48.22777)

Shows the first 5 rows of the dataset — a quick way to understand the structure and content.

Number of Features#

len(metro)
# or
metro.shape

(98, 8)

Returns the number of rows (features). .shape also gives you the number of columns.

Geometry Type#

metro.geom_type.unique()

array(['Point'], dtype=object)

Tells you what kind of geometries are included (e.g. Point, Polygon).

Coordinate Reference System (CRS)#

metro.crs

<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

Shows the coordinate reference system — for example, EPSG:4326 (WGS84).

Bounding Box#

metro.total_bounds

array([16.26083539, 48.13051529, 16.50843802, 48.27751795])

Returns the extent of the dataset: [minx, miny, maxx, maxy].

Geometry Column#

metro.geometry

   POINT (16.32423 48.16072)
   POINT (16.37913 48.20682)
   POINT (16.42070 48.16965)
   POINT (16.32776 48.18365)
   POINT (16.36401 48.22777)
                ...            
  POINT (16.41480 48.17472)
  POINT (16.38129 48.21913)
  POINT (16.26084 48.19696)
  POINT (16.34295 48.18848)
  POINT (16.31896 48.18605)
Name: geometry, Length: 98, dtype: geometry

Displays the geometry objects for each row — these represent the spatial component of the data.

Attribute Fields#

metro.columns

Index(['OBJECTID', 'LINFO', 'HSTNR', 'HTXT', 'HBEM', 'EROEFFNUNG',
       'EROEFFNUN0', 'geometry'],
      dtype='object')

Lists all columns in the GeoDataFrame, including the geometry column and any additional attributes.

Overview#

print("CRS:", metro.crs)
print("Number of features:", len(metro))
print("Geometry types:", metro.geom_type.unique())
print("Bounds:", metro.total_bounds)
metro.head()

CRS: EPSG:4326
Number of features: 98
Geometry types: ['Point']
Bounds: [16.26083539 48.13051529 16.50843802 48.27751795]

	OBJECTID	LINFO	HSTNR	HTXT	HBEM	EROEFFNUNG	EROEFFNUN0	geometry
0	341256.0	6.0	NaN	Am Sch�pfwerk	NaN	1995.0	4.0	POINT (16.32423 48.16072)
1	341257.0	3.0	NaN	Stubentor	NaN	1991.0	4.0	POINT (16.37913 48.20682)
2	341258.0	3.0	NaN	Simmering	NaN	2000.0	12.0	POINT (16.42070 48.16965)
3	341259.0	4.0	NaN	Meidling Hauptstra�e	NaN	1980.0	10.0	POINT (16.32776 48.18365)
4	342444.0	4.0	NaN	Friedensbr�cke	NaN	1976.0	5.0	POINT (16.36401 48.22777)

This gives you a quick overview of what your spatial data contains and how it’s structured — a crucial step before analysis or visualization.

Summary#

In this module, we learned how to read spatial data from different sources and formats using Python.

Specifically, we covered:

How to load vector data formats such as Shapefile (SHP), GeoJSON, and GeoPackage (GPKG) using GeoPandas
How to work with CSV files that contain latitude and longitude fields, and convert them into a proper GeoDataFrame using points_from_xy()
How to explore the structure and content of spatial datasets, including geometry types and attributes

By the end of this section, you should be comfortable with reading and inspecting spatial data in various formats, preparing it for mapping and further geospatial analysis.

Reading Data

Contents

Reading Data#

Import libraries#

Reading Data#

From Different Spatial Data Formats#

GeoJSON (or Shapefile)#

GeoPackage#

From Tabular Data#

CSV#

Exploring Data#

Basic Info#

Preview the Data#

Number of Features#

Geometry Type#

Coordinate Reference System (CRS)#

Bounding Box#

Geometry Column#

Attribute Fields#

Overview#

Summary#