Reading Data#
In practice, we rarely create spatial data entirely from scratch using raw coordinates. Instead, we most often work with existing datasets that have already been collected, structured, and stored in various formats.
In this section, we’ll learn how to read spatial data from different sources using Python. This includes:
Loading vector data formats such as Shapefiles, GeoJSON, and GeoPackages using
GeoPandas
Working with CSV files that contain coordinate information
Exploring the contents of spatial datasets, including attributes and geometry
Knowing how to read and explore spatial data is an essential step before performing any kind of mapping, spatial analysis, or transformation.
Import libraries#
import pandas as pd
import geopandas as gpd
pandas (
pandas
) — a powerful Python library for data analysis and manipulation. It provides easy-to-use data structures, such as DataFrame, which is ideal for working with tabular (non-spatial) data like CSV files, spreadsheets, or database tables.GeoPandas (
geopandas
) — an extension ofpandas
that makes working with geospatial data easy. It builds on the familiarDataFrame
structure and adds support for spatial operations, geometry columns, and reading/writing spatial file formats like Shapefile, GeoJSON, and GeoPackage.
Reading Data#
From Different Spatial Data Formats#
Spatial data can be stored in many different file formats — each designed for specific use cases, tools, and types of analysis. In this section, we’ll take a closer look at three of the most commonly used formats for vector spatial data:
Shapefile (SHP) — a classic format developed by Esri; stores geometry and attributes across multiple files (.shp, .shx, .dbf, etc.).
GeoJSON — a lightweight, human-readable format based on JSON; ideal for web mapping and simple spatial data sharing.
GeoPackage (GPKG) — a modern, single-file SQLite-based format that supports multiple layers (vector, raster, and more).
GeoJSON (or Shapefile)#
We can read spatial data directly using the read_file()
function from GeoPandas.
In this example, we’re loading a GeoJSON file that contains metro station data in Vienna:
Once loaded, we can display it on an interactive map using .explore()
:
metro = gpd.read_file('../data/vienna_metro.geojson')
metro.explore(tiles='cartodbpositron')
GeoPackage#
We can also read data stored in a GeoPackage (GPKG) using the same read_file()
function from GeoPandas.
Here, we’re loading administrative boundaries of Vienna:
admin = gpd.read_file('../data/vienna_admin.gpkg')
admin.explore(tiles='cartodbpositron')
Done! But… GeoPackage files can contain multiple layers, so if needed, you can specify the layer name using the layer=
parameter.
Otherwise, by default, only the first layer in the file will be loaded.
To find out which layers are included in a GeoPackage, we can use the listlayers
function from the fiona library. Let’s import the library and check.
import fiona
layers = fiona.listlayers('../data/vienna_admin.gpkg')
print(layers)
['districts']
Now we know which layers are in our GeoPackage, and we can access them by name
admin_district = gpd.read_file('../data/vienna_admin.gpkg', layer="cadastral_districts")
admin_district.explore(tiles='cartodbpositron')
---------------------------------------------------------------------------
DataLayerError Traceback (most recent call last)
Cell In[5], line 1
----> 1 admin_district = gpd.read_file('../data/vienna_admin.gpkg', layer="cadastral_districts")
2 admin_district.explore(tiles='cartodbpositron')
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/geopandas/io/file.py:316, in _read_file(filename, bbox, mask, columns, rows, engine, **kwargs)
313 filename = response.read()
315 if engine == "pyogrio":
--> 316 return _read_file_pyogrio(
317 filename, bbox=bbox, mask=mask, columns=columns, rows=rows, **kwargs
318 )
320 elif engine == "fiona":
321 if pd.api.types.is_file_like(filename):
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/geopandas/io/file.py:576, in _read_file_pyogrio(path_or_bytes, bbox, mask, rows, **kwargs)
567 warnings.warn(
568 "The 'include_fields' and 'ignore_fields' keywords are deprecated, and "
569 "will be removed in a future release. You can use the 'columns' keyword "
(...)
572 stacklevel=3,
573 )
574 kwargs["columns"] = kwargs.pop("include_fields")
--> 576 return pyogrio.read_dataframe(path_or_bytes, bbox=bbox, **kwargs)
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pyogrio/geopandas.py:275, in read_dataframe(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, mask, fids, sql, sql_dialect, fid_as_index, use_arrow, on_invalid, arrow_to_pandas_kwargs, **kwargs)
270 if not use_arrow:
271 # For arrow, datetimes are read as is.
272 # For numpy IO, datetimes are read as string values to preserve timezone info
273 # as numpy does not directly support timezones.
274 kwargs["datetime_as_string"] = True
--> 275 result = read_func(
276 path_or_buffer,
277 layer=layer,
278 encoding=encoding,
279 columns=columns,
280 read_geometry=read_geometry,
281 force_2d=gdal_force_2d,
282 skip_features=skip_features,
283 max_features=max_features,
284 where=where,
285 bbox=bbox,
286 mask=mask,
287 fids=fids,
288 sql=sql,
289 sql_dialect=sql_dialect,
290 return_fids=fid_as_index,
291 **kwargs,
292 )
294 if use_arrow:
295 import pyarrow as pa
File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pyogrio/raw.py:198, in read(path_or_buffer, layer, encoding, columns, read_geometry, force_2d, skip_features, max_features, where, bbox, mask, fids, sql, sql_dialect, return_fids, datetime_as_string, **kwargs)
59 """Read OGR data source into numpy arrays.
60
61 IMPORTANT: non-linear geometry types (e.g., MultiSurface) are converted
(...)
194
195 """
196 dataset_kwargs = _preprocess_options_key_value(kwargs) if kwargs else {}
--> 198 return ogr_read(
199 get_vsi_path_or_buffer(path_or_buffer),
200 layer=layer,
201 encoding=encoding,
202 columns=columns,
203 read_geometry=read_geometry,
204 force_2d=force_2d,
205 skip_features=skip_features,
206 max_features=max_features or 0,
207 where=where,
208 bbox=bbox,
209 mask=_mask_to_wkb(mask),
210 fids=fids,
211 sql=sql,
212 sql_dialect=sql_dialect,
213 return_fids=return_fids,
214 dataset_kwargs=dataset_kwargs,
215 datetime_as_string=datetime_as_string,
216 )
File pyogrio/_io.pyx:1318, in pyogrio._io.ogr_read()
File pyogrio/_io.pyx:280, in pyogrio._io.get_ogr_layer()
DataLayerError: Layer 'cadastral_districts' could not be opened
Great! We’ve learned how to read spatial data from various commonly used formats — including GeoJSON, Shapefile, and GeoPackage.
Now that we can confidently load spatial datasets, we’re ready to start working with them!
From Tabular Data#
Sometimes, we don’t start with a spatial file — instead, we may have tabular data (like a CSV file) that contains coordinates for each object. In this section, we’ll learn how to read tabular data and create a GeoDataFrame for further spatial analysis
CSV#
CSV (Comma-Separated Values) is a plain-text format commonly used to store tabular data. While it’s not a spatial format by design, it’s frequently used in spatial workflows when a file includes coordinate fields such as longitude and latitude.
In such cases, we can extract these coordinates and use them to create spatial objects (like points).
Let’s read the CSV file using the pandas library and take a look at its structure. This will help us understand how the data is organized and where the coordinate information is stored.
poi = pd.read_csv('../data/top_locations_wien.csv', sep=";", decimal=',')
poi.head()
title | category | Beschreibung | address | zip | city | geo_latitude | geo_longitude | tel_1 | web_url | ||
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21er Haus | museum | Das Museum wurde 2011 saniert und stellt unter... | Arsenalstraße 1 | 1030 | Wien | 48.185771 | 16.383622 | +43 1 795 57-134 | NaN | http://www.21erhaus.at/ |
1 | A.E. Köchert | shopping | Dieser Traditions-Juwelier schmückt heute mit ... | Neuer Markt 15 | 1010 | Wien | 48.206573 | 16.370589 | NaN | NaN | http://www.koechert.com/ |
2 | Aida | cafes | Aida ist eine Wiener Konditoreikette. Das Desi... | Stock-im-Eisen-Platz 2 | 1010 | Wien | 48.208019 | 16.372047 | +43 1 512 79 25 | NaN | http://www.aida.at |
3 | Akademietheater | musicstage | Seit 1922 ist das Akademietheater die zweite S... | Lisztstraße 1 | 1030 | Wien | 48.200246 | 16.377087 | +43 1 51444 4140 | NaN | http://www.burgtheater.at |
4 | Albertina | museum | Die Albertina besitzt nicht nur eine der größt... | Albertinaplatz 1 | 1010 | Wien | 48.204854 | 16.368159 | +43 1 534 83 0 | info@albertina.at | http://www.albertina.at/ |
This is a list of “Top locations” in Vienna. The coordinates are stored in geo_latitude and geo_longitude columns.
To work with this dataset as a spatial data we need to convert it into a GeoDataFrame.
To create a GeoDataFrame from a regular DataFrame, we need to:
Generate geometry objects (in our case — points) using the coordinate columns. We do this using the
points_from_xy()
function from GeoPandas, which takes longitude and latitude values and returns a list ofPoint
geometries.Assign those geometries to a new
geometry
column.Define the Coordinate Reference System (CRS) — here we’ll use EPSG:4326, which corresponds to the standard WGS84 latitude/longitude system used in GPS.
poi_gdf = gpd.GeoDataFrame(poi, geometry=gpd.points_from_xy(poi['geo_longitude'], poi['geo_latitude']), crs=4326)
Let’s check the output
poi_gdf.explore(tiles='cartodbpositron')
And that’s it — we’ve successfully transformed our tabular data into a spatial dataset!
Exploring Data#
Once you’ve loaded a spatial dataset into a GeoDataFrame
, it’s important to understand what it contains.
Here are some key characteristics you can inspect to better understand your data.
Let’s take the example of a dataset called metro
(loaded from spb_metro.geojson
).
Basic Info#
metro.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 98 entries, 0 to 97
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 OBJECTID 98 non-null float64
1 LINFO 98 non-null float64
2 HSTNR 0 non-null float64
3 HTXT 98 non-null object
4 HBEM 5 non-null object
5 EROEFFNUNG 98 non-null float64
6 EROEFFNUN0 98 non-null float64
7 geometry 98 non-null geometry
dtypes: float64(5), geometry(1), object(2)
memory usage: 6.2+ KB
Displays a summary of the DataFrame: number of entries, column names, data types, and missing values.
Preview the Data#
metro.head()
OBJECTID | LINFO | HSTNR | HTXT | HBEM | EROEFFNUNG | EROEFFNUN0 | geometry | |
---|---|---|---|---|---|---|---|---|
0 | 341256.0 | 6.0 | NaN | Am Sch�pfwerk | NaN | 1995.0 | 4.0 | POINT (16.32423 48.16072) |
1 | 341257.0 | 3.0 | NaN | Stubentor | NaN | 1991.0 | 4.0 | POINT (16.37913 48.20682) |
2 | 341258.0 | 3.0 | NaN | Simmering | NaN | 2000.0 | 12.0 | POINT (16.42070 48.16965) |
3 | 341259.0 | 4.0 | NaN | Meidling Hauptstra�e | NaN | 1980.0 | 10.0 | POINT (16.32776 48.18365) |
4 | 342444.0 | 4.0 | NaN | Friedensbr�cke | NaN | 1976.0 | 5.0 | POINT (16.36401 48.22777) |
Shows the first 5 rows of the dataset — a quick way to understand the structure and content.
Number of Features#
len(metro)
# or
metro.shape
(98, 8)
Returns the number of rows (features). .shape
also gives you the number of columns.
Geometry Type#
metro.geom_type.unique()
array(['Point'], dtype=object)
Tells you what kind of geometries are included (e.g. Point
, Polygon
).
Coordinate Reference System (CRS)#
metro.crs
<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
Shows the coordinate reference system — for example, EPSG:4326
(WGS84).
Bounding Box#
metro.total_bounds
array([16.26083539, 48.13051529, 16.50843802, 48.27751795])
Returns the extent of the dataset: [minx, miny, maxx, maxy]
.
Geometry Column#
metro.geometry
0 POINT (16.32423 48.16072)
1 POINT (16.37913 48.20682)
2 POINT (16.42070 48.16965)
3 POINT (16.32776 48.18365)
4 POINT (16.36401 48.22777)
...
93 POINT (16.41480 48.17472)
94 POINT (16.38129 48.21913)
95 POINT (16.26084 48.19696)
96 POINT (16.34295 48.18848)
97 POINT (16.31896 48.18605)
Name: geometry, Length: 98, dtype: geometry
Displays the geometry objects for each row — these represent the spatial component of the data.
Attribute Fields#
metro.columns
Index(['OBJECTID', 'LINFO', 'HSTNR', 'HTXT', 'HBEM', 'EROEFFNUNG',
'EROEFFNUN0', 'geometry'],
dtype='object')
Lists all columns in the GeoDataFrame, including the geometry
column and any additional attributes.
Overview#
print("CRS:", metro.crs)
print("Number of features:", len(metro))
print("Geometry types:", metro.geom_type.unique())
print("Bounds:", metro.total_bounds)
metro.head()
CRS: EPSG:4326
Number of features: 98
Geometry types: ['Point']
Bounds: [16.26083539 48.13051529 16.50843802 48.27751795]
OBJECTID | LINFO | HSTNR | HTXT | HBEM | EROEFFNUNG | EROEFFNUN0 | geometry | |
---|---|---|---|---|---|---|---|---|
0 | 341256.0 | 6.0 | NaN | Am Sch�pfwerk | NaN | 1995.0 | 4.0 | POINT (16.32423 48.16072) |
1 | 341257.0 | 3.0 | NaN | Stubentor | NaN | 1991.0 | 4.0 | POINT (16.37913 48.20682) |
2 | 341258.0 | 3.0 | NaN | Simmering | NaN | 2000.0 | 12.0 | POINT (16.42070 48.16965) |
3 | 341259.0 | 4.0 | NaN | Meidling Hauptstra�e | NaN | 1980.0 | 10.0 | POINT (16.32776 48.18365) |
4 | 342444.0 | 4.0 | NaN | Friedensbr�cke | NaN | 1976.0 | 5.0 | POINT (16.36401 48.22777) |
This gives you a quick overview of what your spatial data contains and how it’s structured — a crucial step before analysis or visualization.
Summary#
In this module, we learned how to read spatial data from different sources and formats using Python.
Specifically, we covered:
How to load vector data formats such as Shapefile (SHP), GeoJSON, and GeoPackage (GPKG) using
GeoPandas
How to work with CSV files that contain latitude and longitude fields, and convert them into a proper
GeoDataFrame
usingpoints_from_xy()
How to explore the structure and content of spatial datasets, including geometry types and attributes
By the end of this section, you should be comfortable with reading and inspecting spatial data in various formats, preparing it for mapping and further geospatial analysis.