4  Data Preparation

Every gePoints workflow begins with a data frame. How that data frame is built matters — not just for the code, but for the clarity and reproducibility of the document. This chapter describes two strategies for bringing data into a gePoints analysis, and argues for a discipline of presenting data in formatted tables from the start.

4.1 Strategy 1: Inline data

For small datasets — roughly 5 to 30 rows — the best approach is to embed the data directly in the code chunk using read_csv() with an inline string. The data is visible, the columns are aligned, and the reader sees exactly what the analysis operates on:

library(readr)
library(gt)

sites <- read_csv(
"text,             lat,        lon
Harvard Forest,    42.53690,  -72.17266
Ordway-Swisher,    29.68927,  -81.99343
Konza Prairie,     39.10077,  -96.56390
Yellowstone,       44.95350, -110.53914
Toolik Lake,       68.66109, -149.37047"
)

gt(sites) |>
  tab_caption(caption = "Selected NEON Core Sites") |>
  tab_source_note(source_note = "Source: NEON Strategy Document, 2011")

Aligning the columns takes a few minutes but eliminates an entire category of errors. A latitude value in the longitude column stands out immediately when the numbers are stacked. With unaligned CSV, that same error is invisible.

This is the approach used throughout the example chapters of this book. The datasets are small — 8 to 21 rows — and the visual inspection of aligned data is worth the formatting effort.

4.2 Strategy 2: External CSV files

For larger datasets, or data that updates independently of the document, reading from an external file is appropriate:

sites <- read_csv("data/NEON_core_sites.txt")

When using external files, the data is hidden — the reader cannot see it without opening the file separately. A gt table in the rendered document compensates for this, but the code itself is less self-contained.

External files also introduce path dependencies. The code chunk must find the file relative to the document’s working directory, which varies depending on whether you render from the command line, from RStudio, or from a Quarto project. For reproducible documentation, inline data avoids this problem entirely.

4.3 The gt table discipline

Every data frame in this book is displayed as a formatted gt table with a title and a source note. This is not decorative — it serves three purposes:

  1. Verification. The table shows exactly what create_kml() will receive. If a coordinate is wrong, it is wrong in the table too, and a careful reader will catch it.
  2. Provenance. The source note records where the data came from. This is metadata that a bare data frame does not carry.
  3. Presentation. A well-formatted table in the rendered document respects the reader. It signals that the data was inspected, not just dumped.

The gt package is already a dependency of gePoints, so no additional installation is needed.

4.4 Column naming conventions

create_kml() expects specific column names. When your source data uses different names — Latitude instead of lat, Site_Name instead of text — you have two choices. For inline data, simply name the columns correctly when you type them. For external files, rename after import:

library(dplyr)

sites <- read_csv("data/external_file.csv") |>
  rename(
    text = Site_Name,
    lat  = Latitude,
    lon  = Longitude
  )

The inline approach is simpler: you control the column names at the point of entry. This is another advantage of embedding small datasets directly in the code.

4.5 Adding styling columns

Styling columns — color, symbol, symbol_scale, text_color, text_scale — can be included in the inline data or added after import with mutate(). For datasets where every row has the same style, mutate() is cleaner:

sites <- sites |>
  mutate(
    color  = "green",
    symbol = "paddle"
  )

For datasets where style varies by row — color by category, size by value — including the styling columns in the inline data makes the mapping explicit and inspectable. The example chapters demonstrate both approaches.

4.6 Previewing maps with preview_map()

The preview_map() function takes the same data frame as create_kml() and renders an interactive Google Maps satellite view directly in a Quarto document or the RStudio viewer. It is a preview tool — a way to see your points on real imagery before taking the KML file to Google Earth for the full 3D experience.

create_kml(sites, "sites.kml")
preview_map(sites)

The map defaults to satellite imagery. Other options are "roadmap", "terrain", and "hybrid":

preview_map(sites, map_type = "hybrid")

Click any circle to see a popup built from the text and comment columns. The circle colors match the color column in your data frame. The map auto-zooms to fit all points.

4.6.1 Google Maps API key

preview_map() requires a Google Maps API key. The function reads the key from the environment variable GGMAP_GOOGLE_API_KEY. If you have not set this up, the steps are:

  1. Go to the Google Cloud Console and create a project (or use an existing one).
  2. Enable the Maps JavaScript API for that project.
  3. Create an API key under Credentials.
  4. Add the key to your .Renviron file so R finds it automatically:
# Open .Renviron for editing
usethis::edit_r_environ()

Add this line to the file:

GGMAP_GOOGLE_API_KEY=your-key-here

Save, restart R, and verify:

Sys.getenv("GGMAP_GOOGLE_API_KEY")

The key is stored on your machine, never in your source code or published documents. This is the same environment variable used by the ggmap package, so if you already use ggmap, the key is already in place.

You can also pass the key directly if needed:

preview_map(sites, api_key = "your-key-here")

4.6.2 Circle radius

The circle size auto-scales to the geographic extent of your data. For fine control, set the radius in meters:

# Smaller circles for dense, local data
preview_map(sites, radius = 2000)

# Larger circles for continent-scale data
preview_map(sites, radius = 50000)