Data Sources

Seahorse supports Data Sources of the following types:

Read more about supported file formats.


Data Sources List

Setting up JDBC Data Source with custom JDBC Drivers

Reading data from and writing data to JDBC-compatible databases is supported.

This functionality requires placing adequate JDBC driver JAR file to Seahorse shared folder jars. That file placement has to be performed before starting editing workflow that uses JDBC connection (otherwise, it will be required to stop running session and start it again).

Setting up a Google Spreadsheet Data Source

Google Sheets Data Source has two parameters that require more detailed description:


Google Spreadsheet Params Forms


In the following sections you will learn how to set up a Google Service Account for the Seahorse instance and how to share the spreadsheets with the Seahorse instance.

Setting up a Google Service Account

  1. Set up Google Project with Google Drive API enabled
    1. Go to Drive API Page
      If you don’t have a Google Project yet - create a new one.
    2. Enable Google Drive API.
  2. Set up Google Service Account
    1. Go to Service Accounts Page
    2. Select your Google Project
    3. Create new Google Service Account.
      IMPORTANT - Tick ’Furnish a new private key’ option and select JSON key type. Store downloaded JSON credentials file securely. You will need that JSON credentials file later.


    Create Service Account Form


  3. Now you can include the JSON content into Google Service Account Credentials JSON field of the Data Source

Sharing Google Spreadsheet with you Seahorse Instance

  1. Obtain e-mail address of your Google Service Account from the list of Service Accounts


    Email Address for Google Service Account


  2. Share your Google Spreadsheet with your Google Service Account using e-mail address from step 1.


    Google Spreadsheet Share Form


  3. You can use Google Spreadsheet and you Google Service Account credentials to define a Google Spreadsheet Data Source in Seahorse.


    Google Spreadsheet Param Forms


Now it’s ready to use in the Seahorse!

Supported File Formats Details

CSV

Comma-separated values

When reading a CSV file, Seahorse infers column types. If a column contains values of multiple types, the narrowest possible type will be chosen, so that all the values can be represented in that type.

Empty cells are treated as null, unless column type is inferred as a String - in this case, they are treated as empty strings.

If the convert to boolean mode is enabled, the columns that contain only zeros, ones or empty values will be inferred as Boolean. In particular, a column consisting of empty cells will be inferred as Boolean with null values only.

While reading, Seahorse assumes that each row in the file has the same number of fields. When this condition is not met, the behavior is undefined.

If the file defines column names, they will be used in the DataFrame. If column’s name is empty or absent, it will be named unnamed_X, where X is the smallest non-negative number such that column names are unique.

You can escape a column separator with a backslash. For example, assuming that comma is the separator, the following line

1,abc,"a,b,c","\"x\"",, z ," z  "

will be parsed as:

1.0 abc a,b,c "x"   _z_ _z__

where _ denotes a space and the fifth value is an empty string. Note, that "\"x\"" is being parsed as "x", since \" inside an already quoted value translates to ".

PARQUET

Parquet

Note that Parquet format does not allow using any of the characters , ;{}()\n\t= in column names.

JSON

JSON

Note that JSON file format does not preserve the order of columns.

When saving a DataFrame, Seahorse converts Timestamp columns to String type (values of that columns are converted to their string representations by Apache Spark).

Null values in JSON are omitted. This might result in schema mismatch if all values in particular column are null (that column will be omitted in output JSON file).