hipscat.io.file_io#

Submodules#

Package Contents#

Functions#

delete_file(file_handle[, storage_options])

Deletes file from filesystem.

load_csv_to_pandas(→ pandas.DataFrame)

Load a csv file to a pandas dataframe

load_json_file(→ dict)

Load a json file to a dictionary

load_parquet_to_pandas(→ pandas.DataFrame)

Load a parquet file to a pandas dataframe

load_text_file(file_pointer[, encoding, storage_options])

Load a text file content to a list of strings.

make_directory(file_pointer[, exist_ok, storage_options])

Make a directory at a given file pointer

read_fits_image(map_file_pointer[, storage_options])

Read the object spatial distribution information from a healpix FITS file.

read_parquet_dataset(dir_pointer, storage_options, ...)

Read parquet dataset from directory pointer.

read_parquet_file(file_pointer[, storage_options])

Read parquet file from file pointer.

read_parquet_file_to_pandas(→ pandas.DataFrame)

Reads a parquet file to a pandas DataFrame

read_parquet_metadata(→ pyarrow.parquet.FileMetaData)

Read FileMetaData from footer of a single Parquet file.

remove_directory(file_pointer[, ignore_errors, ...])

Remove a directory, and all contents, recursively.

write_dataframe_to_csv(dataframe, file_pointer[, ...])

Write a pandas DataFrame to a CSV file

write_dataframe_to_parquet(dataframe, file_pointer[, ...])

Write a pandas DataFrame to a parquet file

write_fits_image(histogram, map_file_pointer[, ...])

Write the object spatial distribution information to a healpix FITS file.

write_parquet_metadata(schema, file_pointer[, ...])

Write a metadata only parquet file from a schema

write_string_to_file(file_pointer, string[, encoding, ...])

Write a string to a text file

append_paths_to_pointer(→ FilePointer)

Append directories and/or a file name to a specified file pointer.

directory_has_contents(→ bool)

Checks if a directory already has some contents (any files or subdirectories)

does_file_or_directory_exist(→ bool)

Checks if a file or directory exists for a given file pointer

find_files_matching_path(→ List[FilePointer])

Find files or directories matching the provided path parts.

get_basename_from_filepointer(→ str)

Returns the base name of a regular file. May return empty string if the file is a directory.

get_directory_contents(→ List[FilePointer])

Finds all files and directories in the specified directory.

get_file_pointer_for_fs(→ FilePointer)

Creates the filepathway from the file_pointer.

get_file_pointer_from_path(→ FilePointer)

Returns a file pointer from a path string

get_file_protocol(→ str)

Method to parse filepointer for the filesystem protocol.

is_regular_file(→ bool)

Checks if a regular file (NOT a directory) exists for a given file pointer.

strip_leading_slash_for_pyarrow(→ FilePointer)

Strips the leading slash for pyarrow read/write functions.

Attributes#

FilePointer

Unified type for references to files.

delete_file(file_handle: hipscat.io.file_io.file_pointer.FilePointer, storage_options: Dict[Any, Any] | None = None)[source]#

Deletes file from filesystem.

Parameters:
  • file_handle – location of file pointer

  • storage_options – dictionary that contains filesystem credentials

load_csv_to_pandas(file_pointer: hipscat.io.file_io.file_pointer.FilePointer, storage_options: Dict[Any, Any] | None = None, **kwargs) pandas.DataFrame[source]#

Load a csv file to a pandas dataframe

Parameters:
  • file_pointer – location of csv file to load

  • storage_options – dictionary that contains abstract filesystem credentials

  • **kwargs – arguments to pass to pandas read_csv loading method

Returns:

pandas dataframe loaded from CSV

load_json_file(file_pointer: hipscat.io.file_io.file_pointer.FilePointer, encoding: str = 'utf-8', storage_options: Dict[Any, Any] | None = None) dict[source]#

Load a json file to a dictionary

Parameters:
  • file_pointer – location of file to read

  • encoding – string encoding method used by the file

  • storage_options – dictionary that contains abstract filesystem credentials

Returns:

dictionary of key value pairs loaded from the JSON file

load_parquet_to_pandas(file_pointer: hipscat.io.file_io.file_pointer.FilePointer, storage_options: Dict[Any, Any] | None = None, **kwargs) pandas.DataFrame[source]#

Load a parquet file to a pandas dataframe

Parameters:
  • file_pointer – location of parquet file to load

  • storage_options – dictionary that contains abstract filesystem credentials

  • **kwargs – arguments to pass to pandas read_parquet loading method

Returns:

pandas dataframe loaded from parquet

load_text_file(file_pointer: hipscat.io.file_io.file_pointer.FilePointer, encoding: str = 'utf-8', storage_options: Dict[Any, Any] | None = None)[source]#

Load a text file content to a list of strings.

Parameters:
  • file_pointer – location of file to read

  • encoding – string encoding method used by the file

  • storage_options – dictionary that contains abstract filesystem credentials

Returns:

text contents of file.

make_directory(file_pointer: hipscat.io.file_io.file_pointer.FilePointer, exist_ok: bool = False, storage_options: Dict[Any, Any] | None = None)[source]#

Make a directory at a given file pointer

Will raise an error if a directory already exists, unless exist_ok is True in which case any existing directories will be left unmodified

Parameters:
  • file_pointer – location in file system to make directory

  • exist_ok – Default False. If false will raise error if directory exists. If true existing directories will be ignored and not modified

  • storage_options – dictionary that contains abstract filesystem credentials

Raises:

OSError

read_fits_image(map_file_pointer: hipscat.io.file_io.file_pointer.FilePointer, storage_options: Dict[Any, Any] | None = None)[source]#

Read the object spatial distribution information from a healpix FITS file.

Parameters:
  • file_pointer – location of file to be written

  • storage_options – dictionary that contains abstract filesystem credentials

Returns:

one-dimensional numpy array of long integers where the value at each index corresponds to the number of objects found at the healpix pixel.

read_parquet_dataset(dir_pointer: hipscat.io.file_io.file_pointer.FilePointer, storage_options: Dict[Any, Any] | None = None, **kwargs)[source]#

Read parquet dataset from directory pointer.

Note that pyarrow.dataset reads require that directory pointers don’t contain a leading slash, and the protocol prefix may additionally be removed. As such, we also return the directory path that is formatted for pyarrow ingestion for follow-up.

Parameters:
  • dir_pointer – location of file to read metadata from

  • storage_options – dictionary that contains abstract filesystem credentials

Returns:

Tuple containing a path to the dataset (that is formatted for pyarrow ingestion) and the dataset read from disk.

read_parquet_file(file_pointer: hipscat.io.file_io.file_pointer.FilePointer, storage_options: Dict[Any, Any] | None = None)[source]#

Read parquet file from file pointer.

Parameters:
  • file_pointer – location of file to read metadata from

  • storage_options – dictionary that contains abstract filesystem credentials

read_parquet_file_to_pandas(file_pointer: hipscat.io.file_io.file_pointer.FilePointer, storage_options: Dict[Any, Any] | None = None, **kwargs) pandas.DataFrame[source]#

Reads a parquet file to a pandas DataFrame

Parameters:
  • file_pointer (FilePointer) – File Pointer to a parquet file

  • **kwargs – Additional arguments to pass to pandas read_parquet method

Returns:

Pandas DataFrame with the data from the parquet file

read_parquet_metadata(file_pointer: hipscat.io.file_io.file_pointer.FilePointer, storage_options: Dict[Any, Any] | None = None, **kwargs) pyarrow.parquet.FileMetaData[source]#

Read FileMetaData from footer of a single Parquet file.

Parameters:
  • file_pointer – location of file to read metadata from

  • storage_options – dictionary that contains abstract filesystem credentials

  • **kwargs – additional arguments to be passed to pyarrow.parquet.read_metadata

remove_directory(file_pointer: hipscat.io.file_io.file_pointer.FilePointer, ignore_errors=False, storage_options: Dict[Any, Any] | None = None)[source]#

Remove a directory, and all contents, recursively.

Parameters:
  • file_pointer – directory in file system to remove

  • ignore_errors – if True errors resulting from failed removals will be ignored

  • storage_options – dictionary that contains abstract filesystem credentials

write_dataframe_to_csv(dataframe: pandas.DataFrame, file_pointer: hipscat.io.file_io.file_pointer.FilePointer, storage_options: Dict[Any, Any] | None = None, **kwargs)[source]#

Write a pandas DataFrame to a CSV file

Parameters:
  • dataframe – DataFrame to write

  • file_pointer – location of file to write to

  • storage_options – dictionary that contains abstract filesystem credentials

  • **kwargs – args to pass to pandas to_csv method

write_dataframe_to_parquet(dataframe: pandas.DataFrame, file_pointer, storage_options: Dict[Any, Any] | None = None)[source]#

Write a pandas DataFrame to a parquet file

Parameters:
  • dataframe – DataFrame to write

  • file_pointer – location of file to write to

  • storage_options – dictionary that contains abstract filesystem credentials

write_fits_image(histogram: numpy.ndarray, map_file_pointer: hipscat.io.file_io.file_pointer.FilePointer, storage_options: Dict[Any, Any] | None = None)[source]#

Write the object spatial distribution information to a healpix FITS file.

Parameters:
  • histogram (np.ndarray) – one-dimensional numpy array of long integers where the value at each index corresponds to the number of objects found at the healpix pixel.

  • file_pointer – location of file to be written

  • storage_options – dictionary that contains abstract filesystem credentials

write_parquet_metadata(schema: Any, file_pointer: hipscat.io.file_io.file_pointer.FilePointer, metadata_collector: list | None = None, storage_options: Dict[Any, Any] | None = None, **kwargs)[source]#

Write a metadata only parquet file from a schema

Parameters:
  • schema – schema to be written

  • file_pointer – location of file to be written to

  • metadata_collector – where to collect metadata information

  • storage_options – dictionary that contains abstract filesystem credentials

  • **kwargs – additional arguments to be passed to pyarrow.parquet.write_metadata

write_string_to_file(file_pointer: hipscat.io.file_io.file_pointer.FilePointer, string: str, encoding: str = 'utf-8', storage_options: Dict[Any, Any] | None = None)[source]#

Write a string to a text file

Parameters:
  • file_pointer – file location to write file to

  • string – string to write to file

  • encoding – Default: ‘utf-8’, encoding method to write to file with

  • storage_options – dictionary that contains abstract filesystem credentials

FilePointer[source]#

Unified type for references to files.

append_paths_to_pointer(pointer: FilePointer, *paths: str) FilePointer[source]#

Append directories and/or a file name to a specified file pointer.

Parameters:
  • pointerFilePointer object to add path to

  • paths – any number of directory names optionally followed by a file name to append to the pointer

Returns:

New file pointer to path given by joining given pointer and path names

directory_has_contents(pointer: FilePointer, storage_options: Dict[Any, Any] | None = None) bool[source]#

Checks if a directory already has some contents (any files or subdirectories)

Parameters:
  • pointer – File Pointer to check for existing contents

  • storage_options – dictionary that contains abstract filesystem credentials

Returns:

True if there are any files or subdirectories below this directory.

does_file_or_directory_exist(pointer: FilePointer, storage_options: Dict[Any, Any] | None = None) bool[source]#

Checks if a file or directory exists for a given file pointer

Parameters:
  • pointer – File Pointer to check if file or directory exists at

  • storage_options – dictionary that contains abstract filesystem credentials

Returns:

True if file or directory at pointer exists, False if not

find_files_matching_path(pointer: FilePointer, *paths: str, include_protocol=False, storage_options: Dict[Any, Any] | None = None) List[FilePointer][source]#

Find files or directories matching the provided path parts.

Parameters:
  • pointer – base File Pointer in which to find contents

  • paths – any number of directory names optionally followed by a file name. directory or file names may be replaced with * as a matcher.

  • include_protocol – boolean on whether or not to include the filesystem protocol in the returned directory contents

  • storage_options – dictionary that contains abstract filesystem credentials

Returns:

New file pointers to files found matching the path

get_basename_from_filepointer(pointer: FilePointer) str[source]#

Returns the base name of a regular file. May return empty string if the file is a directory.

Parameters:

pointerFilePointer object to find a basename within

Returns:

string representation of the basename of a file.

get_directory_contents(pointer: FilePointer, include_protocol=False, storage_options: Dict[Any, Any] | None = None) List[FilePointer][source]#

Finds all files and directories in the specified directory.

NB: This is not recursive, and will return only the first level of directory contents.

Parameters:
  • pointer – File Pointer in which to find contents

  • include_protocol – boolean on whether or not to include the filesystem protocol in the returned directory contents

  • storage_options – dictionary that contains abstract filesystem credentials

Returns:

New file pointers to files or subdirectories below this directory.

get_file_pointer_for_fs(protocol: str, file_pointer: FilePointer) FilePointer[source]#

Creates the filepathway from the file_pointer.

This will strip the protocol so that the file_pointer can be accessed from the filesystem:

  • abfs filesystems DO NOT require the account_name in the pathway

  • s3 filesystems DO require the account_name/container name in the pathway

Parameters:
  • protocol – str filesytem protocol, file, abfs, or s3

  • file_pointer – filesystem pathway

get_file_pointer_from_path(path: str, include_protocol: str = None) FilePointer[source]#

Returns a file pointer from a path string

get_file_protocol(pointer: FilePointer) str[source]#

Method to parse filepointer for the filesystem protocol. If it doesn’t follow the pattern of protocol://pathway/to/file, then it assumes that it is a localfilesystem.

Parameters:

pointer – filesystem pathway pointer

is_regular_file(pointer: FilePointer, storage_options: Dict[Any, Any] | None = None) bool[source]#

Checks if a regular file (NOT a directory) exists for a given file pointer.

Parameters:
  • pointer – File Pointer to check if a regular file

  • storage_options – dictionary that contains abstract filesystem credentials

Returns:

True if regular file at pointer exists, False if not or is a directory

strip_leading_slash_for_pyarrow(pointer: FilePointer, protocol: str) FilePointer[source]#

Strips the leading slash for pyarrow read/write functions. This is required for pyarrow’s underlying filesystem abstraction.

Parameters:

pointerFilePointer object

Returns:

New file pointer with leading slash removed.