How to list the file search through a given path for all files that ends with txt?
BIG DATA RECIPES DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET     ALL TAGS

How to list the file search through a given path for all files that ends with txt?

How to list the file search through a given path for all files that ends with txt?

This recipe helps you list the file search through a given path for all files that ends with txt

0

Recipe Objective

This recipe lists the files in the local system using Python. This is reusable code and can be used for automation of data processes using data orchestration tools such as Airflow etc. It automatically lists the file with a certain extension at a certain location in the HDFS / local file system and that data can be useful to pass into a dataframe and perform further data analysis like cleaning, validation etc.

System requirements :

  • Install the python module as follows if the below modules are not found:
  • pip install os
  • pip install sys
  • pip install glob
  • The below codes can be run in Jupyter notebook , or any python console

Step 1: Using the 'OS' library

The OS module provides a portable way of using operating system dependent functionality. os.listdir() method in python is used to get the list of all files and directories in the specified directory. If we don't specify any directory, then list of files and directories in the current working directory will be returned.

simple code for list of files in the current directory.

import os, sys # Open a file dirs = os.listdir('.') # '.' means the current directory, you can give the directory path in between the single quotes. # This would print all the files and directories for file in dirs: print (file)

It will return a list containing the names of the entries in the directory given by path.

Output of the above code:

Step 2: Using the 'glob' library

glob is mostly a filename pattern matching library, but it can be used to list items in the current directory by:

# Importing the glob library import glob # Path to the directory path = '' # or # path = './' # you can pass the directory path in between the single quotes. # '.' or '' means the current directory, # Extract the list of filenames files = glob.glob(path + '*', recursive=False) # Loop to print the filenames for filename in files: print(filename)

Output of the above code:

The wildcard character '*' is used to match all the items in the current directory. Since we wish to display the items of the current directory, we need to switch off the recursive nature of glob() function.

Step 3: Using the string methods

Simple code to search and list out and print the specific extension filenames using string methods, .startswith() and .endswith() in the current directory

import os #Get .txt files for f_name in os.listdir('.'): if f_name.endswith('.txt'): print(f_name)

Output of the above code:

Step 4 : List Files in a Directory with a Specific Extension and given path

import os req_path = input("Enter your dir path :") if os.path.isfile(req_path): print(f"the given path {req_path} is a file. please pass only dir path") else: all_f_dir=os.listdir(req_path) if len(all_f_dir)==0: print(f"the given path is {req_path} an empty path") else: req_ext = input("Enter the required files extension") req_files =[] print(f"Listing all {req_ext} files") for each_f in all_f_dir: if each_f.endswith(req_ext): req_files.append(each_f) # print(each_f) if len(req_files)==0: print(f"No {req_ext}files in the location of {req_path}") else: print(f"there are {len(req_files)} files in the location of {req_path} with an extension of {req_ext}") print(f"so, the files are: {req_files}")

Output of the above code: Here it prints the list of .txt extension files as shown below.

Relevant Projects

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster
Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.

Hadoop Project for Beginners-SQL Analytics with Hive
In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.

Design a Hadoop Architecture
Learn to design Hadoop Architecture and understand how to store data using data acquisition tools in Hadoop.

Movielens dataset analysis for movie recommendations using Spark in Azure
In this Databricks Azure tutorial project, you will use Spark Sql to analyse the movielens dataset to provide movie recommendations. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Spark Project -Real-time data collection and Spark Streaming Aggregation
In this big data project, we will embark on real-time data collection and aggregation from a simulated real-time system using Spark Streaming.

Airline Dataset Analysis using Hadoop, Hive, Pig and Impala
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Impala.

Data Warehouse Design for E-commerce Environments
In this hive project, you will design a data warehouse for e-commerce environments.

Web Server Log Processing using Hadoop
In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.

Analysing Big Data with Twitter Sentiments using Spark Streaming
In this big data spark project, we will do Twitter sentiment analysis using spark streaming on the incoming streaming data.

Tough engineering choices with large datasets in Hive Part - 1
Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances