,

DATA scraping with Python scripts (Pandas).

This article presents a Python script that uses the pandas_datareader library to parse data from various sources. In particular, we will focus on parsing shares from the New York Stock Exchange (NYSE) via Yahoo Finance. If you’re interested in learning more about pandas_datareader, please visit their official website.

Libraries.

To begin with, we need to install two libraries – pandas_datareader and schedule – which we can do via pip:


pip install pandas-datareader
pip install schedule

We’ll also need to import datetime, pandas_datareader, and schedule for our script:

import datetime as dt 
from pandas_datareader import data as pdr
import schedule

Next, we need to specify the stock tickers we want to parse. We can either define an array of data with a list of stock tickers, like so:


stocks = ['GAPL', 'GIPL']

Or we can save all the tickers in a file. The file should contain an array of data in the form [‘GAPL’, ‘GIPL’]. We can then open the file and parse the tickers like this:


with open(r'C:\folder_name\file_name.txt') as file:
  stocks = eval(file.read())

We also need to specify the path and file name where we want to save the parsed data:


path = (r'C:\folder_name\\')
file_name = 'file_name.csv'

Now we need to specify the time period for which we want to parse the data. We can either specify a fixed period of time:


start = dt.datetime(2022, 1, 18)
end = dt.datetime(2022, 12, 22)

Or we can set the start date to be a certain number of days before the end date. In the example below, we set the start date to be yesterday:


end = dt.datetime.now()
start = end - dt.timedelta(days=1)

Finally, we can use pandas_datareader to parse the data. We can either create one file for all the data, or save each ticker in a separate file. Here is an example of how to save all the data in one file:


for item in stocks: 
  df = pdr.get_data_yahoo(symbols = stocks, start = start, end=end, interval='d').stack("Symbols")
  df.to_csv(path + file_name)

And here is an example of how to save each ticker in a separate file:


for item in stocks: 
    df = pdr.get_data_yahoo(item, start = start, end=end, interval='d')
    stocks.to_csv(path + item + '.csv')

Once we have defined all these parameters, we can automate the script using the schedule library. For example, we can run the script every 40 seconds or at a specific time each day:

schedule.every(40).seconds.do(NYSE)
schedule.every().day.at('01:00').do(NYSE)

Here is an example of how the script might look when fully implemented:


from pandas_datareader import data as pdr
import datetime as dt
import schedule

def nyse():
    with open(r'C:\Pdata\NYSE\OUT\nyse_ticker.txt') as file:
        stocks = eval(file.read())
    directory = (r'C:\Pdata\NYSE\IN\\')
    file_name = 'nyse.csv'
    end = dt.datetime.now()
    start = dt.datetime(2022, 11, 4)
    for item in stocks:
        df = pdr.get_data_yahoo(symbols=stocks, start=start).stack("Symbols")
        df.to_csv(directory + file_name)
        print('NYSE DONE', end)

Example:

Every night at 3:10 Am it downloads data from the previous day and saves it in one file.
Then at 7:01 Am its runs again and download missing data from first time.

from pandas_datareader import data as pdr
import datetime as dt
import schedule


def time():
    end = dt.datetime.now()
    print('Time test', end)


def nyse():
    with open(r'C:\Pdata\NYSE\OUT\nyse_ticker.txt') as file:
        stocks = eval(file.read())
    directory = (r'C:\Pdata\NYSE\IN\\')
    file_name = 'nyse.csv'
    end = dt.datetime.now()
    start = dt.datetime(2022, 11, 4)
    for item in stocks:
        df = pdr.get_data_yahoo(symbols=stocks, start=start).stack("Symbols")
        df.to_csv(directory + file_name)
        print('NYSE DONE', end)
        break

def nyse_error():
    with open(r'C:\Pdata\NYSE\OUT\error.csv') as file:
        stocks = eval(file.read())
    directory = (r'C:\Pdata\NYSE\IN\\')
    file_name = 'nyse_error.csv'
    end = dt.datetime.now()
    start = dt.datetime(2022, 11, 4)
    for item in stocks:
        df = pdr.get_data_yahoo(symbols=stocks, start=start).stack("Symbols")
        df.to_csv(directory + file_name)
        print('NYSE error DONE', end)
        break


def main():
    schedule.every(1).hour.do(time)
    schedule.every().day.at('03:10').do(nyse)
    schedule.every().day.at('07:01').do(nyse_error)

    while True:
        schedule.run_pending()


if __name__ == '__main__':
    main()

Finally, I would like to remind you that this is just an example of how a Python parser can be implemented. By changing the parameters of the script you can customize it to whatever information you need.
Don’t forget that this is raw data and you need to convert it into a database.

Kseno avatar

Leave a Reply

Your email address will not be published. Required fields are marked *