-
In my subsequent posts, I will describe them one by one in details and all codes will be published to github for your private education purpose.
Data tells story
a blog about using technology such as python/pandas to analyze different data to dig out stories underneath.
Wednesday, 9 September 2020
Way to get existing data and organize in your own use.
Sunday, 2 December 2018
Schedule regular tasks in AWS
I spent lots of time on this tiny problem, I would like to tell you, all you see on other websites which teach you to use 'crobtab -e' or other method in Linux won't work in AWS, it looks AWS system totally ignore those settings. I don't like those guys copy from each other without actually validating it, it really waste audiences' lots of time.
The only way to schedule regular tasks in aws via cron is to make a new file without any extension under this fold:
/etc/cron.d/
This is my setting:
vi /etc/cron.d/dailyPriceUpdate
This means to use the 'root' role to:
1. go to directory /home/ubuntu/aushare, remember those environment path won't work automatically in cron;
2. execute /home/ubuntu/anaconda3/bin/python /home/ubuntu/aushare/ASXScrapShareDailyPrice.py;
3. log to /home/ubuntu/dailyPriceUpdate.log;
4. 2>&1 means log errors to the file as well.
Wednesday, 28 November 2018
build a rest API service to provide market data for yourself
The framework I used is python flask, which is a micro service framework and very efficient in building rest API service. I considered doing this with aws API which can provide elastic scalability, however, your application will be tightly relying on aws in that way. I would keep my application independent and portable, so I only used the EC2 (Elastic Computer) service.
How to access the EC2 instance in AWS
After build an instance in AWS, the next thing is to access the instance to install software such as python, python modules and the rest API software.
I am not going to rebuild the wheel, instead use the existing official document in amazon, which is quite useful and informative:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html
Prior to the process, we need to prepare the private key file and convert it to .ppk file following the instruction in the link above 'To convert your private key'.
I am not going to rebuild the wheel, instead use the existing official document in amazon, which is quite useful and informative:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html
Prior to the process, we need to prepare the private key file and convert it to .ppk file following the instruction in the link above 'To convert your private key'.
With Putty, we can get access to the terminal to install files and run applications. The first thing we need to do is to install python3, we can follow this instruction to accomplish:
My system in EC2 is ubuntu, so I used the following command:
`
$ sudo apt-get install python3
`
winSCP is a powerful tool to download and upload files with remote computers, simply following the instruction above, it will prompt you to import settings, keys in putty during the process. Then we can exchange files with aws EC2 instance:
create an instance in aws for free
It takes ages for normal computer to run through the script to download all listed historical price. Hence I applied Amazon Web Service. AWS has a one year free tier so we can apply for this option to low our cost. There is another advantage, you can get access to your personal data wherever you are.
There are many such articles on the internet guiding how to apply and install a free EC2 instance in AWS, I won't do something redundant here. This is the one I found very informative:
For me, the operation system I setup is ubuntu. You need to save the private key file to your computer for further usage to get access to the aws instance.
Some tips:
- Be aware of the geographic area you choose, I forgot to choose this and it defaulted to 'Ohio' in America. It takes a little bit of money and some trouble to migrate to another geographic section. So you need to think of this in the first place.
- You'd better just establish one instance since the free tier one year allows you to run EC2 750 hours per month, so one instance is just right.
- There is no need to use S3 service for data storage, since EC2 is enough for that. I once created a S3 bucket and upload all those files, the request exceeded 2000 then I was charged a little bit.
- If you apply a domain name somewhere and want to link to the IP address in aws, you are using the Route 53 service in aws and this will charge you 0.55 USD per month. Another point is you need a fixed IP address to bond with your domain name. To do this, you need to 'Allocate new address':
- Also check if the previous IP address has been released, otherwise, you will be charged by Amazon, to check this, right click the IP address and the 'Release address' should be grayed out.
get historical data with python
When we want share prices, the first thing we may ask ourselves: how many companies are listed in ASX? Here is what we can see in ASX official website:
( there is a constant in stock.cons.py to remember this:
ASXLIST_FILE = 'https://www.asx.com.au/asx/research/ASXListedCompanies.csv')
So we can download historical market data based on this list. But be aware, although there are 2257 companies listed in this spread sheet, there are lots 'dead' codes which may be delisted without updating the document or no trades at all, so actually there are about 1800 equity codes. We will use some data clean technology to filter out these 'dead' instruments.
Some pre-conditions should be met before you run this python script, and I will explain some most parts of the code.
ASXScrapShareDailyPrice.py is the script you can run. Before that, something need to be carried out:
- go to https://www.python.org/downloads/ to download the latest python version, 3.7.1 should be fine;
- use pip install to setup these modules: selenium, bs4, lxml, pandas.
- download phantomjs (http://phantomjs.org/download.html), unzip and put it in the fold: /usr/local/bin/phantomjs or any other fold you can specify in the code:
browser = webdriver.PhantomJS(executable_path='/usr/local/bin/phantomjs')
To run the python script:
cd /aushare python ASXScrapShareDailyPrice.py
These two lines are to get the right link to obtain daily historical data, there are '%s' for further customization inside the codes, DAILY_PRICE is the place to get historical data if this has never been executed before, while the purpose of DAILY_PRICE_DELTA is to keep your historical data file up to date.
DAILY_PRICE_DELTA = "https://au.finance.yahoo.com/quote/%s.AX/history?period1=%s&period2=%s&interval=1d&filter=history&frequency=1d"
DAILY_PRICE ='https://au.finance.yahoo.com/quote/%s.AX/history?period1=0&period2=%s&interval=1d&filter=history&frequency=1d'
These codes read all the codes listed in ASX:
df =pd.read_csv(ASXLIST_FILE_NAME,header=1)
codelist = df['ASX code'].values
for symbol in codelist:
If you have already executed the script and there is historical price data file, the script will read the existing file and get the latest date, then it will snatch data from this point after so that no duplicate work is needed for the existing data. These are implemented here:
if os.path.isfile(file_name):
df = pd.read_csv(file_name,header=0, index_col =0,date_parser=_parser,skipfooter =1,engine='python') if (df.empty): continue df.reset_index(inplace =True) recent_date = df['Date'].max() print(recent_date) s1 = recent_date +timedelta(days=1) print(s1)
period1= int(time.mktime(s1.timetuple()))
url = DAILY_PRICE_DELTA%(symbol,period1,period2) no_of_pagedowns = 2 else: period2= int(time.mktime(s2.timetuple())) url = DAILY_PRICE%(symbol,period2) no_of_pagedowns = 50
After downloading the data, we need to filter out those duplicated ones, I noted there are quite lots of invalid duplicated prices in yahoo finance, but we should not complain about it since it is free. This is to clean out duplicated data:
result.drop_duplicates(inplace=True)
The final format and location of the historical data file would be defined in stock.cons.py:
DAILY_PRICE_FILE = 'data/daily/%s_price.csv'
So if the symbol name is 'Z1P', the file will be located in data/daily and the file name will be Z1P_price.csv.
Some people ( like me ) prefer to analyze share market using fundamental financial report, to catch balance sheet, annual revenue report and cash flow, you need to run these python scripts with jupyter notebook ( to install jupyter you can go here: http://jupyter.org/ ):
ASXDataScrapBalance.ipynb
ASXDataScrapRevenue.ipynb
ASXDateScrapCashflow.ipynb
The mechanism of these scripts are almost the same as ASXScrapShareDailyPrice, but much simpler.
Please note: don't use threadpool in downloading data, this can overwhelm the server and you will be given an error code 429 or even IP ban.
All codes are here in github for educational purpose:
Subscribe to:
Posts (Atom)