Code in Python that parses fund holdings pulled from EDGAR, given a ticker or CIK
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
2
down vote
favorite
Challenge:
Write code in Python that parses fund holdings pulled from EDGAR, given a ticker or CIK.
Example:
- For this example, we will use this CIK: 0001166559
- Start on this page.
- Enter in the CIK (or ticker), and it will take you here.
- Find the "13F" report documents from the ones listed. Here is a "13F-HR".
- Parse and generate tab-delimited text from the XML.
Goals:
The code should be able to use any mutual fund ticker. Try morningstar.com or lipperweb.com to find valid tickers.
Be sure to check multiple tickers, since the format of the 13F reports can differ.
My solution
from bs4 import BeautifulSoup
import requests
import re
def getHoldings(cik):
"""
Main function that first finds the most recent 13F form
and then passes it to scrapeForm to get the holdings
for a particular institutional investor.
"""
urlSec = "https://www.sec.gov"
urlForms = "/cgi-bin/browse-edgar?action=getcompany&CIK=&type=13F".format(urlSec, cik)
urlRecentForm = urlSec + BeautifulSoup(requests.get(urlForms).content,
'lxml').find('a', "id":"documentsbutton")['href']
contents = BeautifulSoup(requests.get(urlRecentForm).content, 'lxml')
urlTable = "".format(urlSec,contents.find_all('tr',
"class" : 'blueRow')[-1].find('a')['href'])
return scrapeForm(urlTable)
def scrapeForm(url):
"""
This function scrapes holdings from particular URL
"""
soup = BeautifulSoup(requests.get(url).content, 'lxml')
holdings = set([h.text for h in soup.find_all((lambda tag : 'issuer' in tag.name.lower()))])
if(not holdings):
print("No Holdings at: ".format(url))
return
return holdings
Could you provide me some feedback on my code? I completed this challenge recently and just received a general rejection email, so I want to know how could I improve my solution.
python web-scraping
add a comment |Â
up vote
2
down vote
favorite
Challenge:
Write code in Python that parses fund holdings pulled from EDGAR, given a ticker or CIK.
Example:
- For this example, we will use this CIK: 0001166559
- Start on this page.
- Enter in the CIK (or ticker), and it will take you here.
- Find the "13F" report documents from the ones listed. Here is a "13F-HR".
- Parse and generate tab-delimited text from the XML.
Goals:
The code should be able to use any mutual fund ticker. Try morningstar.com or lipperweb.com to find valid tickers.
Be sure to check multiple tickers, since the format of the 13F reports can differ.
My solution
from bs4 import BeautifulSoup
import requests
import re
def getHoldings(cik):
"""
Main function that first finds the most recent 13F form
and then passes it to scrapeForm to get the holdings
for a particular institutional investor.
"""
urlSec = "https://www.sec.gov"
urlForms = "/cgi-bin/browse-edgar?action=getcompany&CIK=&type=13F".format(urlSec, cik)
urlRecentForm = urlSec + BeautifulSoup(requests.get(urlForms).content,
'lxml').find('a', "id":"documentsbutton")['href']
contents = BeautifulSoup(requests.get(urlRecentForm).content, 'lxml')
urlTable = "".format(urlSec,contents.find_all('tr',
"class" : 'blueRow')[-1].find('a')['href'])
return scrapeForm(urlTable)
def scrapeForm(url):
"""
This function scrapes holdings from particular URL
"""
soup = BeautifulSoup(requests.get(url).content, 'lxml')
holdings = set([h.text for h in soup.find_all((lambda tag : 'issuer' in tag.name.lower()))])
if(not holdings):
print("No Holdings at: ".format(url))
return
return holdings
Could you provide me some feedback on my code? I completed this challenge recently and just received a general rejection email, so I want to know how could I improve my solution.
python web-scraping
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
Challenge:
Write code in Python that parses fund holdings pulled from EDGAR, given a ticker or CIK.
Example:
- For this example, we will use this CIK: 0001166559
- Start on this page.
- Enter in the CIK (or ticker), and it will take you here.
- Find the "13F" report documents from the ones listed. Here is a "13F-HR".
- Parse and generate tab-delimited text from the XML.
Goals:
The code should be able to use any mutual fund ticker. Try morningstar.com or lipperweb.com to find valid tickers.
Be sure to check multiple tickers, since the format of the 13F reports can differ.
My solution
from bs4 import BeautifulSoup
import requests
import re
def getHoldings(cik):
"""
Main function that first finds the most recent 13F form
and then passes it to scrapeForm to get the holdings
for a particular institutional investor.
"""
urlSec = "https://www.sec.gov"
urlForms = "/cgi-bin/browse-edgar?action=getcompany&CIK=&type=13F".format(urlSec, cik)
urlRecentForm = urlSec + BeautifulSoup(requests.get(urlForms).content,
'lxml').find('a', "id":"documentsbutton")['href']
contents = BeautifulSoup(requests.get(urlRecentForm).content, 'lxml')
urlTable = "".format(urlSec,contents.find_all('tr',
"class" : 'blueRow')[-1].find('a')['href'])
return scrapeForm(urlTable)
def scrapeForm(url):
"""
This function scrapes holdings from particular URL
"""
soup = BeautifulSoup(requests.get(url).content, 'lxml')
holdings = set([h.text for h in soup.find_all((lambda tag : 'issuer' in tag.name.lower()))])
if(not holdings):
print("No Holdings at: ".format(url))
return
return holdings
Could you provide me some feedback on my code? I completed this challenge recently and just received a general rejection email, so I want to know how could I improve my solution.
python web-scraping
Challenge:
Write code in Python that parses fund holdings pulled from EDGAR, given a ticker or CIK.
Example:
- For this example, we will use this CIK: 0001166559
- Start on this page.
- Enter in the CIK (or ticker), and it will take you here.
- Find the "13F" report documents from the ones listed. Here is a "13F-HR".
- Parse and generate tab-delimited text from the XML.
Goals:
The code should be able to use any mutual fund ticker. Try morningstar.com or lipperweb.com to find valid tickers.
Be sure to check multiple tickers, since the format of the 13F reports can differ.
My solution
from bs4 import BeautifulSoup
import requests
import re
def getHoldings(cik):
"""
Main function that first finds the most recent 13F form
and then passes it to scrapeForm to get the holdings
for a particular institutional investor.
"""
urlSec = "https://www.sec.gov"
urlForms = "/cgi-bin/browse-edgar?action=getcompany&CIK=&type=13F".format(urlSec, cik)
urlRecentForm = urlSec + BeautifulSoup(requests.get(urlForms).content,
'lxml').find('a', "id":"documentsbutton")['href']
contents = BeautifulSoup(requests.get(urlRecentForm).content, 'lxml')
urlTable = "".format(urlSec,contents.find_all('tr',
"class" : 'blueRow')[-1].find('a')['href'])
return scrapeForm(urlTable)
def scrapeForm(url):
"""
This function scrapes holdings from particular URL
"""
soup = BeautifulSoup(requests.get(url).content, 'lxml')
holdings = set([h.text for h in soup.find_all((lambda tag : 'issuer' in tag.name.lower()))])
if(not holdings):
print("No Holdings at: ".format(url))
return
return holdings
Could you provide me some feedback on my code? I completed this challenge recently and just received a general rejection email, so I want to know how could I improve my solution.
python web-scraping
edited Apr 16 at 15:51
Solomon Ucko
822313
822313
asked Apr 16 at 15:37
Stoicas
112
112
add a comment |Â
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
1
down vote
There are a few improvements I would apply to the code.
Code Style
- address PEP8 violations, in particular:
variable and function naming - your functions and variables follow the camel case convention, but PEP8 and Python community advocate forlower_case_with_underscores
naming style- watch for the use of whitespaces around operators and in expressions
- remove unused imports -
re
module is unused - the backslash is unnecessary and can be removed
- the parentheses around
not holdings
are redundant and can be removed you can create a set using a set comprehension directly:
holdings = h.text for h in soup.find_all((lambda tag: 'issuer' in tag.name.lower()))
- I would also define
urlSec
url andurlForms
url template as proper constants - I think you are also overloading the code with too many things in a single expression. Apply the "Extract Variable" refactoring method to improve readability and simplify the code
- use
urljoin()
to join parts of a URL
Web-scraping and HTML-parsing
since you are issuing multiple requests to the same domain, you may re-use
requests.Session()
instance, which may have a positive impact on performance:
if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
- you may also win performance on HTML parsing by utilizing
SoupStrainer
class which allows parsing only the specific things in the DOM tree
Improved code:
from urllib.parse import urljoin
from bs4 import BeautifulSoup, SoupStrainer
import requests
class Scraper:
BASE_URL = "https://www.sec.gov"
FORMS_URL_TEMPLATE = "/cgi-bin/browse-edgar?action=getcompany&CIK=cik&type=13F"
def __init__(self):
self.session = requests.Session()
def get_holdings(self, cik):
"""
Main function that first finds the most recent 13F form
and then passes it to scrapeForm to get the holdings
for a particular institutional investor.
"""
forms_url = urljoin(self.BASE_URL, self.FORMS_URL_TEMPLATE.format(cik=cik))
# get the recent form address
parse_only = SoupStrainer('a', "id": "documentsbutton")
soup = BeautifulSoup(self.session.get(forms_url).content, 'lxml', parse_only=parse_only)
recent_form_url = soup.find('a', "id": "documentsbutton")['href']
recent_form_url = urljoin(self.BASE_URL, recent_form_url)
# get the form document URL
parse_only = SoupStrainer('tr', "class": 'blueRow')
soup = BeautifulSoup(self.session.get(recent_form_url).content, 'lxml', parse_only=parse_only)
form_url = soup.find_all('tr', "class": 'blueRow')[-1].find('a')['href']
form_url = urljoin(self.BASE_URL, form_url)
return self.scrape_document(form_url)
def scrape_document(self, url):
"""
This function scrapes holdings from particular document URL
"""
soup = BeautifulSoup(self.session.get(url).content, 'lxml')
holdings = h.text for h in soup.find_all((lambda tag: 'issuer' in tag.name.lower()))
if not holdings:
print("No Holdings at: ".format(url))
return
return holdings
Thank you for such a detailed response. I appreciate that.
â Stoicas
Apr 17 at 5:51
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
There are a few improvements I would apply to the code.
Code Style
- address PEP8 violations, in particular:
variable and function naming - your functions and variables follow the camel case convention, but PEP8 and Python community advocate forlower_case_with_underscores
naming style- watch for the use of whitespaces around operators and in expressions
- remove unused imports -
re
module is unused - the backslash is unnecessary and can be removed
- the parentheses around
not holdings
are redundant and can be removed you can create a set using a set comprehension directly:
holdings = h.text for h in soup.find_all((lambda tag: 'issuer' in tag.name.lower()))
- I would also define
urlSec
url andurlForms
url template as proper constants - I think you are also overloading the code with too many things in a single expression. Apply the "Extract Variable" refactoring method to improve readability and simplify the code
- use
urljoin()
to join parts of a URL
Web-scraping and HTML-parsing
since you are issuing multiple requests to the same domain, you may re-use
requests.Session()
instance, which may have a positive impact on performance:
if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
- you may also win performance on HTML parsing by utilizing
SoupStrainer
class which allows parsing only the specific things in the DOM tree
Improved code:
from urllib.parse import urljoin
from bs4 import BeautifulSoup, SoupStrainer
import requests
class Scraper:
BASE_URL = "https://www.sec.gov"
FORMS_URL_TEMPLATE = "/cgi-bin/browse-edgar?action=getcompany&CIK=cik&type=13F"
def __init__(self):
self.session = requests.Session()
def get_holdings(self, cik):
"""
Main function that first finds the most recent 13F form
and then passes it to scrapeForm to get the holdings
for a particular institutional investor.
"""
forms_url = urljoin(self.BASE_URL, self.FORMS_URL_TEMPLATE.format(cik=cik))
# get the recent form address
parse_only = SoupStrainer('a', "id": "documentsbutton")
soup = BeautifulSoup(self.session.get(forms_url).content, 'lxml', parse_only=parse_only)
recent_form_url = soup.find('a', "id": "documentsbutton")['href']
recent_form_url = urljoin(self.BASE_URL, recent_form_url)
# get the form document URL
parse_only = SoupStrainer('tr', "class": 'blueRow')
soup = BeautifulSoup(self.session.get(recent_form_url).content, 'lxml', parse_only=parse_only)
form_url = soup.find_all('tr', "class": 'blueRow')[-1].find('a')['href']
form_url = urljoin(self.BASE_URL, form_url)
return self.scrape_document(form_url)
def scrape_document(self, url):
"""
This function scrapes holdings from particular document URL
"""
soup = BeautifulSoup(self.session.get(url).content, 'lxml')
holdings = h.text for h in soup.find_all((lambda tag: 'issuer' in tag.name.lower()))
if not holdings:
print("No Holdings at: ".format(url))
return
return holdings
Thank you for such a detailed response. I appreciate that.
â Stoicas
Apr 17 at 5:51
add a comment |Â
up vote
1
down vote
There are a few improvements I would apply to the code.
Code Style
- address PEP8 violations, in particular:
variable and function naming - your functions and variables follow the camel case convention, but PEP8 and Python community advocate forlower_case_with_underscores
naming style- watch for the use of whitespaces around operators and in expressions
- remove unused imports -
re
module is unused - the backslash is unnecessary and can be removed
- the parentheses around
not holdings
are redundant and can be removed you can create a set using a set comprehension directly:
holdings = h.text for h in soup.find_all((lambda tag: 'issuer' in tag.name.lower()))
- I would also define
urlSec
url andurlForms
url template as proper constants - I think you are also overloading the code with too many things in a single expression. Apply the "Extract Variable" refactoring method to improve readability and simplify the code
- use
urljoin()
to join parts of a URL
Web-scraping and HTML-parsing
since you are issuing multiple requests to the same domain, you may re-use
requests.Session()
instance, which may have a positive impact on performance:
if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
- you may also win performance on HTML parsing by utilizing
SoupStrainer
class which allows parsing only the specific things in the DOM tree
Improved code:
from urllib.parse import urljoin
from bs4 import BeautifulSoup, SoupStrainer
import requests
class Scraper:
BASE_URL = "https://www.sec.gov"
FORMS_URL_TEMPLATE = "/cgi-bin/browse-edgar?action=getcompany&CIK=cik&type=13F"
def __init__(self):
self.session = requests.Session()
def get_holdings(self, cik):
"""
Main function that first finds the most recent 13F form
and then passes it to scrapeForm to get the holdings
for a particular institutional investor.
"""
forms_url = urljoin(self.BASE_URL, self.FORMS_URL_TEMPLATE.format(cik=cik))
# get the recent form address
parse_only = SoupStrainer('a', "id": "documentsbutton")
soup = BeautifulSoup(self.session.get(forms_url).content, 'lxml', parse_only=parse_only)
recent_form_url = soup.find('a', "id": "documentsbutton")['href']
recent_form_url = urljoin(self.BASE_URL, recent_form_url)
# get the form document URL
parse_only = SoupStrainer('tr', "class": 'blueRow')
soup = BeautifulSoup(self.session.get(recent_form_url).content, 'lxml', parse_only=parse_only)
form_url = soup.find_all('tr', "class": 'blueRow')[-1].find('a')['href']
form_url = urljoin(self.BASE_URL, form_url)
return self.scrape_document(form_url)
def scrape_document(self, url):
"""
This function scrapes holdings from particular document URL
"""
soup = BeautifulSoup(self.session.get(url).content, 'lxml')
holdings = h.text for h in soup.find_all((lambda tag: 'issuer' in tag.name.lower()))
if not holdings:
print("No Holdings at: ".format(url))
return
return holdings
Thank you for such a detailed response. I appreciate that.
â Stoicas
Apr 17 at 5:51
add a comment |Â
up vote
1
down vote
up vote
1
down vote
There are a few improvements I would apply to the code.
Code Style
- address PEP8 violations, in particular:
variable and function naming - your functions and variables follow the camel case convention, but PEP8 and Python community advocate forlower_case_with_underscores
naming style- watch for the use of whitespaces around operators and in expressions
- remove unused imports -
re
module is unused - the backslash is unnecessary and can be removed
- the parentheses around
not holdings
are redundant and can be removed you can create a set using a set comprehension directly:
holdings = h.text for h in soup.find_all((lambda tag: 'issuer' in tag.name.lower()))
- I would also define
urlSec
url andurlForms
url template as proper constants - I think you are also overloading the code with too many things in a single expression. Apply the "Extract Variable" refactoring method to improve readability and simplify the code
- use
urljoin()
to join parts of a URL
Web-scraping and HTML-parsing
since you are issuing multiple requests to the same domain, you may re-use
requests.Session()
instance, which may have a positive impact on performance:
if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
- you may also win performance on HTML parsing by utilizing
SoupStrainer
class which allows parsing only the specific things in the DOM tree
Improved code:
from urllib.parse import urljoin
from bs4 import BeautifulSoup, SoupStrainer
import requests
class Scraper:
BASE_URL = "https://www.sec.gov"
FORMS_URL_TEMPLATE = "/cgi-bin/browse-edgar?action=getcompany&CIK=cik&type=13F"
def __init__(self):
self.session = requests.Session()
def get_holdings(self, cik):
"""
Main function that first finds the most recent 13F form
and then passes it to scrapeForm to get the holdings
for a particular institutional investor.
"""
forms_url = urljoin(self.BASE_URL, self.FORMS_URL_TEMPLATE.format(cik=cik))
# get the recent form address
parse_only = SoupStrainer('a', "id": "documentsbutton")
soup = BeautifulSoup(self.session.get(forms_url).content, 'lxml', parse_only=parse_only)
recent_form_url = soup.find('a', "id": "documentsbutton")['href']
recent_form_url = urljoin(self.BASE_URL, recent_form_url)
# get the form document URL
parse_only = SoupStrainer('tr', "class": 'blueRow')
soup = BeautifulSoup(self.session.get(recent_form_url).content, 'lxml', parse_only=parse_only)
form_url = soup.find_all('tr', "class": 'blueRow')[-1].find('a')['href']
form_url = urljoin(self.BASE_URL, form_url)
return self.scrape_document(form_url)
def scrape_document(self, url):
"""
This function scrapes holdings from particular document URL
"""
soup = BeautifulSoup(self.session.get(url).content, 'lxml')
holdings = h.text for h in soup.find_all((lambda tag: 'issuer' in tag.name.lower()))
if not holdings:
print("No Holdings at: ".format(url))
return
return holdings
There are a few improvements I would apply to the code.
Code Style
- address PEP8 violations, in particular:
variable and function naming - your functions and variables follow the camel case convention, but PEP8 and Python community advocate forlower_case_with_underscores
naming style- watch for the use of whitespaces around operators and in expressions
- remove unused imports -
re
module is unused - the backslash is unnecessary and can be removed
- the parentheses around
not holdings
are redundant and can be removed you can create a set using a set comprehension directly:
holdings = h.text for h in soup.find_all((lambda tag: 'issuer' in tag.name.lower()))
- I would also define
urlSec
url andurlForms
url template as proper constants - I think you are also overloading the code with too many things in a single expression. Apply the "Extract Variable" refactoring method to improve readability and simplify the code
- use
urljoin()
to join parts of a URL
Web-scraping and HTML-parsing
since you are issuing multiple requests to the same domain, you may re-use
requests.Session()
instance, which may have a positive impact on performance:
if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
- you may also win performance on HTML parsing by utilizing
SoupStrainer
class which allows parsing only the specific things in the DOM tree
Improved code:
from urllib.parse import urljoin
from bs4 import BeautifulSoup, SoupStrainer
import requests
class Scraper:
BASE_URL = "https://www.sec.gov"
FORMS_URL_TEMPLATE = "/cgi-bin/browse-edgar?action=getcompany&CIK=cik&type=13F"
def __init__(self):
self.session = requests.Session()
def get_holdings(self, cik):
"""
Main function that first finds the most recent 13F form
and then passes it to scrapeForm to get the holdings
for a particular institutional investor.
"""
forms_url = urljoin(self.BASE_URL, self.FORMS_URL_TEMPLATE.format(cik=cik))
# get the recent form address
parse_only = SoupStrainer('a', "id": "documentsbutton")
soup = BeautifulSoup(self.session.get(forms_url).content, 'lxml', parse_only=parse_only)
recent_form_url = soup.find('a', "id": "documentsbutton")['href']
recent_form_url = urljoin(self.BASE_URL, recent_form_url)
# get the form document URL
parse_only = SoupStrainer('tr', "class": 'blueRow')
soup = BeautifulSoup(self.session.get(recent_form_url).content, 'lxml', parse_only=parse_only)
form_url = soup.find_all('tr', "class": 'blueRow')[-1].find('a')['href']
form_url = urljoin(self.BASE_URL, form_url)
return self.scrape_document(form_url)
def scrape_document(self, url):
"""
This function scrapes holdings from particular document URL
"""
soup = BeautifulSoup(self.session.get(url).content, 'lxml')
holdings = h.text for h in soup.find_all((lambda tag: 'issuer' in tag.name.lower()))
if not holdings:
print("No Holdings at: ".format(url))
return
return holdings
answered Apr 16 at 18:15
alecxe
14.3k52976
14.3k52976
Thank you for such a detailed response. I appreciate that.
â Stoicas
Apr 17 at 5:51
add a comment |Â
Thank you for such a detailed response. I appreciate that.
â Stoicas
Apr 17 at 5:51
Thank you for such a detailed response. I appreciate that.
â Stoicas
Apr 17 at 5:51
Thank you for such a detailed response. I appreciate that.
â Stoicas
Apr 17 at 5:51
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f192207%2fcode-in-python-that-parses-fund-holdings-pulled-from-edgar-given-a-ticker-or-ci%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password