Reducing the amount of List in a WebScraper
At the moment, I'm learning and experimenting on the use of web scraping content from different varieties of web pages. But I've come across a common smelly code among several of my applications. I have many repetitive List
that has data being append to them.
from requests import get
import requests
import json
from time import sleep
import pandas as pd
url = 'https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true'
list_name =
list_price =
list_discount =
list_stock =
response = get(url)
json_data = response.json()
def getShockingSales():
index = 0
if response.status_code is 200:
print('Response: ' + 'OK')
else:
print('Unable to access')
total_flashsale = len(json_data['data']['items'])
total_flashsale -= 1
for i in range(index, total_flashsale):
print('Getting data from site... please wait a few seconds')
while i <= total_flashsale:
flash_name = json_data['data']['items'][i]['name']
flash_price = json_data['data']['items'][i]['price']
flash_discount = json_data['data']['items'][i]['discount']
flash_stock = json_data['data']['items'][i]['stock']
list_name.append(flash_name)
list_price.append(flash_price)
list_discount.append(flash_discount)
list_stock.append(flash_stock)
sleep(0.5)
i += 1
if i > total_flashsale:
print('Task is completed...')
return
getShockingSales()
new_panda = pd.DataFrame({'Name': list_name, 'Price': list_price,
'Discount': list_discount, 'Stock Available': list_stock})
print('Converting to Panda Frame....')
sleep(5)
print(new_panda)
Would one list be more than sufficient? Am I approaching this wrongly.
python python-3.x json
New contributor
add a comment |
At the moment, I'm learning and experimenting on the use of web scraping content from different varieties of web pages. But I've come across a common smelly code among several of my applications. I have many repetitive List
that has data being append to them.
from requests import get
import requests
import json
from time import sleep
import pandas as pd
url = 'https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true'
list_name =
list_price =
list_discount =
list_stock =
response = get(url)
json_data = response.json()
def getShockingSales():
index = 0
if response.status_code is 200:
print('Response: ' + 'OK')
else:
print('Unable to access')
total_flashsale = len(json_data['data']['items'])
total_flashsale -= 1
for i in range(index, total_flashsale):
print('Getting data from site... please wait a few seconds')
while i <= total_flashsale:
flash_name = json_data['data']['items'][i]['name']
flash_price = json_data['data']['items'][i]['price']
flash_discount = json_data['data']['items'][i]['discount']
flash_stock = json_data['data']['items'][i]['stock']
list_name.append(flash_name)
list_price.append(flash_price)
list_discount.append(flash_discount)
list_stock.append(flash_stock)
sleep(0.5)
i += 1
if i > total_flashsale:
print('Task is completed...')
return
getShockingSales()
new_panda = pd.DataFrame({'Name': list_name, 'Price': list_price,
'Discount': list_discount, 'Stock Available': list_stock})
print('Converting to Panda Frame....')
sleep(5)
print(new_panda)
Would one list be more than sufficient? Am I approaching this wrongly.
python python-3.x json
New contributor
add a comment |
At the moment, I'm learning and experimenting on the use of web scraping content from different varieties of web pages. But I've come across a common smelly code among several of my applications. I have many repetitive List
that has data being append to them.
from requests import get
import requests
import json
from time import sleep
import pandas as pd
url = 'https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true'
list_name =
list_price =
list_discount =
list_stock =
response = get(url)
json_data = response.json()
def getShockingSales():
index = 0
if response.status_code is 200:
print('Response: ' + 'OK')
else:
print('Unable to access')
total_flashsale = len(json_data['data']['items'])
total_flashsale -= 1
for i in range(index, total_flashsale):
print('Getting data from site... please wait a few seconds')
while i <= total_flashsale:
flash_name = json_data['data']['items'][i]['name']
flash_price = json_data['data']['items'][i]['price']
flash_discount = json_data['data']['items'][i]['discount']
flash_stock = json_data['data']['items'][i]['stock']
list_name.append(flash_name)
list_price.append(flash_price)
list_discount.append(flash_discount)
list_stock.append(flash_stock)
sleep(0.5)
i += 1
if i > total_flashsale:
print('Task is completed...')
return
getShockingSales()
new_panda = pd.DataFrame({'Name': list_name, 'Price': list_price,
'Discount': list_discount, 'Stock Available': list_stock})
print('Converting to Panda Frame....')
sleep(5)
print(new_panda)
Would one list be more than sufficient? Am I approaching this wrongly.
python python-3.x json
New contributor
At the moment, I'm learning and experimenting on the use of web scraping content from different varieties of web pages. But I've come across a common smelly code among several of my applications. I have many repetitive List
that has data being append to them.
from requests import get
import requests
import json
from time import sleep
import pandas as pd
url = 'https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true'
list_name =
list_price =
list_discount =
list_stock =
response = get(url)
json_data = response.json()
def getShockingSales():
index = 0
if response.status_code is 200:
print('Response: ' + 'OK')
else:
print('Unable to access')
total_flashsale = len(json_data['data']['items'])
total_flashsale -= 1
for i in range(index, total_flashsale):
print('Getting data from site... please wait a few seconds')
while i <= total_flashsale:
flash_name = json_data['data']['items'][i]['name']
flash_price = json_data['data']['items'][i]['price']
flash_discount = json_data['data']['items'][i]['discount']
flash_stock = json_data['data']['items'][i]['stock']
list_name.append(flash_name)
list_price.append(flash_price)
list_discount.append(flash_discount)
list_stock.append(flash_stock)
sleep(0.5)
i += 1
if i > total_flashsale:
print('Task is completed...')
return
getShockingSales()
new_panda = pd.DataFrame({'Name': list_name, 'Price': list_price,
'Discount': list_discount, 'Stock Available': list_stock})
print('Converting to Panda Frame....')
sleep(5)
print(new_panda)
Would one list be more than sufficient? Am I approaching this wrongly.
python python-3.x json
python python-3.x json
New contributor
New contributor
edited 20 hours ago
Minial
New contributor
asked 23 hours ago
MinialMinial
185
185
New contributor
New contributor
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
Review
- Remove unnecessary imports
Don't work in the global namespace
This makes it harder to track bugs
constants (
url
) should beUPPER_SNAKE_CASE
Functions (
getShockingSales()
) should belower_snake_case
You don't break or return when an invalid status is encountered
if response.status_code is 200:
should be==
instead ofis
There is a function for this though
response.raise_for_status()
this will create an exception when there is an 4xx, 5xx status
Why use a
while
inside thefor
and return when finished with thewhile
This is really odd!
Either loop with afor
or awhile
, not both! Because the while currently disregards the for loop.
I suggest to stick with for loops, Python excels at readable for loops
(Loop like a native)
Would one list be more than sufficient? Am I approaching this wrongly.
Yes.
You don't have the use 4 separate lists, but can instead create one list and add the column names afterwards.
Code
from requests import get
import pandas as pd
URL = 'https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true'
def get_stocking_sales():
response = get(URL)
response.raise_for_status()
return [
(item['name'], item['price'], item['discount'], item['stock'])
for item in response.json()['data']['items']
]
def create_pd():
return pd.DataFrame(
get_stocking_sales(),
columns=['Name', 'Price', 'Discount', 'Stock']
)
if __name__ == '__main__':
print(create_pd())
Thank you for showing where and what I did wrong and where I can improve and also making them much cleaner! I've followed what you've said and never knew about theif __name__ == '__main__':
concept. Really; not only did you help ~ but I've learned more from your insight. Thank you so much~
– Minial
4 hours ago
add a comment |
Review
Creating functions that read and modify global variables is not a good idea, for example if someone wants to reuse your function, they won't know about side effects.
index
is not useful, andrange(0, n)
is the same asrange(n)
Using
==
is more appropriate thanis
in general, henceresponse.status_code == 200
If
response.status_code != 200
, I think the function should ~return an empty result~ raise an exception like said by @Ludisposed.You use
json_data["data"]["items"]
a lot, you could defineitems = json_data["data"]["items"]
instead, but see below.Your usage of
i
is totally messy. Never use bothfor
andwhile
on the same variable. I think you just want to get the information for each item. So just usefor item in json_data["data"]["items"]:
.Actually,
print("Getting data from site... please wait a few seconds")
is wrong as you got the data atresponse = get(url)
. Also,sleep(0.5)
andsleep(5)
don't make any sense.Speaking from this,
requests.get
is more explicit.You can actually create a pandas DataFrame directly from a list of dictionaries.
Actually, if you don't use the response in another place, you can use the url as an argument of the function.
Putting spaces in column names of a DataFrame is not a good idea. It removes the possibility to access the column named
stock
(for example) withdf.stock
. If you still want that, you can use pandas.DataFrame.renameYou don't need to import
json
.The discounts are given as strings like
"59%"
. I think integers are preferable if you want to perform computations on them. I useddf.discount = df.discount.apply(lambda s: int(s[:-1]))
to perform this.
Optional: you might want to use
logging
instead of printing everything. Or at least print to stderr with:
from sys import stderr
print('Information', file=stderr)
Code
import requests
import pandas as pd
def getShockingSales(url):
response = requests.get(url)
columns = ["name", "price", "discount", "stock"]
response.raise_for_status()
print("Response: OK")
json_data = response.json()
df = pd.DataFrame(json_data["data"]["items"])[columns]
df.discount = df.discount.apply(lambda s: int(s[:-1]))
print("Task is completed...")
return df
URL = "https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true"
df = getShockingSales(URL)
New contributor
Thank you for your insight~ I've learned more than I could hope for by reading your review. It even helped me solved and fixed a few errors in other areas of my application. I wish I could give you more upvotes v.v
– Minial
4 hours ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Minial is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f211164%2freducing-the-amount-of-list-in-a-webscraper%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Review
- Remove unnecessary imports
Don't work in the global namespace
This makes it harder to track bugs
constants (
url
) should beUPPER_SNAKE_CASE
Functions (
getShockingSales()
) should belower_snake_case
You don't break or return when an invalid status is encountered
if response.status_code is 200:
should be==
instead ofis
There is a function for this though
response.raise_for_status()
this will create an exception when there is an 4xx, 5xx status
Why use a
while
inside thefor
and return when finished with thewhile
This is really odd!
Either loop with afor
or awhile
, not both! Because the while currently disregards the for loop.
I suggest to stick with for loops, Python excels at readable for loops
(Loop like a native)
Would one list be more than sufficient? Am I approaching this wrongly.
Yes.
You don't have the use 4 separate lists, but can instead create one list and add the column names afterwards.
Code
from requests import get
import pandas as pd
URL = 'https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true'
def get_stocking_sales():
response = get(URL)
response.raise_for_status()
return [
(item['name'], item['price'], item['discount'], item['stock'])
for item in response.json()['data']['items']
]
def create_pd():
return pd.DataFrame(
get_stocking_sales(),
columns=['Name', 'Price', 'Discount', 'Stock']
)
if __name__ == '__main__':
print(create_pd())
Thank you for showing where and what I did wrong and where I can improve and also making them much cleaner! I've followed what you've said and never knew about theif __name__ == '__main__':
concept. Really; not only did you help ~ but I've learned more from your insight. Thank you so much~
– Minial
4 hours ago
add a comment |
Review
- Remove unnecessary imports
Don't work in the global namespace
This makes it harder to track bugs
constants (
url
) should beUPPER_SNAKE_CASE
Functions (
getShockingSales()
) should belower_snake_case
You don't break or return when an invalid status is encountered
if response.status_code is 200:
should be==
instead ofis
There is a function for this though
response.raise_for_status()
this will create an exception when there is an 4xx, 5xx status
Why use a
while
inside thefor
and return when finished with thewhile
This is really odd!
Either loop with afor
or awhile
, not both! Because the while currently disregards the for loop.
I suggest to stick with for loops, Python excels at readable for loops
(Loop like a native)
Would one list be more than sufficient? Am I approaching this wrongly.
Yes.
You don't have the use 4 separate lists, but can instead create one list and add the column names afterwards.
Code
from requests import get
import pandas as pd
URL = 'https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true'
def get_stocking_sales():
response = get(URL)
response.raise_for_status()
return [
(item['name'], item['price'], item['discount'], item['stock'])
for item in response.json()['data']['items']
]
def create_pd():
return pd.DataFrame(
get_stocking_sales(),
columns=['Name', 'Price', 'Discount', 'Stock']
)
if __name__ == '__main__':
print(create_pd())
Thank you for showing where and what I did wrong and where I can improve and also making them much cleaner! I've followed what you've said and never knew about theif __name__ == '__main__':
concept. Really; not only did you help ~ but I've learned more from your insight. Thank you so much~
– Minial
4 hours ago
add a comment |
Review
- Remove unnecessary imports
Don't work in the global namespace
This makes it harder to track bugs
constants (
url
) should beUPPER_SNAKE_CASE
Functions (
getShockingSales()
) should belower_snake_case
You don't break or return when an invalid status is encountered
if response.status_code is 200:
should be==
instead ofis
There is a function for this though
response.raise_for_status()
this will create an exception when there is an 4xx, 5xx status
Why use a
while
inside thefor
and return when finished with thewhile
This is really odd!
Either loop with afor
or awhile
, not both! Because the while currently disregards the for loop.
I suggest to stick with for loops, Python excels at readable for loops
(Loop like a native)
Would one list be more than sufficient? Am I approaching this wrongly.
Yes.
You don't have the use 4 separate lists, but can instead create one list and add the column names afterwards.
Code
from requests import get
import pandas as pd
URL = 'https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true'
def get_stocking_sales():
response = get(URL)
response.raise_for_status()
return [
(item['name'], item['price'], item['discount'], item['stock'])
for item in response.json()['data']['items']
]
def create_pd():
return pd.DataFrame(
get_stocking_sales(),
columns=['Name', 'Price', 'Discount', 'Stock']
)
if __name__ == '__main__':
print(create_pd())
Review
- Remove unnecessary imports
Don't work in the global namespace
This makes it harder to track bugs
constants (
url
) should beUPPER_SNAKE_CASE
Functions (
getShockingSales()
) should belower_snake_case
You don't break or return when an invalid status is encountered
if response.status_code is 200:
should be==
instead ofis
There is a function for this though
response.raise_for_status()
this will create an exception when there is an 4xx, 5xx status
Why use a
while
inside thefor
and return when finished with thewhile
This is really odd!
Either loop with afor
or awhile
, not both! Because the while currently disregards the for loop.
I suggest to stick with for loops, Python excels at readable for loops
(Loop like a native)
Would one list be more than sufficient? Am I approaching this wrongly.
Yes.
You don't have the use 4 separate lists, but can instead create one list and add the column names afterwards.
Code
from requests import get
import pandas as pd
URL = 'https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true'
def get_stocking_sales():
response = get(URL)
response.raise_for_status()
return [
(item['name'], item['price'], item['discount'], item['stock'])
for item in response.json()['data']['items']
]
def create_pd():
return pd.DataFrame(
get_stocking_sales(),
columns=['Name', 'Price', 'Discount', 'Stock']
)
if __name__ == '__main__':
print(create_pd())
answered 20 hours ago
LudisposedLudisposed
7,16721959
7,16721959
Thank you for showing where and what I did wrong and where I can improve and also making them much cleaner! I've followed what you've said and never knew about theif __name__ == '__main__':
concept. Really; not only did you help ~ but I've learned more from your insight. Thank you so much~
– Minial
4 hours ago
add a comment |
Thank you for showing where and what I did wrong and where I can improve and also making them much cleaner! I've followed what you've said and never knew about theif __name__ == '__main__':
concept. Really; not only did you help ~ but I've learned more from your insight. Thank you so much~
– Minial
4 hours ago
Thank you for showing where and what I did wrong and where I can improve and also making them much cleaner! I've followed what you've said and never knew about the
if __name__ == '__main__':
concept. Really; not only did you help ~ but I've learned more from your insight. Thank you so much~– Minial
4 hours ago
Thank you for showing where and what I did wrong and where I can improve and also making them much cleaner! I've followed what you've said and never knew about the
if __name__ == '__main__':
concept. Really; not only did you help ~ but I've learned more from your insight. Thank you so much~– Minial
4 hours ago
add a comment |
Review
Creating functions that read and modify global variables is not a good idea, for example if someone wants to reuse your function, they won't know about side effects.
index
is not useful, andrange(0, n)
is the same asrange(n)
Using
==
is more appropriate thanis
in general, henceresponse.status_code == 200
If
response.status_code != 200
, I think the function should ~return an empty result~ raise an exception like said by @Ludisposed.You use
json_data["data"]["items"]
a lot, you could defineitems = json_data["data"]["items"]
instead, but see below.Your usage of
i
is totally messy. Never use bothfor
andwhile
on the same variable. I think you just want to get the information for each item. So just usefor item in json_data["data"]["items"]:
.Actually,
print("Getting data from site... please wait a few seconds")
is wrong as you got the data atresponse = get(url)
. Also,sleep(0.5)
andsleep(5)
don't make any sense.Speaking from this,
requests.get
is more explicit.You can actually create a pandas DataFrame directly from a list of dictionaries.
Actually, if you don't use the response in another place, you can use the url as an argument of the function.
Putting spaces in column names of a DataFrame is not a good idea. It removes the possibility to access the column named
stock
(for example) withdf.stock
. If you still want that, you can use pandas.DataFrame.renameYou don't need to import
json
.The discounts are given as strings like
"59%"
. I think integers are preferable if you want to perform computations on them. I useddf.discount = df.discount.apply(lambda s: int(s[:-1]))
to perform this.
Optional: you might want to use
logging
instead of printing everything. Or at least print to stderr with:
from sys import stderr
print('Information', file=stderr)
Code
import requests
import pandas as pd
def getShockingSales(url):
response = requests.get(url)
columns = ["name", "price", "discount", "stock"]
response.raise_for_status()
print("Response: OK")
json_data = response.json()
df = pd.DataFrame(json_data["data"]["items"])[columns]
df.discount = df.discount.apply(lambda s: int(s[:-1]))
print("Task is completed...")
return df
URL = "https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true"
df = getShockingSales(URL)
New contributor
Thank you for your insight~ I've learned more than I could hope for by reading your review. It even helped me solved and fixed a few errors in other areas of my application. I wish I could give you more upvotes v.v
– Minial
4 hours ago
add a comment |
Review
Creating functions that read and modify global variables is not a good idea, for example if someone wants to reuse your function, they won't know about side effects.
index
is not useful, andrange(0, n)
is the same asrange(n)
Using
==
is more appropriate thanis
in general, henceresponse.status_code == 200
If
response.status_code != 200
, I think the function should ~return an empty result~ raise an exception like said by @Ludisposed.You use
json_data["data"]["items"]
a lot, you could defineitems = json_data["data"]["items"]
instead, but see below.Your usage of
i
is totally messy. Never use bothfor
andwhile
on the same variable. I think you just want to get the information for each item. So just usefor item in json_data["data"]["items"]:
.Actually,
print("Getting data from site... please wait a few seconds")
is wrong as you got the data atresponse = get(url)
. Also,sleep(0.5)
andsleep(5)
don't make any sense.Speaking from this,
requests.get
is more explicit.You can actually create a pandas DataFrame directly from a list of dictionaries.
Actually, if you don't use the response in another place, you can use the url as an argument of the function.
Putting spaces in column names of a DataFrame is not a good idea. It removes the possibility to access the column named
stock
(for example) withdf.stock
. If you still want that, you can use pandas.DataFrame.renameYou don't need to import
json
.The discounts are given as strings like
"59%"
. I think integers are preferable if you want to perform computations on them. I useddf.discount = df.discount.apply(lambda s: int(s[:-1]))
to perform this.
Optional: you might want to use
logging
instead of printing everything. Or at least print to stderr with:
from sys import stderr
print('Information', file=stderr)
Code
import requests
import pandas as pd
def getShockingSales(url):
response = requests.get(url)
columns = ["name", "price", "discount", "stock"]
response.raise_for_status()
print("Response: OK")
json_data = response.json()
df = pd.DataFrame(json_data["data"]["items"])[columns]
df.discount = df.discount.apply(lambda s: int(s[:-1]))
print("Task is completed...")
return df
URL = "https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true"
df = getShockingSales(URL)
New contributor
Thank you for your insight~ I've learned more than I could hope for by reading your review. It even helped me solved and fixed a few errors in other areas of my application. I wish I could give you more upvotes v.v
– Minial
4 hours ago
add a comment |
Review
Creating functions that read and modify global variables is not a good idea, for example if someone wants to reuse your function, they won't know about side effects.
index
is not useful, andrange(0, n)
is the same asrange(n)
Using
==
is more appropriate thanis
in general, henceresponse.status_code == 200
If
response.status_code != 200
, I think the function should ~return an empty result~ raise an exception like said by @Ludisposed.You use
json_data["data"]["items"]
a lot, you could defineitems = json_data["data"]["items"]
instead, but see below.Your usage of
i
is totally messy. Never use bothfor
andwhile
on the same variable. I think you just want to get the information for each item. So just usefor item in json_data["data"]["items"]:
.Actually,
print("Getting data from site... please wait a few seconds")
is wrong as you got the data atresponse = get(url)
. Also,sleep(0.5)
andsleep(5)
don't make any sense.Speaking from this,
requests.get
is more explicit.You can actually create a pandas DataFrame directly from a list of dictionaries.
Actually, if you don't use the response in another place, you can use the url as an argument of the function.
Putting spaces in column names of a DataFrame is not a good idea. It removes the possibility to access the column named
stock
(for example) withdf.stock
. If you still want that, you can use pandas.DataFrame.renameYou don't need to import
json
.The discounts are given as strings like
"59%"
. I think integers are preferable if you want to perform computations on them. I useddf.discount = df.discount.apply(lambda s: int(s[:-1]))
to perform this.
Optional: you might want to use
logging
instead of printing everything. Or at least print to stderr with:
from sys import stderr
print('Information', file=stderr)
Code
import requests
import pandas as pd
def getShockingSales(url):
response = requests.get(url)
columns = ["name", "price", "discount", "stock"]
response.raise_for_status()
print("Response: OK")
json_data = response.json()
df = pd.DataFrame(json_data["data"]["items"])[columns]
df.discount = df.discount.apply(lambda s: int(s[:-1]))
print("Task is completed...")
return df
URL = "https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true"
df = getShockingSales(URL)
New contributor
Review
Creating functions that read and modify global variables is not a good idea, for example if someone wants to reuse your function, they won't know about side effects.
index
is not useful, andrange(0, n)
is the same asrange(n)
Using
==
is more appropriate thanis
in general, henceresponse.status_code == 200
If
response.status_code != 200
, I think the function should ~return an empty result~ raise an exception like said by @Ludisposed.You use
json_data["data"]["items"]
a lot, you could defineitems = json_data["data"]["items"]
instead, but see below.Your usage of
i
is totally messy. Never use bothfor
andwhile
on the same variable. I think you just want to get the information for each item. So just usefor item in json_data["data"]["items"]:
.Actually,
print("Getting data from site... please wait a few seconds")
is wrong as you got the data atresponse = get(url)
. Also,sleep(0.5)
andsleep(5)
don't make any sense.Speaking from this,
requests.get
is more explicit.You can actually create a pandas DataFrame directly from a list of dictionaries.
Actually, if you don't use the response in another place, you can use the url as an argument of the function.
Putting spaces in column names of a DataFrame is not a good idea. It removes the possibility to access the column named
stock
(for example) withdf.stock
. If you still want that, you can use pandas.DataFrame.renameYou don't need to import
json
.The discounts are given as strings like
"59%"
. I think integers are preferable if you want to perform computations on them. I useddf.discount = df.discount.apply(lambda s: int(s[:-1]))
to perform this.
Optional: you might want to use
logging
instead of printing everything. Or at least print to stderr with:
from sys import stderr
print('Information', file=stderr)
Code
import requests
import pandas as pd
def getShockingSales(url):
response = requests.get(url)
columns = ["name", "price", "discount", "stock"]
response.raise_for_status()
print("Response: OK")
json_data = response.json()
df = pd.DataFrame(json_data["data"]["items"])[columns]
df.discount = df.discount.apply(lambda s: int(s[:-1]))
print("Task is completed...")
return df
URL = "https://shopee.com.my/api/v2/flash_sale/get_items?offset=0&limit=16&filter_soldout=true"
df = getShockingSales(URL)
New contributor
New contributor
answered 20 hours ago
LaboLabo
1614
1614
New contributor
New contributor
Thank you for your insight~ I've learned more than I could hope for by reading your review. It even helped me solved and fixed a few errors in other areas of my application. I wish I could give you more upvotes v.v
– Minial
4 hours ago
add a comment |
Thank you for your insight~ I've learned more than I could hope for by reading your review. It even helped me solved and fixed a few errors in other areas of my application. I wish I could give you more upvotes v.v
– Minial
4 hours ago
Thank you for your insight~ I've learned more than I could hope for by reading your review. It even helped me solved and fixed a few errors in other areas of my application. I wish I could give you more upvotes v.v
– Minial
4 hours ago
Thank you for your insight~ I've learned more than I could hope for by reading your review. It even helped me solved and fixed a few errors in other areas of my application. I wish I could give you more upvotes v.v
– Minial
4 hours ago
add a comment |
Minial is a new contributor. Be nice, and check out our Code of Conduct.
Minial is a new contributor. Be nice, and check out our Code of Conduct.
Minial is a new contributor. Be nice, and check out our Code of Conduct.
Minial is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f211164%2freducing-the-amount-of-list-in-a-webscraper%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown