How to find and delete duplicate video (.mp4) files? [closed]

Multi tool use
I have 16,000+ short video clips and there are lots of them that are exactly alike to the human eye, but if you examine very closely you will find that one or the other may have an extra 1 second (or much much less) duration at the beginning or the end.
I have already tried several methods and had zero success in finding duplicates. You would think that comparing exact byte size would be good enough since bytes are so small. But no! The reason why not is because there may be a slight extra (or non-extra) few milliseconds in the beginning or end of the video clips. This causes them to be different and not identical, resulting in any duplicate finder using "byte for byte comparison" result no duplicate results.
And although the majority of the video clip is exactly like several others, nothing I use is finding any duplicates because of the few milliseconds difference at the beginning or at the end of the compared .mp4 files.
Does anyone know how I might find success in finding duplicates of these short video clip .mp4 files? On average they are about 30 seconds each, however its only a few milliseconds difference when compared closely to another. To the human eye this would be exactly the same, so I am seeing duplicates, but I don't want to have to watch and compare 16,000+ video clips all on my own.
Any suggestions?
I found a great working answer to my question can you allow me to answer it?
... seems I can't when it's put on hold ...
linux files find video file-comparison
closed as too broad by Rui F Ribeiro, Evan Carroll, jimmij, X Tian, Michael Homer Feb 13 at 23:20
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
add a comment |
I have 16,000+ short video clips and there are lots of them that are exactly alike to the human eye, but if you examine very closely you will find that one or the other may have an extra 1 second (or much much less) duration at the beginning or the end.
I have already tried several methods and had zero success in finding duplicates. You would think that comparing exact byte size would be good enough since bytes are so small. But no! The reason why not is because there may be a slight extra (or non-extra) few milliseconds in the beginning or end of the video clips. This causes them to be different and not identical, resulting in any duplicate finder using "byte for byte comparison" result no duplicate results.
And although the majority of the video clip is exactly like several others, nothing I use is finding any duplicates because of the few milliseconds difference at the beginning or at the end of the compared .mp4 files.
Does anyone know how I might find success in finding duplicates of these short video clip .mp4 files? On average they are about 30 seconds each, however its only a few milliseconds difference when compared closely to another. To the human eye this would be exactly the same, so I am seeing duplicates, but I don't want to have to watch and compare 16,000+ video clips all on my own.
Any suggestions?
I found a great working answer to my question can you allow me to answer it?
... seems I can't when it's put on hold ...
linux files find video file-comparison
closed as too broad by Rui F Ribeiro, Evan Carroll, jimmij, X Tian, Michael Homer Feb 13 at 23:20
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
Kindly let us know you want to delete only the video whose size and name is same right
– Praveen Kumar BS
Feb 12 at 4:25
3
Generating predictable thumbnails might be a decent idea: video.stackexchange.com/a/5315/14767
– Haxiel
Feb 12 at 5:28
You haven't defined "exactly alike to the human eye" at all. Does that mean same clarity? Same starting frame?
– Evan Carroll
Feb 12 at 7:03
@EvanCarroll No. It means that they are the same for the human eye.
– peterh
Feb 17 at 9:33
add a comment |
I have 16,000+ short video clips and there are lots of them that are exactly alike to the human eye, but if you examine very closely you will find that one or the other may have an extra 1 second (or much much less) duration at the beginning or the end.
I have already tried several methods and had zero success in finding duplicates. You would think that comparing exact byte size would be good enough since bytes are so small. But no! The reason why not is because there may be a slight extra (or non-extra) few milliseconds in the beginning or end of the video clips. This causes them to be different and not identical, resulting in any duplicate finder using "byte for byte comparison" result no duplicate results.
And although the majority of the video clip is exactly like several others, nothing I use is finding any duplicates because of the few milliseconds difference at the beginning or at the end of the compared .mp4 files.
Does anyone know how I might find success in finding duplicates of these short video clip .mp4 files? On average they are about 30 seconds each, however its only a few milliseconds difference when compared closely to another. To the human eye this would be exactly the same, so I am seeing duplicates, but I don't want to have to watch and compare 16,000+ video clips all on my own.
Any suggestions?
I found a great working answer to my question can you allow me to answer it?
... seems I can't when it's put on hold ...
linux files find video file-comparison
I have 16,000+ short video clips and there are lots of them that are exactly alike to the human eye, but if you examine very closely you will find that one or the other may have an extra 1 second (or much much less) duration at the beginning or the end.
I have already tried several methods and had zero success in finding duplicates. You would think that comparing exact byte size would be good enough since bytes are so small. But no! The reason why not is because there may be a slight extra (or non-extra) few milliseconds in the beginning or end of the video clips. This causes them to be different and not identical, resulting in any duplicate finder using "byte for byte comparison" result no duplicate results.
And although the majority of the video clip is exactly like several others, nothing I use is finding any duplicates because of the few milliseconds difference at the beginning or at the end of the compared .mp4 files.
Does anyone know how I might find success in finding duplicates of these short video clip .mp4 files? On average they are about 30 seconds each, however its only a few milliseconds difference when compared closely to another. To the human eye this would be exactly the same, so I am seeing duplicates, but I don't want to have to watch and compare 16,000+ video clips all on my own.
Any suggestions?
I found a great working answer to my question can you allow me to answer it?
... seems I can't when it's put on hold ...
linux files find video file-comparison
linux files find video file-comparison
edited Feb 17 at 3:47
Anonymous User
asked Feb 12 at 3:33


Anonymous UserAnonymous User
244
244
closed as too broad by Rui F Ribeiro, Evan Carroll, jimmij, X Tian, Michael Homer Feb 13 at 23:20
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
closed as too broad by Rui F Ribeiro, Evan Carroll, jimmij, X Tian, Michael Homer Feb 13 at 23:20
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
Kindly let us know you want to delete only the video whose size and name is same right
– Praveen Kumar BS
Feb 12 at 4:25
3
Generating predictable thumbnails might be a decent idea: video.stackexchange.com/a/5315/14767
– Haxiel
Feb 12 at 5:28
You haven't defined "exactly alike to the human eye" at all. Does that mean same clarity? Same starting frame?
– Evan Carroll
Feb 12 at 7:03
@EvanCarroll No. It means that they are the same for the human eye.
– peterh
Feb 17 at 9:33
add a comment |
Kindly let us know you want to delete only the video whose size and name is same right
– Praveen Kumar BS
Feb 12 at 4:25
3
Generating predictable thumbnails might be a decent idea: video.stackexchange.com/a/5315/14767
– Haxiel
Feb 12 at 5:28
You haven't defined "exactly alike to the human eye" at all. Does that mean same clarity? Same starting frame?
– Evan Carroll
Feb 12 at 7:03
@EvanCarroll No. It means that they are the same for the human eye.
– peterh
Feb 17 at 9:33
Kindly let us know you want to delete only the video whose size and name is same right
– Praveen Kumar BS
Feb 12 at 4:25
Kindly let us know you want to delete only the video whose size and name is same right
– Praveen Kumar BS
Feb 12 at 4:25
3
3
Generating predictable thumbnails might be a decent idea: video.stackexchange.com/a/5315/14767
– Haxiel
Feb 12 at 5:28
Generating predictable thumbnails might be a decent idea: video.stackexchange.com/a/5315/14767
– Haxiel
Feb 12 at 5:28
You haven't defined "exactly alike to the human eye" at all. Does that mean same clarity? Same starting frame?
– Evan Carroll
Feb 12 at 7:03
You haven't defined "exactly alike to the human eye" at all. Does that mean same clarity? Same starting frame?
– Evan Carroll
Feb 12 at 7:03
@EvanCarroll No. It means that they are the same for the human eye.
– peterh
Feb 17 at 9:33
@EvanCarroll No. It means that they are the same for the human eye.
– peterh
Feb 17 at 9:33
add a comment |
1 Answer
1
active
oldest
votes
Me too had the same problem..
So wrote a program myself..
Problem is I had videos of various formats and resolution..
So needed to take hash of each video frame and compare.
https://github.com/gklc811/duplicate_video_finder
you can just change the directories at top and you are good to go..
from os import path, walk, makedirs, rename
from time import clock
from imagehash import average_hash
from PIL import Image
from cv2 import VideoCapture, CAP_PROP_FRAME_COUNT, CAP_PROP_FRAME_WIDTH, CAP_PROP_FRAME_HEIGHT, CAP_PROP_FPS
from json import dump, load
from multiprocessing import Pool, cpu_count
input_vid_dir = r'C:UsersgokulDocumentsdata\'
json_dir = r'C:UsersgokulDocumentsdb\'
analyzed_dir = r'C:UsersgokulDocumentsanalyzed\'
duplicate_dir = r'C:UsersgokulDocumentsduplicate\'
if not path.exists(json_dir):
makedirs(json_dir)
if not path.exists(analyzed_dir):
makedirs(analyzed_dir)
if not path.exists(duplicate_dir):
makedirs(duplicate_dir)
def write_to_json(filename, data):
file_full_path = json_dir + filename + ".json"
with open(file_full_path, 'w') as file_pointer:
dump(data, file_pointer)
return
def video_to_json(filename):
file_full_path = input_vid_dir + filename
start = clock()
size = round(path.getsize(file_full_path) / 1024 / 1024, 2)
video_pointer = VideoCapture(file_full_path)
frame_count = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_COUNT)))
width = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_WIDTH)))
height = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_HEIGHT)))
fps = int(VideoCapture.get(video_pointer, int(CAP_PROP_FPS)))
success, image = video_pointer.read()
video_hash = {}
while success:
frame_hash = average_hash(Image.fromarray(image))
video_hash[str(frame_hash)] = filename
success, image = video_pointer.read()
stop = clock()
time_taken = stop - start
print("Time taken for ", file_full_path, " is : ", time_taken)
data_dict = dict()
data_dict['size'] = size
data_dict['time_taken'] = time_taken
data_dict['fps'] = fps
data_dict['height'] = height
data_dict['width'] = width
data_dict['frame_count'] = frame_count
data_dict['filename'] = filename
data_dict['video_hash'] = video_hash
write_to_json(filename, data_dict)
return
def multiprocess_video_to_json():
files = next(walk(input_vid_dir))[2]
processes = cpu_count()
print(processes)
pool = Pool(processes)
start = clock()
pool.starmap_async(video_to_json, zip(files))
pool.close()
pool.join()
stop = clock()
print("Time Taken : ", stop - start)
def key_with_max_val(d):
max_value = 0
required_key = ""
for k in d:
if d[k] > max_value:
max_value = d[k]
required_key = k
return required_key
def duplicate_analyzer():
files = next(walk(json_dir))[2]
data_dict = {}
for file in files:
filename = json_dir + file
with open(filename) as f:
data = load(f)
video_hash = data['video_hash']
count = 0
duplicate_file_dict = dict()
for key in video_hash:
count += 1
if key in data_dict:
if data_dict[key] in duplicate_file_dict:
duplicate_file_dict[data_dict[key]] = duplicate_file_dict[data_dict[key]] + 1
else:
duplicate_file_dict[data_dict[key]] = 1
else:
data_dict[key] = video_hash[key]
if duplicate_file_dict:
duplicate_file = key_with_max_val(duplicate_file_dict)
duplicate_percentage = ((duplicate_file_dict[duplicate_file] / count) * 100)
if duplicate_percentage > 50:
file = file[:-5]
print(file, " is dup of ", duplicate_file)
src = analyzed_dir + file
tgt = duplicate_dir + file
if path.exists(src):
rename(src, tgt)
# else:
# print("File already moved")
def mv_analyzed_file():
files = next(walk(json_dir))[2]
for filename in files:
filename = filename[:-5]
src = input_vid_dir + filename
tgt = analyzed_dir + filename
if path.exists(src):
rename(src, tgt)
# else:
# print("File already moved")
if __name__ == '__main__':
mv_analyzed_file()
multiprocess_video_to_json()
mv_analyzed_file()
duplicate_analyzer()
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Me too had the same problem..
So wrote a program myself..
Problem is I had videos of various formats and resolution..
So needed to take hash of each video frame and compare.
https://github.com/gklc811/duplicate_video_finder
you can just change the directories at top and you are good to go..
from os import path, walk, makedirs, rename
from time import clock
from imagehash import average_hash
from PIL import Image
from cv2 import VideoCapture, CAP_PROP_FRAME_COUNT, CAP_PROP_FRAME_WIDTH, CAP_PROP_FRAME_HEIGHT, CAP_PROP_FPS
from json import dump, load
from multiprocessing import Pool, cpu_count
input_vid_dir = r'C:UsersgokulDocumentsdata\'
json_dir = r'C:UsersgokulDocumentsdb\'
analyzed_dir = r'C:UsersgokulDocumentsanalyzed\'
duplicate_dir = r'C:UsersgokulDocumentsduplicate\'
if not path.exists(json_dir):
makedirs(json_dir)
if not path.exists(analyzed_dir):
makedirs(analyzed_dir)
if not path.exists(duplicate_dir):
makedirs(duplicate_dir)
def write_to_json(filename, data):
file_full_path = json_dir + filename + ".json"
with open(file_full_path, 'w') as file_pointer:
dump(data, file_pointer)
return
def video_to_json(filename):
file_full_path = input_vid_dir + filename
start = clock()
size = round(path.getsize(file_full_path) / 1024 / 1024, 2)
video_pointer = VideoCapture(file_full_path)
frame_count = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_COUNT)))
width = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_WIDTH)))
height = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_HEIGHT)))
fps = int(VideoCapture.get(video_pointer, int(CAP_PROP_FPS)))
success, image = video_pointer.read()
video_hash = {}
while success:
frame_hash = average_hash(Image.fromarray(image))
video_hash[str(frame_hash)] = filename
success, image = video_pointer.read()
stop = clock()
time_taken = stop - start
print("Time taken for ", file_full_path, " is : ", time_taken)
data_dict = dict()
data_dict['size'] = size
data_dict['time_taken'] = time_taken
data_dict['fps'] = fps
data_dict['height'] = height
data_dict['width'] = width
data_dict['frame_count'] = frame_count
data_dict['filename'] = filename
data_dict['video_hash'] = video_hash
write_to_json(filename, data_dict)
return
def multiprocess_video_to_json():
files = next(walk(input_vid_dir))[2]
processes = cpu_count()
print(processes)
pool = Pool(processes)
start = clock()
pool.starmap_async(video_to_json, zip(files))
pool.close()
pool.join()
stop = clock()
print("Time Taken : ", stop - start)
def key_with_max_val(d):
max_value = 0
required_key = ""
for k in d:
if d[k] > max_value:
max_value = d[k]
required_key = k
return required_key
def duplicate_analyzer():
files = next(walk(json_dir))[2]
data_dict = {}
for file in files:
filename = json_dir + file
with open(filename) as f:
data = load(f)
video_hash = data['video_hash']
count = 0
duplicate_file_dict = dict()
for key in video_hash:
count += 1
if key in data_dict:
if data_dict[key] in duplicate_file_dict:
duplicate_file_dict[data_dict[key]] = duplicate_file_dict[data_dict[key]] + 1
else:
duplicate_file_dict[data_dict[key]] = 1
else:
data_dict[key] = video_hash[key]
if duplicate_file_dict:
duplicate_file = key_with_max_val(duplicate_file_dict)
duplicate_percentage = ((duplicate_file_dict[duplicate_file] / count) * 100)
if duplicate_percentage > 50:
file = file[:-5]
print(file, " is dup of ", duplicate_file)
src = analyzed_dir + file
tgt = duplicate_dir + file
if path.exists(src):
rename(src, tgt)
# else:
# print("File already moved")
def mv_analyzed_file():
files = next(walk(json_dir))[2]
for filename in files:
filename = filename[:-5]
src = input_vid_dir + filename
tgt = analyzed_dir + filename
if path.exists(src):
rename(src, tgt)
# else:
# print("File already moved")
if __name__ == '__main__':
mv_analyzed_file()
multiprocess_video_to_json()
mv_analyzed_file()
duplicate_analyzer()
add a comment |
Me too had the same problem..
So wrote a program myself..
Problem is I had videos of various formats and resolution..
So needed to take hash of each video frame and compare.
https://github.com/gklc811/duplicate_video_finder
you can just change the directories at top and you are good to go..
from os import path, walk, makedirs, rename
from time import clock
from imagehash import average_hash
from PIL import Image
from cv2 import VideoCapture, CAP_PROP_FRAME_COUNT, CAP_PROP_FRAME_WIDTH, CAP_PROP_FRAME_HEIGHT, CAP_PROP_FPS
from json import dump, load
from multiprocessing import Pool, cpu_count
input_vid_dir = r'C:UsersgokulDocumentsdata\'
json_dir = r'C:UsersgokulDocumentsdb\'
analyzed_dir = r'C:UsersgokulDocumentsanalyzed\'
duplicate_dir = r'C:UsersgokulDocumentsduplicate\'
if not path.exists(json_dir):
makedirs(json_dir)
if not path.exists(analyzed_dir):
makedirs(analyzed_dir)
if not path.exists(duplicate_dir):
makedirs(duplicate_dir)
def write_to_json(filename, data):
file_full_path = json_dir + filename + ".json"
with open(file_full_path, 'w') as file_pointer:
dump(data, file_pointer)
return
def video_to_json(filename):
file_full_path = input_vid_dir + filename
start = clock()
size = round(path.getsize(file_full_path) / 1024 / 1024, 2)
video_pointer = VideoCapture(file_full_path)
frame_count = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_COUNT)))
width = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_WIDTH)))
height = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_HEIGHT)))
fps = int(VideoCapture.get(video_pointer, int(CAP_PROP_FPS)))
success, image = video_pointer.read()
video_hash = {}
while success:
frame_hash = average_hash(Image.fromarray(image))
video_hash[str(frame_hash)] = filename
success, image = video_pointer.read()
stop = clock()
time_taken = stop - start
print("Time taken for ", file_full_path, " is : ", time_taken)
data_dict = dict()
data_dict['size'] = size
data_dict['time_taken'] = time_taken
data_dict['fps'] = fps
data_dict['height'] = height
data_dict['width'] = width
data_dict['frame_count'] = frame_count
data_dict['filename'] = filename
data_dict['video_hash'] = video_hash
write_to_json(filename, data_dict)
return
def multiprocess_video_to_json():
files = next(walk(input_vid_dir))[2]
processes = cpu_count()
print(processes)
pool = Pool(processes)
start = clock()
pool.starmap_async(video_to_json, zip(files))
pool.close()
pool.join()
stop = clock()
print("Time Taken : ", stop - start)
def key_with_max_val(d):
max_value = 0
required_key = ""
for k in d:
if d[k] > max_value:
max_value = d[k]
required_key = k
return required_key
def duplicate_analyzer():
files = next(walk(json_dir))[2]
data_dict = {}
for file in files:
filename = json_dir + file
with open(filename) as f:
data = load(f)
video_hash = data['video_hash']
count = 0
duplicate_file_dict = dict()
for key in video_hash:
count += 1
if key in data_dict:
if data_dict[key] in duplicate_file_dict:
duplicate_file_dict[data_dict[key]] = duplicate_file_dict[data_dict[key]] + 1
else:
duplicate_file_dict[data_dict[key]] = 1
else:
data_dict[key] = video_hash[key]
if duplicate_file_dict:
duplicate_file = key_with_max_val(duplicate_file_dict)
duplicate_percentage = ((duplicate_file_dict[duplicate_file] / count) * 100)
if duplicate_percentage > 50:
file = file[:-5]
print(file, " is dup of ", duplicate_file)
src = analyzed_dir + file
tgt = duplicate_dir + file
if path.exists(src):
rename(src, tgt)
# else:
# print("File already moved")
def mv_analyzed_file():
files = next(walk(json_dir))[2]
for filename in files:
filename = filename[:-5]
src = input_vid_dir + filename
tgt = analyzed_dir + filename
if path.exists(src):
rename(src, tgt)
# else:
# print("File already moved")
if __name__ == '__main__':
mv_analyzed_file()
multiprocess_video_to_json()
mv_analyzed_file()
duplicate_analyzer()
add a comment |
Me too had the same problem..
So wrote a program myself..
Problem is I had videos of various formats and resolution..
So needed to take hash of each video frame and compare.
https://github.com/gklc811/duplicate_video_finder
you can just change the directories at top and you are good to go..
from os import path, walk, makedirs, rename
from time import clock
from imagehash import average_hash
from PIL import Image
from cv2 import VideoCapture, CAP_PROP_FRAME_COUNT, CAP_PROP_FRAME_WIDTH, CAP_PROP_FRAME_HEIGHT, CAP_PROP_FPS
from json import dump, load
from multiprocessing import Pool, cpu_count
input_vid_dir = r'C:UsersgokulDocumentsdata\'
json_dir = r'C:UsersgokulDocumentsdb\'
analyzed_dir = r'C:UsersgokulDocumentsanalyzed\'
duplicate_dir = r'C:UsersgokulDocumentsduplicate\'
if not path.exists(json_dir):
makedirs(json_dir)
if not path.exists(analyzed_dir):
makedirs(analyzed_dir)
if not path.exists(duplicate_dir):
makedirs(duplicate_dir)
def write_to_json(filename, data):
file_full_path = json_dir + filename + ".json"
with open(file_full_path, 'w') as file_pointer:
dump(data, file_pointer)
return
def video_to_json(filename):
file_full_path = input_vid_dir + filename
start = clock()
size = round(path.getsize(file_full_path) / 1024 / 1024, 2)
video_pointer = VideoCapture(file_full_path)
frame_count = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_COUNT)))
width = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_WIDTH)))
height = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_HEIGHT)))
fps = int(VideoCapture.get(video_pointer, int(CAP_PROP_FPS)))
success, image = video_pointer.read()
video_hash = {}
while success:
frame_hash = average_hash(Image.fromarray(image))
video_hash[str(frame_hash)] = filename
success, image = video_pointer.read()
stop = clock()
time_taken = stop - start
print("Time taken for ", file_full_path, " is : ", time_taken)
data_dict = dict()
data_dict['size'] = size
data_dict['time_taken'] = time_taken
data_dict['fps'] = fps
data_dict['height'] = height
data_dict['width'] = width
data_dict['frame_count'] = frame_count
data_dict['filename'] = filename
data_dict['video_hash'] = video_hash
write_to_json(filename, data_dict)
return
def multiprocess_video_to_json():
files = next(walk(input_vid_dir))[2]
processes = cpu_count()
print(processes)
pool = Pool(processes)
start = clock()
pool.starmap_async(video_to_json, zip(files))
pool.close()
pool.join()
stop = clock()
print("Time Taken : ", stop - start)
def key_with_max_val(d):
max_value = 0
required_key = ""
for k in d:
if d[k] > max_value:
max_value = d[k]
required_key = k
return required_key
def duplicate_analyzer():
files = next(walk(json_dir))[2]
data_dict = {}
for file in files:
filename = json_dir + file
with open(filename) as f:
data = load(f)
video_hash = data['video_hash']
count = 0
duplicate_file_dict = dict()
for key in video_hash:
count += 1
if key in data_dict:
if data_dict[key] in duplicate_file_dict:
duplicate_file_dict[data_dict[key]] = duplicate_file_dict[data_dict[key]] + 1
else:
duplicate_file_dict[data_dict[key]] = 1
else:
data_dict[key] = video_hash[key]
if duplicate_file_dict:
duplicate_file = key_with_max_val(duplicate_file_dict)
duplicate_percentage = ((duplicate_file_dict[duplicate_file] / count) * 100)
if duplicate_percentage > 50:
file = file[:-5]
print(file, " is dup of ", duplicate_file)
src = analyzed_dir + file
tgt = duplicate_dir + file
if path.exists(src):
rename(src, tgt)
# else:
# print("File already moved")
def mv_analyzed_file():
files = next(walk(json_dir))[2]
for filename in files:
filename = filename[:-5]
src = input_vid_dir + filename
tgt = analyzed_dir + filename
if path.exists(src):
rename(src, tgt)
# else:
# print("File already moved")
if __name__ == '__main__':
mv_analyzed_file()
multiprocess_video_to_json()
mv_analyzed_file()
duplicate_analyzer()
Me too had the same problem..
So wrote a program myself..
Problem is I had videos of various formats and resolution..
So needed to take hash of each video frame and compare.
https://github.com/gklc811/duplicate_video_finder
you can just change the directories at top and you are good to go..
from os import path, walk, makedirs, rename
from time import clock
from imagehash import average_hash
from PIL import Image
from cv2 import VideoCapture, CAP_PROP_FRAME_COUNT, CAP_PROP_FRAME_WIDTH, CAP_PROP_FRAME_HEIGHT, CAP_PROP_FPS
from json import dump, load
from multiprocessing import Pool, cpu_count
input_vid_dir = r'C:UsersgokulDocumentsdata\'
json_dir = r'C:UsersgokulDocumentsdb\'
analyzed_dir = r'C:UsersgokulDocumentsanalyzed\'
duplicate_dir = r'C:UsersgokulDocumentsduplicate\'
if not path.exists(json_dir):
makedirs(json_dir)
if not path.exists(analyzed_dir):
makedirs(analyzed_dir)
if not path.exists(duplicate_dir):
makedirs(duplicate_dir)
def write_to_json(filename, data):
file_full_path = json_dir + filename + ".json"
with open(file_full_path, 'w') as file_pointer:
dump(data, file_pointer)
return
def video_to_json(filename):
file_full_path = input_vid_dir + filename
start = clock()
size = round(path.getsize(file_full_path) / 1024 / 1024, 2)
video_pointer = VideoCapture(file_full_path)
frame_count = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_COUNT)))
width = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_WIDTH)))
height = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_HEIGHT)))
fps = int(VideoCapture.get(video_pointer, int(CAP_PROP_FPS)))
success, image = video_pointer.read()
video_hash = {}
while success:
frame_hash = average_hash(Image.fromarray(image))
video_hash[str(frame_hash)] = filename
success, image = video_pointer.read()
stop = clock()
time_taken = stop - start
print("Time taken for ", file_full_path, " is : ", time_taken)
data_dict = dict()
data_dict['size'] = size
data_dict['time_taken'] = time_taken
data_dict['fps'] = fps
data_dict['height'] = height
data_dict['width'] = width
data_dict['frame_count'] = frame_count
data_dict['filename'] = filename
data_dict['video_hash'] = video_hash
write_to_json(filename, data_dict)
return
def multiprocess_video_to_json():
files = next(walk(input_vid_dir))[2]
processes = cpu_count()
print(processes)
pool = Pool(processes)
start = clock()
pool.starmap_async(video_to_json, zip(files))
pool.close()
pool.join()
stop = clock()
print("Time Taken : ", stop - start)
def key_with_max_val(d):
max_value = 0
required_key = ""
for k in d:
if d[k] > max_value:
max_value = d[k]
required_key = k
return required_key
def duplicate_analyzer():
files = next(walk(json_dir))[2]
data_dict = {}
for file in files:
filename = json_dir + file
with open(filename) as f:
data = load(f)
video_hash = data['video_hash']
count = 0
duplicate_file_dict = dict()
for key in video_hash:
count += 1
if key in data_dict:
if data_dict[key] in duplicate_file_dict:
duplicate_file_dict[data_dict[key]] = duplicate_file_dict[data_dict[key]] + 1
else:
duplicate_file_dict[data_dict[key]] = 1
else:
data_dict[key] = video_hash[key]
if duplicate_file_dict:
duplicate_file = key_with_max_val(duplicate_file_dict)
duplicate_percentage = ((duplicate_file_dict[duplicate_file] / count) * 100)
if duplicate_percentage > 50:
file = file[:-5]
print(file, " is dup of ", duplicate_file)
src = analyzed_dir + file
tgt = duplicate_dir + file
if path.exists(src):
rename(src, tgt)
# else:
# print("File already moved")
def mv_analyzed_file():
files = next(walk(json_dir))[2]
for filename in files:
filename = filename[:-5]
src = input_vid_dir + filename
tgt = analyzed_dir + filename
if path.exists(src):
rename(src, tgt)
# else:
# print("File already moved")
if __name__ == '__main__':
mv_analyzed_file()
multiprocess_video_to_json()
mv_analyzed_file()
duplicate_analyzer()
answered Feb 12 at 9:19
Gokul CGokul C
213
213
add a comment |
add a comment |
p,90IB UcPg4xBMv1k9UpqD4z qtkBpFi 7WM oo,sdGT2
Kindly let us know you want to delete only the video whose size and name is same right
– Praveen Kumar BS
Feb 12 at 4:25
3
Generating predictable thumbnails might be a decent idea: video.stackexchange.com/a/5315/14767
– Haxiel
Feb 12 at 5:28
You haven't defined "exactly alike to the human eye" at all. Does that mean same clarity? Same starting frame?
– Evan Carroll
Feb 12 at 7:03
@EvanCarroll No. It means that they are the same for the human eye.
– peterh
Feb 17 at 9:33