How to find and delete duplicate video (.mp4) files? [closed]












3















I have 16,000+ short video clips and there are lots of them that are exactly alike to the human eye, but if you examine very closely you will find that one or the other may have an extra 1 second (or much much less) duration at the beginning or the end.



I have already tried several methods and had zero success in finding duplicates. You would think that comparing exact byte size would be good enough since bytes are so small. But no! The reason why not is because there may be a slight extra (or non-extra) few milliseconds in the beginning or end of the video clips. This causes them to be different and not identical, resulting in any duplicate finder using "byte for byte comparison" result no duplicate results.



And although the majority of the video clip is exactly like several others, nothing I use is finding any duplicates because of the few milliseconds difference at the beginning or at the end of the compared .mp4 files.



Does anyone know how I might find success in finding duplicates of these short video clip .mp4 files? On average they are about 30 seconds each, however its only a few milliseconds difference when compared closely to another. To the human eye this would be exactly the same, so I am seeing duplicates, but I don't want to have to watch and compare 16,000+ video clips all on my own.



Any suggestions?





I found a great working answer to my question can you allow me to answer it?



... seems I can't when it's put on hold ...










share|improve this question















closed as too broad by Rui F Ribeiro, Evan Carroll, jimmij, X Tian, Michael Homer Feb 13 at 23:20


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.



















  • Kindly let us know you want to delete only the video whose size and name is same right

    – Praveen Kumar BS
    Feb 12 at 4:25






  • 3





    Generating predictable thumbnails might be a decent idea: video.stackexchange.com/a/5315/14767

    – Haxiel
    Feb 12 at 5:28











  • You haven't defined "exactly alike to the human eye" at all. Does that mean same clarity? Same starting frame?

    – Evan Carroll
    Feb 12 at 7:03











  • @EvanCarroll No. It means that they are the same for the human eye.

    – peterh
    Feb 17 at 9:33
















3















I have 16,000+ short video clips and there are lots of them that are exactly alike to the human eye, but if you examine very closely you will find that one or the other may have an extra 1 second (or much much less) duration at the beginning or the end.



I have already tried several methods and had zero success in finding duplicates. You would think that comparing exact byte size would be good enough since bytes are so small. But no! The reason why not is because there may be a slight extra (or non-extra) few milliseconds in the beginning or end of the video clips. This causes them to be different and not identical, resulting in any duplicate finder using "byte for byte comparison" result no duplicate results.



And although the majority of the video clip is exactly like several others, nothing I use is finding any duplicates because of the few milliseconds difference at the beginning or at the end of the compared .mp4 files.



Does anyone know how I might find success in finding duplicates of these short video clip .mp4 files? On average they are about 30 seconds each, however its only a few milliseconds difference when compared closely to another. To the human eye this would be exactly the same, so I am seeing duplicates, but I don't want to have to watch and compare 16,000+ video clips all on my own.



Any suggestions?





I found a great working answer to my question can you allow me to answer it?



... seems I can't when it's put on hold ...










share|improve this question















closed as too broad by Rui F Ribeiro, Evan Carroll, jimmij, X Tian, Michael Homer Feb 13 at 23:20


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.



















  • Kindly let us know you want to delete only the video whose size and name is same right

    – Praveen Kumar BS
    Feb 12 at 4:25






  • 3





    Generating predictable thumbnails might be a decent idea: video.stackexchange.com/a/5315/14767

    – Haxiel
    Feb 12 at 5:28











  • You haven't defined "exactly alike to the human eye" at all. Does that mean same clarity? Same starting frame?

    – Evan Carroll
    Feb 12 at 7:03











  • @EvanCarroll No. It means that they are the same for the human eye.

    – peterh
    Feb 17 at 9:33














3












3








3


1






I have 16,000+ short video clips and there are lots of them that are exactly alike to the human eye, but if you examine very closely you will find that one or the other may have an extra 1 second (or much much less) duration at the beginning or the end.



I have already tried several methods and had zero success in finding duplicates. You would think that comparing exact byte size would be good enough since bytes are so small. But no! The reason why not is because there may be a slight extra (or non-extra) few milliseconds in the beginning or end of the video clips. This causes them to be different and not identical, resulting in any duplicate finder using "byte for byte comparison" result no duplicate results.



And although the majority of the video clip is exactly like several others, nothing I use is finding any duplicates because of the few milliseconds difference at the beginning or at the end of the compared .mp4 files.



Does anyone know how I might find success in finding duplicates of these short video clip .mp4 files? On average they are about 30 seconds each, however its only a few milliseconds difference when compared closely to another. To the human eye this would be exactly the same, so I am seeing duplicates, but I don't want to have to watch and compare 16,000+ video clips all on my own.



Any suggestions?





I found a great working answer to my question can you allow me to answer it?



... seems I can't when it's put on hold ...










share|improve this question
















I have 16,000+ short video clips and there are lots of them that are exactly alike to the human eye, but if you examine very closely you will find that one or the other may have an extra 1 second (or much much less) duration at the beginning or the end.



I have already tried several methods and had zero success in finding duplicates. You would think that comparing exact byte size would be good enough since bytes are so small. But no! The reason why not is because there may be a slight extra (or non-extra) few milliseconds in the beginning or end of the video clips. This causes them to be different and not identical, resulting in any duplicate finder using "byte for byte comparison" result no duplicate results.



And although the majority of the video clip is exactly like several others, nothing I use is finding any duplicates because of the few milliseconds difference at the beginning or at the end of the compared .mp4 files.



Does anyone know how I might find success in finding duplicates of these short video clip .mp4 files? On average they are about 30 seconds each, however its only a few milliseconds difference when compared closely to another. To the human eye this would be exactly the same, so I am seeing duplicates, but I don't want to have to watch and compare 16,000+ video clips all on my own.



Any suggestions?





I found a great working answer to my question can you allow me to answer it?



... seems I can't when it's put on hold ...







linux files find video file-comparison






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 17 at 3:47







Anonymous User

















asked Feb 12 at 3:33









Anonymous UserAnonymous User

244




244




closed as too broad by Rui F Ribeiro, Evan Carroll, jimmij, X Tian, Michael Homer Feb 13 at 23:20


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.









closed as too broad by Rui F Ribeiro, Evan Carroll, jimmij, X Tian, Michael Homer Feb 13 at 23:20


Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.















  • Kindly let us know you want to delete only the video whose size and name is same right

    – Praveen Kumar BS
    Feb 12 at 4:25






  • 3





    Generating predictable thumbnails might be a decent idea: video.stackexchange.com/a/5315/14767

    – Haxiel
    Feb 12 at 5:28











  • You haven't defined "exactly alike to the human eye" at all. Does that mean same clarity? Same starting frame?

    – Evan Carroll
    Feb 12 at 7:03











  • @EvanCarroll No. It means that they are the same for the human eye.

    – peterh
    Feb 17 at 9:33



















  • Kindly let us know you want to delete only the video whose size and name is same right

    – Praveen Kumar BS
    Feb 12 at 4:25






  • 3





    Generating predictable thumbnails might be a decent idea: video.stackexchange.com/a/5315/14767

    – Haxiel
    Feb 12 at 5:28











  • You haven't defined "exactly alike to the human eye" at all. Does that mean same clarity? Same starting frame?

    – Evan Carroll
    Feb 12 at 7:03











  • @EvanCarroll No. It means that they are the same for the human eye.

    – peterh
    Feb 17 at 9:33

















Kindly let us know you want to delete only the video whose size and name is same right

– Praveen Kumar BS
Feb 12 at 4:25





Kindly let us know you want to delete only the video whose size and name is same right

– Praveen Kumar BS
Feb 12 at 4:25




3




3





Generating predictable thumbnails might be a decent idea: video.stackexchange.com/a/5315/14767

– Haxiel
Feb 12 at 5:28





Generating predictable thumbnails might be a decent idea: video.stackexchange.com/a/5315/14767

– Haxiel
Feb 12 at 5:28













You haven't defined "exactly alike to the human eye" at all. Does that mean same clarity? Same starting frame?

– Evan Carroll
Feb 12 at 7:03





You haven't defined "exactly alike to the human eye" at all. Does that mean same clarity? Same starting frame?

– Evan Carroll
Feb 12 at 7:03













@EvanCarroll No. It means that they are the same for the human eye.

– peterh
Feb 17 at 9:33





@EvanCarroll No. It means that they are the same for the human eye.

– peterh
Feb 17 at 9:33










1 Answer
1






active

oldest

votes


















2














Me too had the same problem..
So wrote a program myself..



Problem is I had videos of various formats and resolution..
So needed to take hash of each video frame and compare.



https://github.com/gklc811/duplicate_video_finder



you can just change the directories at top and you are good to go..



from os import path, walk, makedirs, rename
from time import clock
from imagehash import average_hash
from PIL import Image
from cv2 import VideoCapture, CAP_PROP_FRAME_COUNT, CAP_PROP_FRAME_WIDTH, CAP_PROP_FRAME_HEIGHT, CAP_PROP_FPS
from json import dump, load
from multiprocessing import Pool, cpu_count

input_vid_dir = r'C:UsersgokulDocumentsdata\'
json_dir = r'C:UsersgokulDocumentsdb\'
analyzed_dir = r'C:UsersgokulDocumentsanalyzed\'
duplicate_dir = r'C:UsersgokulDocumentsduplicate\'

if not path.exists(json_dir):
makedirs(json_dir)

if not path.exists(analyzed_dir):
makedirs(analyzed_dir)

if not path.exists(duplicate_dir):
makedirs(duplicate_dir)


def write_to_json(filename, data):
file_full_path = json_dir + filename + ".json"
with open(file_full_path, 'w') as file_pointer:
dump(data, file_pointer)
return


def video_to_json(filename):
file_full_path = input_vid_dir + filename
start = clock()
size = round(path.getsize(file_full_path) / 1024 / 1024, 2)
video_pointer = VideoCapture(file_full_path)
frame_count = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_COUNT)))
width = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_WIDTH)))
height = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_HEIGHT)))
fps = int(VideoCapture.get(video_pointer, int(CAP_PROP_FPS)))
success, image = video_pointer.read()
video_hash = {}
while success:
frame_hash = average_hash(Image.fromarray(image))
video_hash[str(frame_hash)] = filename
success, image = video_pointer.read()
stop = clock()
time_taken = stop - start
print("Time taken for ", file_full_path, " is : ", time_taken)
data_dict = dict()
data_dict['size'] = size
data_dict['time_taken'] = time_taken
data_dict['fps'] = fps
data_dict['height'] = height
data_dict['width'] = width
data_dict['frame_count'] = frame_count
data_dict['filename'] = filename
data_dict['video_hash'] = video_hash
write_to_json(filename, data_dict)
return


def multiprocess_video_to_json():
files = next(walk(input_vid_dir))[2]
processes = cpu_count()
print(processes)
pool = Pool(processes)
start = clock()
pool.starmap_async(video_to_json, zip(files))
pool.close()
pool.join()
stop = clock()
print("Time Taken : ", stop - start)


def key_with_max_val(d):
max_value = 0
required_key = ""
for k in d:
if d[k] > max_value:
max_value = d[k]
required_key = k
return required_key


def duplicate_analyzer():
files = next(walk(json_dir))[2]
data_dict = {}
for file in files:
filename = json_dir + file
with open(filename) as f:
data = load(f)
video_hash = data['video_hash']
count = 0
duplicate_file_dict = dict()
for key in video_hash:
count += 1
if key in data_dict:
if data_dict[key] in duplicate_file_dict:
duplicate_file_dict[data_dict[key]] = duplicate_file_dict[data_dict[key]] + 1
else:
duplicate_file_dict[data_dict[key]] = 1
else:
data_dict[key] = video_hash[key]
if duplicate_file_dict:
duplicate_file = key_with_max_val(duplicate_file_dict)
duplicate_percentage = ((duplicate_file_dict[duplicate_file] / count) * 100)
if duplicate_percentage > 50:
file = file[:-5]
print(file, " is dup of ", duplicate_file)
src = analyzed_dir + file
tgt = duplicate_dir + file
if path.exists(src):
rename(src, tgt)
# else:
# print("File already moved")


def mv_analyzed_file():
files = next(walk(json_dir))[2]
for filename in files:
filename = filename[:-5]
src = input_vid_dir + filename
tgt = analyzed_dir + filename
if path.exists(src):
rename(src, tgt)
# else:
# print("File already moved")


if __name__ == '__main__':
mv_analyzed_file()
multiprocess_video_to_json()
mv_analyzed_file()
duplicate_analyzer()





share|improve this answer






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2














    Me too had the same problem..
    So wrote a program myself..



    Problem is I had videos of various formats and resolution..
    So needed to take hash of each video frame and compare.



    https://github.com/gklc811/duplicate_video_finder



    you can just change the directories at top and you are good to go..



    from os import path, walk, makedirs, rename
    from time import clock
    from imagehash import average_hash
    from PIL import Image
    from cv2 import VideoCapture, CAP_PROP_FRAME_COUNT, CAP_PROP_FRAME_WIDTH, CAP_PROP_FRAME_HEIGHT, CAP_PROP_FPS
    from json import dump, load
    from multiprocessing import Pool, cpu_count

    input_vid_dir = r'C:UsersgokulDocumentsdata\'
    json_dir = r'C:UsersgokulDocumentsdb\'
    analyzed_dir = r'C:UsersgokulDocumentsanalyzed\'
    duplicate_dir = r'C:UsersgokulDocumentsduplicate\'

    if not path.exists(json_dir):
    makedirs(json_dir)

    if not path.exists(analyzed_dir):
    makedirs(analyzed_dir)

    if not path.exists(duplicate_dir):
    makedirs(duplicate_dir)


    def write_to_json(filename, data):
    file_full_path = json_dir + filename + ".json"
    with open(file_full_path, 'w') as file_pointer:
    dump(data, file_pointer)
    return


    def video_to_json(filename):
    file_full_path = input_vid_dir + filename
    start = clock()
    size = round(path.getsize(file_full_path) / 1024 / 1024, 2)
    video_pointer = VideoCapture(file_full_path)
    frame_count = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_COUNT)))
    width = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_WIDTH)))
    height = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_HEIGHT)))
    fps = int(VideoCapture.get(video_pointer, int(CAP_PROP_FPS)))
    success, image = video_pointer.read()
    video_hash = {}
    while success:
    frame_hash = average_hash(Image.fromarray(image))
    video_hash[str(frame_hash)] = filename
    success, image = video_pointer.read()
    stop = clock()
    time_taken = stop - start
    print("Time taken for ", file_full_path, " is : ", time_taken)
    data_dict = dict()
    data_dict['size'] = size
    data_dict['time_taken'] = time_taken
    data_dict['fps'] = fps
    data_dict['height'] = height
    data_dict['width'] = width
    data_dict['frame_count'] = frame_count
    data_dict['filename'] = filename
    data_dict['video_hash'] = video_hash
    write_to_json(filename, data_dict)
    return


    def multiprocess_video_to_json():
    files = next(walk(input_vid_dir))[2]
    processes = cpu_count()
    print(processes)
    pool = Pool(processes)
    start = clock()
    pool.starmap_async(video_to_json, zip(files))
    pool.close()
    pool.join()
    stop = clock()
    print("Time Taken : ", stop - start)


    def key_with_max_val(d):
    max_value = 0
    required_key = ""
    for k in d:
    if d[k] > max_value:
    max_value = d[k]
    required_key = k
    return required_key


    def duplicate_analyzer():
    files = next(walk(json_dir))[2]
    data_dict = {}
    for file in files:
    filename = json_dir + file
    with open(filename) as f:
    data = load(f)
    video_hash = data['video_hash']
    count = 0
    duplicate_file_dict = dict()
    for key in video_hash:
    count += 1
    if key in data_dict:
    if data_dict[key] in duplicate_file_dict:
    duplicate_file_dict[data_dict[key]] = duplicate_file_dict[data_dict[key]] + 1
    else:
    duplicate_file_dict[data_dict[key]] = 1
    else:
    data_dict[key] = video_hash[key]
    if duplicate_file_dict:
    duplicate_file = key_with_max_val(duplicate_file_dict)
    duplicate_percentage = ((duplicate_file_dict[duplicate_file] / count) * 100)
    if duplicate_percentage > 50:
    file = file[:-5]
    print(file, " is dup of ", duplicate_file)
    src = analyzed_dir + file
    tgt = duplicate_dir + file
    if path.exists(src):
    rename(src, tgt)
    # else:
    # print("File already moved")


    def mv_analyzed_file():
    files = next(walk(json_dir))[2]
    for filename in files:
    filename = filename[:-5]
    src = input_vid_dir + filename
    tgt = analyzed_dir + filename
    if path.exists(src):
    rename(src, tgt)
    # else:
    # print("File already moved")


    if __name__ == '__main__':
    mv_analyzed_file()
    multiprocess_video_to_json()
    mv_analyzed_file()
    duplicate_analyzer()





    share|improve this answer




























      2














      Me too had the same problem..
      So wrote a program myself..



      Problem is I had videos of various formats and resolution..
      So needed to take hash of each video frame and compare.



      https://github.com/gklc811/duplicate_video_finder



      you can just change the directories at top and you are good to go..



      from os import path, walk, makedirs, rename
      from time import clock
      from imagehash import average_hash
      from PIL import Image
      from cv2 import VideoCapture, CAP_PROP_FRAME_COUNT, CAP_PROP_FRAME_WIDTH, CAP_PROP_FRAME_HEIGHT, CAP_PROP_FPS
      from json import dump, load
      from multiprocessing import Pool, cpu_count

      input_vid_dir = r'C:UsersgokulDocumentsdata\'
      json_dir = r'C:UsersgokulDocumentsdb\'
      analyzed_dir = r'C:UsersgokulDocumentsanalyzed\'
      duplicate_dir = r'C:UsersgokulDocumentsduplicate\'

      if not path.exists(json_dir):
      makedirs(json_dir)

      if not path.exists(analyzed_dir):
      makedirs(analyzed_dir)

      if not path.exists(duplicate_dir):
      makedirs(duplicate_dir)


      def write_to_json(filename, data):
      file_full_path = json_dir + filename + ".json"
      with open(file_full_path, 'w') as file_pointer:
      dump(data, file_pointer)
      return


      def video_to_json(filename):
      file_full_path = input_vid_dir + filename
      start = clock()
      size = round(path.getsize(file_full_path) / 1024 / 1024, 2)
      video_pointer = VideoCapture(file_full_path)
      frame_count = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_COUNT)))
      width = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_WIDTH)))
      height = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_HEIGHT)))
      fps = int(VideoCapture.get(video_pointer, int(CAP_PROP_FPS)))
      success, image = video_pointer.read()
      video_hash = {}
      while success:
      frame_hash = average_hash(Image.fromarray(image))
      video_hash[str(frame_hash)] = filename
      success, image = video_pointer.read()
      stop = clock()
      time_taken = stop - start
      print("Time taken for ", file_full_path, " is : ", time_taken)
      data_dict = dict()
      data_dict['size'] = size
      data_dict['time_taken'] = time_taken
      data_dict['fps'] = fps
      data_dict['height'] = height
      data_dict['width'] = width
      data_dict['frame_count'] = frame_count
      data_dict['filename'] = filename
      data_dict['video_hash'] = video_hash
      write_to_json(filename, data_dict)
      return


      def multiprocess_video_to_json():
      files = next(walk(input_vid_dir))[2]
      processes = cpu_count()
      print(processes)
      pool = Pool(processes)
      start = clock()
      pool.starmap_async(video_to_json, zip(files))
      pool.close()
      pool.join()
      stop = clock()
      print("Time Taken : ", stop - start)


      def key_with_max_val(d):
      max_value = 0
      required_key = ""
      for k in d:
      if d[k] > max_value:
      max_value = d[k]
      required_key = k
      return required_key


      def duplicate_analyzer():
      files = next(walk(json_dir))[2]
      data_dict = {}
      for file in files:
      filename = json_dir + file
      with open(filename) as f:
      data = load(f)
      video_hash = data['video_hash']
      count = 0
      duplicate_file_dict = dict()
      for key in video_hash:
      count += 1
      if key in data_dict:
      if data_dict[key] in duplicate_file_dict:
      duplicate_file_dict[data_dict[key]] = duplicate_file_dict[data_dict[key]] + 1
      else:
      duplicate_file_dict[data_dict[key]] = 1
      else:
      data_dict[key] = video_hash[key]
      if duplicate_file_dict:
      duplicate_file = key_with_max_val(duplicate_file_dict)
      duplicate_percentage = ((duplicate_file_dict[duplicate_file] / count) * 100)
      if duplicate_percentage > 50:
      file = file[:-5]
      print(file, " is dup of ", duplicate_file)
      src = analyzed_dir + file
      tgt = duplicate_dir + file
      if path.exists(src):
      rename(src, tgt)
      # else:
      # print("File already moved")


      def mv_analyzed_file():
      files = next(walk(json_dir))[2]
      for filename in files:
      filename = filename[:-5]
      src = input_vid_dir + filename
      tgt = analyzed_dir + filename
      if path.exists(src):
      rename(src, tgt)
      # else:
      # print("File already moved")


      if __name__ == '__main__':
      mv_analyzed_file()
      multiprocess_video_to_json()
      mv_analyzed_file()
      duplicate_analyzer()





      share|improve this answer


























        2












        2








        2







        Me too had the same problem..
        So wrote a program myself..



        Problem is I had videos of various formats and resolution..
        So needed to take hash of each video frame and compare.



        https://github.com/gklc811/duplicate_video_finder



        you can just change the directories at top and you are good to go..



        from os import path, walk, makedirs, rename
        from time import clock
        from imagehash import average_hash
        from PIL import Image
        from cv2 import VideoCapture, CAP_PROP_FRAME_COUNT, CAP_PROP_FRAME_WIDTH, CAP_PROP_FRAME_HEIGHT, CAP_PROP_FPS
        from json import dump, load
        from multiprocessing import Pool, cpu_count

        input_vid_dir = r'C:UsersgokulDocumentsdata\'
        json_dir = r'C:UsersgokulDocumentsdb\'
        analyzed_dir = r'C:UsersgokulDocumentsanalyzed\'
        duplicate_dir = r'C:UsersgokulDocumentsduplicate\'

        if not path.exists(json_dir):
        makedirs(json_dir)

        if not path.exists(analyzed_dir):
        makedirs(analyzed_dir)

        if not path.exists(duplicate_dir):
        makedirs(duplicate_dir)


        def write_to_json(filename, data):
        file_full_path = json_dir + filename + ".json"
        with open(file_full_path, 'w') as file_pointer:
        dump(data, file_pointer)
        return


        def video_to_json(filename):
        file_full_path = input_vid_dir + filename
        start = clock()
        size = round(path.getsize(file_full_path) / 1024 / 1024, 2)
        video_pointer = VideoCapture(file_full_path)
        frame_count = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_COUNT)))
        width = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_WIDTH)))
        height = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_HEIGHT)))
        fps = int(VideoCapture.get(video_pointer, int(CAP_PROP_FPS)))
        success, image = video_pointer.read()
        video_hash = {}
        while success:
        frame_hash = average_hash(Image.fromarray(image))
        video_hash[str(frame_hash)] = filename
        success, image = video_pointer.read()
        stop = clock()
        time_taken = stop - start
        print("Time taken for ", file_full_path, " is : ", time_taken)
        data_dict = dict()
        data_dict['size'] = size
        data_dict['time_taken'] = time_taken
        data_dict['fps'] = fps
        data_dict['height'] = height
        data_dict['width'] = width
        data_dict['frame_count'] = frame_count
        data_dict['filename'] = filename
        data_dict['video_hash'] = video_hash
        write_to_json(filename, data_dict)
        return


        def multiprocess_video_to_json():
        files = next(walk(input_vid_dir))[2]
        processes = cpu_count()
        print(processes)
        pool = Pool(processes)
        start = clock()
        pool.starmap_async(video_to_json, zip(files))
        pool.close()
        pool.join()
        stop = clock()
        print("Time Taken : ", stop - start)


        def key_with_max_val(d):
        max_value = 0
        required_key = ""
        for k in d:
        if d[k] > max_value:
        max_value = d[k]
        required_key = k
        return required_key


        def duplicate_analyzer():
        files = next(walk(json_dir))[2]
        data_dict = {}
        for file in files:
        filename = json_dir + file
        with open(filename) as f:
        data = load(f)
        video_hash = data['video_hash']
        count = 0
        duplicate_file_dict = dict()
        for key in video_hash:
        count += 1
        if key in data_dict:
        if data_dict[key] in duplicate_file_dict:
        duplicate_file_dict[data_dict[key]] = duplicate_file_dict[data_dict[key]] + 1
        else:
        duplicate_file_dict[data_dict[key]] = 1
        else:
        data_dict[key] = video_hash[key]
        if duplicate_file_dict:
        duplicate_file = key_with_max_val(duplicate_file_dict)
        duplicate_percentage = ((duplicate_file_dict[duplicate_file] / count) * 100)
        if duplicate_percentage > 50:
        file = file[:-5]
        print(file, " is dup of ", duplicate_file)
        src = analyzed_dir + file
        tgt = duplicate_dir + file
        if path.exists(src):
        rename(src, tgt)
        # else:
        # print("File already moved")


        def mv_analyzed_file():
        files = next(walk(json_dir))[2]
        for filename in files:
        filename = filename[:-5]
        src = input_vid_dir + filename
        tgt = analyzed_dir + filename
        if path.exists(src):
        rename(src, tgt)
        # else:
        # print("File already moved")


        if __name__ == '__main__':
        mv_analyzed_file()
        multiprocess_video_to_json()
        mv_analyzed_file()
        duplicate_analyzer()





        share|improve this answer













        Me too had the same problem..
        So wrote a program myself..



        Problem is I had videos of various formats and resolution..
        So needed to take hash of each video frame and compare.



        https://github.com/gklc811/duplicate_video_finder



        you can just change the directories at top and you are good to go..



        from os import path, walk, makedirs, rename
        from time import clock
        from imagehash import average_hash
        from PIL import Image
        from cv2 import VideoCapture, CAP_PROP_FRAME_COUNT, CAP_PROP_FRAME_WIDTH, CAP_PROP_FRAME_HEIGHT, CAP_PROP_FPS
        from json import dump, load
        from multiprocessing import Pool, cpu_count

        input_vid_dir = r'C:UsersgokulDocumentsdata\'
        json_dir = r'C:UsersgokulDocumentsdb\'
        analyzed_dir = r'C:UsersgokulDocumentsanalyzed\'
        duplicate_dir = r'C:UsersgokulDocumentsduplicate\'

        if not path.exists(json_dir):
        makedirs(json_dir)

        if not path.exists(analyzed_dir):
        makedirs(analyzed_dir)

        if not path.exists(duplicate_dir):
        makedirs(duplicate_dir)


        def write_to_json(filename, data):
        file_full_path = json_dir + filename + ".json"
        with open(file_full_path, 'w') as file_pointer:
        dump(data, file_pointer)
        return


        def video_to_json(filename):
        file_full_path = input_vid_dir + filename
        start = clock()
        size = round(path.getsize(file_full_path) / 1024 / 1024, 2)
        video_pointer = VideoCapture(file_full_path)
        frame_count = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_COUNT)))
        width = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_WIDTH)))
        height = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_HEIGHT)))
        fps = int(VideoCapture.get(video_pointer, int(CAP_PROP_FPS)))
        success, image = video_pointer.read()
        video_hash = {}
        while success:
        frame_hash = average_hash(Image.fromarray(image))
        video_hash[str(frame_hash)] = filename
        success, image = video_pointer.read()
        stop = clock()
        time_taken = stop - start
        print("Time taken for ", file_full_path, " is : ", time_taken)
        data_dict = dict()
        data_dict['size'] = size
        data_dict['time_taken'] = time_taken
        data_dict['fps'] = fps
        data_dict['height'] = height
        data_dict['width'] = width
        data_dict['frame_count'] = frame_count
        data_dict['filename'] = filename
        data_dict['video_hash'] = video_hash
        write_to_json(filename, data_dict)
        return


        def multiprocess_video_to_json():
        files = next(walk(input_vid_dir))[2]
        processes = cpu_count()
        print(processes)
        pool = Pool(processes)
        start = clock()
        pool.starmap_async(video_to_json, zip(files))
        pool.close()
        pool.join()
        stop = clock()
        print("Time Taken : ", stop - start)


        def key_with_max_val(d):
        max_value = 0
        required_key = ""
        for k in d:
        if d[k] > max_value:
        max_value = d[k]
        required_key = k
        return required_key


        def duplicate_analyzer():
        files = next(walk(json_dir))[2]
        data_dict = {}
        for file in files:
        filename = json_dir + file
        with open(filename) as f:
        data = load(f)
        video_hash = data['video_hash']
        count = 0
        duplicate_file_dict = dict()
        for key in video_hash:
        count += 1
        if key in data_dict:
        if data_dict[key] in duplicate_file_dict:
        duplicate_file_dict[data_dict[key]] = duplicate_file_dict[data_dict[key]] + 1
        else:
        duplicate_file_dict[data_dict[key]] = 1
        else:
        data_dict[key] = video_hash[key]
        if duplicate_file_dict:
        duplicate_file = key_with_max_val(duplicate_file_dict)
        duplicate_percentage = ((duplicate_file_dict[duplicate_file] / count) * 100)
        if duplicate_percentage > 50:
        file = file[:-5]
        print(file, " is dup of ", duplicate_file)
        src = analyzed_dir + file
        tgt = duplicate_dir + file
        if path.exists(src):
        rename(src, tgt)
        # else:
        # print("File already moved")


        def mv_analyzed_file():
        files = next(walk(json_dir))[2]
        for filename in files:
        filename = filename[:-5]
        src = input_vid_dir + filename
        tgt = analyzed_dir + filename
        if path.exists(src):
        rename(src, tgt)
        # else:
        # print("File already moved")


        if __name__ == '__main__':
        mv_analyzed_file()
        multiprocess_video_to_json()
        mv_analyzed_file()
        duplicate_analyzer()






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Feb 12 at 9:19









        Gokul CGokul C

        213




        213















            Popular posts from this blog

            How to reconfigure Docker Trusted Registry 2.x.x to use CEPH FS mount instead of NFS and other traditional...

            is 'sed' thread safe

            How to make a Squid Proxy server?