How to find and delete duplicate video (.mp4) files? [closed]

I have 16,000+ short video clips and there are lots of them that are exactly alike to the human eye, but if you examine very closely you will find that one or the other may have an extra 1 second (or much much less) duration at the beginning or the end.

I have already tried several methods and had zero success in finding duplicates. You would think that comparing exact byte size would be good enough since bytes are so small. But no! The reason why not is because there may be a slight extra (or non-extra) few milliseconds in the beginning or end of the video clips. This causes them to be different and not identical, resulting in any duplicate finder using "byte for byte comparison" result no duplicate results.

And although the majority of the video clip is exactly like several others, nothing I use is finding any duplicates because of the few milliseconds difference at the beginning or at the end of the compared .mp4 files.

Does anyone know how I might find success in finding duplicates of these short video clip .mp4 files? On average they are about 30 seconds each, however its only a few milliseconds difference when compared closely to another. To the human eye this would be exactly the same, so I am seeing duplicates, but I don't want to have to watch and compare 16,000+ video clips all on my own.

Any suggestions?

I found a great working answer to my question can you allow me to answer it?

... seems I can't when it's put on hold ...

edited Feb 17 at 3:47

asked Feb 12 at 3:33

Anonymous User

244

closed as too broad by Rui F Ribeiro, Evan Carroll, jimmij, X Tian, Michael Homer Feb 13 at 23:20

Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.

Kindly let us know you want to delete only the video whose size and name is same right

– Praveen Kumar BS
Feb 12 at 4:25

3

Generating predictable thumbnails might be a decent idea: video.stackexchange.com/a/5315/14767

– Haxiel
Feb 12 at 5:28

You haven't defined "exactly alike to the human eye" at all. Does that mean same clarity? Same starting frame?

– Evan Carroll
Feb 12 at 7:03

@EvanCarroll No. It means that they are the same for the human eye.

– peterh
Feb 17 at 9:33

add a comment |

Any suggestions?

I found a great working answer to my question can you allow me to answer it?

... seems I can't when it's put on hold ...

edited Feb 17 at 3:47

asked Feb 12 at 3:33

Anonymous User

244

closed as too broad by Rui F Ribeiro, Evan Carroll, jimmij, X Tian, Michael Homer Feb 13 at 23:20

Kindly let us know you want to delete only the video whose size and name is same right

– Praveen Kumar BS
Feb 12 at 4:25

3

Generating predictable thumbnails might be a decent idea: video.stackexchange.com/a/5315/14767

– Haxiel
Feb 12 at 5:28

You haven't defined "exactly alike to the human eye" at all. Does that mean same clarity? Same starting frame?

– Evan Carroll
Feb 12 at 7:03

@EvanCarroll No. It means that they are the same for the human eye.

– peterh
Feb 17 at 9:33

add a comment |

Any suggestions?

I found a great working answer to my question can you allow me to answer it?

... seems I can't when it's put on hold ...

edited Feb 17 at 3:47

asked Feb 12 at 3:33

Anonymous User

244

Any suggestions?

I found a great working answer to my question can you allow me to answer it?

... seems I can't when it's put on hold ...

linux files find video file-comparison

edited Feb 17 at 3:47

asked Feb 12 at 3:33

Anonymous User

244

edited Feb 17 at 3:47

asked Feb 12 at 3:33

Anonymous User

244

edited Feb 17 at 3:47

asked Feb 12 at 3:33

Anonymous User

244

asked Feb 12 at 3:33

Anonymous User

244

asked Feb 12 at 3:33

Anonymous User

244

closed as too broad by Rui F Ribeiro, Evan Carroll, jimmij, X Tian, Michael Homer Feb 13 at 23:20

Kindly let us know you want to delete only the video whose size and name is same right

– Praveen Kumar BS
Feb 12 at 4:25

3

Generating predictable thumbnails might be a decent idea: video.stackexchange.com/a/5315/14767

– Haxiel
Feb 12 at 5:28

You haven't defined "exactly alike to the human eye" at all. Does that mean same clarity? Same starting frame?

– Evan Carroll
Feb 12 at 7:03

@EvanCarroll No. It means that they are the same for the human eye.

– peterh
Feb 17 at 9:33

add a comment |

Kindly let us know you want to delete only the video whose size and name is same right

– Praveen Kumar BS
Feb 12 at 4:25

3

Generating predictable thumbnails might be a decent idea: video.stackexchange.com/a/5315/14767

– Haxiel
Feb 12 at 5:28

You haven't defined "exactly alike to the human eye" at all. Does that mean same clarity? Same starting frame?

– Evan Carroll
Feb 12 at 7:03

@EvanCarroll No. It means that they are the same for the human eye.

– peterh
Feb 17 at 9:33

Kindly let us know you want to delete only the video whose size and name is same right

– Praveen Kumar BS
Feb 12 at 4:25

Generating predictable thumbnails might be a decent idea: video.stackexchange.com/a/5315/14767

– Haxiel
Feb 12 at 5:28

You haven't defined "exactly alike to the human eye" at all. Does that mean same clarity? Same starting frame?

– Evan Carroll
Feb 12 at 7:03

@EvanCarroll No. It means that they are the same for the human eye.

– peterh
Feb 17 at 9:33

add a comment |

1 Answer
1

active

oldest

votes

Me too had the same problem..
So wrote a program myself..

Problem is I had videos of various formats and resolution..
So needed to take hash of each video frame and compare.

https://github.com/gklc811/duplicate_video_finder

you can just change the directories at top and you are good to go..

from os import path, walk, makedirs, rename

from time import clock

from imagehash import average_hash

from PIL import Image

from cv2 import VideoCapture, CAP_PROP_FRAME_COUNT, CAP_PROP_FRAME_WIDTH, CAP_PROP_FRAME_HEIGHT, CAP_PROP_FPS

from json import dump, load

from multiprocessing import Pool, cpu_count



input_vid_dir = r'C:UsersgokulDocumentsdata\'

json_dir = r'C:UsersgokulDocumentsdb\'

analyzed_dir = r'C:UsersgokulDocumentsanalyzed\'

duplicate_dir = r'C:UsersgokulDocumentsduplicate\'



if not path.exists(json_dir):

    makedirs(json_dir)



if not path.exists(analyzed_dir):

    makedirs(analyzed_dir)



if not path.exists(duplicate_dir):

    makedirs(duplicate_dir)





def write_to_json(filename, data):

    file_full_path = json_dir + filename + ".json"

    with open(file_full_path, 'w') as file_pointer:

        dump(data, file_pointer)

    return





def video_to_json(filename):

    file_full_path = input_vid_dir + filename

    start = clock()

    size = round(path.getsize(file_full_path) / 1024 / 1024, 2)

    video_pointer = VideoCapture(file_full_path)

    frame_count = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_COUNT)))

    width = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_WIDTH)))

    height = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_HEIGHT)))

    fps = int(VideoCapture.get(video_pointer, int(CAP_PROP_FPS)))

    success, image = video_pointer.read()

    video_hash = {}

    while success:

        frame_hash = average_hash(Image.fromarray(image))

        video_hash[str(frame_hash)] = filename

        success, image = video_pointer.read()

    stop = clock()

    time_taken = stop - start

    print("Time taken for ", file_full_path, " is : ", time_taken)

    data_dict = dict()

    data_dict['size'] = size

    data_dict['time_taken'] = time_taken

    data_dict['fps'] = fps

    data_dict['height'] = height

    data_dict['width'] = width

    data_dict['frame_count'] = frame_count

    data_dict['filename'] = filename

    data_dict['video_hash'] = video_hash

    write_to_json(filename, data_dict)

    return





def multiprocess_video_to_json():

    files = next(walk(input_vid_dir))[2]

    processes = cpu_count()

    print(processes)

    pool = Pool(processes)

    start = clock()

    pool.starmap_async(video_to_json, zip(files))

    pool.close()

    pool.join()

    stop = clock()

    print("Time Taken : ", stop - start)





def key_with_max_val(d):

    max_value = 0

    required_key = ""

    for k in d:

        if d[k] > max_value:

            max_value = d[k]

            required_key = k

    return required_key





def duplicate_analyzer():

    files = next(walk(json_dir))[2]

    data_dict = {}

    for file in files:

        filename = json_dir + file

        with open(filename) as f:

            data = load(f)

        video_hash = data['video_hash']

        count = 0

        duplicate_file_dict = dict()

        for key in video_hash:

            count += 1

            if key in data_dict:

                if data_dict[key] in duplicate_file_dict:

                    duplicate_file_dict[data_dict[key]] = duplicate_file_dict[data_dict[key]] + 1

                else:

                    duplicate_file_dict[data_dict[key]] = 1

            else:

                data_dict[key] = video_hash[key]

        if duplicate_file_dict:

            duplicate_file = key_with_max_val(duplicate_file_dict)

            duplicate_percentage = ((duplicate_file_dict[duplicate_file] / count) * 100)

            if duplicate_percentage > 50:

                file = file[:-5]

                print(file, " is dup of ", duplicate_file)

                src = analyzed_dir + file

                tgt = duplicate_dir + file

                if path.exists(src):

                    rename(src, tgt)

                # else:

                #     print("File already moved")





def mv_analyzed_file():

    files = next(walk(json_dir))[2]

    for filename in files:

        filename = filename[:-5]

        src = input_vid_dir + filename

        tgt = analyzed_dir + filename

        if path.exists(src):

            rename(src, tgt)

        # else:

        #     print("File already moved")





if __name__ == '__main__':

    mv_analyzed_file()

    multiprocess_video_to_json()

    mv_analyzed_file()

    duplicate_analyzer()

answered Feb 12 at 9:19

Gokul C

213

add a comment |

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Me too had the same problem..
So wrote a program myself..

Problem is I had videos of various formats and resolution..
So needed to take hash of each video frame and compare.

https://github.com/gklc811/duplicate_video_finder

you can just change the directories at top and you are good to go..

from os import path, walk, makedirs, rename

from time import clock

from imagehash import average_hash

from PIL import Image

from cv2 import VideoCapture, CAP_PROP_FRAME_COUNT, CAP_PROP_FRAME_WIDTH, CAP_PROP_FRAME_HEIGHT, CAP_PROP_FPS

from json import dump, load

from multiprocessing import Pool, cpu_count



input_vid_dir = r'C:UsersgokulDocumentsdata\'

json_dir = r'C:UsersgokulDocumentsdb\'

analyzed_dir = r'C:UsersgokulDocumentsanalyzed\'

duplicate_dir = r'C:UsersgokulDocumentsduplicate\'



if not path.exists(json_dir):

    makedirs(json_dir)



if not path.exists(analyzed_dir):

    makedirs(analyzed_dir)



if not path.exists(duplicate_dir):

    makedirs(duplicate_dir)





def write_to_json(filename, data):

    file_full_path = json_dir + filename + ".json"

    with open(file_full_path, 'w') as file_pointer:

        dump(data, file_pointer)

    return





def video_to_json(filename):

    file_full_path = input_vid_dir + filename

    start = clock()

    size = round(path.getsize(file_full_path) / 1024 / 1024, 2)

    video_pointer = VideoCapture(file_full_path)

    frame_count = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_COUNT)))

    width = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_WIDTH)))

    height = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_HEIGHT)))

    fps = int(VideoCapture.get(video_pointer, int(CAP_PROP_FPS)))

    success, image = video_pointer.read()

    video_hash = {}

    while success:

        frame_hash = average_hash(Image.fromarray(image))

        video_hash[str(frame_hash)] = filename

        success, image = video_pointer.read()

    stop = clock()

    time_taken = stop - start

    print("Time taken for ", file_full_path, " is : ", time_taken)

    data_dict = dict()

    data_dict['size'] = size

    data_dict['time_taken'] = time_taken

    data_dict['fps'] = fps

    data_dict['height'] = height

    data_dict['width'] = width

    data_dict['frame_count'] = frame_count

    data_dict['filename'] = filename

    data_dict['video_hash'] = video_hash

    write_to_json(filename, data_dict)

    return





def multiprocess_video_to_json():

    files = next(walk(input_vid_dir))[2]

    processes = cpu_count()

    print(processes)

    pool = Pool(processes)

    start = clock()

    pool.starmap_async(video_to_json, zip(files))

    pool.close()

    pool.join()

    stop = clock()

    print("Time Taken : ", stop - start)





def key_with_max_val(d):

    max_value = 0

    required_key = ""

    for k in d:

        if d[k] > max_value:

            max_value = d[k]

            required_key = k

    return required_key





def duplicate_analyzer():

    files = next(walk(json_dir))[2]

    data_dict = {}

    for file in files:

        filename = json_dir + file

        with open(filename) as f:

            data = load(f)

        video_hash = data['video_hash']

        count = 0

        duplicate_file_dict = dict()

        for key in video_hash:

            count += 1

            if key in data_dict:

                if data_dict[key] in duplicate_file_dict:

                    duplicate_file_dict[data_dict[key]] = duplicate_file_dict[data_dict[key]] + 1

                else:

                    duplicate_file_dict[data_dict[key]] = 1

            else:

                data_dict[key] = video_hash[key]

        if duplicate_file_dict:

            duplicate_file = key_with_max_val(duplicate_file_dict)

            duplicate_percentage = ((duplicate_file_dict[duplicate_file] / count) * 100)

            if duplicate_percentage > 50:

                file = file[:-5]

                print(file, " is dup of ", duplicate_file)

                src = analyzed_dir + file

                tgt = duplicate_dir + file

                if path.exists(src):

                    rename(src, tgt)

                # else:

                #     print("File already moved")





def mv_analyzed_file():

    files = next(walk(json_dir))[2]

    for filename in files:

        filename = filename[:-5]

        src = input_vid_dir + filename

        tgt = analyzed_dir + filename

        if path.exists(src):

            rename(src, tgt)

        # else:

        #     print("File already moved")





if __name__ == '__main__':

    mv_analyzed_file()

    multiprocess_video_to_json()

    mv_analyzed_file()

    duplicate_analyzer()

answered Feb 12 at 9:19

Gokul C

213

add a comment |

Me too had the same problem..
So wrote a program myself..

Problem is I had videos of various formats and resolution..
So needed to take hash of each video frame and compare.

https://github.com/gklc811/duplicate_video_finder

you can just change the directories at top and you are good to go..

from os import path, walk, makedirs, rename

from time import clock

from imagehash import average_hash

from PIL import Image

from cv2 import VideoCapture, CAP_PROP_FRAME_COUNT, CAP_PROP_FRAME_WIDTH, CAP_PROP_FRAME_HEIGHT, CAP_PROP_FPS

from json import dump, load

from multiprocessing import Pool, cpu_count



input_vid_dir = r'C:UsersgokulDocumentsdata\'

json_dir = r'C:UsersgokulDocumentsdb\'

analyzed_dir = r'C:UsersgokulDocumentsanalyzed\'

duplicate_dir = r'C:UsersgokulDocumentsduplicate\'



if not path.exists(json_dir):

    makedirs(json_dir)



if not path.exists(analyzed_dir):

    makedirs(analyzed_dir)



if not path.exists(duplicate_dir):

    makedirs(duplicate_dir)





def write_to_json(filename, data):

    file_full_path = json_dir + filename + ".json"

    with open(file_full_path, 'w') as file_pointer:

        dump(data, file_pointer)

    return





def video_to_json(filename):

    file_full_path = input_vid_dir + filename

    start = clock()

    size = round(path.getsize(file_full_path) / 1024 / 1024, 2)

    video_pointer = VideoCapture(file_full_path)

    frame_count = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_COUNT)))

    width = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_WIDTH)))

    height = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_HEIGHT)))

    fps = int(VideoCapture.get(video_pointer, int(CAP_PROP_FPS)))

    success, image = video_pointer.read()

    video_hash = {}

    while success:

        frame_hash = average_hash(Image.fromarray(image))

        video_hash[str(frame_hash)] = filename

        success, image = video_pointer.read()

    stop = clock()

    time_taken = stop - start

    print("Time taken for ", file_full_path, " is : ", time_taken)

    data_dict = dict()

    data_dict['size'] = size

    data_dict['time_taken'] = time_taken

    data_dict['fps'] = fps

    data_dict['height'] = height

    data_dict['width'] = width

    data_dict['frame_count'] = frame_count

    data_dict['filename'] = filename

    data_dict['video_hash'] = video_hash

    write_to_json(filename, data_dict)

    return





def multiprocess_video_to_json():

    files = next(walk(input_vid_dir))[2]

    processes = cpu_count()

    print(processes)

    pool = Pool(processes)

    start = clock()

    pool.starmap_async(video_to_json, zip(files))

    pool.close()

    pool.join()

    stop = clock()

    print("Time Taken : ", stop - start)





def key_with_max_val(d):

    max_value = 0

    required_key = ""

    for k in d:

        if d[k] > max_value:

            max_value = d[k]

            required_key = k

    return required_key





def duplicate_analyzer():

    files = next(walk(json_dir))[2]

    data_dict = {}

    for file in files:

        filename = json_dir + file

        with open(filename) as f:

            data = load(f)

        video_hash = data['video_hash']

        count = 0

        duplicate_file_dict = dict()

        for key in video_hash:

            count += 1

            if key in data_dict:

                if data_dict[key] in duplicate_file_dict:

                    duplicate_file_dict[data_dict[key]] = duplicate_file_dict[data_dict[key]] + 1

                else:

                    duplicate_file_dict[data_dict[key]] = 1

            else:

                data_dict[key] = video_hash[key]

        if duplicate_file_dict:

            duplicate_file = key_with_max_val(duplicate_file_dict)

            duplicate_percentage = ((duplicate_file_dict[duplicate_file] / count) * 100)

            if duplicate_percentage > 50:

                file = file[:-5]

                print(file, " is dup of ", duplicate_file)

                src = analyzed_dir + file

                tgt = duplicate_dir + file

                if path.exists(src):

                    rename(src, tgt)

                # else:

                #     print("File already moved")





def mv_analyzed_file():

    files = next(walk(json_dir))[2]

    for filename in files:

        filename = filename[:-5]

        src = input_vid_dir + filename

        tgt = analyzed_dir + filename

        if path.exists(src):

            rename(src, tgt)

        # else:

        #     print("File already moved")





if __name__ == '__main__':

    mv_analyzed_file()

    multiprocess_video_to_json()

    mv_analyzed_file()

    duplicate_analyzer()

answered Feb 12 at 9:19

Gokul C

213

add a comment |

Me too had the same problem..
So wrote a program myself..

Problem is I had videos of various formats and resolution..
So needed to take hash of each video frame and compare.

https://github.com/gklc811/duplicate_video_finder

you can just change the directories at top and you are good to go..

from os import path, walk, makedirs, rename

from time import clock

from imagehash import average_hash

from PIL import Image

from cv2 import VideoCapture, CAP_PROP_FRAME_COUNT, CAP_PROP_FRAME_WIDTH, CAP_PROP_FRAME_HEIGHT, CAP_PROP_FPS

from json import dump, load

from multiprocessing import Pool, cpu_count



input_vid_dir = r'C:UsersgokulDocumentsdata\'

json_dir = r'C:UsersgokulDocumentsdb\'

analyzed_dir = r'C:UsersgokulDocumentsanalyzed\'

duplicate_dir = r'C:UsersgokulDocumentsduplicate\'



if not path.exists(json_dir):

    makedirs(json_dir)



if not path.exists(analyzed_dir):

    makedirs(analyzed_dir)



if not path.exists(duplicate_dir):

    makedirs(duplicate_dir)





def write_to_json(filename, data):

    file_full_path = json_dir + filename + ".json"

    with open(file_full_path, 'w') as file_pointer:

        dump(data, file_pointer)

    return





def video_to_json(filename):

    file_full_path = input_vid_dir + filename

    start = clock()

    size = round(path.getsize(file_full_path) / 1024 / 1024, 2)

    video_pointer = VideoCapture(file_full_path)

    frame_count = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_COUNT)))

    width = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_WIDTH)))

    height = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_HEIGHT)))

    fps = int(VideoCapture.get(video_pointer, int(CAP_PROP_FPS)))

    success, image = video_pointer.read()

    video_hash = {}

    while success:

        frame_hash = average_hash(Image.fromarray(image))

        video_hash[str(frame_hash)] = filename

        success, image = video_pointer.read()

    stop = clock()

    time_taken = stop - start

    print("Time taken for ", file_full_path, " is : ", time_taken)

    data_dict = dict()

    data_dict['size'] = size

    data_dict['time_taken'] = time_taken

    data_dict['fps'] = fps

    data_dict['height'] = height

    data_dict['width'] = width

    data_dict['frame_count'] = frame_count

    data_dict['filename'] = filename

    data_dict['video_hash'] = video_hash

    write_to_json(filename, data_dict)

    return





def multiprocess_video_to_json():

    files = next(walk(input_vid_dir))[2]

    processes = cpu_count()

    print(processes)

    pool = Pool(processes)

    start = clock()

    pool.starmap_async(video_to_json, zip(files))

    pool.close()

    pool.join()

    stop = clock()

    print("Time Taken : ", stop - start)





def key_with_max_val(d):

    max_value = 0

    required_key = ""

    for k in d:

        if d[k] > max_value:

            max_value = d[k]

            required_key = k

    return required_key





def duplicate_analyzer():

    files = next(walk(json_dir))[2]

    data_dict = {}

    for file in files:

        filename = json_dir + file

        with open(filename) as f:

            data = load(f)

        video_hash = data['video_hash']

        count = 0

        duplicate_file_dict = dict()

        for key in video_hash:

            count += 1

            if key in data_dict:

                if data_dict[key] in duplicate_file_dict:

                    duplicate_file_dict[data_dict[key]] = duplicate_file_dict[data_dict[key]] + 1

                else:

                    duplicate_file_dict[data_dict[key]] = 1

            else:

                data_dict[key] = video_hash[key]

        if duplicate_file_dict:

            duplicate_file = key_with_max_val(duplicate_file_dict)

            duplicate_percentage = ((duplicate_file_dict[duplicate_file] / count) * 100)

            if duplicate_percentage > 50:

                file = file[:-5]

                print(file, " is dup of ", duplicate_file)

                src = analyzed_dir + file

                tgt = duplicate_dir + file

                if path.exists(src):

                    rename(src, tgt)

                # else:

                #     print("File already moved")





def mv_analyzed_file():

    files = next(walk(json_dir))[2]

    for filename in files:

        filename = filename[:-5]

        src = input_vid_dir + filename

        tgt = analyzed_dir + filename

        if path.exists(src):

            rename(src, tgt)

        # else:

        #     print("File already moved")





if __name__ == '__main__':

    mv_analyzed_file()

    multiprocess_video_to_json()

    mv_analyzed_file()

    duplicate_analyzer()

answered Feb 12 at 9:19

Gokul C

213

Me too had the same problem..
So wrote a program myself..

Problem is I had videos of various formats and resolution..
So needed to take hash of each video frame and compare.

https://github.com/gklc811/duplicate_video_finder

you can just change the directories at top and you are good to go..

from os import path, walk, makedirs, rename

from time import clock

from imagehash import average_hash

from PIL import Image

from cv2 import VideoCapture, CAP_PROP_FRAME_COUNT, CAP_PROP_FRAME_WIDTH, CAP_PROP_FRAME_HEIGHT, CAP_PROP_FPS

from json import dump, load

from multiprocessing import Pool, cpu_count



input_vid_dir = r'C:UsersgokulDocumentsdata\'

json_dir = r'C:UsersgokulDocumentsdb\'

analyzed_dir = r'C:UsersgokulDocumentsanalyzed\'

duplicate_dir = r'C:UsersgokulDocumentsduplicate\'



if not path.exists(json_dir):

    makedirs(json_dir)



if not path.exists(analyzed_dir):

    makedirs(analyzed_dir)



if not path.exists(duplicate_dir):

    makedirs(duplicate_dir)





def write_to_json(filename, data):

    file_full_path = json_dir + filename + ".json"

    with open(file_full_path, 'w') as file_pointer:

        dump(data, file_pointer)

    return





def video_to_json(filename):

    file_full_path = input_vid_dir + filename

    start = clock()

    size = round(path.getsize(file_full_path) / 1024 / 1024, 2)

    video_pointer = VideoCapture(file_full_path)

    frame_count = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_COUNT)))

    width = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_WIDTH)))

    height = int(VideoCapture.get(video_pointer, int(CAP_PROP_FRAME_HEIGHT)))

    fps = int(VideoCapture.get(video_pointer, int(CAP_PROP_FPS)))

    success, image = video_pointer.read()

    video_hash = {}

    while success:

        frame_hash = average_hash(Image.fromarray(image))

        video_hash[str(frame_hash)] = filename

        success, image = video_pointer.read()

    stop = clock()

    time_taken = stop - start

    print("Time taken for ", file_full_path, " is : ", time_taken)

    data_dict = dict()

    data_dict['size'] = size

    data_dict['time_taken'] = time_taken

    data_dict['fps'] = fps

    data_dict['height'] = height

    data_dict['width'] = width

    data_dict['frame_count'] = frame_count

    data_dict['filename'] = filename

    data_dict['video_hash'] = video_hash

    write_to_json(filename, data_dict)

    return





def multiprocess_video_to_json():

    files = next(walk(input_vid_dir))[2]

    processes = cpu_count()

    print(processes)

    pool = Pool(processes)

    start = clock()

    pool.starmap_async(video_to_json, zip(files))

    pool.close()

    pool.join()

    stop = clock()

    print("Time Taken : ", stop - start)





def key_with_max_val(d):

    max_value = 0

    required_key = ""

    for k in d:

        if d[k] > max_value:

            max_value = d[k]

            required_key = k

    return required_key





def duplicate_analyzer():

    files = next(walk(json_dir))[2]

    data_dict = {}

    for file in files:

        filename = json_dir + file

        with open(filename) as f:

            data = load(f)

        video_hash = data['video_hash']

        count = 0

        duplicate_file_dict = dict()

        for key in video_hash:

            count += 1

            if key in data_dict:

                if data_dict[key] in duplicate_file_dict:

                    duplicate_file_dict[data_dict[key]] = duplicate_file_dict[data_dict[key]] + 1

                else:

                    duplicate_file_dict[data_dict[key]] = 1

            else:

                data_dict[key] = video_hash[key]

        if duplicate_file_dict:

            duplicate_file = key_with_max_val(duplicate_file_dict)

            duplicate_percentage = ((duplicate_file_dict[duplicate_file] / count) * 100)

            if duplicate_percentage > 50:

                file = file[:-5]

                print(file, " is dup of ", duplicate_file)

                src = analyzed_dir + file

                tgt = duplicate_dir + file

                if path.exists(src):

                    rename(src, tgt)

                # else:

                #     print("File already moved")





def mv_analyzed_file():

    files = next(walk(json_dir))[2]

    for filename in files:

        filename = filename[:-5]

        src = input_vid_dir + filename

        tgt = analyzed_dir + filename

        if path.exists(src):

            rename(src, tgt)

        # else:

        #     print("File already moved")





if __name__ == '__main__':

    mv_analyzed_file()

    multiprocess_video_to_json()

    mv_analyzed_file()

    duplicate_analyzer()

answered Feb 12 at 9:19

Gokul C

213

answered Feb 12 at 9:19

Gokul C

213

answered Feb 12 at 9:19

Gokul C

213

answered Feb 12 at 9:19

Gokul C

213

add a comment |

This page is only for reference, If you need detailed information, please check here

p,90IB UcPg4xBMv1k9UpqD4z qtkBpFi 7WM oo,sdGT2

搜尋此網誌

Ytdyklly